All of lore.kernel.org
 help / color / mirror / Atom feed
From: Willem de Bruijn <willemb@google.com>
To: Martin KaFai Lau <kafai@fb.com>
Cc: netdev@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	David Miller <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	kernel-team@fb.com
Subject: Re: [RFC PATCH v2 net-next] net: Preserve skb delivery time during forward
Date: Thu, 16 Dec 2021 10:32:23 -0500	[thread overview]
Message-ID: <CA+FuTSdR0yPwXAZZjziGOeujJ_Ac19fX1DsqfRXX3Dsn1uFPAQ@mail.gmail.com> (raw)
In-Reply-To: <20211215201158.271976-1-kafai@fb.com>

h

On Wed, Dec 15, 2021 at 3:12 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> The skb->skb_mstamp_ns is used as the EDT (Earliest Department Time)
> in TCP.  skb->skb_mstamp_ns is a union member of skb->tstamp.
>
> When the skb traveling veth and being forwarded like below, the skb->tstamp
> is reset to 0 at multiple points.
>
>                                                                         (c: skb->tstamp = 0)
>                                                                          vv
> tcp-sender => veth@netns => veth@hostns(b: rx: skb->tstamp = real_clock) => fq@eth0
>                          ^^
>                         (a: skb->tstamp = 0)
>
> (a) veth@netns TX to veth@hostns:
>     skb->tstamp (mono clock) is a EDT and it is in future time.
>     Reset to 0 so that it won't skip the net_timestamp_check at the
>     RX side in (b).
> (b) RX (netif_rx) in veth@hostns:
>     net_timestamp_check puts a current time (real clock) in skb->tstamp.
> (c) veth@hostns forward to fq@eth0:
>     skb->tstamp is reset back to 0 again because fq is using
>     mono clock.
>
> This leads to an unstable TCP throughput issue described by Daniel in [0].
>
> We also have a use case that a bpf runs at ingress@veth@hostns
> to set EDT in skb->tstamp to limit the bandwidth usage
> of a particular netns.  This EDT currently also gets
> reset in step (c) as described above.
>
> Unlike RFC v1 trying to migrate rx tstamp to mono first,
> this patch is to preserve the EDT in skb->skb_mstamp_ns during forward.

Sucks we have to do this complex dance, because there is no room for
an skb->delivery_time.

> The idea is to temporarily store skb->skb_mstamp_ns during forward.
> skb_shinfo(skb)->hwtstamps is used as a temporary store and
> it is union-ed with the newly added "u64 tx_delivery_tstamp".
> hwtstamps should only be used when a packet is received or
> sent out of a hw device.
>
> During forward, skb->tstamp will be temporarily stored in
> skb_shinfo(skb)->tx_delivery_tstamp and a new bit
> (SKBTX_DELIVERY_TSTAMP) in skb_shinfo(skb)->tx_flags
> will also be set to tell tx_delivery_tstamp is in use.
> hwtstamps is accessed through the skb_hwtstamps() getter,
> so unlikely(tx_flags & SKBTX_DELIVERY_TSTAMP) can
> be tested in there and reset tx_delivery_tstamp to 0
> before hwtstamps is used.
>
> After moving the skb->tstamp to skb_shinfo(skb)->tx_delivery_tstamp,
> the skb->tstamp will still be reset to 0 during forward.  Thus,
> on the RX side (__netif_receive_skb_core), all existing code paths
> will still get the received time in real clock and will work as-is.
>
> When this skb finally xmit-ing out in __dev_queue_xmit(),
> it will check the SKBTX_DELIVERY_TSTAMP bit in skb_shinfo(skb)->tx_flags
> and restore the skb->tstamp from skb_shinfo(skb)->tx_delivery_tstamp
> if needed.  This bit test is done immediately after another existing
> bit test 'skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP'.
>
> Another bit SKBTX_DELIVERY_TSTAMP_ALLOW_FWD is added
> to skb_shinfo(skb)->tx_flags.  It is used to specify
> the skb->tstamp is set as a delivery time and can be
> temporarily stored during forward.  This bit is now set
> when EDT is stored in skb->skb_mstamp_ns in tcp_output.c
> This will avoid packet received from a NIC with real-clock
> in skb->tstamp being forwarded without reset.
>
> The change in af_packet.c is to avoid it calling skb_hwtstamps()
> which will reset the skb_shinfo(skb)->tx_delivery_tstamp.
> af_packet.c only wants to read the hwtstamps instead of
> storing a time in it, so a new read only getter skb_hwtstamps_ktime()
> is added.  Otherwise, a tcpdump will trigger this code path
> and unnecessarily reset the EDT stored in tx_delivery_tstamp.
>
> [Note: not all skb->tstamp=0 reset has been changed in this RFC yet]
>
> [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/linux/skbuff.h  | 52 ++++++++++++++++++++++++++++++++++++++++-
>  net/bridge/br_forward.c |  2 +-
>  net/core/dev.c          |  1 +
>  net/core/filter.c       |  6 ++---
>  net/core/skbuff.c       |  2 +-
>  net/ipv4/ip_forward.c   |  2 +-
>  net/ipv4/tcp_output.c   | 21 +++++++++++------
>  net/ipv6/ip6_output.c   |  2 +-
>  net/packet/af_packet.c  |  8 +++----
>  9 files changed, 77 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 6535294f6a48..9bf0a1e2a1bd 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -435,9 +435,17 @@ enum {
>         /* device driver is going to provide hardware time stamp */
>         SKBTX_IN_PROGRESS = 1 << 2,
>
> +       /* shinfo stores a future tx_delivery_tstamp instead of hwtstamps */
> +       SKBTX_DELIVERY_TSTAMP = 1 << 3,
> +
>         /* generate wifi status information (where possible) */
>         SKBTX_WIFI_STATUS = 1 << 4,
>
> +       /* skb->tstamp stored a future delivery time which
> +        * was set by a local sk and it can be fowarded.
> +        */
> +       SKBTX_DELIVERY_TSTAMP_ALLOW_FWD = 1 << 5,
> +
>         /* generate software time stamp when entering packet scheduling */
>         SKBTX_SCHED_TSTAMP = 1 << 6,
>  };
> @@ -530,7 +538,14 @@ struct skb_shared_info {
>         /* Warning: this field is not always filled in (UFO)! */
>         unsigned short  gso_segs;
>         struct sk_buff  *frag_list;
> -       struct skb_shared_hwtstamps hwtstamps;
> +       union {
> +               /* If SKBTX_DELIVERY_TSTAMP is set in tx_flags,
> +                * tx_delivery_tstamp is stored instead of
> +                * hwtstamps.
> +                */

Should we just encode the timebase and/or type { timestamp,
delivery_time } in th lower bits of the timestamp field? Its
resolution is higher than actual clock precision.

> +               struct skb_shared_hwtstamps hwtstamps;
> +               u64 tx_delivery_tstamp;
> +       };
>         unsigned int    gso_type;
>         u32             tskey;
>
> @@ -1463,9 +1478,44 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
>
>  static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
>  {
> +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP)) {
> +               skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP;
> +               skb_shinfo(skb)->tx_delivery_tstamp = 0;
> +       }
>         return &skb_shinfo(skb)->hwtstamps;
>  }
>
> +/* Caller only needs to read the hwtstamps as ktime.
> + * To update hwtstamps,  HW device driver should call the writable
> + * version skb_hwtstamps() that returns a pointer.
> + */
> +static inline ktime_t skb_hwtstamps_ktime(const struct sk_buff *skb)
> +{
> +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP))
> +               return 0;
> +       return skb_shinfo(skb)->hwtstamps.hwtstamp;
> +}
> +
> +static inline void skb_scrub_tstamp(struct sk_buff *skb)

skb_save_delivery_time?

is non-zero skb->tstamp test not sufficient, instead of
SKBTX_DELIVERY_TSTAMP_ALLOW_FWD.

It is if only called on the egress path. Is bpf on ingress the only
reason for this?

> +{
> +       if (skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP_ALLOW_FWD) {
> +               skb_shinfo(skb)->tx_delivery_tstamp = skb->tstamp;
> +               skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP;
> +               skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
> +       }

Is this only called when there are no clones/shares?

> +       skb->tstamp = 0;
> +}
> +
> +static inline void skb_restore_delivery_time(struct sk_buff *skb)
> +{
> +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP)) {
> +               skb->tstamp = skb_shinfo(skb)->tx_delivery_tstamp;
> +               skb_shinfo(skb)->tx_delivery_tstamp = 0;
> +               skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP;
> +               skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
> +       }
> +}
> +
>  static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
>  {
>         bool is_zcopy = skb && skb_shinfo(skb)->flags & SKBFL_ZEROCOPY_ENABLE;
> diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
> index ec646656dbf1..a3ba6195f2e3 100644
> --- a/net/bridge/br_forward.c
> +++ b/net/bridge/br_forward.c
> @@ -62,7 +62,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);
>
>  int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
>  {
> -       skb->tstamp = 0;
> +       skb_scrub_tstamp(skb);
>         return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
>                        net, sk, skb, NULL, skb->dev,
>                        br_dev_queue_push_xmit);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a855e41bbe39..e9e7de758cba 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c

  parent reply	other threads:[~2021-12-16 15:33 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-15 20:11 [RFC PATCH v2 net-next] net: Preserve skb delivery time during forward Martin KaFai Lau
2021-12-15 22:07 ` Julian Anastasov
2021-12-16 15:32 ` Willem de Bruijn [this message]
2021-12-16 22:23   ` Martin KaFai Lau
2021-12-16 22:58     ` Willem de Bruijn
2021-12-16 23:42       ` Daniel Borkmann
2021-12-17  7:33         ` Martin KaFai Lau
2021-12-17 11:13           ` Daniel Borkmann
2021-12-17 17:57             ` Martin KaFai Lau
2021-12-17  0:33       ` Martin KaFai Lau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+FuTSdR0yPwXAZZjziGOeujJ_Ac19fX1DsqfRXX3Dsn1uFPAQ@mail.gmail.com \
    --to=willemb@google.com \
    --cc=ast@kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kafai@fb.com \
    --cc=kernel-team@fb.com \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.