All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mina Almasry <almasrymina@google.com>
To: Pavel Begunkov <asml.silence@gmail.com>
Cc: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"Martin KaFai Lau" <martin.lau@linux.dev>,
	"Song Liu" <song@kernel.org>,
	"Yonghong Song" <yonghong.song@linux.dev>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"KP Singh" <kpsingh@kernel.org>,
	"Stanislav Fomichev" <sdf@google.com>,
	"Hao Luo" <haoluo@google.com>, "Jiri Olsa" <jolsa@kernel.org>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org,
	linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org,
	sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
	linux-arch@vger.kernel.org, bpf@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org,
	"David S. Miller" <davem@davemloft.net>,
	"Eric Dumazet" <edumazet@google.com>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Paolo Abeni" <pabeni@redhat.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Richard Henderson" <richard.henderson@linaro.org>,
	"Ivan Kokshaysky" <ink@jurassic.park.msu.ru>,
	"Matt Turner" <mattst88@gmail.com>,
	"Thomas Bogendoerfer" <tsbogend@alpha.franken.de>,
	"James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>,
	"Helge Deller" <deller@gmx.de>,
	"Jesper Dangaard Brouer" <hawk@kernel.org>,
	"Ilias Apalodimas" <ilias.apalodimas@linaro.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Masami Hiramatsu" <mhiramat@kernel.org>,
	"Arnd Bergmann" <arnd@arndb.de>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Andrii Nakryiko" <andrii@kernel.org>,
	"David Ahern" <dsahern@kernel.org>,
	"Willem de Bruijn" <willemdebruijn.kernel@gmail.com>,
	"Shuah Khan" <shuah@kernel.org>,
	"Sumit Semwal" <sumit.semwal@linaro.org>,
	"Christian König" <christian.koenig@amd.com>,
	"David Wei" <dw@davidwei.uk>, "Jason Gunthorpe" <jgg@ziepe.ca>,
	"Yunsheng Lin" <linyunsheng@huawei.com>,
	"Shailend Chand" <shailend@google.com>,
	"Harshitha Ramamurthy" <hramamurthy@google.com>,
	"Shakeel Butt" <shakeelb@google.com>,
	"Jeroen de Borst" <jeroendb@google.com>,
	"Praveen Kaligineedi" <pkaligineedi@google.com>
Subject: Re: [RFC PATCH net-next v5 07/14] page_pool: devmem support
Date: Tue, 13 Feb 2024 13:11:28 -0800	[thread overview]
Message-ID: <CAHS8izO2zARuMovrYU3kdwSXsQAM6+SajQjDT3ckSvVOfHwaCQ@mail.gmail.com> (raw)
In-Reply-To: <3374356e-5f4b-4a6f-bb19-8cb7c56103bc@gmail.com>

On Tue, Feb 13, 2024 at 5:28 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 12/18/23 02:40, Mina Almasry wrote:
> > Convert netmem to be a union of struct page and struct netmem. Overload
> > the LSB of struct netmem* to indicate that it's a net_iov, otherwise
> > it's a page.
> >
> > Currently these entries in struct page are rented by the page_pool and
> > used exclusively by the net stack:
> >
> > struct {
> >       unsigned long pp_magic;
> >       struct page_pool *pp;
> >       unsigned long _pp_mapping_pad;
> >       unsigned long dma_addr;
> >       atomic_long_t pp_ref_count;
> > };
> >
> > Mirror these (and only these) entries into struct net_iov and implement
> > netmem helpers that can access these common fields regardless of
> > whether the underlying type is page or net_iov.
> > Implement checks for net_iov in netmem helpers which delegate to mm
> > APIs, to ensure net_iov are never passed to the mm stack.
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >
> > ---
> >
> > RFCv5:
> > - Use netmem instead of page* with LSB set.
> > - Use pp_ref_count for refcounting net_iov.
> > - Removed many of the custom checks for netmem.
> >
> > v1:
> > - Disable fragmentation support for iov properly.
> > - fix napi_pp_put_page() path (Yunsheng).
> > - Use pp_frag_count for devmem refcounting.
> >
> > ---
> >   include/net/netmem.h            | 145 ++++++++++++++++++++++++++++++--
> >   include/net/page_pool/helpers.h |  25 +++---
> >   net/core/page_pool.c            |  26 +++---
> >   net/core/skbuff.c               |   9 +-
> >   4 files changed, 164 insertions(+), 41 deletions(-)
> >
> > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > index 31f338f19da0..7557aecc0f78 100644
> > --- a/include/net/netmem.h
> > +++ b/include/net/netmem.h
> > @@ -12,11 +12,47 @@
> >
> >   /* net_iov */
> >
> > +DECLARE_STATIC_KEY_FALSE(page_pool_mem_providers);
> > +
> > +/*  We overload the LSB of the struct page pointer to indicate whether it's
> > + *  a page or net_iov.
> > + */
> > +#define NET_IOV 0x01UL
> > +
> >   struct net_iov {
> > +     unsigned long __unused_padding;
> > +     unsigned long pp_magic;
> > +     struct page_pool *pp;
> >       struct dmabuf_genpool_chunk_owner *owner;
> >       unsigned long dma_addr;
> > +     atomic_long_t pp_ref_count;
> >   };
>
> I wonder if it would be better to extract a common sub-struct
> used in struct page, struct_group_tagged can help to avoid
> touching old code:
>
> struct page {
>         unsigned long flags;
>         union {
>                 ...
>                 struct_group_tagged(<struct_name>, ...,
>                         /**
>                          * @pp_magic: magic value to avoid recycling non
>                          * page_pool allocated pages.
>                          */
>                         unsigned long pp_magic;
>                         struct page_pool *pp;
>                         unsigned long _pp_mapping_pad;
>                         unsigned long dma_addr;
>                         atomic_long_t pp_ref_count;
>                 );
>         };
> }
>
> struct net_iov {
>         unsigned long pad;
>         struct <struct_name> p;
> };
>
>
> A bit of a churn with the padding and nesting net_iov but looks
> sturdier. No duplication, and you can just check positions of the
> structure instead of per-field NET_IOV_ASSERT_OFFSET, which you
> have to not forget to update e.g. when adding a new field. Also,

Yes, this is nicer. If possible I'll punt it to a minor cleanup as a
follow up change. Logistically I think if this series need-not touch
code outside of net/, that's better.

> with the change __netmem_clear_lsb can return a pointer to that
> structure, casting struct net_iov when it's a page is a bit iffy.
>
> And the next question would be whether it'd be a good idea to encode
> iov vs page not by setting a bit but via one of the fields in the
> structure, maybe pp_magic.
>

I will push back against this, for 2 reasons:

1. I think pp_magic's first 2 bits (and maybe more) are used by mm
code and thus I think extending usage of pp_magic in this series is a
bit iffy and I would like to avoid it. I just don't want to touch the
semantics of struct page if I don't have to.
2. I think this will be a measurable perf regression. Currently we can
tell if a pointer is a page or net_iov without dereferencing the
pointer and dirtying the cache-line. This will cause us to possibly
dereference the pointer in areas where we don't need to. I think I had
an earlier version of this code that required a dereference to tell if
a page was devmem and Eric pointed to me it was a perf regression.

I also don't see any upside of using pp_magic, other than making the
code slightly more readable, maybe.

> With that said I'm a bit concerned about the net_iov size. If each
> represents 4096 bytes and you're registering 10MB, then you need
> 30 pages worth of memory just for the iov array. Makes kvmalloc
> a must even for relatively small sizes.
>

This I think is an age-old challenge with pages. 1.6% of the machine's
memory is 'wasted' on every machine because a struct page needs to be
allocated for each PAGE_SIZE region. We're running into the same issue
here where if we want to refer to PAGE_SIZE regions of memory we need
to allocate some reference to it. Note that net_iov can be relatively
easily extended to support N order pages. Also note that in the devmem
TCP use case it's not really an issue; the minor increase in mem
utilization is more than offset by the saving in memory bw as compared
to using host memory as a bounce buffer. All in all I vote this is
something that can be tuned or improved in the future if someone finds
the extra memory usage a hurdle to using devmem TCP or this net_iov
infra.

> And the final bit, I don't believe the overlay is necessary in
> this series. Optimisations are great, but this one is a bit more on
> the controversial side. Unless I missed something and it does make
> things easier, it might make sense to do it separately later.
>

I completely agree, the overlay is not necessary. I implemented the
overlay in response to Yunsheng's  strong requests for more 'unified'
processing between page and devmem. This is the most unification I can
do IMO without violating the requirements from Jason. I'm prepared to
remove the overlay if it turns out controversial, but so far I haven't
seen any complaints. Jason, please do take a look if you have not
already.

>
> > +/* These fields in struct page are used by the page_pool and net stack:
> > + *
> > + *   struct {
> > + *           unsigned long pp_magic;
> > + *           struct page_pool *pp;
> > + *           unsigned long _pp_mapping_pad;
> > + *           unsigned long dma_addr;
> > + *           atomic_long_t pp_ref_count;
> > + *   };
> > + *
> > + * We mirror the page_pool fields here so the page_pool can access these fields
> > + * without worrying whether the underlying fields belong to a page or net_iov.
> > + *
> > + * The non-net stack fields of struct page are private to the mm stack and must
> > + * never be mirrored to net_iov.
> > + */
> > +#define NET_IOV_ASSERT_OFFSET(pg, iov)             \
> > +     static_assert(offsetof(struct page, pg) == \
> > +                   offsetof(struct net_iov, iov))
> > +NET_IOV_ASSERT_OFFSET(pp_magic, pp_magic);
> > +NET_IOV_ASSERT_OFFSET(pp, pp);
> > +NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr);
> > +NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count);
> > +#undef NET_IOV_ASSERT_OFFSET
> > +
> >   static inline struct dmabuf_genpool_chunk_owner *
> >   net_iov_owner(const struct net_iov *niov)
> >   {
> > @@ -47,19 +83,25 @@ net_iov_binding(const struct net_iov *niov)
> >   struct netmem {
> >       union {
> >               struct page page;
> > -
> > -             /* Stub to prevent compiler implicitly converting from page*
> > -              * to netmem_t* and vice versa.
> > -              *
> > -              * Other memory type(s) net stack would like to support
> > -              * can be added to this union.
> > -              */
> > -             void *addr;
> > +             struct net_iov niov;
> >       };
> >   };
> >
> ...
>
> --
> Pavel Begunkov



--
Thanks,
Mina

  reply	other threads:[~2024-02-13 21:11 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-18  2:40 [RFC PATCH net-next v5 00/14] Device Memory TCP Mina Almasry
2023-12-18  2:40 ` Mina Almasry
2023-12-18  2:40 ` [RFC PATCH net-next v5 01/14] net: page_pool: create hooks for custom page providers Mina Almasry
2023-12-18  2:40   ` Mina Almasry
2023-12-18  2:40 ` [RFC PATCH net-next v5 02/14] net: page_pool: factor out page_pool recycle check Mina Almasry
2023-12-18  2:40   ` Mina Almasry
2023-12-18  2:40 ` [RFC PATCH net-next v5 03/14] net: netdev netlink api to bind dma-buf to a net device Mina Almasry
2023-12-18  2:40   ` Mina Almasry
2023-12-18  2:40 ` [RFC PATCH net-next v5 04/14] netdev: support binding dma-buf to netdevice Mina Almasry
2023-12-18  2:40   ` Mina Almasry
2023-12-18  2:40 ` [RFC PATCH net-next v5 05/14] netdev: netdevice devmem allocator Mina Almasry
2023-12-18  2:40   ` Mina Almasry
2024-02-13 13:15   ` Pavel Begunkov
2024-02-13 20:01     ` Mina Almasry
2023-12-18  2:40 ` [RFC PATCH net-next v5 06/14] page_pool: convert to use netmem Mina Almasry
2023-12-18  2:40   ` Mina Almasry
2023-12-18  2:40 ` [RFC PATCH net-next v5 07/14] page_pool: devmem support Mina Almasry
2023-12-18  2:40   ` Mina Almasry
2024-02-13 13:18   ` Pavel Begunkov
2024-02-13 21:11     ` Mina Almasry [this message]
2024-02-14 15:30       ` Pavel Begunkov
2023-12-18  2:40 ` [RFC PATCH net-next v5 08/14] memory-provider: dmabuf devmem memory provider Mina Almasry
2023-12-18  2:40   ` Mina Almasry
2024-02-13 13:19   ` Pavel Begunkov
2023-12-18  2:40 ` [RFC PATCH net-next v5 09/14] net: support non paged skb frags Mina Almasry
2023-12-18  2:40   ` Mina Almasry
2023-12-18  2:40 ` [RFC PATCH net-next v5 10/14] net: add support for skbs with unreadable frags Mina Almasry
2023-12-18  2:40   ` Mina Almasry
2023-12-18  2:40 ` [RFC PATCH net-next v5 11/14] tcp: RX path for devmem TCP Mina Almasry
2023-12-18  2:40   ` Mina Almasry
2023-12-18  2:40 ` [RFC PATCH net-next v5 12/14] net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags Mina Almasry
2023-12-18  2:40   ` Mina Almasry
2023-12-28  8:46   ` Helge Deller
2023-12-18  2:40 ` [RFC PATCH net-next v5 13/14] net: add devmem TCP documentation Mina Almasry
2023-12-18  2:40   ` Mina Almasry
2023-12-18  2:40 ` [RFC PATCH net-next v5 14/14] selftests: add ncdevmem, netcat for devmem TCP Mina Almasry
2023-12-18  2:40   ` Mina Almasry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHS8izO2zARuMovrYU3kdwSXsQAM6+SajQjDT3ckSvVOfHwaCQ@mail.gmail.com \
    --to=almasrymina@google.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=andrii@kernel.org \
    --cc=arnd@arndb.de \
    --cc=asml.silence@gmail.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=christian.koenig@amd.com \
    --cc=corbet@lwn.net \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=deller@gmx.de \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=dsahern@kernel.org \
    --cc=dw@davidwei.uk \
    --cc=edumazet@google.com \
    --cc=haoluo@google.com \
    --cc=hawk@kernel.org \
    --cc=hramamurthy@google.com \
    --cc=ilias.apalodimas@linaro.org \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jeroendb@google.com \
    --cc=jgg@ziepe.ca \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-alpha@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-parisc@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=linyunsheng@huawei.com \
    --cc=martin.lau@linux.dev \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mattst88@gmail.com \
    --cc=mhiramat@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pkaligineedi@google.com \
    --cc=richard.henderson@linaro.org \
    --cc=rostedt@goodmis.org \
    --cc=sdf@google.com \
    --cc=shailend@google.com \
    --cc=shakeelb@google.com \
    --cc=shuah@kernel.org \
    --cc=song@kernel.org \
    --cc=sparclinux@vger.kernel.org \
    --cc=sumit.semwal@linaro.org \
    --cc=tsbogend@alpha.franken.de \
    --cc=willemdebruijn.kernel@gmail.com \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.