From: Stanislav Fomichev <sdf@google.com>
To: Mina Almasry <almasrymina@google.com>
Cc: "David Ahern" <dsahern@kernel.org>,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org,
linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org,
linaro-mm-sig@lists.linaro.org,
"David S. Miller" <davem@davemloft.net>,
"Eric Dumazet" <edumazet@google.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"Paolo Abeni" <pabeni@redhat.com>,
"Jesper Dangaard Brouer" <hawk@kernel.org>,
"Ilias Apalodimas" <ilias.apalodimas@linaro.org>,
"Arnd Bergmann" <arnd@arndb.de>,
"Willem de Bruijn" <willemdebruijn.kernel@gmail.com>,
"Shuah Khan" <shuah@kernel.org>,
"Sumit Semwal" <sumit.semwal@linaro.org>,
"Christian König" <christian.koenig@amd.com>,
"Shakeel Butt" <shakeelb@google.com>,
"Jeroen de Borst" <jeroendb@google.com>,
"Praveen Kaligineedi" <pkaligineedi@google.com>,
"Willem de Bruijn" <willemb@google.com>,
"Kaiyuan Zhang" <kaiyuanz@google.com>
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
Date: Mon, 6 Nov 2023 13:59:23 -0800 [thread overview]
Message-ID: <ZUlhu4hlTaqR3CTh@google.com> (raw)
In-Reply-To: <CAHS8izMrnVUfbbS=OcJ6JT9SZRRfZ2MC7UnggthpZT=zf2BGLA@mail.gmail.com>
On 11/06, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@kernel.org> wrote:
> >
> > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > On 11/05, Mina Almasry wrote:
> > >> For device memory TCP, we expect the skb headers to be available in host
> > >> memory for access, and we expect the skb frags to be in device memory
> > >> and unaccessible to the host. We expect there to be no mixing and
> > >> matching of device memory frags (unaccessible) with host memory frags
> > >> (accessible) in the same skb.
> > >>
> > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > >> are device memory frags or not.
> > >>
> > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > >> and marks the skb as skb->devmem accordingly.
> > >>
> > >> Add checks through the network stack to avoid accessing the frags of
> > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > >>
> > >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > >> Signed-off-by: Mina Almasry <almasrymina@google.com>
> > >>
> > >> ---
> > >> include/linux/skbuff.h | 14 +++++++-
> > >> include/net/tcp.h | 5 +--
> > >> net/core/datagram.c | 6 ++++
> > >> net/core/gro.c | 5 ++-
> > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > >> net/ipv4/tcp.c | 6 ++++
> > >> net/ipv4/tcp_input.c | 13 +++++--
> > >> net/ipv4/tcp_output.c | 5 ++-
> > >> net/packet/af_packet.c | 4 +--
> > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > >>
> > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > >> index 1fae276c1353..8fb468ff8115 100644
> > >> --- a/include/linux/skbuff.h
> > >> +++ b/include/linux/skbuff.h
> > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > >> * @csum_level: indicates the number of consecutive checksums found in
> > >> * the packet minus one that have been verified as
> > >> * CHECKSUM_UNNECESSARY (max 3)
> > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > >> + * device memory.
> > >> * @dst_pending_confirm: need to confirm neighbour
> > >> * @decrypted: Decrypted SKB
> > >> * @slow_gro: state present at GRO time, slower prepare step required
> > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > >> __u8 csum_not_inet:1;
> > >> #endif
> > >> -
> > >> + __u8 devmem:1;
> > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > >> __u16 tc_index; /* traffic control index */
> > >> #endif
> > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > >> __skb_zcopy_downgrade_managed(skb);
> > >> }
> > >>
> > >> +/* Return true if frags in this skb are not readable by the host. */
> > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > >> +{
> > >> + return skb->devmem;
> > >
> > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > It better communicates the fact that the stack shouldn't dereference the
> > > frags (because it has 'devmem' fragments or for some other potential
> > > future reason).
> >
> > +1.
> >
> > Also, the flag on the skb is an optimization - a high level signal that
> > one or more frags is in unreadable memory. There is no requirement that
> > all of the frags are in the same memory type.
David: maybe there should be such a requirement (that they all are
unreadable)? Might be easier to support initially; we can relax later
on.
> The flag indicates that the skb contains all devmem dma-buf memory
> specifically, not generic 'not_readable' frags as the comment says:
>
> + * @devmem: indicates that all the fragments in this skb are backed by
> + * device memory.
>
> The reason it's not a generic 'not_readable' flag is because handing
> off a generic not_readable skb to the userspace is semantically not
> what we're doing. recvmsg() is augmented in this patch series to
> return a devmem skb to the user via a cmsg_devmem struct which refers
> specifically to the memory in the dma-buf. recvmsg() in this patch
> series is not augmented to give any 'not_readable' skb to the
> userspace.
>
> IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> correct. If a new type of unreadable skbs are introduced to the stack,
> I imagine the stack would implement:
>
> 1. new header flag: skb->newmem
> 2.
>
> static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> {
> return skb->devmem || skb->newmem;
> }
>
> 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
You copy it to the userspace in a special way because your frags
are page_is_page_pool_iov(). I agree with David, the skb bit is
just and optimization.
For most of the core stack, it doesn't matter why your skb is not
readable. For a few places where it matters (recvmsg?), you can
double-check your frags (all or some) with page_is_page_pool_iov.
Unrelated: we probably need socket to dmabuf association as well (via
netlink or something).
We are fundamentally receiving into and sending from a dmabuf (devmem ==
dmabuf).
And once you have this association, recvmsg shouldn't need any new
special flags.
next prev parent reply other threads:[~2023-11-06 21:59 UTC|newest]
Thread overview: 126+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-06 2:43 [RFC PATCH v3 00/12] Device Memory TCP Mina Almasry
2023-11-06 2:44 ` [RFC PATCH v3 01/12] net: page_pool: factor out releasing DMA from releasing the page Mina Almasry
2023-11-06 2:44 ` [RFC PATCH v3 02/12] net: page_pool: create hooks for custom page providers Mina Almasry
2023-11-07 7:44 ` Yunsheng Lin
2023-11-09 11:09 ` Paolo Abeni
2023-11-10 23:19 ` Jakub Kicinski
2023-11-13 3:28 ` Mina Almasry
2023-11-13 22:10 ` Jakub Kicinski
2023-11-06 2:44 ` [RFC PATCH v3 03/12] net: netdev netlink api to bind dma-buf to a net device Mina Almasry
2023-11-10 23:16 ` Jakub Kicinski
2023-11-06 2:44 ` [RFC PATCH v3 04/12] netdev: support binding dma-buf to netdevice Mina Almasry
2023-11-07 7:46 ` Yunsheng Lin
2023-11-07 21:59 ` Mina Almasry
2023-11-08 3:40 ` Yunsheng Lin
2023-11-09 2:22 ` Mina Almasry
2023-11-09 9:29 ` Yunsheng Lin
2023-11-08 23:47 ` David Wei
2023-11-09 2:25 ` Mina Almasry
2023-11-09 8:29 ` Paolo Abeni
2023-11-10 2:59 ` Mina Almasry
2023-11-10 7:38 ` Yunsheng Lin
2023-11-10 9:45 ` Mina Almasry
2023-11-10 23:19 ` Jakub Kicinski
2023-11-11 2:19 ` Mina Almasry
2023-11-06 2:44 ` [RFC PATCH v3 05/12] netdev: netdevice devmem allocator Mina Almasry
2023-11-06 23:44 ` David Ahern
2023-11-07 22:10 ` Mina Almasry
2023-11-07 22:55 ` David Ahern
2023-11-07 23:03 ` Mina Almasry
2023-11-09 1:15 ` David Wei
2023-11-10 14:26 ` Pavel Begunkov
2023-11-11 17:19 ` David Ahern
2023-11-14 16:09 ` Pavel Begunkov
2023-11-09 1:00 ` David Wei
2023-11-08 3:48 ` Yunsheng Lin
2023-11-09 1:41 ` Mina Almasry
2023-11-07 7:45 ` Yunsheng Lin
2023-11-09 8:44 ` Paolo Abeni
2023-11-06 2:44 ` [RFC PATCH v3 06/12] memory-provider: dmabuf devmem memory provider Mina Almasry
2023-11-06 21:02 ` Stanislav Fomichev
2023-11-06 23:49 ` David Ahern
2023-11-08 0:02 ` Mina Almasry
2023-11-08 0:10 ` David Ahern
2023-11-10 23:16 ` Jakub Kicinski
2023-11-13 4:54 ` Mina Almasry
2023-11-06 2:44 ` [RFC PATCH v3 07/12] page-pool: device memory support Mina Almasry
2023-11-07 8:00 ` Yunsheng Lin
2023-11-07 21:56 ` Mina Almasry
2023-11-08 10:56 ` Yunsheng Lin
2023-11-09 3:20 ` Mina Almasry
2023-11-09 9:30 ` Yunsheng Lin
2023-11-09 12:20 ` Mina Almasry
2023-11-09 13:23 ` Yunsheng Lin
2023-11-09 9:01 ` Paolo Abeni
2023-11-06 2:44 ` [RFC PATCH v3 08/12] net: support non paged skb frags Mina Almasry
2023-11-07 9:00 ` Yunsheng Lin
2023-11-07 21:19 ` Mina Almasry
2023-11-08 11:25 ` Yunsheng Lin
2023-11-09 9:14 ` Paolo Abeni
2023-11-10 4:06 ` Mina Almasry
2023-11-10 23:19 ` Jakub Kicinski
2023-11-13 6:05 ` Mina Almasry
2023-11-13 22:17 ` Jakub Kicinski
2023-11-06 2:44 ` [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags Mina Almasry
2023-11-06 18:47 ` Stanislav Fomichev
2023-11-06 19:34 ` David Ahern
2023-11-06 20:31 ` Mina Almasry
2023-11-06 21:59 ` Stanislav Fomichev [this message]
2023-11-06 22:18 ` Mina Almasry
2023-11-06 22:59 ` Stanislav Fomichev
2023-11-06 23:27 ` Mina Almasry
2023-11-06 23:55 ` Stanislav Fomichev
2023-11-07 0:07 ` Willem de Bruijn
2023-11-07 0:14 ` Stanislav Fomichev
2023-11-07 0:59 ` Stanislav Fomichev
2023-11-07 2:23 ` Willem de Bruijn
2023-11-07 17:44 ` Stanislav Fomichev
2023-11-07 17:57 ` Willem de Bruijn
2023-11-07 18:14 ` Stanislav Fomichev
2023-11-07 0:20 ` Mina Almasry
2023-11-07 1:06 ` Stanislav Fomichev
2023-11-07 19:53 ` Mina Almasry
2023-11-07 21:05 ` Stanislav Fomichev
2023-11-07 21:17 ` Eric Dumazet
2023-11-07 22:23 ` Stanislav Fomichev
2023-11-10 23:17 ` Jakub Kicinski
2023-11-10 23:19 ` Jakub Kicinski
2023-11-07 1:09 ` David Ahern
2023-11-06 23:37 ` David Ahern
2023-11-07 0:03 ` Mina Almasry
2023-11-06 20:56 ` Stanislav Fomichev
2023-11-07 0:16 ` David Ahern
2023-11-07 0:23 ` Mina Almasry
2023-11-08 14:43 ` David Laight
2023-11-06 2:44 ` [RFC PATCH v3 10/12] tcp: RX path for devmem TCP Mina Almasry
2023-11-06 18:44 ` Stanislav Fomichev
2023-11-06 19:29 ` Mina Almasry
2023-11-06 21:14 ` Willem de Bruijn
2023-11-06 22:34 ` Stanislav Fomichev
2023-11-06 22:55 ` Willem de Bruijn
2023-11-06 23:32 ` Stanislav Fomichev
2023-11-06 23:55 ` David Ahern
2023-11-07 0:02 ` Willem de Bruijn
2023-11-07 23:55 ` Mina Almasry
2023-11-08 0:01 ` David Ahern
2023-11-09 2:39 ` Mina Almasry
2023-11-09 16:07 ` Edward Cree
2023-12-08 20:12 ` Pavel Begunkov
2023-11-09 11:05 ` Paolo Abeni
2023-11-10 23:16 ` Jakub Kicinski
2023-12-08 20:28 ` Pavel Begunkov
2023-12-08 20:09 ` Pavel Begunkov
2023-11-06 21:17 ` Stanislav Fomichev
2023-11-08 15:36 ` Edward Cree
2023-11-09 10:52 ` Paolo Abeni
2023-11-10 23:19 ` Jakub Kicinski
2023-11-06 2:44 ` [RFC PATCH v3 11/12] net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages Mina Almasry
2023-11-06 2:44 ` [RFC PATCH v3 12/12] selftests: add ncdevmem, netcat for devmem TCP Mina Almasry
2023-11-09 11:03 ` Paolo Abeni
2023-11-10 23:13 ` Jakub Kicinski
2023-11-11 2:27 ` Mina Almasry
2023-11-11 2:35 ` Jakub Kicinski
2023-11-13 4:08 ` Mina Almasry
2023-11-13 22:20 ` Jakub Kicinski
2023-11-10 23:17 ` Jakub Kicinski
2023-11-07 15:18 ` [RFC PATCH v3 00/12] Device Memory TCP David Ahern
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZUlhu4hlTaqR3CTh@google.com \
--to=sdf@google.com \
--cc=almasrymina@google.com \
--cc=arnd@arndb.de \
--cc=christian.koenig@amd.com \
--cc=davem@davemloft.net \
--cc=dri-devel@lists.freedesktop.org \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=ilias.apalodimas@linaro.org \
--cc=jeroendb@google.com \
--cc=kaiyuanz@google.com \
--cc=kuba@kernel.org \
--cc=linaro-mm-sig@lists.linaro.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-media@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pkaligineedi@google.com \
--cc=shakeelb@google.com \
--cc=shuah@kernel.org \
--cc=sumit.semwal@linaro.org \
--cc=willemb@google.com \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).