Re: [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online
       [not found] ` <20200727052846.4070247-4-jonathan.lemon@gmail.com>
@ 2020-07-27  5:58   ` Christoph Hellwig
  2020-07-27 17:08     ` Jonathan Lemon
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2020-07-27  5:58 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, hch, robin.murphy, akpm, davem, kuba,
	willemb, edumazet, steffen.klassert, saeedm, maximmi,
	bjorn.topel, magnus.karlsson, borisp, david

On Sun, Jul 26, 2020 at 10:28:28PM -0700, Jonathan Lemon wrote:
> Change the system RAM check from 'valid' to 'online', so dummy
> pages which refer to external DMA resources can be mapped.


NAK.  This looks completely bogus.  Maybe you do have a good reason
somewhere (although I doubt it), but then you'd actualy need to both
explain it here, and also actually make sure I get the whole damn series.

Until you fix up how you send these stupid partial cc lists I'll just
ignore this crap.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
       [not found] ` <20200727052846.4070247-22-jonathan.lemon@gmail.com>
@ 2020-07-27  7:35   ` Christoph Hellwig
  2020-07-27 17:00     ` Jonathan Lemon
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2020-07-27  7:35 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, hch, robin.murphy, akpm, davem, kuba,
	willemb, edumazet, steffen.klassert, saeedm, maximmi,
	bjorn.topel, magnus.karlsson, borisp, david

Seriously?  If you only even considered this is something reasonable
to do you should not be anywhere near Linux kernel development.

Just go away!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size for netgpu
       [not found] ` <20200727052846.4070247-14-jonathan.lemon@gmail.com>
@ 2020-07-27 15:19   ` Eric Dumazet
  2020-07-27 17:20     ` Jonathan Lemon
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2020-07-27 15:19 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
<jonathan.lemon@gmail.com> wrote:
>
> From: Jonathan Lemon <bsd@fb.com>
>
> The "header splitting" feature used by netgpu doesn't actually parse
> the incoming packet header.  Instead, it splits the packet at a fixed
> offset.  In order for this to work, the sender needs to send packets
> with a fixed header size.
>
> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> ---
>  net/ipv4/tcp_output.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index d8f16f6a9b02..e8a74d0f7ad2 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -438,6 +438,7 @@ struct tcp_out_options {
>         u8 ws;                  /* window scale, 0 to disable */
>         u8 num_sack_blocks;     /* number of SACK blocks to include */
>         u8 hash_size;           /* bytes in hash_location */
> +       u8 pad_size;            /* additional nops for padding */
>         __u8 *hash_location;    /* temporary pointer, overloaded */
>         __u32 tsval, tsecr;     /* need to include OPTION_TS */
>         struct tcp_fastopen_cookie *fastopen_cookie;    /* Fast open cookie */
> @@ -562,6 +563,17 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
>         smc_options_write(ptr, &options);
>
>         mptcp_options_write(ptr, opts);
> +
> +#if IS_ENABLED(CONFIG_NETGPU)
> +       /* pad out options */
> +       if (opts->pad_size) {
> +               int len = opts->pad_size;
> +               u8 *p = (u8 *)ptr;
> +
> +               while (len--)
> +                       *p++ = TCPOPT_NOP;
> +       }
> +#endif
>  }
>
>  static void smc_set_option(const struct tcp_sock *tp,
> @@ -826,6 +838,14 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
>                         opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
>         }
>
> +#if IS_ENABLED(CONFIG_NETGPU)
> +       /* force padding */
> +       if (size < 20) {
> +               opts->pad_size = 20 - size;
> +               size += opts->pad_size;
> +       }
> +#endif
> +

This is obviously wrong, as any kernel compiled with CONFIG_NETGPU
will fail all packetdrill tests suite.

Also the fixed 20 value is not pretty.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
       [not found] ` <20200727052846.4070247-16-jonathan.lemon@gmail.com>
@ 2020-07-27 15:19   ` Eric Dumazet
  2020-07-27 15:55     ` Jonathan Lemon
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2020-07-27 15:19 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
<jonathan.lemon@gmail.com> wrote:
>
> This flag indicates that the attached data is a zero-copy send,
> and the pages should be retrieved from the netgpu module.  The
> socket should should already have been attached to a netgpu queue.
>
> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> ---
>  include/linux/socket.h | 1 +
>  net/ipv4/tcp.c         | 8 ++++++++
>  2 files changed, 9 insertions(+)
>
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 04d2bc97f497..63816cc25dee 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -310,6 +310,7 @@ struct ucred {
>                                           */
>
>  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> +#define MSG_NETDMA     0x8000000
>  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
>  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
>                                            descriptor received through
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 261c28ccc8f6..340ce319edc9 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>                         uarg->zerocopy = 0;
>         }
>
> +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> +               zc = sk->sk_route_caps & NETIF_F_SG;
> +               if (!zc) {
> +                       err = -EFAULT;
> +                       goto out_err;
> +               }
> +       }
>

Sorry, no, we can not allow adding yet another branch into TCP fast
path for yet another variant of zero copy.

Overall, I think your patch series desperately tries to add changes in
TCP stack, while there is yet no proof
that you have to use TCP transport between the peers.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag
       [not found] ` <20200727052846.4070247-9-jonathan.lemon@gmail.com>
@ 2020-07-27 15:24   ` Eric Dumazet
  2020-07-27 16:59     ` Jonathan Lemon
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2020-07-27 15:24 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 12:20 AM Jonathan Lemon
<jonathan.lemon@gmail.com> wrote:
>
> This could likely be moved elsewhere.  The presence of the flag on
> the skb indicates that one of the fragments may contain zerocopy
> RX data, where the data is not accessible to the cpu.

Why do we need yet another flag in skb exactly ?

Please define what means "data not accessible to the cpu" ?

This kind of change is a red flag for me.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-27 15:19   ` [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg() Eric Dumazet
@ 2020-07-27 15:55     ` Jonathan Lemon
  2020-07-27 16:09       ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Jonathan Lemon @ 2020-07-27 15:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 08:19:43AM -0700, Eric Dumazet wrote:
> On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> <jonathan.lemon@gmail.com> wrote:
> >
> > This flag indicates that the attached data is a zero-copy send,
> > and the pages should be retrieved from the netgpu module.  The
> > socket should should already have been attached to a netgpu queue.
> >
> > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > ---
> >  include/linux/socket.h | 1 +
> >  net/ipv4/tcp.c         | 8 ++++++++
> >  2 files changed, 9 insertions(+)
> >
> > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > index 04d2bc97f497..63816cc25dee 100644
> > --- a/include/linux/socket.h
> > +++ b/include/linux/socket.h
> > @@ -310,6 +310,7 @@ struct ucred {
> >                                           */
> >
> >  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> > +#define MSG_NETDMA     0x8000000
> >  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> >  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> >                                            descriptor received through
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 261c28ccc8f6..340ce319edc9 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> >                         uarg->zerocopy = 0;
> >         }
> >
> > +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> > +               zc = sk->sk_route_caps & NETIF_F_SG;
> > +               if (!zc) {
> > +                       err = -EFAULT;
> > +                       goto out_err;
> > +               }
> > +       }
> >
> 
> Sorry, no, we can not allow adding yet another branch into TCP fast
> path for yet another variant of zero copy.

I'm not in disagreement with that statement, but the existing zerocopy
work makes some assumptions that aren't suitable.  I take it that you'd
rather have things folded together so the old/new code works together?

Allocating an extra structure for every skbuff isn't ideal in my book.


> Overall, I think your patch series desperately tries to add changes in
> TCP stack, while there is yet no proof
> that you have to use TCP transport between the peers.

The goal is having a reliable transport without resorting to RDMA.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-27 15:55     ` Jonathan Lemon
@ 2020-07-27 16:09       ` Eric Dumazet
  2020-07-27 17:35         ` Jonathan Lemon
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2020-07-27 16:09 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 8:56 AM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
>
> On Mon, Jul 27, 2020 at 08:19:43AM -0700, Eric Dumazet wrote:
> > On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> > <jonathan.lemon@gmail.com> wrote:
> > >
> > > This flag indicates that the attached data is a zero-copy send,
> > > and the pages should be retrieved from the netgpu module.  The
> > > socket should should already have been attached to a netgpu queue.
> > >
> > > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > > ---
> > >  include/linux/socket.h | 1 +
> > >  net/ipv4/tcp.c         | 8 ++++++++
> > >  2 files changed, 9 insertions(+)
> > >
> > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > index 04d2bc97f497..63816cc25dee 100644
> > > --- a/include/linux/socket.h
> > > +++ b/include/linux/socket.h
> > > @@ -310,6 +310,7 @@ struct ucred {
> > >                                           */
> > >
> > >  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> > > +#define MSG_NETDMA     0x8000000
> > >  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> > >  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> > >                                            descriptor received through
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index 261c28ccc8f6..340ce319edc9 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > >                         uarg->zerocopy = 0;
> > >         }
> > >
> > > +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> > > +               zc = sk->sk_route_caps & NETIF_F_SG;
> > > +               if (!zc) {
> > > +                       err = -EFAULT;
> > > +                       goto out_err;
> > > +               }
> > > +       }
> > >
> >
> > Sorry, no, we can not allow adding yet another branch into TCP fast
> > path for yet another variant of zero copy.
>
> I'm not in disagreement with that statement, but the existing zerocopy
> work makes some assumptions that aren't suitable.  I take it that you'd
> rather have things folded together so the old/new code works together?

Exact.  Forcing users to use MSG_NETDMA, yet reusing SOCK_ZEROCOPY is silly.

SOCK_ZEROCOPY has been added to that user space and kernel would agree
on MSG_ZEROCOPY being not a nop (as it was on old kernels)

>
> Allocating an extra structure for every skbuff isn't ideal in my book.
>

We do not allocate a structure for every skbuff. Please look again.


>
> > Overall, I think your patch series desperately tries to add changes in
> > TCP stack, while there is yet no proof
> > that you have to use TCP transport between the peers.
>
> The goal is having a reliable transport without resorting to RDMA.

And why should it be TCP ?

Are you dealing with lost packets, retransmits, timers, and al  ?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag
  2020-07-27 15:24   ` [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag Eric Dumazet
@ 2020-07-27 16:59     ` Jonathan Lemon
  2020-07-27 17:08       ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Jonathan Lemon @ 2020-07-27 16:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 08:24:55AM -0700, Eric Dumazet wrote:
> On Mon, Jul 27, 2020 at 12:20 AM Jonathan Lemon
> <jonathan.lemon@gmail.com> wrote:
> >
> > This could likely be moved elsewhere.  The presence of the flag on
> > the skb indicates that one of the fragments may contain zerocopy
> > RX data, where the data is not accessible to the cpu.
> 
> Why do we need yet another flag in skb exactly ?
> 
> Please define what means "data not accessible to the cpu" ?
> 
> This kind of change is a red flag for me.

The architecture this is targeting is a ML cluster, where a 200Gbps NIC
is attached to a PCIe switch which also has a GPU card attached.  There
are several of these, and the link(s) to the host cpu (which has another
NIC attached) can't handle the incoming traffic.

So what we're doing here is transferring the data directly from the NIC
to the GPU via DMA.  The host never sees the data, but can control it
indirectly via the handles returned to userspace.

I'm not sure that a flag on the skb is the right location for this -
perhaps moving it into skb_shared() instead would be better?
-- 
Jonathan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-27  7:35   ` [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu Christoph Hellwig
@ 2020-07-27 17:00     ` Jonathan Lemon
  2020-07-27 18:24       ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Jonathan Lemon @ 2020-07-27 17:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: netdev, kernel-team, robin.murphy, akpm, davem, kuba, willemb,
	edumazet, steffen.klassert, saeedm, maximmi, bjorn.topel,
	magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 09:35:09AM +0200, Christoph Hellwig wrote:
> Seriously?  If you only even considered this is something reasonable
> to do you should not be anywhere near Linux kernel development.
> 
> Just go away!

This isn't really a constructive comment.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag
  2020-07-27 16:59     ` Jonathan Lemon
@ 2020-07-27 17:08       ` Eric Dumazet
  2020-07-27 17:16         ` Jonathan Lemon
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2020-07-27 17:08 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 10:01 AM Jonathan Lemon
<jonathan.lemon@gmail.com> wrote:
>
> On Mon, Jul 27, 2020 at 08:24:55AM -0700, Eric Dumazet wrote:
> > On Mon, Jul 27, 2020 at 12:20 AM Jonathan Lemon
> > <jonathan.lemon@gmail.com> wrote:
> > >
> > > This could likely be moved elsewhere.  The presence of the flag on
> > > the skb indicates that one of the fragments may contain zerocopy
> > > RX data, where the data is not accessible to the cpu.
> >
> > Why do we need yet another flag in skb exactly ?
> >
> > Please define what means "data not accessible to the cpu" ?
> >
> > This kind of change is a red flag for me.
>
> The architecture this is targeting is a ML cluster, where a 200Gbps NIC
> is attached to a PCIe switch which also has a GPU card attached.  There
> are several of these, and the link(s) to the host cpu (which has another
> NIC attached) can't handle the incoming traffic.
>
> So what we're doing here is transferring the data directly from the NIC
> to the GPU via DMA.  The host never sees the data, but can control it
> indirectly via the handles returned to userspace.
>

This seems to need a page/memory attribute or something.

skb should not have this knowledge, unless you are planning to make
sure that everything accessing skb data is going to test this new flag
and fail if it is set ?


> I'm not sure that a flag on the skb is the right location for this -
> perhaps moving it into skb_shared() instead would be better?
> --
> Jonathan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online
  2020-07-27  5:58   ` [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online Christoph Hellwig
@ 2020-07-27 17:08     ` Jonathan Lemon
  0 siblings, 0 replies; 29+ messages in thread
From: Jonathan Lemon @ 2020-07-27 17:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: netdev, kernel-team, robin.murphy, akpm, davem, kuba, willemb,
	edumazet, steffen.klassert, saeedm, maximmi, bjorn.topel,
	magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 07:58:13AM +0200, Christoph Hellwig wrote:
> On Sun, Jul 26, 2020 at 10:28:28PM -0700, Jonathan Lemon wrote:
> > Change the system RAM check from 'valid' to 'online', so dummy
> > pages which refer to external DMA resources can be mapped.
> 
> NAK.  This looks completely bogus.  Maybe you do have a good reason
> somewhere (although I doubt it), but then you'd actualy need to both
> explain it here, and also actually make sure I get the whole damn series.

The entire patchset was sent out in one command - I have no control over
how the mail server ends up delivering things.

An alternative here would just allocate 'struct pages' from normal
system memory, instead of from the system resources.  I initially did it
this way so the page could be used in the normal fashion to locate the
DMA address, instead of having a separate lookup table.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag
  2020-07-27 17:08       ` Eric Dumazet
@ 2020-07-27 17:16         ` Jonathan Lemon
  0 siblings, 0 replies; 29+ messages in thread
From: Jonathan Lemon @ 2020-07-27 17:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 10:08:10AM -0700, Eric Dumazet wrote:
> On Mon, Jul 27, 2020 at 10:01 AM Jonathan Lemon
> <jonathan.lemon@gmail.com> wrote:
> >
> > On Mon, Jul 27, 2020 at 08:24:55AM -0700, Eric Dumazet wrote:
> > > On Mon, Jul 27, 2020 at 12:20 AM Jonathan Lemon
> > > <jonathan.lemon@gmail.com> wrote:
> > > >
> > > > This could likely be moved elsewhere.  The presence of the flag on
> > > > the skb indicates that one of the fragments may contain zerocopy
> > > > RX data, where the data is not accessible to the cpu.
> > >
> > > Why do we need yet another flag in skb exactly ?
> > >
> > > Please define what means "data not accessible to the cpu" ?
> > >
> > > This kind of change is a red flag for me.
> >
> > The architecture this is targeting is a ML cluster, where a 200Gbps NIC
> > is attached to a PCIe switch which also has a GPU card attached.  There
> > are several of these, and the link(s) to the host cpu (which has another
> > NIC attached) can't handle the incoming traffic.
> >
> > So what we're doing here is transferring the data directly from the NIC
> > to the GPU via DMA.  The host never sees the data, but can control it
> > indirectly via the handles returned to userspace.
> >
> 
> This seems to need a page/memory attribute or something.

'struct page' is even more constrained.  I opted to flag the skb since
there should not be mixed zc/non-zc pages in the same skb.  I'd be happy
to see other alternatives.


> skb should not have this knowledge, unless you are planning to make
> sure that everything accessing skb data is going to test this new flag
> and fail if it is set ?

That might be how things end up going - in the samw way skb_zcopy()
works.
--
Jonathan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size for netgpu
  2020-07-27 15:19   ` [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size " Eric Dumazet
@ 2020-07-27 17:20     ` Jonathan Lemon
  0 siblings, 0 replies; 29+ messages in thread
From: Jonathan Lemon @ 2020-07-27 17:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 08:19:24AM -0700, Eric Dumazet wrote:
> On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> <jonathan.lemon@gmail.com> wrote:
> >
> > From: Jonathan Lemon <bsd@fb.com>
> >
> > The "header splitting" feature used by netgpu doesn't actually parse
> > the incoming packet header.  Instead, it splits the packet at a fixed
> > offset.  In order for this to work, the sender needs to send packets
> > with a fixed header size.
> >
> > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > ---
> >  net/ipv4/tcp_output.c | 20 ++++++++++++++++++++
> >  1 file changed, 20 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index d8f16f6a9b02..e8a74d0f7ad2 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -438,6 +438,7 @@ struct tcp_out_options {
> >         u8 ws;                  /* window scale, 0 to disable */
> >         u8 num_sack_blocks;     /* number of SACK blocks to include */
> >         u8 hash_size;           /* bytes in hash_location */
> > +       u8 pad_size;            /* additional nops for padding */
> >         __u8 *hash_location;    /* temporary pointer, overloaded */
> >         __u32 tsval, tsecr;     /* need to include OPTION_TS */
> >         struct tcp_fastopen_cookie *fastopen_cookie;    /* Fast open cookie */
> > @@ -562,6 +563,17 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
> >         smc_options_write(ptr, &options);
> >
> >         mptcp_options_write(ptr, opts);
> > +
> > +#if IS_ENABLED(CONFIG_NETGPU)
> > +       /* pad out options */
> > +       if (opts->pad_size) {
> > +               int len = opts->pad_size;
> > +               u8 *p = (u8 *)ptr;
> > +
> > +               while (len--)
> > +                       *p++ = TCPOPT_NOP;
> > +       }
> > +#endif
> >  }
> >
> >  static void smc_set_option(const struct tcp_sock *tp,
> > @@ -826,6 +838,14 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
> >                         opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
> >         }
> >
> > +#if IS_ENABLED(CONFIG_NETGPU)
> > +       /* force padding */
> > +       if (size < 20) {
> > +               opts->pad_size = 20 - size;
> > +               size += opts->pad_size;
> > +       }
> > +#endif
> > +
> 
> This is obviously wrong, as any kernel compiled with CONFIG_NETGPU
> will fail all packetdrill tests suite.
> 
> Also the fixed 20 value is not pretty.

Would changing this into a sysctl be a suitable solution?  It really is
a temporary solution to handle hardware that doesn't support splitting,
and adding a sysctl seems so permanent.....  
-- 
Jonathan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-27 16:09       ` Eric Dumazet
@ 2020-07-27 17:35         ` Jonathan Lemon
  2020-07-27 17:44           ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Jonathan Lemon @ 2020-07-27 17:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 09:09:48AM -0700, Eric Dumazet wrote:
> On Mon, Jul 27, 2020 at 8:56 AM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> >
> > On Mon, Jul 27, 2020 at 08:19:43AM -0700, Eric Dumazet wrote:
> > > On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> > > <jonathan.lemon@gmail.com> wrote:
> > > >
> > > > This flag indicates that the attached data is a zero-copy send,
> > > > and the pages should be retrieved from the netgpu module.  The
> > > > socket should should already have been attached to a netgpu queue.
> > > >
> > > > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > > > ---
> > > >  include/linux/socket.h | 1 +
> > > >  net/ipv4/tcp.c         | 8 ++++++++
> > > >  2 files changed, 9 insertions(+)
> > > >
> > > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > > index 04d2bc97f497..63816cc25dee 100644
> > > > --- a/include/linux/socket.h
> > > > +++ b/include/linux/socket.h
> > > > @@ -310,6 +310,7 @@ struct ucred {
> > > >                                           */
> > > >
> > > >  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> > > > +#define MSG_NETDMA     0x8000000
> > > >  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> > > >  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> > > >                                            descriptor received through
> > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > index 261c28ccc8f6..340ce319edc9 100644
> > > > --- a/net/ipv4/tcp.c
> > > > +++ b/net/ipv4/tcp.c
> > > > @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > >                         uarg->zerocopy = 0;
> > > >         }
> > > >
> > > > +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> > > > +               zc = sk->sk_route_caps & NETIF_F_SG;
> > > > +               if (!zc) {
> > > > +                       err = -EFAULT;
> > > > +                       goto out_err;
> > > > +               }
> > > > +       }
> > > >
> > >
> > > Sorry, no, we can not allow adding yet another branch into TCP fast
> > > path for yet another variant of zero copy.
> >
> > I'm not in disagreement with that statement, but the existing zerocopy
> > work makes some assumptions that aren't suitable.  I take it that you'd
> > rather have things folded together so the old/new code works together?
> 
> Exact.  Forcing users to use MSG_NETDMA, yet reusing SOCK_ZEROCOPY is silly.
> 
> SOCK_ZEROCOPY has been added to that user space and kernel would agree
> on MSG_ZEROCOPY being not a nop (as it was on old kernels)
> 
> >
> > Allocating an extra structure for every skbuff isn't ideal in my book.
> >
> 
> We do not allocate a structure for every skbuff. Please look again.

I'm looking here:

    uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));

Doesn't sock_zerocopy_realloc() allocate a new structure if the skb
doesn't have one already?


> > > Overall, I think your patch series desperately tries to add changes in
> > > TCP stack, while there is yet no proof
> > > that you have to use TCP transport between the peers.
> >
> > The goal is having a reliable transport without resorting to RDMA.
> 
> And why should it be TCP ?
> 
> Are you dealing with lost packets, retransmits, timers, and al  ?

Yes?  If there was a true lossless medium, RDMA would have taken over by
now.  Or are you suggesting that the transport protocol reliability
should be performed in userspace?  (not all the world is QUIC yet)
-- 
Jonathan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-27 17:35         ` Jonathan Lemon
@ 2020-07-27 17:44           ` Eric Dumazet
  2020-07-28  2:11             ` Jonathan Lemon
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2020-07-27 17:44 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 10:35 AM Jonathan Lemon
<jonathan.lemon@gmail.com> wrote:
>
> On Mon, Jul 27, 2020 at 09:09:48AM -0700, Eric Dumazet wrote:
> > On Mon, Jul 27, 2020 at 8:56 AM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> > >
> > > On Mon, Jul 27, 2020 at 08:19:43AM -0700, Eric Dumazet wrote:
> > > > On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> > > > <jonathan.lemon@gmail.com> wrote:
> > > > >
> > > > > This flag indicates that the attached data is a zero-copy send,
> > > > > and the pages should be retrieved from the netgpu module.  The
> > > > > socket should should already have been attached to a netgpu queue.
> > > > >
> > > > > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > > > > ---
> > > > >  include/linux/socket.h | 1 +
> > > > >  net/ipv4/tcp.c         | 8 ++++++++
> > > > >  2 files changed, 9 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > > > index 04d2bc97f497..63816cc25dee 100644
> > > > > --- a/include/linux/socket.h
> > > > > +++ b/include/linux/socket.h
> > > > > @@ -310,6 +310,7 @@ struct ucred {
> > > > >                                           */
> > > > >
> > > > >  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> > > > > +#define MSG_NETDMA     0x8000000
> > > > >  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> > > > >  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> > > > >                                            descriptor received through
> > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > index 261c28ccc8f6..340ce319edc9 100644
> > > > > --- a/net/ipv4/tcp.c
> > > > > +++ b/net/ipv4/tcp.c
> > > > > @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > > >                         uarg->zerocopy = 0;
> > > > >         }
> > > > >
> > > > > +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> > > > > +               zc = sk->sk_route_caps & NETIF_F_SG;
> > > > > +               if (!zc) {
> > > > > +                       err = -EFAULT;
> > > > > +                       goto out_err;
> > > > > +               }
> > > > > +       }
> > > > >
> > > >
> > > > Sorry, no, we can not allow adding yet another branch into TCP fast
> > > > path for yet another variant of zero copy.
> > >
> > > I'm not in disagreement with that statement, but the existing zerocopy
> > > work makes some assumptions that aren't suitable.  I take it that you'd
> > > rather have things folded together so the old/new code works together?
> >
> > Exact.  Forcing users to use MSG_NETDMA, yet reusing SOCK_ZEROCOPY is silly.
> >
> > SOCK_ZEROCOPY has been added to that user space and kernel would agree
> > on MSG_ZEROCOPY being not a nop (as it was on old kernels)
> >
> > >
> > > Allocating an extra structure for every skbuff isn't ideal in my book.
> > >
> >
> > We do not allocate a structure for every skbuff. Please look again.
>
> I'm looking here:
>
>     uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
>
> Doesn't sock_zerocopy_realloc() allocate a new structure if the skb
> doesn't have one already?
>
>
> > > > Overall, I think your patch series desperately tries to add changes in
> > > > TCP stack, while there is yet no proof
> > > > that you have to use TCP transport between the peers.
> > >
> > > The goal is having a reliable transport without resorting to RDMA.
> >
> > And why should it be TCP ?
> >
> > Are you dealing with lost packets, retransmits, timers, and al  ?
>
> Yes?  If there was a true lossless medium, RDMA would have taken over by
> now.  Or are you suggesting that the transport protocol reliability
> should be performed in userspace?  (not all the world is QUIC yet)
>

The thing is : this patch series is a monster thing adding stuff that
is going to impact 100% % of TCP flows,
even if not used in this NETDMA context.

So you need to convince us you are really desperate to get this in
upstream linux.

I have implemented TCP RX zero copy without adding a single line in
standard TCP code.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-27 17:00     ` Jonathan Lemon
@ 2020-07-27 18:24       ` Christoph Hellwig
  2020-07-28  1:48         ` Jonathan Lemon
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2020-07-27 18:24 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Christoph Hellwig, netdev, kernel-team, robin.murphy, akpm,
	davem, kuba, willemb, edumazet, steffen.klassert, saeedm,
	maximmi, bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 10:00:03AM -0700, Jonathan Lemon wrote:
> On Mon, Jul 27, 2020 at 09:35:09AM +0200, Christoph Hellwig wrote:
> > Seriously?  If you only even considered this is something reasonable
> > to do you should not be anywhere near Linux kernel development.
> > 
> > Just go away!
> 
> This isn't really a constructive comment.

And this wasn't a constructive patch.  If you really think adding tons
of garbage to the kernel to support a proprietary driver you really
should not be here at all.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-27 18:24       ` Christoph Hellwig
@ 2020-07-28  1:48         ` Jonathan Lemon
  2020-07-28  6:47           ` Christoph Hellwig
  2020-07-28 18:19           ` Jason Gunthorpe
  0 siblings, 2 replies; 29+ messages in thread
From: Jonathan Lemon @ 2020-07-28  1:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: netdev, kernel-team, robin.murphy, akpm, davem, kuba, willemb,
	edumazet, steffen.klassert, saeedm, maximmi, bjorn.topel,
	magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 08:24:25PM +0200, Christoph Hellwig wrote:
> On Mon, Jul 27, 2020 at 10:00:03AM -0700, Jonathan Lemon wrote:
> > On Mon, Jul 27, 2020 at 09:35:09AM +0200, Christoph Hellwig wrote:
> > > Seriously?  If you only even considered this is something reasonable
> > > to do you should not be anywhere near Linux kernel development.
> > > 
> > > Just go away!
> > 
> > This isn't really a constructive comment.
> 
> And this wasn't a constructive patch.  If you really think adding tons
> of garbage to the kernel to support a proprietary driver you really
> should not be here at all.

This is not in support of a proprietary driver.  As the cover letter
notes, this is for data transfers between the NIC/GPU, while still
utilizing the kernel protocol stack and leaving the application in
control.

While the current GPU utilized is nvidia, there's nothing in the rest of
the patches specific to Nvidia - an Intel or AMD GPU interface could be
equally workable.

I think this is a better patch than all the various implementations of
the protocol stack in the form of RDMA, driver code and device firmware.

I'm aware that Nvidia code is maintained outside the tree, so this last
patch may better placed there; I included it here so reviewers can see
how things work.
-- 
Jonathan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-27 17:44           ` Eric Dumazet
@ 2020-07-28  2:11             ` Jonathan Lemon
  2020-07-28  2:17               ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Jonathan Lemon @ 2020-07-28  2:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 10:44:59AM -0700, Eric Dumazet wrote:
> On Mon, Jul 27, 2020 at 10:35 AM Jonathan Lemon
> <jonathan.lemon@gmail.com> wrote:
> >
> > On Mon, Jul 27, 2020 at 09:09:48AM -0700, Eric Dumazet wrote:
> > > On Mon, Jul 27, 2020 at 8:56 AM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> > > >
> > > > On Mon, Jul 27, 2020 at 08:19:43AM -0700, Eric Dumazet wrote:
> > > > > On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> > > > > <jonathan.lemon@gmail.com> wrote:
> > > > > >
> > > > > > This flag indicates that the attached data is a zero-copy send,
> > > > > > and the pages should be retrieved from the netgpu module.  The
> > > > > > socket should should already have been attached to a netgpu queue.
> > > > > >
> > > > > > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > > > > > ---
> > > > > >  include/linux/socket.h | 1 +
> > > > > >  net/ipv4/tcp.c         | 8 ++++++++
> > > > > >  2 files changed, 9 insertions(+)
> > > > > >
> > > > > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > > > > index 04d2bc97f497..63816cc25dee 100644
> > > > > > --- a/include/linux/socket.h
> > > > > > +++ b/include/linux/socket.h
> > > > > > @@ -310,6 +310,7 @@ struct ucred {
> > > > > >                                           */
> > > > > >
> > > > > >  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> > > > > > +#define MSG_NETDMA     0x8000000
> > > > > >  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> > > > > >  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> > > > > >                                            descriptor received through
> > > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > > index 261c28ccc8f6..340ce319edc9 100644
> > > > > > --- a/net/ipv4/tcp.c
> > > > > > +++ b/net/ipv4/tcp.c
> > > > > > @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > > > >                         uarg->zerocopy = 0;
> > > > > >         }
> > > > > >
> > > > > > +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> > > > > > +               zc = sk->sk_route_caps & NETIF_F_SG;
> > > > > > +               if (!zc) {
> > > > > > +                       err = -EFAULT;
> > > > > > +                       goto out_err;
> > > > > > +               }
> > > > > > +       }
> > > > > >
> > > > >
> > > > > Sorry, no, we can not allow adding yet another branch into TCP fast
> > > > > path for yet another variant of zero copy.
> > > >
> > > > I'm not in disagreement with that statement, but the existing zerocopy
> > > > work makes some assumptions that aren't suitable.  I take it that you'd
> > > > rather have things folded together so the old/new code works together?
> > >
> > > Exact.  Forcing users to use MSG_NETDMA, yet reusing SOCK_ZEROCOPY is silly.
> > >
> > > SOCK_ZEROCOPY has been added to that user space and kernel would agree
> > > on MSG_ZEROCOPY being not a nop (as it was on old kernels)
> > >
> > > >
> > > > Allocating an extra structure for every skbuff isn't ideal in my book.
> > > >
> > >
> > > We do not allocate a structure for every skbuff. Please look again.
> >
> > I'm looking here:
> >
> >     uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
> >
> > Doesn't sock_zerocopy_realloc() allocate a new structure if the skb
> > doesn't have one already?
> >
> >
> > > > > Overall, I think your patch series desperately tries to add changes in
> > > > > TCP stack, while there is yet no proof
> > > > > that you have to use TCP transport between the peers.
> > > >
> > > > The goal is having a reliable transport without resorting to RDMA.
> > >
> > > And why should it be TCP ?
> > >
> > > Are you dealing with lost packets, retransmits, timers, and al  ?
> >
> > Yes?  If there was a true lossless medium, RDMA would have taken over by
> > now.  Or are you suggesting that the transport protocol reliability
> > should be performed in userspace?  (not all the world is QUIC yet)
> >
> 
> The thing is : this patch series is a monster thing adding stuff that
> is going to impact 100% % of TCP flows,
> even if not used in this NETDMA context.
> 
> So you need to convince us you are really desperate to get this in
> upstream linux.
> 
> I have implemented TCP RX zero copy without adding a single line in
> standard TCP code.

That's a bit of an exaggeration, as I see skb_zcopy_*() calls scattered
around the normal TCP code path.  I also haven't changed the normal TCP
path either, other than doing some of the same things as skb_zcopy_*().
(ignoring the ugly moron about padding out the TCP header, which I'll
put under a static_branch_unlikely).

The thing is, the existing zero copy code is zero-copy to /host/ memory,
which is not the same thing as zero-copy to other memory areas.  
-- 
Jonathan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-28  2:11             ` Jonathan Lemon
@ 2020-07-28  2:17               ` Eric Dumazet
  2020-07-28  3:08                 ` Jonathan Lemon
  2020-07-28  6:50                 ` Christoph Hellwig
  0 siblings, 2 replies; 29+ messages in thread
From: Eric Dumazet @ 2020-07-28  2:17 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 7:11 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
>
> On Mon, Jul 27, 2020 at 10:44:59AM -0700, Eric Dumazet wrote:
> > On Mon, Jul 27, 2020 at 10:35 AM Jonathan Lemon
> > <jonathan.lemon@gmail.com> wrote:
> > >
> > > On Mon, Jul 27, 2020 at 09:09:48AM -0700, Eric Dumazet wrote:
> > > > On Mon, Jul 27, 2020 at 8:56 AM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> > > > >
> > > > > On Mon, Jul 27, 2020 at 08:19:43AM -0700, Eric Dumazet wrote:
> > > > > > On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> > > > > > <jonathan.lemon@gmail.com> wrote:
> > > > > > >
> > > > > > > This flag indicates that the attached data is a zero-copy send,
> > > > > > > and the pages should be retrieved from the netgpu module.  The
> > > > > > > socket should should already have been attached to a netgpu queue.
> > > > > > >
> > > > > > > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > > > > > > ---
> > > > > > >  include/linux/socket.h | 1 +
> > > > > > >  net/ipv4/tcp.c         | 8 ++++++++
> > > > > > >  2 files changed, 9 insertions(+)
> > > > > > >
> > > > > > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > > > > > index 04d2bc97f497..63816cc25dee 100644
> > > > > > > --- a/include/linux/socket.h
> > > > > > > +++ b/include/linux/socket.h
> > > > > > > @@ -310,6 +310,7 @@ struct ucred {
> > > > > > >                                           */
> > > > > > >
> > > > > > >  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> > > > > > > +#define MSG_NETDMA     0x8000000
> > > > > > >  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> > > > > > >  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> > > > > > >                                            descriptor received through
> > > > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > > > index 261c28ccc8f6..340ce319edc9 100644
> > > > > > > --- a/net/ipv4/tcp.c
> > > > > > > +++ b/net/ipv4/tcp.c
> > > > > > > @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > > > > >                         uarg->zerocopy = 0;
> > > > > > >         }
> > > > > > >
> > > > > > > +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> > > > > > > +               zc = sk->sk_route_caps & NETIF_F_SG;
> > > > > > > +               if (!zc) {
> > > > > > > +                       err = -EFAULT;
> > > > > > > +                       goto out_err;
> > > > > > > +               }
> > > > > > > +       }
> > > > > > >
> > > > > >
> > > > > > Sorry, no, we can not allow adding yet another branch into TCP fast
> > > > > > path for yet another variant of zero copy.
> > > > >
> > > > > I'm not in disagreement with that statement, but the existing zerocopy
> > > > > work makes some assumptions that aren't suitable.  I take it that you'd
> > > > > rather have things folded together so the old/new code works together?
> > > >
> > > > Exact.  Forcing users to use MSG_NETDMA, yet reusing SOCK_ZEROCOPY is silly.
> > > >
> > > > SOCK_ZEROCOPY has been added to that user space and kernel would agree
> > > > on MSG_ZEROCOPY being not a nop (as it was on old kernels)
> > > >
> > > > >
> > > > > Allocating an extra structure for every skbuff isn't ideal in my book.
> > > > >
> > > >
> > > > We do not allocate a structure for every skbuff. Please look again.
> > >
> > > I'm looking here:
> > >
> > >     uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
> > >
> > > Doesn't sock_zerocopy_realloc() allocate a new structure if the skb
> > > doesn't have one already?
> > >
> > >
> > > > > > Overall, I think your patch series desperately tries to add changes in
> > > > > > TCP stack, while there is yet no proof
> > > > > > that you have to use TCP transport between the peers.
> > > > >
> > > > > The goal is having a reliable transport without resorting to RDMA.
> > > >
> > > > And why should it be TCP ?
> > > >
> > > > Are you dealing with lost packets, retransmits, timers, and al  ?
> > >
> > > Yes?  If there was a true lossless medium, RDMA would have taken over by
> > > now.  Or are you suggesting that the transport protocol reliability
> > > should be performed in userspace?  (not all the world is QUIC yet)
> > >
> >
> > The thing is : this patch series is a monster thing adding stuff that
> > is going to impact 100% % of TCP flows,
> > even if not used in this NETDMA context.
> >
> > So you need to convince us you are really desperate to get this in
> > upstream linux.
> >
> > I have implemented TCP RX zero copy without adding a single line in
> > standard TCP code.
>
> That's a bit of an exaggeration, as I see skb_zcopy_*() calls scattered
> around the normal TCP code path.  I also haven't changed the normal TCP
> path either, other than doing some of the same things as skb_zcopy_*().
> (ignoring the ugly moron about padding out the TCP header, which I'll
> put under a static_branch_unlikely).


You are mixing TX zerocopy and RX zero copy.  Quite different things.

My claim was about TCP RX zero copy (aka tcp_zerocopy_receive())


>
> The thing is, the existing zero copy code is zero-copy to /host/ memory,
> which is not the same thing as zero-copy to other memory areas.

You have to really explain what difference it makes, and why current
stuff can not be extended.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-28  2:17               ` Eric Dumazet
@ 2020-07-28  3:08                 ` Jonathan Lemon
  2020-07-28  6:50                 ` Christoph Hellwig
  1 sibling, 0 replies; 29+ messages in thread
From: Jonathan Lemon @ 2020-07-28  3:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 07:17:51PM -0700, Eric Dumazet wrote:
> On Mon, Jul 27, 2020 at 7:11 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> >
> > On Mon, Jul 27, 2020 at 10:44:59AM -0700, Eric Dumazet wrote:
> > > On Mon, Jul 27, 2020 at 10:35 AM Jonathan Lemon
> > > <jonathan.lemon@gmail.com> wrote:
> > > >
> > > > On Mon, Jul 27, 2020 at 09:09:48AM -0700, Eric Dumazet wrote:
> > > > > On Mon, Jul 27, 2020 at 8:56 AM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, Jul 27, 2020 at 08:19:43AM -0700, Eric Dumazet wrote:
> > > > > > > On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> > > > > > > <jonathan.lemon@gmail.com> wrote:
> > > > > > > >
> > > > > > > > This flag indicates that the attached data is a zero-copy send,
> > > > > > > > and the pages should be retrieved from the netgpu module.  The
> > > > > > > > socket should should already have been attached to a netgpu queue.
> > > > > > > >
> > > > > > > > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > > > > > > > ---
> > > > > > > >  include/linux/socket.h | 1 +
> > > > > > > >  net/ipv4/tcp.c         | 8 ++++++++
> > > > > > > >  2 files changed, 9 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > > > > > > index 04d2bc97f497..63816cc25dee 100644
> > > > > > > > --- a/include/linux/socket.h
> > > > > > > > +++ b/include/linux/socket.h
> > > > > > > > @@ -310,6 +310,7 @@ struct ucred {
> > > > > > > >                                           */
> > > > > > > >
> > > > > > > >  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> > > > > > > > +#define MSG_NETDMA     0x8000000
> > > > > > > >  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> > > > > > > >  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> > > > > > > >                                            descriptor received through
> > > > > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > > > > index 261c28ccc8f6..340ce319edc9 100644
> > > > > > > > --- a/net/ipv4/tcp.c
> > > > > > > > +++ b/net/ipv4/tcp.c
> > > > > > > > @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > > > > > >                         uarg->zerocopy = 0;
> > > > > > > >         }
> > > > > > > >
> > > > > > > > +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> > > > > > > > +               zc = sk->sk_route_caps & NETIF_F_SG;
> > > > > > > > +               if (!zc) {
> > > > > > > > +                       err = -EFAULT;
> > > > > > > > +                       goto out_err;
> > > > > > > > +               }
> > > > > > > > +       }
> > > > > > > >
> > > > > > >
> > > > > > > Sorry, no, we can not allow adding yet another branch into TCP fast
> > > > > > > path for yet another variant of zero copy.
> > > > > >
> > > > > > I'm not in disagreement with that statement, but the existing zerocopy
> > > > > > work makes some assumptions that aren't suitable.  I take it that you'd
> > > > > > rather have things folded together so the old/new code works together?
> > > > >
> > > > > Exact.  Forcing users to use MSG_NETDMA, yet reusing SOCK_ZEROCOPY is silly.
> > > > >
> > > > > SOCK_ZEROCOPY has been added to that user space and kernel would agree
> > > > > on MSG_ZEROCOPY being not a nop (as it was on old kernels)
> > > > >
> > > > > >
> > > > > > Allocating an extra structure for every skbuff isn't ideal in my book.
> > > > > >
> > > > >
> > > > > We do not allocate a structure for every skbuff. Please look again.
> > > >
> > > > I'm looking here:
> > > >
> > > >     uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
> > > >
> > > > Doesn't sock_zerocopy_realloc() allocate a new structure if the skb
> > > > doesn't have one already?
> > > >
> > > >
> > > > > > > Overall, I think your patch series desperately tries to add changes in
> > > > > > > TCP stack, while there is yet no proof
> > > > > > > that you have to use TCP transport between the peers.
> > > > > >
> > > > > > The goal is having a reliable transport without resorting to RDMA.
> > > > >
> > > > > And why should it be TCP ?
> > > > >
> > > > > Are you dealing with lost packets, retransmits, timers, and al  ?
> > > >
> > > > Yes?  If there was a true lossless medium, RDMA would have taken over by
> > > > now.  Or are you suggesting that the transport protocol reliability
> > > > should be performed in userspace?  (not all the world is QUIC yet)
> > > >
> > >
> > > The thing is : this patch series is a monster thing adding stuff that
> > > is going to impact 100% % of TCP flows,
> > > even if not used in this NETDMA context.
> > >
> > > So you need to convince us you are really desperate to get this in
> > > upstream linux.
> > >
> > > I have implemented TCP RX zero copy without adding a single line in
> > > standard TCP code.
> >
> > That's a bit of an exaggeration, as I see skb_zcopy_*() calls scattered
> > around the normal TCP code path.  I also haven't changed the normal TCP
> > path either, other than doing some of the same things as skb_zcopy_*().
> > (ignoring the ugly moron about padding out the TCP header, which I'll
> > put under a static_branch_unlikely).
> 
> You are mixing TX zerocopy and RX zero copy.  Quite different things.
> 
> My claim was about TCP RX zero copy (aka tcp_zerocopy_receive())

I understand that (as I'm implementing both sides).  My equivalent of 
tcp_zerocopy_receive is netgpu_recv_skb().

In my variant, I don't actually have a real page, it's just a
placeholder for GPU memory.  So the device driver must obtain the
destination page [dma address] from the netgpu module, and either
deliver it to the user, or return it back to netgpu.

The zc_netdma special bit is there to handle the case where the skb is
freed back to the system - suppose the socket is closed with buffers
sitting on the RX queue.  The existing zero-copy code does not need this
because the RX buffers came from system memory in the first place.


> > The thing is, the existing zero copy code is zero-copy to /host/ memory,
> > which is not the same thing as zero-copy to other memory areas.
> 
> You have to really explain what difference it makes, and why current
> stuff can not be extended.

For RX, the ability to return the 'pages' back to the originator.  For
TX, making sure the pages are not touched by the host, as this would be
an immediate kernel panic.  For example, skb_copy_ubufs() is verboten.

The netgpu and the existing zerocopy code can likely be merged with some
additional work, but they still have different requirements, and
different user APIs.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28  1:48         ` Jonathan Lemon
@ 2020-07-28  6:47           ` Christoph Hellwig
  2020-07-28 16:05             ` Jonathan Lemon
  2020-07-28 18:19           ` Jason Gunthorpe
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2020-07-28  6:47 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Christoph Hellwig, netdev, kernel-team, robin.murphy, akpm,
	davem, kuba, willemb, edumazet, steffen.klassert, saeedm,
	maximmi, bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 06:48:12PM -0700, Jonathan Lemon wrote:
> I'm aware that Nvidia code is maintained outside the tree, so this last
> patch may better placed there; I included it here so reviewers can see
> how things work.

Sorry dude, but with statements like this you really disqualify yourself
from kernel work.  Please really just go away and stop this crap.  It's
not just your attitude, but also the resulting crap code.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-28  2:17               ` Eric Dumazet
  2020-07-28  3:08                 ` Jonathan Lemon
@ 2020-07-28  6:50                 ` Christoph Hellwig
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-07-28  6:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonathan Lemon, netdev, kernel-team, Christoph Hellwig,
	Robin Murphy, Andrew Morton, David Miller, Jakub Kicinski,
	Willem de Bruijn, Steffen Klassert, Saeed Mahameed,
	Maxim Mikityanskiy, bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 07:17:51PM -0700, Eric Dumazet wrote:
> > The thing is, the existing zero copy code is zero-copy to /host/ memory,
> > which is not the same thing as zero-copy to other memory areas.
> 
> You have to really explain what difference it makes, and why current
> stuff can not be extended.

There basically is none.  You need to call a different dma mapping
routine and make sure to never access the device pages.  See
drivers/pci/p2pdma.c and its users for an example on how to do it
properly.  But the author wants all his crazy hacks to hook into his
proprietary crap, so everyone should be ignore this horrible series.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28  6:47           ` Christoph Hellwig
@ 2020-07-28 16:05             ` Jonathan Lemon
  2020-07-28 16:10               ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Jonathan Lemon @ 2020-07-28 16:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: netdev, kernel-team, robin.murphy, akpm, davem, kuba, willemb,
	edumazet, steffen.klassert, saeedm, maximmi, bjorn.topel,
	magnus.karlsson, borisp, david

On Tue, Jul 28, 2020 at 08:47:06AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 27, 2020 at 06:48:12PM -0700, Jonathan Lemon wrote:
> > I'm aware that Nvidia code is maintained outside the tree, so this last
> > patch may better placed there; I included it here so reviewers can see
> > how things work.
> 
> Sorry dude, but with statements like this you really disqualify yourself
> from kernel work.  Please really just go away and stop this crap.  It's
> not just your attitude, but also the resulting crap code.

The attitude appears to be on your part, as apparently you have no
technical comments to contribute here.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28 16:05             ` Jonathan Lemon
@ 2020-07-28 16:10               ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-07-28 16:10 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Christoph Hellwig, netdev, kernel-team, robin.murphy, akpm,
	davem, kuba, willemb, edumazet, steffen.klassert, saeedm,
	maximmi, bjorn.topel, magnus.karlsson, borisp, david

On Tue, Jul 28, 2020 at 09:05:08AM -0700, Jonathan Lemon wrote:
> On Tue, Jul 28, 2020 at 08:47:06AM +0200, Christoph Hellwig wrote:
> > On Mon, Jul 27, 2020 at 06:48:12PM -0700, Jonathan Lemon wrote:
> > > I'm aware that Nvidia code is maintained outside the tree, so this last
> > > patch may better placed there; I included it here so reviewers can see
> > > how things work.
> > 
> > Sorry dude, but with statements like this you really disqualify yourself
> > from kernel work.  Please really just go away and stop this crap.  It's
> > not just your attitude, but also the resulting crap code.
> 
> The attitude appears to be on your part, as apparently you have no
> technical comments to contribute here.

You are submitting a series just to support a proprietary driver, and
that is disqualification enough on policy grounds.  And now just stop
the crap.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28  1:48         ` Jonathan Lemon
  2020-07-28  6:47           ` Christoph Hellwig
@ 2020-07-28 18:19           ` Jason Gunthorpe
  2020-07-28 21:01             ` Jonathan Lemon
  1 sibling, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2020-07-28 18:19 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Christoph Hellwig, netdev, kernel-team, robin.murphy, akpm,
	davem, kuba, willemb, edumazet, steffen.klassert, saeedm,
	maximmi, bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 06:48:12PM -0700, Jonathan Lemon wrote:

> While the current GPU utilized is nvidia, there's nothing in the rest of
> the patches specific to Nvidia - an Intel or AMD GPU interface could be
> equally workable.

I think that is very misleading.

It looks like this patch, and all the ugly MM stuff, is done the way
it is *specifically* to match the clunky nv_p2p interface that only
the NVIDIA driver exposes.

Any approach done in tree, where we can actually modify the GPU
driver, would do sane things like have the GPU driver itself create
the MEMORY_DEVICE_PCI_P2PDMA pages, use the P2P DMA API framework, use
dmabuf for the cross-driver attachment, etc, etc.

Of course none of this is possible with a proprietary driver. I could
give you such detailed feedback but it is completely wasted since you
can't actually use it or implement it.

Which is why the prohibition on building APIs to support out of tree
modules exists - we can't do a good job if our hands are tied by being
unable to change something.

This is really a textbook example of why this is the correct
philosophy.

If you are serious about advancing this then the initial patches in a
long road must be focused on building up the core kernel
infrastructure for P2P DMA to a point where netdev could consume
it. There has been a lot of different ideas thrown about on how to do
this over the years.

As you've seen, posting patches so tightly coupled to the NVIDIA GPU
implementation just makes people angry, I also advise against doing it
any further.

> I think this is a better patch than all the various implementations of
> the protocol stack in the form of RDMA, driver code and device firmware.

Oh? You mean "better" in the sense the header split offload in the NIC
is better liked than a full protocol running in the NIC?

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28 18:19           ` Jason Gunthorpe
@ 2020-07-28 21:01             ` Jonathan Lemon
  2020-07-28 21:14               ` Christoph Hellwig
  2020-07-28 23:38               ` Jason Gunthorpe
  0 siblings, 2 replies; 29+ messages in thread
From: Jonathan Lemon @ 2020-07-28 21:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, netdev, kernel-team, robin.murphy, akpm,
	davem, kuba, willemb, edumazet, steffen.klassert, saeedm,
	maximmi, bjorn.topel, magnus.karlsson, borisp, david

On Tue, Jul 28, 2020 at 03:19:04PM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 27, 2020 at 06:48:12PM -0700, Jonathan Lemon wrote:
> 
> > While the current GPU utilized is nvidia, there's nothing in the rest of
> > the patches specific to Nvidia - an Intel or AMD GPU interface could be
> > equally workable.
> 
> I think that is very misleading.
> 
> It looks like this patch, and all the ugly MM stuff, is done the way
> it is *specifically* to match the clunky nv_p2p interface that only
> the NVIDIA driver exposes.

For /this/ patch [21], this is quite true.  I'm forced to use the nv_p2p
API if I want to use the hardware that I have.  What's being overlooked
is that the host mem driver does not do this, nor would another GPU
if it used p2p_dma.  I'm just providing get_page, put_page, get_dma.


> Any approach done in tree, where we can actually modify the GPU
> driver, would do sane things like have the GPU driver itself create
> the MEMORY_DEVICE_PCI_P2PDMA pages, use the P2P DMA API framework, use
> dmabuf for the cross-driver attachment, etc, etc.

So why doesn't Nvidia implement the above in the driver?
Actually a serious question, not trolling here.


> If you are serious about advancing this then the initial patches in a
> long road must be focused on building up the core kernel
> infrastructure for P2P DMA to a point where netdev could consume
> it. There has been a lot of different ideas thrown about on how to do
> this over the years.

Yes, I'm serious about doing this work, and may not have seen or
remember all the various ideas I've seen over time.  The netstack
operates on pages - are you advocating replacing them with sglists?


> > I think this is a better patch than all the various implementations of
> > the protocol stack in the form of RDMA, driver code and device firmware.
> 
> Oh? You mean "better" in the sense the header split offload in the NIC
> is better liked than a full protocol running in the NIC?

Yes.  The NIC firmware should become simpler, not more complicated.
-- 
Jonathan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28 21:01             ` Jonathan Lemon
@ 2020-07-28 21:14               ` Christoph Hellwig
  2020-07-28 23:38               ` Jason Gunthorpe
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-07-28 21:14 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Jason Gunthorpe, Christoph Hellwig, netdev, kernel-team,
	robin.murphy, akpm, davem, kuba, willemb, edumazet,
	steffen.klassert, saeedm, maximmi, bjorn.topel, magnus.karlsson,
	borisp, david

On Tue, Jul 28, 2020 at 02:01:16PM -0700, Jonathan Lemon wrote:
> For /this/ patch [21], this is quite true.  I'm forced to use the nv_p2p
> API if I want to use the hardware that I have.  What's being overlooked
> is that the host mem driver does not do this, nor would another GPU
> if it used p2p_dma.  I'm just providing get_page, put_page, get_dma.

No, that is not overlooked.   What you overlooked is that your design
is bogus.  We don't do pointless plugings for something where there
is exactly one way to do, which is done in common infrastructure.

> 
> 
> > Any approach done in tree, where we can actually modify the GPU
> > driver, would do sane things like have the GPU driver itself create
> > the MEMORY_DEVICE_PCI_P2PDMA pages, use the P2P DMA API framework, use
> > dmabuf for the cross-driver attachment, etc, etc.
> 
> So why doesn't Nvidia implement the above in the driver?
> Actually a serious question, not trolling here.

No one knows, and most importantly it simply does not matter at all.

If you actually want to be productive contributor and not just troll
(as it absoutely seems) you'd just stop talking about a out of tree
driver that can't actually be distributed together with the kernel.

And better forget everything you ever knew about it, because it not
only is irrelevant but actually harmful.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28 21:01             ` Jonathan Lemon
  2020-07-28 21:14               ` Christoph Hellwig
@ 2020-07-28 23:38               ` Jason Gunthorpe
  1 sibling, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2020-07-28 23:38 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Christoph Hellwig, netdev, kernel-team, robin.murphy, akpm,
	davem, kuba, willemb, edumazet, steffen.klassert, saeedm,
	maximmi, bjorn.topel, magnus.karlsson, borisp, david

On Tue, Jul 28, 2020 at 02:01:16PM -0700, Jonathan Lemon wrote:
> On Tue, Jul 28, 2020 at 03:19:04PM -0300, Jason Gunthorpe wrote:
> > On Mon, Jul 27, 2020 at 06:48:12PM -0700, Jonathan Lemon wrote:
> > 
> > > While the current GPU utilized is nvidia, there's nothing in the rest of
> > > the patches specific to Nvidia - an Intel or AMD GPU interface could be
> > > equally workable.
> > 
> > I think that is very misleading.
> > 
> > It looks like this patch, and all the ugly MM stuff, is done the way
> > it is *specifically* to match the clunky nv_p2p interface that only
> > the NVIDIA driver exposes.
> 
> For /this/ patch [21], this is quite true.  I'm forced to use the nv_p2p
> API if I want to use the hardware that I have.  What's being overlooked
> is that the host mem driver does not do this, nor would another GPU
> if it used p2p_dma.  I'm just providing get_page, put_page, get_dma.

Not really, the design copied the nv_p2p api design directly into
struct netgpu_functions and then aligned the rest of the parts to use
it too. Yes, other GPU drivers could also be squeezed into this API,
but if you'd never looked at the NVIDIA driver you'd never pick such a
design. It is inherently disconnected from the MM.

> > Any approach done in tree, where we can actually modify the GPU
> > driver, would do sane things like have the GPU driver itself create
> > the MEMORY_DEVICE_PCI_P2PDMA pages, use the P2P DMA API framework, use
> > dmabuf for the cross-driver attachment, etc, etc.
> 
> So why doesn't Nvidia implement the above in the driver?
> Actually a serious question, not trolling here.

A kernel mailing list is not appropriate place to discuss a
proprietary out of tree driver, take questions like that with your
support channel.

> > If you are serious about advancing this then the initial patches in a
> > long road must be focused on building up the core kernel
> > infrastructure for P2P DMA to a point where netdev could consume
> > it. There has been a lot of different ideas thrown about on how to do
> > this over the years.
> 
> Yes, I'm serious about doing this work, and may not have seen or
> remember all the various ideas I've seen over time.  The netstack
> operates on pages - are you advocating replacing them with sglists?

So far, the general expectation is that any pages would be ZONE_DEVICE
MEMORY_DEVICE_PCI_P2PDMA pages created by the PCI device's driver to
cover the device's BAR. These are __iomem pages so they can't be
intermixed in the kernel with system memory pages. That detail has
a been a large stumbling block in most cases.

Resolving this design issue removes most of the MM hackery in the
netgpu. Though, I have no idea if you can intermix ZONE_DEVICE pages
into skb's in the net stack.

From there, I'd expect the pages are mmaped into userspace in a VMA or
passed into a userspace dmabuf FD.

At this point, consumers, like the net stack should rely on some core
APIs to extract the pages and DMA maps from the user objects. Either
some new pin_user_pages() variant for VMAs or via the work that was
started on the dma_buf_map_attachment() for dmabuf.

From there it needs to handle everything carefully and then call over
to the pcip2p code to validate and dma map them. There are many
missing little details along this path.

Overall there should be nothing like 'netgpu'. This is all just some
special case of an existing user memory flow where the user pages
being touched are allowed to be P2P pages or system memory and the
flow knows how to deal with the difference.

More or less. All of these touch points need beefing up and additional
features.

> > > I think this is a better patch than all the various implementations of
> > > the protocol stack in the form of RDMA, driver code and device firmware.
> > 
> > Oh? You mean "better" in the sense the header split offload in the NIC
> > is better liked than a full protocol running in the NIC?
> 
> Yes.  The NIC firmware should become simpler, not more complicated.

Do you have any application benchmarks? The typical AI communication
pattern is very challenging and a state of the art RDMA implementation
gets incredible performance.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size for netgpu
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  0 siblings, 0 replies; 29+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

The "header splitting" feature used by netgpu doesn't actually parse
the incoming packet header.  Instead, it splits the packet at a fixed
offset.  In order for this to work, the sender needs to send packets
with a fixed header size.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 net/ipv4/tcp_output.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d8f16f6a9b02..e8a74d0f7ad2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -438,6 +438,7 @@ struct tcp_out_options {
 	u8 ws;			/* window scale, 0 to disable */
 	u8 num_sack_blocks;	/* number of SACK blocks to include */
 	u8 hash_size;		/* bytes in hash_location */
+	u8 pad_size;		/* additional nops for padding */
 	__u8 *hash_location;	/* temporary pointer, overloaded */
 	__u32 tsval, tsecr;	/* need to include OPTION_TS */
 	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
@@ -562,6 +563,17 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 	smc_options_write(ptr, &options);
 
 	mptcp_options_write(ptr, opts);
+
+#if IS_ENABLED(CONFIG_NETGPU)
+	/* pad out options */
+	if (opts->pad_size) {
+		int len = opts->pad_size;
+		u8 *p = (u8 *)ptr;
+
+		while (len--)
+			*p++ = TCPOPT_NOP;
+	}
+#endif
 }
 
 static void smc_set_option(const struct tcp_sock *tp,
@@ -826,6 +838,14 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
 	}
 
+#if IS_ENABLED(CONFIG_NETGPU)
+	/* force padding */
+	if (size < 20) {
+		opts->pad_size = 20 - size;
+		size += opts->pad_size;
+	}
+#endif
+
 	return size;
 }
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-07-28 23:38 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200727052846.4070247-1-jonathan.lemon@gmail.com>
     [not found] ` <20200727052846.4070247-4-jonathan.lemon@gmail.com>
2020-07-27  5:58   ` [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online Christoph Hellwig
2020-07-27 17:08     ` Jonathan Lemon
     [not found] ` <20200727052846.4070247-22-jonathan.lemon@gmail.com>
2020-07-27  7:35   ` [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu Christoph Hellwig
2020-07-27 17:00     ` Jonathan Lemon
2020-07-27 18:24       ` Christoph Hellwig
2020-07-28  1:48         ` Jonathan Lemon
2020-07-28  6:47           ` Christoph Hellwig
2020-07-28 16:05             ` Jonathan Lemon
2020-07-28 16:10               ` Christoph Hellwig
2020-07-28 18:19           ` Jason Gunthorpe
2020-07-28 21:01             ` Jonathan Lemon
2020-07-28 21:14               ` Christoph Hellwig
2020-07-28 23:38               ` Jason Gunthorpe
     [not found] ` <20200727052846.4070247-14-jonathan.lemon@gmail.com>
2020-07-27 15:19   ` [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size " Eric Dumazet
2020-07-27 17:20     ` Jonathan Lemon
     [not found] ` <20200727052846.4070247-16-jonathan.lemon@gmail.com>
2020-07-27 15:19   ` [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg() Eric Dumazet
2020-07-27 15:55     ` Jonathan Lemon
2020-07-27 16:09       ` Eric Dumazet
2020-07-27 17:35         ` Jonathan Lemon
2020-07-27 17:44           ` Eric Dumazet
2020-07-28  2:11             ` Jonathan Lemon
2020-07-28  2:17               ` Eric Dumazet
2020-07-28  3:08                 ` Jonathan Lemon
2020-07-28  6:50                 ` Christoph Hellwig
     [not found] ` <20200727052846.4070247-9-jonathan.lemon@gmail.com>
2020-07-27 15:24   ` [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag Eric Dumazet
2020-07-27 16:59     ` Jonathan Lemon
2020-07-27 17:08       ` Eric Dumazet
2020-07-27 17:16         ` Jonathan Lemon
2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size for netgpu Jonathan Lemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.