Re: [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online
       [not found] ` <20200727052846.4070247-4-jonathan.lemon@gmail.com>
@ 2020-07-27  5:58   ` Christoph Hellwig
  2020-07-27 17:08     ` Jonathan Lemon
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2020-07-27  5:58 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, hch, robin.murphy, akpm, davem, kuba,
	willemb, edumazet, steffen.klassert, saeedm, maximmi,
	bjorn.topel, magnus.karlsson, borisp, david

On Sun, Jul 26, 2020 at 10:28:28PM -0700, Jonathan Lemon wrote:
> Change the system RAM check from 'valid' to 'online', so dummy
> pages which refer to external DMA resources can be mapped.


NAK.  This looks completely bogus.  Maybe you do have a good reason
somewhere (although I doubt it), but then you'd actualy need to both
explain it here, and also actually make sure I get the whole damn series.

Until you fix up how you send these stupid partial cc lists I'll just
ignore this crap.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
       [not found] ` <20200727052846.4070247-22-jonathan.lemon@gmail.com>
@ 2020-07-27  7:35   ` Christoph Hellwig
  2020-07-27 17:00     ` Jonathan Lemon
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2020-07-27  7:35 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, hch, robin.murphy, akpm, davem, kuba,
	willemb, edumazet, steffen.klassert, saeedm, maximmi,
	bjorn.topel, magnus.karlsson, borisp, david

Seriously?  If you only even considered this is something reasonable
to do you should not be anywhere near Linux kernel development.

Just go away!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size for netgpu
       [not found] ` <20200727052846.4070247-14-jonathan.lemon@gmail.com>
@ 2020-07-27 15:19   ` Eric Dumazet
  2020-07-27 17:20     ` Jonathan Lemon
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2020-07-27 15:19 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
<jonathan.lemon@gmail.com> wrote:
>
> From: Jonathan Lemon <bsd@fb.com>
>
> The "header splitting" feature used by netgpu doesn't actually parse
> the incoming packet header.  Instead, it splits the packet at a fixed
> offset.  In order for this to work, the sender needs to send packets
> with a fixed header size.
>
> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> ---
>  net/ipv4/tcp_output.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index d8f16f6a9b02..e8a74d0f7ad2 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -438,6 +438,7 @@ struct tcp_out_options {
>         u8 ws;                  /* window scale, 0 to disable */
>         u8 num_sack_blocks;     /* number of SACK blocks to include */
>         u8 hash_size;           /* bytes in hash_location */
> +       u8 pad_size;            /* additional nops for padding */
>         __u8 *hash_location;    /* temporary pointer, overloaded */
>         __u32 tsval, tsecr;     /* need to include OPTION_TS */
>         struct tcp_fastopen_cookie *fastopen_cookie;    /* Fast open cookie */
> @@ -562,6 +563,17 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
>         smc_options_write(ptr, &options);
>
>         mptcp_options_write(ptr, opts);
> +
> +#if IS_ENABLED(CONFIG_NETGPU)
> +       /* pad out options */
> +       if (opts->pad_size) {
> +               int len = opts->pad_size;
> +               u8 *p = (u8 *)ptr;
> +
> +               while (len--)
> +                       *p++ = TCPOPT_NOP;
> +       }
> +#endif
>  }
>
>  static void smc_set_option(const struct tcp_sock *tp,
> @@ -826,6 +838,14 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
>                         opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
>         }
>
> +#if IS_ENABLED(CONFIG_NETGPU)
> +       /* force padding */
> +       if (size < 20) {
> +               opts->pad_size = 20 - size;
> +               size += opts->pad_size;
> +       }
> +#endif
> +

This is obviously wrong, as any kernel compiled with CONFIG_NETGPU
will fail all packetdrill tests suite.

Also the fixed 20 value is not pretty.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
       [not found] ` <20200727052846.4070247-16-jonathan.lemon@gmail.com>
@ 2020-07-27 15:19   ` Eric Dumazet
  2020-07-27 15:55     ` Jonathan Lemon
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2020-07-27 15:19 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
<jonathan.lemon@gmail.com> wrote:
>
> This flag indicates that the attached data is a zero-copy send,
> and the pages should be retrieved from the netgpu module.  The
> socket should should already have been attached to a netgpu queue.
>
> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> ---
>  include/linux/socket.h | 1 +
>  net/ipv4/tcp.c         | 8 ++++++++
>  2 files changed, 9 insertions(+)
>
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 04d2bc97f497..63816cc25dee 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -310,6 +310,7 @@ struct ucred {
>                                           */
>
>  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> +#define MSG_NETDMA     0x8000000
>  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
>  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
>                                            descriptor received through
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 261c28ccc8f6..340ce319edc9 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>                         uarg->zerocopy = 0;
>         }
>
> +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> +               zc = sk->sk_route_caps & NETIF_F_SG;
> +               if (!zc) {
> +                       err = -EFAULT;
> +                       goto out_err;
> +               }
> +       }
>

Sorry, no, we can not allow adding yet another branch into TCP fast
path for yet another variant of zero copy.

Overall, I think your patch series desperately tries to add changes in
TCP stack, while there is yet no proof
that you have to use TCP transport between the peers.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag
       [not found] ` <20200727052846.4070247-9-jonathan.lemon@gmail.com>
@ 2020-07-27 15:24   ` Eric Dumazet
  2020-07-27 16:59     ` Jonathan Lemon
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2020-07-27 15:24 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 12:20 AM Jonathan Lemon
<jonathan.lemon@gmail.com> wrote:
>
> This could likely be moved elsewhere.  The presence of the flag on
> the skb indicates that one of the fragments may contain zerocopy
> RX data, where the data is not accessible to the cpu.

Why do we need yet another flag in skb exactly ?

Please define what means "data not accessible to the cpu" ?

This kind of change is a red flag for me.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-27 15:19   ` [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg() Eric Dumazet
@ 2020-07-27 15:55     ` Jonathan Lemon
  2020-07-27 16:09       ` Eric Dumazet
  0 siblings, 1 reply; 33+ messages in thread
From: Jonathan Lemon @ 2020-07-27 15:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 08:19:43AM -0700, Eric Dumazet wrote:
> On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> <jonathan.lemon@gmail.com> wrote:
> >
> > This flag indicates that the attached data is a zero-copy send,
> > and the pages should be retrieved from the netgpu module.  The
> > socket should should already have been attached to a netgpu queue.
> >
> > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > ---
> >  include/linux/socket.h | 1 +
> >  net/ipv4/tcp.c         | 8 ++++++++
> >  2 files changed, 9 insertions(+)
> >
> > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > index 04d2bc97f497..63816cc25dee 100644
> > --- a/include/linux/socket.h
> > +++ b/include/linux/socket.h
> > @@ -310,6 +310,7 @@ struct ucred {
> >                                           */
> >
> >  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> > +#define MSG_NETDMA     0x8000000
> >  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> >  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> >                                            descriptor received through
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 261c28ccc8f6..340ce319edc9 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> >                         uarg->zerocopy = 0;
> >         }
> >
> > +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> > +               zc = sk->sk_route_caps & NETIF_F_SG;
> > +               if (!zc) {
> > +                       err = -EFAULT;
> > +                       goto out_err;
> > +               }
> > +       }
> >
> 
> Sorry, no, we can not allow adding yet another branch into TCP fast
> path for yet another variant of zero copy.

I'm not in disagreement with that statement, but the existing zerocopy
work makes some assumptions that aren't suitable.  I take it that you'd
rather have things folded together so the old/new code works together?

Allocating an extra structure for every skbuff isn't ideal in my book.


> Overall, I think your patch series desperately tries to add changes in
> TCP stack, while there is yet no proof
> that you have to use TCP transport between the peers.

The goal is having a reliable transport without resorting to RDMA.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-27 15:55     ` Jonathan Lemon
@ 2020-07-27 16:09       ` Eric Dumazet
  2020-07-27 17:35         ` Jonathan Lemon
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2020-07-27 16:09 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 8:56 AM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
>
> On Mon, Jul 27, 2020 at 08:19:43AM -0700, Eric Dumazet wrote:
> > On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> > <jonathan.lemon@gmail.com> wrote:
> > >
> > > This flag indicates that the attached data is a zero-copy send,
> > > and the pages should be retrieved from the netgpu module.  The
> > > socket should should already have been attached to a netgpu queue.
> > >
> > > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > > ---
> > >  include/linux/socket.h | 1 +
> > >  net/ipv4/tcp.c         | 8 ++++++++
> > >  2 files changed, 9 insertions(+)
> > >
> > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > index 04d2bc97f497..63816cc25dee 100644
> > > --- a/include/linux/socket.h
> > > +++ b/include/linux/socket.h
> > > @@ -310,6 +310,7 @@ struct ucred {
> > >                                           */
> > >
> > >  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> > > +#define MSG_NETDMA     0x8000000
> > >  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> > >  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> > >                                            descriptor received through
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index 261c28ccc8f6..340ce319edc9 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > >                         uarg->zerocopy = 0;
> > >         }
> > >
> > > +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> > > +               zc = sk->sk_route_caps & NETIF_F_SG;
> > > +               if (!zc) {
> > > +                       err = -EFAULT;
> > > +                       goto out_err;
> > > +               }
> > > +       }
> > >
> >
> > Sorry, no, we can not allow adding yet another branch into TCP fast
> > path for yet another variant of zero copy.
>
> I'm not in disagreement with that statement, but the existing zerocopy
> work makes some assumptions that aren't suitable.  I take it that you'd
> rather have things folded together so the old/new code works together?

Exact.  Forcing users to use MSG_NETDMA, yet reusing SOCK_ZEROCOPY is silly.

SOCK_ZEROCOPY has been added to that user space and kernel would agree
on MSG_ZEROCOPY being not a nop (as it was on old kernels)

>
> Allocating an extra structure for every skbuff isn't ideal in my book.
>

We do not allocate a structure for every skbuff. Please look again.


>
> > Overall, I think your patch series desperately tries to add changes in
> > TCP stack, while there is yet no proof
> > that you have to use TCP transport between the peers.
>
> The goal is having a reliable transport without resorting to RDMA.

And why should it be TCP ?

Are you dealing with lost packets, retransmits, timers, and al  ?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag
  2020-07-27 15:24   ` [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag Eric Dumazet
@ 2020-07-27 16:59     ` Jonathan Lemon
  2020-07-27 17:08       ` Eric Dumazet
  0 siblings, 1 reply; 33+ messages in thread
From: Jonathan Lemon @ 2020-07-27 16:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 08:24:55AM -0700, Eric Dumazet wrote:
> On Mon, Jul 27, 2020 at 12:20 AM Jonathan Lemon
> <jonathan.lemon@gmail.com> wrote:
> >
> > This could likely be moved elsewhere.  The presence of the flag on
> > the skb indicates that one of the fragments may contain zerocopy
> > RX data, where the data is not accessible to the cpu.
> 
> Why do we need yet another flag in skb exactly ?
> 
> Please define what means "data not accessible to the cpu" ?
> 
> This kind of change is a red flag for me.

The architecture this is targeting is a ML cluster, where a 200Gbps NIC
is attached to a PCIe switch which also has a GPU card attached.  There
are several of these, and the link(s) to the host cpu (which has another
NIC attached) can't handle the incoming traffic.

So what we're doing here is transferring the data directly from the NIC
to the GPU via DMA.  The host never sees the data, but can control it
indirectly via the handles returned to userspace.

I'm not sure that a flag on the skb is the right location for this -
perhaps moving it into skb_shared() instead would be better?
-- 
Jonathan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-27  7:35   ` [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu Christoph Hellwig
@ 2020-07-27 17:00     ` Jonathan Lemon
  2020-07-27 18:24       ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Jonathan Lemon @ 2020-07-27 17:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: netdev, kernel-team, robin.murphy, akpm, davem, kuba, willemb,
	edumazet, steffen.klassert, saeedm, maximmi, bjorn.topel,
	magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 09:35:09AM +0200, Christoph Hellwig wrote:
> Seriously?  If you only even considered this is something reasonable
> to do you should not be anywhere near Linux kernel development.
> 
> Just go away!

This isn't really a constructive comment.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag
  2020-07-27 16:59     ` Jonathan Lemon
@ 2020-07-27 17:08       ` Eric Dumazet
  2020-07-27 17:16         ` Jonathan Lemon
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2020-07-27 17:08 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 10:01 AM Jonathan Lemon
<jonathan.lemon@gmail.com> wrote:
>
> On Mon, Jul 27, 2020 at 08:24:55AM -0700, Eric Dumazet wrote:
> > On Mon, Jul 27, 2020 at 12:20 AM Jonathan Lemon
> > <jonathan.lemon@gmail.com> wrote:
> > >
> > > This could likely be moved elsewhere.  The presence of the flag on
> > > the skb indicates that one of the fragments may contain zerocopy
> > > RX data, where the data is not accessible to the cpu.
> >
> > Why do we need yet another flag in skb exactly ?
> >
> > Please define what means "data not accessible to the cpu" ?
> >
> > This kind of change is a red flag for me.
>
> The architecture this is targeting is a ML cluster, where a 200Gbps NIC
> is attached to a PCIe switch which also has a GPU card attached.  There
> are several of these, and the link(s) to the host cpu (which has another
> NIC attached) can't handle the incoming traffic.
>
> So what we're doing here is transferring the data directly from the NIC
> to the GPU via DMA.  The host never sees the data, but can control it
> indirectly via the handles returned to userspace.
>

This seems to need a page/memory attribute or something.

skb should not have this knowledge, unless you are planning to make
sure that everything accessing skb data is going to test this new flag
and fail if it is set ?


> I'm not sure that a flag on the skb is the right location for this -
> perhaps moving it into skb_shared() instead would be better?
> --
> Jonathan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online
  2020-07-27  5:58   ` [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online Christoph Hellwig
@ 2020-07-27 17:08     ` Jonathan Lemon
  0 siblings, 0 replies; 33+ messages in thread
From: Jonathan Lemon @ 2020-07-27 17:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: netdev, kernel-team, robin.murphy, akpm, davem, kuba, willemb,
	edumazet, steffen.klassert, saeedm, maximmi, bjorn.topel,
	magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 07:58:13AM +0200, Christoph Hellwig wrote:
> On Sun, Jul 26, 2020 at 10:28:28PM -0700, Jonathan Lemon wrote:
> > Change the system RAM check from 'valid' to 'online', so dummy
> > pages which refer to external DMA resources can be mapped.
> 
> NAK.  This looks completely bogus.  Maybe you do have a good reason
> somewhere (although I doubt it), but then you'd actualy need to both
> explain it here, and also actually make sure I get the whole damn series.

The entire patchset was sent out in one command - I have no control over
how the mail server ends up delivering things.

An alternative here would just allocate 'struct pages' from normal
system memory, instead of from the system resources.  I initially did it
this way so the page could be used in the normal fashion to locate the
DMA address, instead of having a separate lookup table.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag
  2020-07-27 17:08       ` Eric Dumazet
@ 2020-07-27 17:16         ` Jonathan Lemon
  0 siblings, 0 replies; 33+ messages in thread
From: Jonathan Lemon @ 2020-07-27 17:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 10:08:10AM -0700, Eric Dumazet wrote:
> On Mon, Jul 27, 2020 at 10:01 AM Jonathan Lemon
> <jonathan.lemon@gmail.com> wrote:
> >
> > On Mon, Jul 27, 2020 at 08:24:55AM -0700, Eric Dumazet wrote:
> > > On Mon, Jul 27, 2020 at 12:20 AM Jonathan Lemon
> > > <jonathan.lemon@gmail.com> wrote:
> > > >
> > > > This could likely be moved elsewhere.  The presence of the flag on
> > > > the skb indicates that one of the fragments may contain zerocopy
> > > > RX data, where the data is not accessible to the cpu.
> > >
> > > Why do we need yet another flag in skb exactly ?
> > >
> > > Please define what means "data not accessible to the cpu" ?
> > >
> > > This kind of change is a red flag for me.
> >
> > The architecture this is targeting is a ML cluster, where a 200Gbps NIC
> > is attached to a PCIe switch which also has a GPU card attached.  There
> > are several of these, and the link(s) to the host cpu (which has another
> > NIC attached) can't handle the incoming traffic.
> >
> > So what we're doing here is transferring the data directly from the NIC
> > to the GPU via DMA.  The host never sees the data, but can control it
> > indirectly via the handles returned to userspace.
> >
> 
> This seems to need a page/memory attribute or something.

'struct page' is even more constrained.  I opted to flag the skb since
there should not be mixed zc/non-zc pages in the same skb.  I'd be happy
to see other alternatives.


> skb should not have this knowledge, unless you are planning to make
> sure that everything accessing skb data is going to test this new flag
> and fail if it is set ?

That might be how things end up going - in the samw way skb_zcopy()
works.
--
Jonathan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size for netgpu
  2020-07-27 15:19   ` [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size " Eric Dumazet
@ 2020-07-27 17:20     ` Jonathan Lemon
  0 siblings, 0 replies; 33+ messages in thread
From: Jonathan Lemon @ 2020-07-27 17:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 08:19:24AM -0700, Eric Dumazet wrote:
> On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> <jonathan.lemon@gmail.com> wrote:
> >
> > From: Jonathan Lemon <bsd@fb.com>
> >
> > The "header splitting" feature used by netgpu doesn't actually parse
> > the incoming packet header.  Instead, it splits the packet at a fixed
> > offset.  In order for this to work, the sender needs to send packets
> > with a fixed header size.
> >
> > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > ---
> >  net/ipv4/tcp_output.c | 20 ++++++++++++++++++++
> >  1 file changed, 20 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index d8f16f6a9b02..e8a74d0f7ad2 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -438,6 +438,7 @@ struct tcp_out_options {
> >         u8 ws;                  /* window scale, 0 to disable */
> >         u8 num_sack_blocks;     /* number of SACK blocks to include */
> >         u8 hash_size;           /* bytes in hash_location */
> > +       u8 pad_size;            /* additional nops for padding */
> >         __u8 *hash_location;    /* temporary pointer, overloaded */
> >         __u32 tsval, tsecr;     /* need to include OPTION_TS */
> >         struct tcp_fastopen_cookie *fastopen_cookie;    /* Fast open cookie */
> > @@ -562,6 +563,17 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
> >         smc_options_write(ptr, &options);
> >
> >         mptcp_options_write(ptr, opts);
> > +
> > +#if IS_ENABLED(CONFIG_NETGPU)
> > +       /* pad out options */
> > +       if (opts->pad_size) {
> > +               int len = opts->pad_size;
> > +               u8 *p = (u8 *)ptr;
> > +
> > +               while (len--)
> > +                       *p++ = TCPOPT_NOP;
> > +       }
> > +#endif
> >  }
> >
> >  static void smc_set_option(const struct tcp_sock *tp,
> > @@ -826,6 +838,14 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
> >                         opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
> >         }
> >
> > +#if IS_ENABLED(CONFIG_NETGPU)
> > +       /* force padding */
> > +       if (size < 20) {
> > +               opts->pad_size = 20 - size;
> > +               size += opts->pad_size;
> > +       }
> > +#endif
> > +
> 
> This is obviously wrong, as any kernel compiled with CONFIG_NETGPU
> will fail all packetdrill tests suite.
> 
> Also the fixed 20 value is not pretty.

Would changing this into a sysctl be a suitable solution?  It really is
a temporary solution to handle hardware that doesn't support splitting,
and adding a sysctl seems so permanent.....  
-- 
Jonathan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-27 16:09       ` Eric Dumazet
@ 2020-07-27 17:35         ` Jonathan Lemon
  2020-07-27 17:44           ` Eric Dumazet
  0 siblings, 1 reply; 33+ messages in thread
From: Jonathan Lemon @ 2020-07-27 17:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 09:09:48AM -0700, Eric Dumazet wrote:
> On Mon, Jul 27, 2020 at 8:56 AM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> >
> > On Mon, Jul 27, 2020 at 08:19:43AM -0700, Eric Dumazet wrote:
> > > On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> > > <jonathan.lemon@gmail.com> wrote:
> > > >
> > > > This flag indicates that the attached data is a zero-copy send,
> > > > and the pages should be retrieved from the netgpu module.  The
> > > > socket should should already have been attached to a netgpu queue.
> > > >
> > > > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > > > ---
> > > >  include/linux/socket.h | 1 +
> > > >  net/ipv4/tcp.c         | 8 ++++++++
> > > >  2 files changed, 9 insertions(+)
> > > >
> > > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > > index 04d2bc97f497..63816cc25dee 100644
> > > > --- a/include/linux/socket.h
> > > > +++ b/include/linux/socket.h
> > > > @@ -310,6 +310,7 @@ struct ucred {
> > > >                                           */
> > > >
> > > >  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> > > > +#define MSG_NETDMA     0x8000000
> > > >  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> > > >  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> > > >                                            descriptor received through
> > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > index 261c28ccc8f6..340ce319edc9 100644
> > > > --- a/net/ipv4/tcp.c
> > > > +++ b/net/ipv4/tcp.c
> > > > @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > >                         uarg->zerocopy = 0;
> > > >         }
> > > >
> > > > +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> > > > +               zc = sk->sk_route_caps & NETIF_F_SG;
> > > > +               if (!zc) {
> > > > +                       err = -EFAULT;
> > > > +                       goto out_err;
> > > > +               }
> > > > +       }
> > > >
> > >
> > > Sorry, no, we can not allow adding yet another branch into TCP fast
> > > path for yet another variant of zero copy.
> >
> > I'm not in disagreement with that statement, but the existing zerocopy
> > work makes some assumptions that aren't suitable.  I take it that you'd
> > rather have things folded together so the old/new code works together?
> 
> Exact.  Forcing users to use MSG_NETDMA, yet reusing SOCK_ZEROCOPY is silly.
> 
> SOCK_ZEROCOPY has been added to that user space and kernel would agree
> on MSG_ZEROCOPY being not a nop (as it was on old kernels)
> 
> >
> > Allocating an extra structure for every skbuff isn't ideal in my book.
> >
> 
> We do not allocate a structure for every skbuff. Please look again.

I'm looking here:

    uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));

Doesn't sock_zerocopy_realloc() allocate a new structure if the skb
doesn't have one already?


> > > Overall, I think your patch series desperately tries to add changes in
> > > TCP stack, while there is yet no proof
> > > that you have to use TCP transport between the peers.
> >
> > The goal is having a reliable transport without resorting to RDMA.
> 
> And why should it be TCP ?
> 
> Are you dealing with lost packets, retransmits, timers, and al  ?

Yes?  If there was a true lossless medium, RDMA would have taken over by
now.  Or are you suggesting that the transport protocol reliability
should be performed in userspace?  (not all the world is QUIC yet)
-- 
Jonathan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-27 17:35         ` Jonathan Lemon
@ 2020-07-27 17:44           ` Eric Dumazet
  2020-07-28  2:11             ` Jonathan Lemon
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2020-07-27 17:44 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 10:35 AM Jonathan Lemon
<jonathan.lemon@gmail.com> wrote:
>
> On Mon, Jul 27, 2020 at 09:09:48AM -0700, Eric Dumazet wrote:
> > On Mon, Jul 27, 2020 at 8:56 AM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> > >
> > > On Mon, Jul 27, 2020 at 08:19:43AM -0700, Eric Dumazet wrote:
> > > > On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> > > > <jonathan.lemon@gmail.com> wrote:
> > > > >
> > > > > This flag indicates that the attached data is a zero-copy send,
> > > > > and the pages should be retrieved from the netgpu module.  The
> > > > > socket should should already have been attached to a netgpu queue.
> > > > >
> > > > > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > > > > ---
> > > > >  include/linux/socket.h | 1 +
> > > > >  net/ipv4/tcp.c         | 8 ++++++++
> > > > >  2 files changed, 9 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > > > index 04d2bc97f497..63816cc25dee 100644
> > > > > --- a/include/linux/socket.h
> > > > > +++ b/include/linux/socket.h
> > > > > @@ -310,6 +310,7 @@ struct ucred {
> > > > >                                           */
> > > > >
> > > > >  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> > > > > +#define MSG_NETDMA     0x8000000
> > > > >  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> > > > >  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> > > > >                                            descriptor received through
> > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > index 261c28ccc8f6..340ce319edc9 100644
> > > > > --- a/net/ipv4/tcp.c
> > > > > +++ b/net/ipv4/tcp.c
> > > > > @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > > >                         uarg->zerocopy = 0;
> > > > >         }
> > > > >
> > > > > +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> > > > > +               zc = sk->sk_route_caps & NETIF_F_SG;
> > > > > +               if (!zc) {
> > > > > +                       err = -EFAULT;
> > > > > +                       goto out_err;
> > > > > +               }
> > > > > +       }
> > > > >
> > > >
> > > > Sorry, no, we can not allow adding yet another branch into TCP fast
> > > > path for yet another variant of zero copy.
> > >
> > > I'm not in disagreement with that statement, but the existing zerocopy
> > > work makes some assumptions that aren't suitable.  I take it that you'd
> > > rather have things folded together so the old/new code works together?
> >
> > Exact.  Forcing users to use MSG_NETDMA, yet reusing SOCK_ZEROCOPY is silly.
> >
> > SOCK_ZEROCOPY has been added to that user space and kernel would agree
> > on MSG_ZEROCOPY being not a nop (as it was on old kernels)
> >
> > >
> > > Allocating an extra structure for every skbuff isn't ideal in my book.
> > >
> >
> > We do not allocate a structure for every skbuff. Please look again.
>
> I'm looking here:
>
>     uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
>
> Doesn't sock_zerocopy_realloc() allocate a new structure if the skb
> doesn't have one already?
>
>
> > > > Overall, I think your patch series desperately tries to add changes in
> > > > TCP stack, while there is yet no proof
> > > > that you have to use TCP transport between the peers.
> > >
> > > The goal is having a reliable transport without resorting to RDMA.
> >
> > And why should it be TCP ?
> >
> > Are you dealing with lost packets, retransmits, timers, and al  ?
>
> Yes?  If there was a true lossless medium, RDMA would have taken over by
> now.  Or are you suggesting that the transport protocol reliability
> should be performed in userspace?  (not all the world is QUIC yet)
>

The thing is : this patch series is a monster thing adding stuff that
is going to impact 100% % of TCP flows,
even if not used in this NETDMA context.

So you need to convince us you are really desperate to get this in
upstream linux.

I have implemented TCP RX zero copy without adding a single line in
standard TCP code.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-27 17:00     ` Jonathan Lemon
@ 2020-07-27 18:24       ` Christoph Hellwig
  2020-07-28  1:48         ` Jonathan Lemon
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2020-07-27 18:24 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Christoph Hellwig, netdev, kernel-team, robin.murphy, akpm,
	davem, kuba, willemb, edumazet, steffen.klassert, saeedm,
	maximmi, bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 10:00:03AM -0700, Jonathan Lemon wrote:
> On Mon, Jul 27, 2020 at 09:35:09AM +0200, Christoph Hellwig wrote:
> > Seriously?  If you only even considered this is something reasonable
> > to do you should not be anywhere near Linux kernel development.
> > 
> > Just go away!
> 
> This isn't really a constructive comment.

And this wasn't a constructive patch.  If you really think adding tons
of garbage to the kernel to support a proprietary driver you really
should not be here at all.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-27 18:24       ` Christoph Hellwig
@ 2020-07-28  1:48         ` Jonathan Lemon
  2020-07-28  6:47           ` Christoph Hellwig
  2020-07-28 18:19           ` Jason Gunthorpe
  0 siblings, 2 replies; 33+ messages in thread
From: Jonathan Lemon @ 2020-07-28  1:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: netdev, kernel-team, robin.murphy, akpm, davem, kuba, willemb,
	edumazet, steffen.klassert, saeedm, maximmi, bjorn.topel,
	magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 08:24:25PM +0200, Christoph Hellwig wrote:
> On Mon, Jul 27, 2020 at 10:00:03AM -0700, Jonathan Lemon wrote:
> > On Mon, Jul 27, 2020 at 09:35:09AM +0200, Christoph Hellwig wrote:
> > > Seriously?  If you only even considered this is something reasonable
> > > to do you should not be anywhere near Linux kernel development.
> > > 
> > > Just go away!
> > 
> > This isn't really a constructive comment.
> 
> And this wasn't a constructive patch.  If you really think adding tons
> of garbage to the kernel to support a proprietary driver you really
> should not be here at all.

This is not in support of a proprietary driver.  As the cover letter
notes, this is for data transfers between the NIC/GPU, while still
utilizing the kernel protocol stack and leaving the application in
control.

While the current GPU utilized is nvidia, there's nothing in the rest of
the patches specific to Nvidia - an Intel or AMD GPU interface could be
equally workable.

I think this is a better patch than all the various implementations of
the protocol stack in the form of RDMA, driver code and device firmware.

I'm aware that Nvidia code is maintained outside the tree, so this last
patch may better placed there; I included it here so reviewers can see
how things work.
-- 
Jonathan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-27 17:44           ` Eric Dumazet
@ 2020-07-28  2:11             ` Jonathan Lemon
  2020-07-28  2:17               ` Eric Dumazet
  0 siblings, 1 reply; 33+ messages in thread
From: Jonathan Lemon @ 2020-07-28  2:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 10:44:59AM -0700, Eric Dumazet wrote:
> On Mon, Jul 27, 2020 at 10:35 AM Jonathan Lemon
> <jonathan.lemon@gmail.com> wrote:
> >
> > On Mon, Jul 27, 2020 at 09:09:48AM -0700, Eric Dumazet wrote:
> > > On Mon, Jul 27, 2020 at 8:56 AM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> > > >
> > > > On Mon, Jul 27, 2020 at 08:19:43AM -0700, Eric Dumazet wrote:
> > > > > On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> > > > > <jonathan.lemon@gmail.com> wrote:
> > > > > >
> > > > > > This flag indicates that the attached data is a zero-copy send,
> > > > > > and the pages should be retrieved from the netgpu module.  The
> > > > > > socket should should already have been attached to a netgpu queue.
> > > > > >
> > > > > > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > > > > > ---
> > > > > >  include/linux/socket.h | 1 +
> > > > > >  net/ipv4/tcp.c         | 8 ++++++++
> > > > > >  2 files changed, 9 insertions(+)
> > > > > >
> > > > > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > > > > index 04d2bc97f497..63816cc25dee 100644
> > > > > > --- a/include/linux/socket.h
> > > > > > +++ b/include/linux/socket.h
> > > > > > @@ -310,6 +310,7 @@ struct ucred {
> > > > > >                                           */
> > > > > >
> > > > > >  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> > > > > > +#define MSG_NETDMA     0x8000000
> > > > > >  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> > > > > >  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> > > > > >                                            descriptor received through
> > > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > > index 261c28ccc8f6..340ce319edc9 100644
> > > > > > --- a/net/ipv4/tcp.c
> > > > > > +++ b/net/ipv4/tcp.c
> > > > > > @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > > > >                         uarg->zerocopy = 0;
> > > > > >         }
> > > > > >
> > > > > > +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> > > > > > +               zc = sk->sk_route_caps & NETIF_F_SG;
> > > > > > +               if (!zc) {
> > > > > > +                       err = -EFAULT;
> > > > > > +                       goto out_err;
> > > > > > +               }
> > > > > > +       }
> > > > > >
> > > > >
> > > > > Sorry, no, we can not allow adding yet another branch into TCP fast
> > > > > path for yet another variant of zero copy.
> > > >
> > > > I'm not in disagreement with that statement, but the existing zerocopy
> > > > work makes some assumptions that aren't suitable.  I take it that you'd
> > > > rather have things folded together so the old/new code works together?
> > >
> > > Exact.  Forcing users to use MSG_NETDMA, yet reusing SOCK_ZEROCOPY is silly.
> > >
> > > SOCK_ZEROCOPY has been added to that user space and kernel would agree
> > > on MSG_ZEROCOPY being not a nop (as it was on old kernels)
> > >
> > > >
> > > > Allocating an extra structure for every skbuff isn't ideal in my book.
> > > >
> > >
> > > We do not allocate a structure for every skbuff. Please look again.
> >
> > I'm looking here:
> >
> >     uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
> >
> > Doesn't sock_zerocopy_realloc() allocate a new structure if the skb
> > doesn't have one already?
> >
> >
> > > > > Overall, I think your patch series desperately tries to add changes in
> > > > > TCP stack, while there is yet no proof
> > > > > that you have to use TCP transport between the peers.
> > > >
> > > > The goal is having a reliable transport without resorting to RDMA.
> > >
> > > And why should it be TCP ?
> > >
> > > Are you dealing with lost packets, retransmits, timers, and al  ?
> >
> > Yes?  If there was a true lossless medium, RDMA would have taken over by
> > now.  Or are you suggesting that the transport protocol reliability
> > should be performed in userspace?  (not all the world is QUIC yet)
> >
> 
> The thing is : this patch series is a monster thing adding stuff that
> is going to impact 100% % of TCP flows,
> even if not used in this NETDMA context.
> 
> So you need to convince us you are really desperate to get this in
> upstream linux.
> 
> I have implemented TCP RX zero copy without adding a single line in
> standard TCP code.

That's a bit of an exaggeration, as I see skb_zcopy_*() calls scattered
around the normal TCP code path.  I also haven't changed the normal TCP
path either, other than doing some of the same things as skb_zcopy_*().
(ignoring the ugly moron about padding out the TCP header, which I'll
put under a static_branch_unlikely).

The thing is, the existing zero copy code is zero-copy to /host/ memory,
which is not the same thing as zero-copy to other memory areas.  
-- 
Jonathan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-28  2:11             ` Jonathan Lemon
@ 2020-07-28  2:17               ` Eric Dumazet
  2020-07-28  3:08                 ` Jonathan Lemon
  2020-07-28  6:50                 ` Christoph Hellwig
  0 siblings, 2 replies; 33+ messages in thread
From: Eric Dumazet @ 2020-07-28  2:17 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 7:11 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
>
> On Mon, Jul 27, 2020 at 10:44:59AM -0700, Eric Dumazet wrote:
> > On Mon, Jul 27, 2020 at 10:35 AM Jonathan Lemon
> > <jonathan.lemon@gmail.com> wrote:
> > >
> > > On Mon, Jul 27, 2020 at 09:09:48AM -0700, Eric Dumazet wrote:
> > > > On Mon, Jul 27, 2020 at 8:56 AM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> > > > >
> > > > > On Mon, Jul 27, 2020 at 08:19:43AM -0700, Eric Dumazet wrote:
> > > > > > On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> > > > > > <jonathan.lemon@gmail.com> wrote:
> > > > > > >
> > > > > > > This flag indicates that the attached data is a zero-copy send,
> > > > > > > and the pages should be retrieved from the netgpu module.  The
> > > > > > > socket should should already have been attached to a netgpu queue.
> > > > > > >
> > > > > > > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > > > > > > ---
> > > > > > >  include/linux/socket.h | 1 +
> > > > > > >  net/ipv4/tcp.c         | 8 ++++++++
> > > > > > >  2 files changed, 9 insertions(+)
> > > > > > >
> > > > > > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > > > > > index 04d2bc97f497..63816cc25dee 100644
> > > > > > > --- a/include/linux/socket.h
> > > > > > > +++ b/include/linux/socket.h
> > > > > > > @@ -310,6 +310,7 @@ struct ucred {
> > > > > > >                                           */
> > > > > > >
> > > > > > >  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> > > > > > > +#define MSG_NETDMA     0x8000000
> > > > > > >  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> > > > > > >  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> > > > > > >                                            descriptor received through
> > > > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > > > index 261c28ccc8f6..340ce319edc9 100644
> > > > > > > --- a/net/ipv4/tcp.c
> > > > > > > +++ b/net/ipv4/tcp.c
> > > > > > > @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > > > > >                         uarg->zerocopy = 0;
> > > > > > >         }
> > > > > > >
> > > > > > > +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> > > > > > > +               zc = sk->sk_route_caps & NETIF_F_SG;
> > > > > > > +               if (!zc) {
> > > > > > > +                       err = -EFAULT;
> > > > > > > +                       goto out_err;
> > > > > > > +               }
> > > > > > > +       }
> > > > > > >
> > > > > >
> > > > > > Sorry, no, we can not allow adding yet another branch into TCP fast
> > > > > > path for yet another variant of zero copy.
> > > > >
> > > > > I'm not in disagreement with that statement, but the existing zerocopy
> > > > > work makes some assumptions that aren't suitable.  I take it that you'd
> > > > > rather have things folded together so the old/new code works together?
> > > >
> > > > Exact.  Forcing users to use MSG_NETDMA, yet reusing SOCK_ZEROCOPY is silly.
> > > >
> > > > SOCK_ZEROCOPY has been added to that user space and kernel would agree
> > > > on MSG_ZEROCOPY being not a nop (as it was on old kernels)
> > > >
> > > > >
> > > > > Allocating an extra structure for every skbuff isn't ideal in my book.
> > > > >
> > > >
> > > > We do not allocate a structure for every skbuff. Please look again.
> > >
> > > I'm looking here:
> > >
> > >     uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
> > >
> > > Doesn't sock_zerocopy_realloc() allocate a new structure if the skb
> > > doesn't have one already?
> > >
> > >
> > > > > > Overall, I think your patch series desperately tries to add changes in
> > > > > > TCP stack, while there is yet no proof
> > > > > > that you have to use TCP transport between the peers.
> > > > >
> > > > > The goal is having a reliable transport without resorting to RDMA.
> > > >
> > > > And why should it be TCP ?
> > > >
> > > > Are you dealing with lost packets, retransmits, timers, and al  ?
> > >
> > > Yes?  If there was a true lossless medium, RDMA would have taken over by
> > > now.  Or are you suggesting that the transport protocol reliability
> > > should be performed in userspace?  (not all the world is QUIC yet)
> > >
> >
> > The thing is : this patch series is a monster thing adding stuff that
> > is going to impact 100% % of TCP flows,
> > even if not used in this NETDMA context.
> >
> > So you need to convince us you are really desperate to get this in
> > upstream linux.
> >
> > I have implemented TCP RX zero copy without adding a single line in
> > standard TCP code.
>
> That's a bit of an exaggeration, as I see skb_zcopy_*() calls scattered
> around the normal TCP code path.  I also haven't changed the normal TCP
> path either, other than doing some of the same things as skb_zcopy_*().
> (ignoring the ugly moron about padding out the TCP header, which I'll
> put under a static_branch_unlikely).


You are mixing TX zerocopy and RX zero copy.  Quite different things.

My claim was about TCP RX zero copy (aka tcp_zerocopy_receive())


>
> The thing is, the existing zero copy code is zero-copy to /host/ memory,
> which is not the same thing as zero-copy to other memory areas.

You have to really explain what difference it makes, and why current
stuff can not be extended.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-28  2:17               ` Eric Dumazet
@ 2020-07-28  3:08                 ` Jonathan Lemon
  2020-07-28  6:50                 ` Christoph Hellwig
  1 sibling, 0 replies; 33+ messages in thread
From: Jonathan Lemon @ 2020-07-28  3:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, Christoph Hellwig, Robin Murphy,
	Andrew Morton, David Miller, Jakub Kicinski, Willem de Bruijn,
	Steffen Klassert, Saeed Mahameed, Maxim Mikityanskiy,
	bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 07:17:51PM -0700, Eric Dumazet wrote:
> On Mon, Jul 27, 2020 at 7:11 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> >
> > On Mon, Jul 27, 2020 at 10:44:59AM -0700, Eric Dumazet wrote:
> > > On Mon, Jul 27, 2020 at 10:35 AM Jonathan Lemon
> > > <jonathan.lemon@gmail.com> wrote:
> > > >
> > > > On Mon, Jul 27, 2020 at 09:09:48AM -0700, Eric Dumazet wrote:
> > > > > On Mon, Jul 27, 2020 at 8:56 AM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, Jul 27, 2020 at 08:19:43AM -0700, Eric Dumazet wrote:
> > > > > > > On Mon, Jul 27, 2020 at 12:51 AM Jonathan Lemon
> > > > > > > <jonathan.lemon@gmail.com> wrote:
> > > > > > > >
> > > > > > > > This flag indicates that the attached data is a zero-copy send,
> > > > > > > > and the pages should be retrieved from the netgpu module.  The
> > > > > > > > socket should should already have been attached to a netgpu queue.
> > > > > > > >
> > > > > > > > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > > > > > > > ---
> > > > > > > >  include/linux/socket.h | 1 +
> > > > > > > >  net/ipv4/tcp.c         | 8 ++++++++
> > > > > > > >  2 files changed, 9 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > > > > > > index 04d2bc97f497..63816cc25dee 100644
> > > > > > > > --- a/include/linux/socket.h
> > > > > > > > +++ b/include/linux/socket.h
> > > > > > > > @@ -310,6 +310,7 @@ struct ucred {
> > > > > > > >                                           */
> > > > > > > >
> > > > > > > >  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> > > > > > > > +#define MSG_NETDMA     0x8000000
> > > > > > > >  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> > > > > > > >  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> > > > > > > >                                            descriptor received through
> > > > > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > > > > index 261c28ccc8f6..340ce319edc9 100644
> > > > > > > > --- a/net/ipv4/tcp.c
> > > > > > > > +++ b/net/ipv4/tcp.c
> > > > > > > > @@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > > > > > >                         uarg->zerocopy = 0;
> > > > > > > >         }
> > > > > > > >
> > > > > > > > +       if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> > > > > > > > +               zc = sk->sk_route_caps & NETIF_F_SG;
> > > > > > > > +               if (!zc) {
> > > > > > > > +                       err = -EFAULT;
> > > > > > > > +                       goto out_err;
> > > > > > > > +               }
> > > > > > > > +       }
> > > > > > > >
> > > > > > >
> > > > > > > Sorry, no, we can not allow adding yet another branch into TCP fast
> > > > > > > path for yet another variant of zero copy.
> > > > > >
> > > > > > I'm not in disagreement with that statement, but the existing zerocopy
> > > > > > work makes some assumptions that aren't suitable.  I take it that you'd
> > > > > > rather have things folded together so the old/new code works together?
> > > > >
> > > > > Exact.  Forcing users to use MSG_NETDMA, yet reusing SOCK_ZEROCOPY is silly.
> > > > >
> > > > > SOCK_ZEROCOPY has been added to that user space and kernel would agree
> > > > > on MSG_ZEROCOPY being not a nop (as it was on old kernels)
> > > > >
> > > > > >
> > > > > > Allocating an extra structure for every skbuff isn't ideal in my book.
> > > > > >
> > > > >
> > > > > We do not allocate a structure for every skbuff. Please look again.
> > > >
> > > > I'm looking here:
> > > >
> > > >     uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
> > > >
> > > > Doesn't sock_zerocopy_realloc() allocate a new structure if the skb
> > > > doesn't have one already?
> > > >
> > > >
> > > > > > > Overall, I think your patch series desperately tries to add changes in
> > > > > > > TCP stack, while there is yet no proof
> > > > > > > that you have to use TCP transport between the peers.
> > > > > >
> > > > > > The goal is having a reliable transport without resorting to RDMA.
> > > > >
> > > > > And why should it be TCP ?
> > > > >
> > > > > Are you dealing with lost packets, retransmits, timers, and al  ?
> > > >
> > > > Yes?  If there was a true lossless medium, RDMA would have taken over by
> > > > now.  Or are you suggesting that the transport protocol reliability
> > > > should be performed in userspace?  (not all the world is QUIC yet)
> > > >
> > >
> > > The thing is : this patch series is a monster thing adding stuff that
> > > is going to impact 100% % of TCP flows,
> > > even if not used in this NETDMA context.
> > >
> > > So you need to convince us you are really desperate to get this in
> > > upstream linux.
> > >
> > > I have implemented TCP RX zero copy without adding a single line in
> > > standard TCP code.
> >
> > That's a bit of an exaggeration, as I see skb_zcopy_*() calls scattered
> > around the normal TCP code path.  I also haven't changed the normal TCP
> > path either, other than doing some of the same things as skb_zcopy_*().
> > (ignoring the ugly moron about padding out the TCP header, which I'll
> > put under a static_branch_unlikely).
> 
> You are mixing TX zerocopy and RX zero copy.  Quite different things.
> 
> My claim was about TCP RX zero copy (aka tcp_zerocopy_receive())

I understand that (as I'm implementing both sides).  My equivalent of 
tcp_zerocopy_receive is netgpu_recv_skb().

In my variant, I don't actually have a real page, it's just a
placeholder for GPU memory.  So the device driver must obtain the
destination page [dma address] from the netgpu module, and either
deliver it to the user, or return it back to netgpu.

The zc_netdma special bit is there to handle the case where the skb is
freed back to the system - suppose the socket is closed with buffers
sitting on the RX queue.  The existing zero-copy code does not need this
because the RX buffers came from system memory in the first place.


> > The thing is, the existing zero copy code is zero-copy to /host/ memory,
> > which is not the same thing as zero-copy to other memory areas.
> 
> You have to really explain what difference it makes, and why current
> stuff can not be extended.

For RX, the ability to return the 'pages' back to the originator.  For
TX, making sure the pages are not touched by the host, as this would be
an immediate kernel panic.  For example, skb_copy_ubufs() is verboten.

The netgpu and the existing zerocopy code can likely be merged with some
additional work, but they still have different requirements, and
different user APIs.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28  1:48         ` Jonathan Lemon
@ 2020-07-28  6:47           ` Christoph Hellwig
  2020-07-28 16:05             ` Jonathan Lemon
  2020-07-28 18:19           ` Jason Gunthorpe
  1 sibling, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2020-07-28  6:47 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Christoph Hellwig, netdev, kernel-team, robin.murphy, akpm,
	davem, kuba, willemb, edumazet, steffen.klassert, saeedm,
	maximmi, bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 06:48:12PM -0700, Jonathan Lemon wrote:
> I'm aware that Nvidia code is maintained outside the tree, so this last
> patch may better placed there; I included it here so reviewers can see
> how things work.

Sorry dude, but with statements like this you really disqualify yourself
from kernel work.  Please really just go away and stop this crap.  It's
not just your attitude, but also the resulting crap code.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-28  2:17               ` Eric Dumazet
  2020-07-28  3:08                 ` Jonathan Lemon
@ 2020-07-28  6:50                 ` Christoph Hellwig
  1 sibling, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2020-07-28  6:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonathan Lemon, netdev, kernel-team, Christoph Hellwig,
	Robin Murphy, Andrew Morton, David Miller, Jakub Kicinski,
	Willem de Bruijn, Steffen Klassert, Saeed Mahameed,
	Maxim Mikityanskiy, bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 07:17:51PM -0700, Eric Dumazet wrote:
> > The thing is, the existing zero copy code is zero-copy to /host/ memory,
> > which is not the same thing as zero-copy to other memory areas.
> 
> You have to really explain what difference it makes, and why current
> stuff can not be extended.

There basically is none.  You need to call a different dma mapping
routine and make sure to never access the device pages.  See
drivers/pci/p2pdma.c and its users for an example on how to do it
properly.  But the author wants all his crazy hacks to hook into his
proprietary crap, so everyone should be ignore this horrible series.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28  6:47           ` Christoph Hellwig
@ 2020-07-28 16:05             ` Jonathan Lemon
  2020-07-28 16:10               ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Jonathan Lemon @ 2020-07-28 16:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: netdev, kernel-team, robin.murphy, akpm, davem, kuba, willemb,
	edumazet, steffen.klassert, saeedm, maximmi, bjorn.topel,
	magnus.karlsson, borisp, david

On Tue, Jul 28, 2020 at 08:47:06AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 27, 2020 at 06:48:12PM -0700, Jonathan Lemon wrote:
> > I'm aware that Nvidia code is maintained outside the tree, so this last
> > patch may better placed there; I included it here so reviewers can see
> > how things work.
> 
> Sorry dude, but with statements like this you really disqualify yourself
> from kernel work.  Please really just go away and stop this crap.  It's
> not just your attitude, but also the resulting crap code.

The attitude appears to be on your part, as apparently you have no
technical comments to contribute here.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28 16:05             ` Jonathan Lemon
@ 2020-07-28 16:10               ` Christoph Hellwig
  0 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2020-07-28 16:10 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Christoph Hellwig, netdev, kernel-team, robin.murphy, akpm,
	davem, kuba, willemb, edumazet, steffen.klassert, saeedm,
	maximmi, bjorn.topel, magnus.karlsson, borisp, david

On Tue, Jul 28, 2020 at 09:05:08AM -0700, Jonathan Lemon wrote:
> On Tue, Jul 28, 2020 at 08:47:06AM +0200, Christoph Hellwig wrote:
> > On Mon, Jul 27, 2020 at 06:48:12PM -0700, Jonathan Lemon wrote:
> > > I'm aware that Nvidia code is maintained outside the tree, so this last
> > > patch may better placed there; I included it here so reviewers can see
> > > how things work.
> > 
> > Sorry dude, but with statements like this you really disqualify yourself
> > from kernel work.  Please really just go away and stop this crap.  It's
> > not just your attitude, but also the resulting crap code.
> 
> The attitude appears to be on your part, as apparently you have no
> technical comments to contribute here.

You are submitting a series just to support a proprietary driver, and
that is disqualification enough on policy grounds.  And now just stop
the crap.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28  1:48         ` Jonathan Lemon
  2020-07-28  6:47           ` Christoph Hellwig
@ 2020-07-28 18:19           ` Jason Gunthorpe
  2020-07-28 21:01             ` Jonathan Lemon
  1 sibling, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2020-07-28 18:19 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Christoph Hellwig, netdev, kernel-team, robin.murphy, akpm,
	davem, kuba, willemb, edumazet, steffen.klassert, saeedm,
	maximmi, bjorn.topel, magnus.karlsson, borisp, david

On Mon, Jul 27, 2020 at 06:48:12PM -0700, Jonathan Lemon wrote:

> While the current GPU utilized is nvidia, there's nothing in the rest of
> the patches specific to Nvidia - an Intel or AMD GPU interface could be
> equally workable.

I think that is very misleading.

It looks like this patch, and all the ugly MM stuff, is done the way
it is *specifically* to match the clunky nv_p2p interface that only
the NVIDIA driver exposes.

Any approach done in tree, where we can actually modify the GPU
driver, would do sane things like have the GPU driver itself create
the MEMORY_DEVICE_PCI_P2PDMA pages, use the P2P DMA API framework, use
dmabuf for the cross-driver attachment, etc, etc.

Of course none of this is possible with a proprietary driver. I could
give you such detailed feedback but it is completely wasted since you
can't actually use it or implement it.

Which is why the prohibition on building APIs to support out of tree
modules exists - we can't do a good job if our hands are tied by being
unable to change something.

This is really a textbook example of why this is the correct
philosophy.

If you are serious about advancing this then the initial patches in a
long road must be focused on building up the core kernel
infrastructure for P2P DMA to a point where netdev could consume
it. There has been a lot of different ideas thrown about on how to do
this over the years.

As you've seen, posting patches so tightly coupled to the NVIDIA GPU
implementation just makes people angry, I also advise against doing it
any further.

> I think this is a better patch than all the various implementations of
> the protocol stack in the form of RDMA, driver code and device firmware.

Oh? You mean "better" in the sense the header split offload in the NIC
is better liked than a full protocol running in the NIC?

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28 18:19           ` Jason Gunthorpe
@ 2020-07-28 21:01             ` Jonathan Lemon
  2020-07-28 21:14               ` Christoph Hellwig
  2020-07-28 23:38               ` Jason Gunthorpe
  0 siblings, 2 replies; 33+ messages in thread
From: Jonathan Lemon @ 2020-07-28 21:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, netdev, kernel-team, robin.murphy, akpm,
	davem, kuba, willemb, edumazet, steffen.klassert, saeedm,
	maximmi, bjorn.topel, magnus.karlsson, borisp, david

On Tue, Jul 28, 2020 at 03:19:04PM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 27, 2020 at 06:48:12PM -0700, Jonathan Lemon wrote:
> 
> > While the current GPU utilized is nvidia, there's nothing in the rest of
> > the patches specific to Nvidia - an Intel or AMD GPU interface could be
> > equally workable.
> 
> I think that is very misleading.
> 
> It looks like this patch, and all the ugly MM stuff, is done the way
> it is *specifically* to match the clunky nv_p2p interface that only
> the NVIDIA driver exposes.

For /this/ patch [21], this is quite true.  I'm forced to use the nv_p2p
API if I want to use the hardware that I have.  What's being overlooked
is that the host mem driver does not do this, nor would another GPU
if it used p2p_dma.  I'm just providing get_page, put_page, get_dma.


> Any approach done in tree, where we can actually modify the GPU
> driver, would do sane things like have the GPU driver itself create
> the MEMORY_DEVICE_PCI_P2PDMA pages, use the P2P DMA API framework, use
> dmabuf for the cross-driver attachment, etc, etc.

So why doesn't Nvidia implement the above in the driver?
Actually a serious question, not trolling here.


> If you are serious about advancing this then the initial patches in a
> long road must be focused on building up the core kernel
> infrastructure for P2P DMA to a point where netdev could consume
> it. There has been a lot of different ideas thrown about on how to do
> this over the years.

Yes, I'm serious about doing this work, and may not have seen or
remember all the various ideas I've seen over time.  The netstack
operates on pages - are you advocating replacing them with sglists?


> > I think this is a better patch than all the various implementations of
> > the protocol stack in the form of RDMA, driver code and device firmware.
> 
> Oh? You mean "better" in the sense the header split offload in the NIC
> is better liked than a full protocol running in the NIC?

Yes.  The NIC firmware should become simpler, not more complicated.
-- 
Jonathan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28 21:01             ` Jonathan Lemon
@ 2020-07-28 21:14               ` Christoph Hellwig
  2020-07-28 23:38               ` Jason Gunthorpe
  1 sibling, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2020-07-28 21:14 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Jason Gunthorpe, Christoph Hellwig, netdev, kernel-team,
	robin.murphy, akpm, davem, kuba, willemb, edumazet,
	steffen.klassert, saeedm, maximmi, bjorn.topel, magnus.karlsson,
	borisp, david

On Tue, Jul 28, 2020 at 02:01:16PM -0700, Jonathan Lemon wrote:
> For /this/ patch [21], this is quite true.  I'm forced to use the nv_p2p
> API if I want to use the hardware that I have.  What's being overlooked
> is that the host mem driver does not do this, nor would another GPU
> if it used p2p_dma.  I'm just providing get_page, put_page, get_dma.

No, that is not overlooked.   What you overlooked is that your design
is bogus.  We don't do pointless plugings for something where there
is exactly one way to do, which is done in common infrastructure.

> 
> 
> > Any approach done in tree, where we can actually modify the GPU
> > driver, would do sane things like have the GPU driver itself create
> > the MEMORY_DEVICE_PCI_P2PDMA pages, use the P2P DMA API framework, use
> > dmabuf for the cross-driver attachment, etc, etc.
> 
> So why doesn't Nvidia implement the above in the driver?
> Actually a serious question, not trolling here.

No one knows, and most importantly it simply does not matter at all.

If you actually want to be productive contributor and not just troll
(as it absoutely seems) you'd just stop talking about a out of tree
driver that can't actually be distributed together with the kernel.

And better forget everything you ever knew about it, because it not
only is irrelevant but actually harmful.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28 21:01             ` Jonathan Lemon
  2020-07-28 21:14               ` Christoph Hellwig
@ 2020-07-28 23:38               ` Jason Gunthorpe
  1 sibling, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2020-07-28 23:38 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Christoph Hellwig, netdev, kernel-team, robin.murphy, akpm,
	davem, kuba, willemb, edumazet, steffen.klassert, saeedm,
	maximmi, bjorn.topel, magnus.karlsson, borisp, david

On Tue, Jul 28, 2020 at 02:01:16PM -0700, Jonathan Lemon wrote:
> On Tue, Jul 28, 2020 at 03:19:04PM -0300, Jason Gunthorpe wrote:
> > On Mon, Jul 27, 2020 at 06:48:12PM -0700, Jonathan Lemon wrote:
> > 
> > > While the current GPU utilized is nvidia, there's nothing in the rest of
> > > the patches specific to Nvidia - an Intel or AMD GPU interface could be
> > > equally workable.
> > 
> > I think that is very misleading.
> > 
> > It looks like this patch, and all the ugly MM stuff, is done the way
> > it is *specifically* to match the clunky nv_p2p interface that only
> > the NVIDIA driver exposes.
> 
> For /this/ patch [21], this is quite true.  I'm forced to use the nv_p2p
> API if I want to use the hardware that I have.  What's being overlooked
> is that the host mem driver does not do this, nor would another GPU
> if it used p2p_dma.  I'm just providing get_page, put_page, get_dma.

Not really, the design copied the nv_p2p api design directly into
struct netgpu_functions and then aligned the rest of the parts to use
it too. Yes, other GPU drivers could also be squeezed into this API,
but if you'd never looked at the NVIDIA driver you'd never pick such a
design. It is inherently disconnected from the MM.

> > Any approach done in tree, where we can actually modify the GPU
> > driver, would do sane things like have the GPU driver itself create
> > the MEMORY_DEVICE_PCI_P2PDMA pages, use the P2P DMA API framework, use
> > dmabuf for the cross-driver attachment, etc, etc.
> 
> So why doesn't Nvidia implement the above in the driver?
> Actually a serious question, not trolling here.

A kernel mailing list is not appropriate place to discuss a
proprietary out of tree driver, take questions like that with your
support channel.

> > If you are serious about advancing this then the initial patches in a
> > long road must be focused on building up the core kernel
> > infrastructure for P2P DMA to a point where netdev could consume
> > it. There has been a lot of different ideas thrown about on how to do
> > this over the years.
> 
> Yes, I'm serious about doing this work, and may not have seen or
> remember all the various ideas I've seen over time.  The netstack
> operates on pages - are you advocating replacing them with sglists?

So far, the general expectation is that any pages would be ZONE_DEVICE
MEMORY_DEVICE_PCI_P2PDMA pages created by the PCI device's driver to
cover the device's BAR. These are __iomem pages so they can't be
intermixed in the kernel with system memory pages. That detail has
a been a large stumbling block in most cases.

Resolving this design issue removes most of the MM hackery in the
netgpu. Though, I have no idea if you can intermix ZONE_DEVICE pages
into skb's in the net stack.

From there, I'd expect the pages are mmaped into userspace in a VMA or
passed into a userspace dmabuf FD.

At this point, consumers, like the net stack should rely on some core
APIs to extract the pages and DMA maps from the user objects. Either
some new pin_user_pages() variant for VMAs or via the work that was
started on the dma_buf_map_attachment() for dmabuf.

From there it needs to handle everything carefully and then call over
to the pcip2p code to validate and dma map them. There are many
missing little details along this path.

Overall there should be nothing like 'netgpu'. This is all just some
special case of an existing user memory flow where the user pages
being touched are allowed to be P2P pages or system memory and the
flow knows how to deal with the difference.

More or less. All of these touch points need beefing up and additional
features.

> > > I think this is a better patch than all the various implementations of
> > > the protocol stack in the form of RDMA, driver code and device firmware.
> > 
> > Oh? You mean "better" in the sense the header split offload in the NIC
> > is better liked than a full protocol running in the NIC?
> 
> Yes.  The NIC firmware should become simpler, not more complicated.

Do you have any application benchmarks? The typical AI communication
pattern is very challenging and a state of the art RDMA implementation
gets incredible performance.

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28 17:27       ` Christoph Hellwig
@ 2020-07-28 18:47         ` Chris Mason
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Mason @ 2020-07-28 18:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Greg KH, Jonathan Lemon, netdev, kernel-team

On 28 Jul 2020, at 13:27, Christoph Hellwig wrote:

> On Tue, Jul 28, 2020 at 01:18:48PM -0400, Chris Mason wrote:
>>> come after in the future.
>>
>> Jonathan, I think we need to do a better job talking about patches 
>> that are
>> just meant to enable possible users vs patches that we actually hope 
>> the
>> upstream kernel to take.  Obviously code that only supports out of 
>> tree
>> drivers isn???t a good fit for the upstream kernel.  From the point 
>> of view
>> of experimenting with these patches, GPUs benefit a lot from this
>> functionality so I think it does make sense to have the enabling 
>> patches
>> somewhere, just not in this series.
>
> Sorry, but his crap is built only for this use case, and that is what
> really pissed people off as it very much looks intentional.

No, we’ve had workloads asking for better zero copy solutions for 
ages.  The goal is to address both this specialized workload and the 
general case zero copy tx/rx.

-chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28 17:18     ` Chris Mason
@ 2020-07-28 17:27       ` Christoph Hellwig
  2020-07-28 18:47         ` Chris Mason
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2020-07-28 17:27 UTC (permalink / raw)
  To: Chris Mason; +Cc: Greg KH, Jonathan Lemon, netdev, kernel-team

On Tue, Jul 28, 2020 at 01:18:48PM -0400, Chris Mason wrote:
> > come after in the future.
> 
> Jonathan, I think we need to do a better job talking about patches that are
> just meant to enable possible users vs patches that we actually hope the
> upstream kernel to take.  Obviously code that only supports out of tree
> drivers isn???t a good fit for the upstream kernel.  From the point of view
> of experimenting with these patches, GPUs benefit a lot from this
> functionality so I think it does make sense to have the enabling patches
> somewhere, just not in this series.

Sorry, but his crap is built only for this use case, and that is what
really pissed people off as it very much looks intentional.

> We???re finding it more common to have pcie switch hops between a [ GPU, NIC
> ] pair and the CPU, which gives a huge advantage to out of tree drivers or
> extensions that can DMA directly between the GPU/NIC without having to copy
> through the CPU.  I???d love to have an alternative built on TCP because
> that???s where we invest the vast majority of our tuning, security and
> interoperability testing.  It???s just more predictable overall.
> 
> This isn???t a new story, but if we can layer on APIs that enable this
> cleanly for in-tree drivers, we can work with the vendors to use better
> supported APIs and have a more stable kernel.  Obviously this is an RFC and
> there???s a long road ahead, but as long as the upstream kernel doesn???t
> provide an answer, out of tree drivers are going to fill in the weak spots.
> 
> Other possible use cases would include also include other GPUs or my
> favorite:
> 
> NVME <-> filesystem <-> NIC with io_uring driving the IO and without copies.

And we have all that working with the existing p2pdma infrastructure (at
least if you're using RDMA insted of badly reinventing it, but it could
be added to other users easily).

That infrastructure is EXPORT_SYMBOL_GPL as it should be for
infrastructure like that, and a lot of his crap just seems to be because
he's working around that.

So I really agree with Gred, this very much looks like a deliberate
trolling attempt.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28 16:31   ` Greg KH
@ 2020-07-28 17:18     ` Chris Mason
  2020-07-28 17:27       ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Mason @ 2020-07-28 17:18 UTC (permalink / raw)
  To: Greg KH; +Cc: Jonathan Lemon, netdev, kernel-team

On 28 Jul 2020, at 12:31, Greg KH wrote:

> On Mon, Jul 27, 2020 at 03:44:44PM -0700, Jonathan Lemon wrote:
>> From: Jonathan Lemon <bsd@fb.com>
>>
>> This provides the interface between the netgpu core module and the
>> nvidia kernel driver.  This should be built as an external module,
>> pointing to the nvidia build.  For example:
>>
>> export NV_PACKAGE_DIR=/w/nvidia/NVIDIA-Linux-x86_64-440.64
>> make -C ${kdir} M=`pwd` O=obj $*
>
> Ok, now you are just trolling us.
>
> Nice job, I shouldn't have read the previous patches.
>
> Please, go get a lawyer to sign-off on this patch, with their 
> corporate
> email address on it.  That's the only way we could possibly consider
> something like this.
>
> Oh, and we need you to use your corporate email address too, as you 
> are
> not putting copyright notices on this code, we will need to know who 
> to
> come after in the future.

Jonathan, I think we need to do a better job talking about patches that 
are just meant to enable possible users vs patches that we actually hope 
the upstream kernel to take.  Obviously code that only supports out of 
tree drivers isn’t a good fit for the upstream kernel.  From the point 
of view of experimenting with these patches, GPUs benefit a lot from 
this functionality so I think it does make sense to have the enabling 
patches somewhere, just not in this series.

We’re finding it more common to have pcie switch hops between a [ GPU, 
NIC ] pair and the CPU, which gives a huge advantage to out of tree 
drivers or extensions that can DMA directly between the GPU/NIC without 
having to copy through the CPU.  I’d love to have an alternative built 
on TCP because that’s where we invest the vast majority of our tuning, 
security and interoperability testing.  It’s just more predictable 
overall.

This isn’t a new story, but if we can layer on APIs that enable this 
cleanly for in-tree drivers, we can work with the vendors to use better 
supported APIs and have a more stable kernel.  Obviously this is an RFC 
and there’s a long road ahead, but as long as the upstream kernel 
doesn’t provide an answer, out of tree drivers are going to fill in 
the weak spots.

Other possible use cases would include also include other GPUs or my 
favorite:

NVME <-> filesystem <-> NIC with io_uring driving the IO and without 
copies.

-chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-27 22:44 ` [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu Jonathan Lemon
@ 2020-07-28 16:31   ` Greg KH
  2020-07-28 17:18     ` Chris Mason
  0 siblings, 1 reply; 33+ messages in thread
From: Greg KH @ 2020-07-28 16:31 UTC (permalink / raw)
  To: Jonathan Lemon; +Cc: netdev, kernel-team

On Mon, Jul 27, 2020 at 03:44:44PM -0700, Jonathan Lemon wrote:
> From: Jonathan Lemon <bsd@fb.com>
> 
> This provides the interface between the netgpu core module and the
> nvidia kernel driver.  This should be built as an external module,
> pointing to the nvidia build.  For example:
> 
> export NV_PACKAGE_DIR=/w/nvidia/NVIDIA-Linux-x86_64-440.64
> make -C ${kdir} M=`pwd` O=obj $*

Ok, now you are just trolling us.

Nice job, I shouldn't have read the previous patches.

Please, go get a lawyer to sign-off on this patch, with their corporate
email address on it.  That's the only way we could possibly consider
something like this.

Oh, and we need you to use your corporate email address too, as you are
not putting copyright notices on this code, we will need to know who to
come after in the future.

greg k-h

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-28 16:31   ` Greg KH
  0 siblings, 1 reply; 33+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

This provides the interface between the netgpu core module and the
nvidia kernel driver.  This should be built as an external module,
pointing to the nvidia build.  For example:

export NV_PACKAGE_DIR=/w/nvidia/NVIDIA-Linux-x86_64-440.64
make -C ${kdir} M=`pwd` O=obj $*

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 drivers/misc/netgpu/nvidia/Kbuild        |   9 +
 drivers/misc/netgpu/nvidia/Kconfig       |  10 +
 drivers/misc/netgpu/nvidia/netgpu_cuda.c | 416 +++++++++++++++++++++++
 3 files changed, 435 insertions(+)
 create mode 100644 drivers/misc/netgpu/nvidia/Kbuild
 create mode 100644 drivers/misc/netgpu/nvidia/Kconfig
 create mode 100644 drivers/misc/netgpu/nvidia/netgpu_cuda.c

diff --git a/drivers/misc/netgpu/nvidia/Kbuild b/drivers/misc/netgpu/nvidia/Kbuild
new file mode 100644
index 000000000000..10a3b3156f30
--- /dev/null
+++ b/drivers/misc/netgpu/nvidia/Kbuild
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+nv_dir = $(NV_PACKAGE_DIR)/kernel
+
+KBUILD_EXTRA_SYMBOLS = $(nv_dir)/Module.symvers
+
+obj-m := netgpu_cuda.o
+
+ccflags-y += -I$(nv_dir)
diff --git a/drivers/misc/netgpu/nvidia/Kconfig b/drivers/misc/netgpu/nvidia/Kconfig
new file mode 100644
index 000000000000..6bb8be158943
--- /dev/null
+++ b/drivers/misc/netgpu/nvidia/Kconfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# NetGPU framework
+#
+
+config NETGPU_CUDA
+	tristate "Network/GPU driver for Nvidia"
+	depends on NETGPU && m
+	help
+	  Experimental Network / GPU driver for Nvidia
diff --git a/drivers/misc/netgpu/nvidia/netgpu_cuda.c b/drivers/misc/netgpu/nvidia/netgpu_cuda.c
new file mode 100644
index 000000000000..2cd93dab52ad
--- /dev/null
+++ b/drivers/misc/netgpu/nvidia/netgpu_cuda.c
@@ -0,0 +1,416 @@
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/uio.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
+#include <linux/memory.h>
+#include <linux/interval_tree.h>
+
+#include <net/netgpu.h>
+#include "../netgpu_priv.h"
+
+#include "nvidia/nv-p2p.h"
+
+/* nvidia GPU uses 64K pages */
+#define GPU_PAGE_SHIFT		16
+#define GPU_PAGE_SIZE		(1UL << GPU_PAGE_SHIFT)
+#define GPU_PAGE_MASK		(GPU_PAGE_SIZE - 1)
+
+struct netgpu_cuda_region {
+	struct netgpu_region r;				/* must be first */
+	struct rb_root_cached root;
+	struct nvidia_p2p_page_table *gpu_pgtbl;
+};
+
+struct netgpu_cuda_dmamap {
+	struct netgpu_dmamap map;			/* must be first */
+	unsigned pg_shift;
+	unsigned long pg_mask;
+	u64 *dma;
+	struct nvidia_p2p_dma_mapping *gpu_map;
+};
+
+/* page_range represents one contiguous GPU PA region */
+struct netgpu_page_range {
+	unsigned long pfn;
+	struct resource *res;
+	struct interval_tree_node va_node;
+};
+
+static int nvidia_pg_shift[] = {
+	[NVIDIA_P2P_PAGE_SIZE_4KB]   = 12,
+	[NVIDIA_P2P_PAGE_SIZE_64KB]  = 16,
+	[NVIDIA_P2P_PAGE_SIZE_128KB] = 17,
+};
+
+#define node2page_range(itn) \
+	container_of(itn, struct netgpu_page_range, va_node)
+
+#define region_remove_each(root, first, last, itn)			\
+	while ((itn = interval_tree_iter_first(root, first, last)) &&	\
+	       (interval_tree_remove(itn, root), 1))
+
+#define cuda_region_remove_each(r, itn)					\
+	region_remove_each(&cuda_region(r)->root, r->start,		\
+			   r->start + (r->nr_pages << PAGE_SHIFT) - 1,	\
+			   itn)
+
+static inline struct netgpu_cuda_region *
+cuda_region(struct netgpu_region *r)
+{
+	return (struct netgpu_cuda_region *)r;
+}
+
+static inline struct netgpu_cuda_dmamap *
+cuda_map(struct netgpu_dmamap *map)
+{
+	return (struct netgpu_cuda_dmamap *)map;
+}
+
+static inline struct netgpu_page_range *
+region_find(struct netgpu_region *r, unsigned long start, int count)
+{
+	struct interval_tree_node *itn;
+	unsigned long last;
+
+	last = start + count * PAGE_SIZE - 1;
+
+	itn = interval_tree_iter_first(&cuda_region(r)->root, start, last);
+	return itn ? node2page_range(itn) : 0;
+}
+
+static dma_addr_t
+netgpu_cuda_get_dma(struct netgpu_dmamap *map, unsigned long addr)
+{
+	unsigned long base, idx;
+
+	base = addr - map->start;
+	idx = base >> cuda_map(map)->pg_shift;
+	return cuda_map(map)->dma[idx] + (base & cuda_map(map)->pg_mask);
+}
+
+static int
+netgpu_cuda_get_page(struct netgpu_dmamap *map, unsigned long addr,
+		     struct page **page, dma_addr_t *dma)
+{
+	struct netgpu_page_range *pr;
+	unsigned long idx;
+
+	pr = region_find(map->r, addr, 1);
+	if (!pr)
+		return -EFAULT;
+	idx = (addr - pr->va_node.start) >> PAGE_SHIFT;
+
+	*page = pfn_to_page(pr->pfn + idx);
+	get_page(*page);
+	*dma = netgpu_cuda_get_dma(map, addr);
+
+	return 0;
+}
+
+static void
+region_get_pages(struct page **pages, unsigned long pfn, int n)
+{
+	struct page *p;
+	int i;
+
+	for (i = 0; i < n; i++) {
+		p = pfn_to_page(pfn + i);
+		get_page(p);
+		pages[i] = p;
+	}
+}
+
+static int
+netgpu_cuda_get_pages(struct netgpu_region *r, struct page **pages,
+		      unsigned long addr, int count)
+{
+	struct netgpu_page_range *pr;
+	unsigned long idx, end;
+	int n;
+
+	pr = region_find(r, addr, count);
+	if (!pr)
+		return -EFAULT;
+
+	idx = (addr - pr->va_node.start) >> PAGE_SHIFT;
+	end = (pr->va_node.last - pr->va_node.start) >> PAGE_SHIFT;
+	n = end - idx + 1;
+	n = min(count, n);
+
+	region_get_pages(pages, pr->pfn + idx, n);
+
+	return n;
+}
+
+static void
+netgpu_cuda_unmap_region(struct netgpu_dmamap *map)
+{
+	struct pci_dev *pdev;
+	int err;
+
+	pdev = cuda_map(map)->gpu_map->pci_dev;
+
+	err = nvidia_p2p_dma_unmap_pages(pdev, cuda_region(map->r)->gpu_pgtbl,
+					 cuda_map(map)->gpu_map);
+	if (err)
+		pr_err("nvidia_p2p_dma_unmap failed: %d\n", err);
+}
+
+static struct netgpu_dmamap *
+netgpu_cuda_map_region(struct netgpu_region *r, struct device *device)
+{
+	struct netgpu_cuda_region *cr = cuda_region(r);
+	struct nvidia_p2p_dma_mapping *gpu_map;
+	struct netgpu_dmamap *map;
+	struct pci_dev *pdev;
+	int err;
+
+	map = kmalloc(sizeof(struct netgpu_cuda_dmamap), GFP_KERNEL);
+	if (!map)
+		return ERR_PTR(-ENOMEM);
+
+	pdev = to_pci_dev(device);
+
+	/*
+	 * takes PA from pgtbl, performs mapping, saves mapping
+	 * dma_mapping holds dma mapped addresses, and pdev.
+	 * mem_info contains pgtbl and mapping list.  mapping is added to list.
+	 * rm_p2p_dma_map_pages() does the work.
+	 */
+	err = nvidia_p2p_dma_map_pages(pdev, cr->gpu_pgtbl, &gpu_map);
+	if (err) {
+		kfree(map);
+		return ERR_PTR(err);
+	}
+
+	cuda_map(map)->gpu_map = gpu_map;
+	cuda_map(map)->dma = gpu_map->dma_addresses;
+	cuda_map(map)->pg_shift = nvidia_pg_shift[gpu_map->page_size_type];
+	cuda_map(map)->pg_mask = (1UL << cuda_map(map)->pg_shift) - 1;
+
+	return map;
+}
+
+static struct resource *
+netgpu_add_pages(int nid, u64 start, u64 end)
+{
+	struct mhp_params params = { .pgprot = PAGE_KERNEL };
+
+	return add_memory_pages(nid, start, end - start, &params);
+}
+
+static void
+netgpu_free_pages(struct resource *res)
+{
+	release_memory_pages(res);
+}
+
+static void
+netgpu_free_page_range(struct netgpu_page_range *pr)
+{
+	unsigned long pfn, pfn_end;
+	struct page *page;
+
+	pfn_end = pr->pfn +
+		  ((pr->va_node.last + 1 - pr->va_node.start) >> PAGE_SHIFT);
+
+	/* XXX verify page count is 2! */
+	for (pfn = pr->pfn; pfn < pfn_end; pfn++) {
+		page = pfn_to_page(pfn);
+		set_page_count(page, 0);
+	}
+	netgpu_free_pages(pr->res);
+	kfree(pr);
+}
+
+static void
+netgpu_cuda_release_pages(struct netgpu_region *r)
+{
+	struct interval_tree_node *va_node;
+
+	cuda_region_remove_each(r, va_node)
+		netgpu_free_page_range(node2page_range(va_node));
+}
+
+static void
+netgpu_init_pages(u64 va, unsigned long pfn_start, unsigned long pfn_end)
+{
+	unsigned long pfn;
+	struct page *page;
+
+	for (pfn = pfn_start; pfn < pfn_end; pfn++) {
+		page = pfn_to_page(pfn);
+		mm_zero_struct_page(page);
+
+		set_page_count(page, 2);	/* matches host logic */
+		page->page_type = 7;		/* XXX differential flag */
+		__SetPageReserved(page);
+
+		SetPagePrivate(page);
+		set_page_private(page, va);
+		va += PAGE_SIZE;
+	}
+}
+
+static int
+netgpu_add_page_range(struct netgpu_region *r, u64 va, u64 start, u64 end)
+{
+	struct netgpu_page_range *pr;
+	struct resource *res;
+
+	pr = kmalloc(sizeof(*pr), GFP_KERNEL);
+	if (!pr)
+		return -ENOMEM;
+
+	res = netgpu_add_pages(numa_mem_id(), start, end);
+	if (IS_ERR(res)) {
+		kfree(pr);
+		return PTR_ERR(res);
+	}
+
+	pr->pfn = PHYS_PFN(start);
+	pr->va_node.start = va;
+	pr->va_node.last = va + (end - start) - 1;
+	pr->res = res;
+
+	netgpu_init_pages(va, PHYS_PFN(start), PHYS_PFN(end));
+
+	interval_tree_insert(&pr->va_node, &cuda_region(r)->root);
+
+	return 0;
+}
+
+static void
+netgpu_cuda_pgtbl_cb(void *data)
+{
+	struct netgpu_region *r = data;
+
+	/* This is required - nvidia gets unhappy if the page table is
+	 * freed from the page table callback.
+	 */
+	cuda_region(r)->gpu_pgtbl = NULL;
+	netgpu_detach_region(r);
+}
+
+static struct netgpu_region *
+netgpu_cuda_add_region(struct netgpu_mem *mem, const struct iovec *iov)
+{
+	struct nvidia_p2p_page_table *gpu_pgtbl = NULL;
+	u64 va, pa, len, start, end;
+	struct netgpu_region *r;
+	int err, i, gpu_pgsize;
+
+	err = -ENOMEM;
+	r = kzalloc(sizeof(struct netgpu_cuda_region), GFP_KERNEL);
+	if (!r)
+		return ERR_PTR(err);
+
+	start = (u64)iov->iov_base;
+	r->start = round_down(start, GPU_PAGE_SIZE);
+	len = round_up(start - r->start + iov->iov_len, GPU_PAGE_SIZE);
+	r->nr_pages = len >> PAGE_SHIFT;
+
+	r->mem = mem;
+	INIT_LIST_HEAD(&r->ctx_list);
+	INIT_LIST_HEAD(&r->dma_list);
+	spin_lock_init(&r->lock);
+
+	/*
+	 * allocates page table, sets gpu_uuid to owning gpu.
+	 * allocates page array, set PA for each page.
+	 * sets page_size (64K here)
+	 * rm_p2p_get_pages() does the actual work.
+	 */
+	err = nvidia_p2p_get_pages(0, 0, r->start, len, &gpu_pgtbl,
+				   netgpu_cuda_pgtbl_cb, r);
+	if (err)
+		goto out;
+
+	/* gpu pgtbl owns r, will free via netgpu_cuda_pgtbl_cb */
+	cuda_region(r)->gpu_pgtbl = gpu_pgtbl;
+
+	if (!NVIDIA_P2P_PAGE_TABLE_VERSION_COMPATIBLE(gpu_pgtbl)) {
+		pr_err("incompatible page table\n");
+		err = -EINVAL;
+		goto out;
+	}
+
+	gpu_pgsize = 1UL << nvidia_pg_shift[gpu_pgtbl->page_size];
+	if (r->nr_pages != gpu_pgtbl->entries * gpu_pgsize / PAGE_SIZE) {
+		pr_err("GPU page count %ld != host page count %ld\n",
+		       gpu_pgtbl->entries * gpu_pgsize / PAGE_SIZE,
+		       r->nr_pages);
+		err = -EINVAL;
+		goto out;
+	}
+
+	start = U64_MAX;
+	end = 0;
+
+	for (i = 0; i < gpu_pgtbl->entries; i++) {
+		pa = gpu_pgtbl->pages[i]->physical_address;
+		if (pa != end) {
+			if (end) {
+				err = netgpu_add_page_range(r, va, start, end);
+				if (err)
+					goto out;
+			}
+			start = pa;
+			va = r->start + i * gpu_pgsize;
+		}
+		end = pa + gpu_pgsize;
+	}
+	err = netgpu_add_page_range(r, va, start, end);
+	if (err)
+		goto out;
+
+	return r;
+
+out:
+	netgpu_cuda_release_pages(r);
+	if (gpu_pgtbl)
+		nvidia_p2p_put_pages(0, 0, r->start, gpu_pgtbl);
+	kfree(r);
+
+	return ERR_PTR(err);
+}
+
+static void
+netgpu_cuda_free_region(struct netgpu_mem *mem, struct netgpu_region *r)
+{
+	netgpu_cuda_release_pages(r);
+	if (cuda_region(r)->gpu_pgtbl)
+		nvidia_p2p_put_pages(0, 0, r->start, cuda_region(r)->gpu_pgtbl);
+	kfree(r);
+}
+
+struct netgpu_ops cuda_ops = {
+	.owner		= THIS_MODULE,
+	.memtype	= NETGPU_MEMTYPE_CUDA,
+	.add_region	= netgpu_cuda_add_region,
+	.free_region	= netgpu_cuda_free_region,
+	.map_region	= netgpu_cuda_map_region,
+	.unmap_region	= netgpu_cuda_unmap_region,
+	.get_dma	= netgpu_cuda_get_dma,
+	.get_page	= netgpu_cuda_get_page,
+	.get_pages	= netgpu_cuda_get_pages,
+};
+
+static int __init
+netgpu_cuda_init(void)
+{
+	return netgpu_register(&cuda_ops);
+}
+
+static void __exit
+netgpu_cuda_fini(void)
+{
+	netgpu_unregister(cuda_ops.memtype);
+}
+
+module_init(netgpu_cuda_init);
+module_exit(netgpu_cuda_fini);
+MODULE_LICENSE("GPL v2");
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2020-07-28 23:38 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200727052846.4070247-1-jonathan.lemon@gmail.com>
     [not found] ` <20200727052846.4070247-4-jonathan.lemon@gmail.com>
2020-07-27  5:58   ` [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online Christoph Hellwig
2020-07-27 17:08     ` Jonathan Lemon
     [not found] ` <20200727052846.4070247-22-jonathan.lemon@gmail.com>
2020-07-27  7:35   ` [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu Christoph Hellwig
2020-07-27 17:00     ` Jonathan Lemon
2020-07-27 18:24       ` Christoph Hellwig
2020-07-28  1:48         ` Jonathan Lemon
2020-07-28  6:47           ` Christoph Hellwig
2020-07-28 16:05             ` Jonathan Lemon
2020-07-28 16:10               ` Christoph Hellwig
2020-07-28 18:19           ` Jason Gunthorpe
2020-07-28 21:01             ` Jonathan Lemon
2020-07-28 21:14               ` Christoph Hellwig
2020-07-28 23:38               ` Jason Gunthorpe
     [not found] ` <20200727052846.4070247-14-jonathan.lemon@gmail.com>
2020-07-27 15:19   ` [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size " Eric Dumazet
2020-07-27 17:20     ` Jonathan Lemon
     [not found] ` <20200727052846.4070247-16-jonathan.lemon@gmail.com>
2020-07-27 15:19   ` [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg() Eric Dumazet
2020-07-27 15:55     ` Jonathan Lemon
2020-07-27 16:09       ` Eric Dumazet
2020-07-27 17:35         ` Jonathan Lemon
2020-07-27 17:44           ` Eric Dumazet
2020-07-28  2:11             ` Jonathan Lemon
2020-07-28  2:17               ` Eric Dumazet
2020-07-28  3:08                 ` Jonathan Lemon
2020-07-28  6:50                 ` Christoph Hellwig
     [not found] ` <20200727052846.4070247-9-jonathan.lemon@gmail.com>
2020-07-27 15:24   ` [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag Eric Dumazet
2020-07-27 16:59     ` Jonathan Lemon
2020-07-27 17:08       ` Eric Dumazet
2020-07-27 17:16         ` Jonathan Lemon
2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu Jonathan Lemon
2020-07-28 16:31   ` Greg KH
2020-07-28 17:18     ` Chris Mason
2020-07-28 17:27       ` Christoph Hellwig
2020-07-28 18:47         ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).