netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: David Ahern <dsahern@gmail.com>
Cc: Boris Pismenny <borispismenny@gmail.com>,
	Boris Pismenny <borisp@mellanox.com>,
	davem@davemloft.net, saeedm@nvidia.com, hch@lst.de,
	sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org,
	viro@zeniv.linux.org.uk, edumazet@google.com,
	boris.pismenny@gmail.com, linux-nvme@lists.infradead.org,
	netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com,
	yorayz@nvidia.com, Ben Ben-Ishay <benishay@mellanox.com>,
	Or Gerlitz <ogerlitz@mellanox.com>,
	Yoray Zack <yorayz@mellanox.com>,
	Boris Pismenny <borisp@nvidia.com>
Subject: Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
Date: Fri, 11 Dec 2020 10:45:34 -0800	[thread overview]
Message-ID: <20201211104445.30684242@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> (raw)
In-Reply-To: <5cadcfa0-b992-f124-f006-51872c86b804@gmail.com>

On Thu, 10 Dec 2020 19:43:57 -0700 David Ahern wrote:
> On 12/10/20 7:01 PM, Jakub Kicinski wrote:
> > On Wed, 9 Dec 2020 21:26:05 -0700 David Ahern wrote:  
> >> Yes, TCP is a byte stream, so the packets could very well show up like this:
> >>
> >>  +--------------+---------+-----------+---------+--------+-----+
> >>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
> >>  +--------------+---------+-----------+---------+--------+-----+
> >>  +-----------------------------------+---------+--------+-----+
> >>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
> >>  +-----------------------------------+---------+--------+-----+
> >>  +-------- +-------------------------+---------+--------+-----+
> >>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
> >>  +---------+-------------------------+---------+--------+-----+
> >>
> >> If your hardware can extract the NVMe payload into a targeted SGL like
> >> you want in this set, then it has some logic for parsing headers and
> >> "snapping" an SGL to a new element. ie., it already knows 'prev data'
> >> goes with the in-progress PDU, sees more data, recognizes a new PDU
> >> header and a new payload. That means it already has to handle a
> >> 'snap-to-PDU' style argument where the end of the payload closes out an
> >> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
> >> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
> >> this case, you want 'snap-to-PDU' but that could just as easily be 'no
> >> snap at all', just a byte stream and filling an SGL after the protocol
> >> headers.  
> > 
> > This 'snap-to-PDU' requirement is something that I don't understand
> > with the current TCP zero copy. In case of, say, a storage application  
> 
> current TCP zero-copy does not handle this and it can't AFAIK. I believe
> it requires hardware level support where an Rx queue is dedicated to a
> flow / socket and some degree of header and payload splitting (header is
> consumed by the kernel stack and payload goes to socket owner's memory).

Yet, Google claims to use the RX ZC in production, and with a CX3 Pro /
mlx4 NICs.

Simple workaround that comes to mind is have the headers and payloads
on separate TCP streams. That doesn't seem too slick.. but neither is
the 4k MSS, so maybe that's what Google does?

> > which wants to send some headers (whatever RPC info, block number,
> > etc.) and then a 4k block of data - how does the RX side get just the
> > 4k block a into a page so it can zero copy it out to its storage device?
> > 
> > Per-connection state in the NIC, and FW parsing headers is one way,
> > but I wonder how this record split problem is best resolved generically.
> > Perhaps by passing hints in the headers somehow?
> > 
> > Sorry for the slight off-topic :)
> >   
> Hardware has to be parsing the incoming packets to find the usual
> ethernet/IP/TCP headers and TCP payload offset. Then the hardware has to
> have some kind of ULP processor to know how to parse the TCP byte stream
> at least well enough to find the PDU header and interpret it to get pdu
> header length and payload length.

The big difference between normal headers and L7 headers is that one is
at ~constant offset, self-contained, and always complete (PDU header
can be split across segments).

Edwin Peer did an implementation of TLS ULP for the NFP, it was
complex. Not to mention it's L7 protocol ossification.

To put it bluntly maybe it's fine for smaller shops but I'm guessing
it's going to be a hard sell to hyperscalers and people who don't like
to be locked in to HW.

> At that point you push the protocol headers (eth/ip/tcp) into one buffer
> for the kernel stack protocols and put the payload into another. The
> former would be some page owned by the OS and the latter owned by the
> process / socket (generically, in this case it is a kernel level
> socket). In addition, since the payload is spread across multiple
> packets the hardware has to keep track of TCP sequence number and its
> current place in the SGL where it is writing the payload to keep the
> bytes contiguous and detect out-of-order.
> 
> If the ULP processor knows about PDU headers it knows when enough
> payload has been found to satisfy that PDU in which case it can tell the
> cursor to move on to the next SGL element (or separate SGL). That's what
> I meant by 'snap-to-PDU'.
> 
> Alternatively, if it is any random application with a byte stream not
> understood by hardware, the cursor just keeps moving along the SGL
> elements assigned it for this particular flow.
> 
> If you have a socket whose payload is getting offloaded to its own queue
> (which this set is effectively doing), you can create the queue with
> some attribute that says 'NVMe ULP', 'iscsi ULP', 'just a byte stream'
> that controls the parsing when you stop writing to one SGL element and
> move on to the next. Again, assuming hardware support for such attributes.
> 
> I don't work for Nvidia, so this is all supposition based on what the
> patches are doing.

Ack, these patches are not exciting (to me), so I'm wondering if there
is a better way. The only reason NIC would have to understand a ULP for
ZC is to parse out header/message lengths. There's gotta be a way to
pass those in header options or such...

And, you know, if we figure something out - maybe we stand a chance
against having 4 different zero copy implementations (this, TCP,
AF_XDP, netgpu) :(

  reply	other threads:[~2020-12-11 20:15 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 01/15] iov_iter: Skip copy in memcpy_to_page if src==dst Boris Pismenny
2020-12-08  0:39   ` David Ahern
2020-12-08 14:30     ` Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload Boris Pismenny
2020-12-08  0:42   ` David Ahern
2020-12-08 14:36     ` Boris Pismenny
2020-12-09  0:38       ` David Ahern
2020-12-09  8:15         ` Boris Pismenny
2020-12-10  4:26           ` David Ahern
2020-12-11  2:01             ` Jakub Kicinski
2020-12-11  2:43               ` David Ahern
2020-12-11 18:45                 ` Jakub Kicinski [this message]
2020-12-11 18:58                   ` Eric Dumazet
2020-12-11 19:59                   ` David Ahern
2020-12-11 23:05                     ` Jonathan Lemon
2020-12-13 18:34                   ` Boris Pismenny
2020-12-13 18:21             ` Boris Pismenny
2020-12-15  5:19               ` David Ahern
2020-12-17 19:06                 ` Boris Pismenny
2020-12-18  0:44                   ` David Ahern
2020-12-09  0:57   ` David Ahern
2020-12-09  1:11     ` David Ahern
2020-12-09  8:28       ` Boris Pismenny
2020-12-09  8:25     ` Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 03/15] net: Introduce crc offload for tcp ddp ulp Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 04/15] net/tls: expose get_netdev_for_sock Boris Pismenny
2020-12-09  1:06   ` David Ahern
2020-12-09  7:41     ` Boris Pismenny
2020-12-10  3:39       ` David Ahern
2020-12-11 18:43         ` Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 05/15] nvme-tcp: Add DDP offload control path Boris Pismenny
2020-12-10 17:15   ` Shai Malin
2020-12-14  6:38     ` Boris Pismenny
2020-12-15 13:33       ` Shai Malin
2020-12-17 18:51         ` Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 06/15] nvme-tcp: Add DDP data-path Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 07/15] nvme-tcp : Recalculate crc in the end of the capsule Boris Pismenny
2020-12-15 14:07   ` Shai Malin
2020-12-07 21:06 ` [PATCH v1 net-next 08/15] nvme-tcp: Deal with netdevice DOWN events Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 09/15] net/mlx5: Header file changes for nvme-tcp offload Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 10/15] net/mlx5: Add 128B CQE for NVMEoTCP offload Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 11/15] net/mlx5e: TCP flow steering for nvme-tcp Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 12/15] net/mlx5e: NVMEoTCP DDP offload control path Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 13/15] net/mlx5e: NVMEoTCP, data-path for DDP offload Boris Pismenny
2020-12-18  0:57   ` David Ahern
2020-12-07 21:06 ` [PATCH v1 net-next 14/15] net/mlx5e: NVMEoTCP statistics Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 15/15] net/mlx5e: NVMEoTCP workaround CRC after resync Boris Pismenny
2021-01-14  1:27 ` [PATCH v1 net-next 00/15] nvme-tcp receive offloads Sagi Grimberg
2021-01-14  4:47   ` David Ahern
2021-01-14 19:21     ` Boris Pismenny
2021-01-14 19:17   ` Boris Pismenny
2021-01-14 21:07     ` Sagi Grimberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201211104445.30684242@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com \
    --to=kuba@kernel.org \
    --cc=axboe@fb.com \
    --cc=benishay@mellanox.com \
    --cc=benishay@nvidia.com \
    --cc=boris.pismenny@gmail.com \
    --cc=borisp@mellanox.com \
    --cc=borisp@nvidia.com \
    --cc=borispismenny@gmail.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@gmail.com \
    --cc=edumazet@google.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=netdev@vger.kernel.org \
    --cc=ogerlitz@mellanox.com \
    --cc=ogerlitz@nvidia.com \
    --cc=saeedm@nvidia.com \
    --cc=sagi@grimberg.me \
    --cc=viro@zeniv.linux.org.uk \
    --cc=yorayz@mellanox.com \
    --cc=yorayz@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).