All of lore.kernel.org
 help / color / mirror / Atom feed
From: Adel Abouchaev <adel.abushaev@gmail.com>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>, Jakub Kicinski <kuba@kernel.org>,
	davem <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>,
	Paolo Abeni <pabeni@redhat.com>, Jonathan Corbet <corbet@lwn.net>,
	David Ahern <dsahern@kernel.org>,
	shuah@kernel.org, imagedong@tencent.com,
	network dev <netdev@vger.kernel.org>,
	linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org
Subject: Re: [net-next v2 0/6] net: support QUIC crypto
Date: Tue, 27 Sep 2022 09:44:42 -0700	[thread overview]
Message-ID: <0c989c58-2aa6-eca6-2bb9-24b1ae71694a@gmail.com> (raw)
In-Reply-To: <CA+FuTSf_8MjF4jeUjEqDrOwqXzf485jX_GJyVP5kPUDzOFezkg@mail.gmail.com>


On 9/25/22 11:04 AM, Willem de Bruijn wrote:
>>> The patch seems to get the crypto_ctx by doing a connection hash table
>>> lookup in the sendmsg(), which is not good from the performance side.
>>> One QUIC connection can go over multiple UDP sockets, but I don't
>>> think one socket can be used by multiple QUIC connections. So why not
>>> save the ctx in the socket instead?
>> A single socket could have multiple connections originated from it,
>> having different destinations, if the socket is not connected. An
>> optimization could be made for connected sockets to cache the context
>> and save time on a lookup. The measurement of kernel operations timing
>> did not reveal a significant amount of time spent in this lookup due to
>> a relatively small number of connections per socket in general. A shared
>> table across multiple sockets might experience a different performance
>> grading.
> I'm late to this patch series, sorry. High quality implementation. I
> have a few design questions similar to Xin.
>
> If multiplexing, instead of looking up a connection by { address, port
> variable length connection ID }, perhaps return a connection table
> index on setsockopt and use that in sendmsg.


It was deliberate to not to return anything other than 0 from 
setsockopt() as defined in the spec for the function. Despite that it 
says "shall", the doc says that 0 is the only value for successful 
operation. This was the reason not to use setsockopt() for any 
bidirectional transfers of data and or status. A more sophisticated 
approach with netlink sockets would be more suitable for it. The second 
reason is the API asymmetry for Tx and Rx which will be introduced - the 
Rx will still need to match on the address, port and cid. The third 
reason is that in current implementations there are no more than a few 
connections per socket, which does not abuse the rhashtable that does a 
lookup, although it takes time to hash the key into a hash for a seek. 
The performance measurement ran against the runtime and did not flag 
this path as underperforming either, there were other parts that 
substantially add to the runtime, not the key lookup though.


>>> The patch is to reduce the copying operations between user space and
>>> the kernel. I might miss something in your user space code, but the
>>> msg to send is *already packed* into the Stream Frame in user space,
>>> what's the difference if you encrypt it in userspace and then
>>> sendmsg(udp_sk) with zero-copy to the kernel.
>> It is possible to do it this way. Zero-copy works best with packet sizes
>> starting at 32K and larger.  Anything less than that would consume the
>> improvements of zero-copy by zero-copy pre/post operations and needs to
>> align memory.
> Part of the cost of MSG_ZEROCOPY is in mapping and unmapping user
> pages. This series re-implements that with its own get_user_pages.
> That is duplicative non-trivial code. And it will incur the same cost.
> What this implementation saves is the (indeed non-trivial)
> asynchronous completion notification over the error queue.
>
> The cover letter gives some performance numbers against a userspace
> implementation that has to copy from user to kernel. It might be more
> even to compare against an implementation using MSG_ZEROCOPY and
> UDP_SEGMENT. A userspace crypto implementation may have other benefits
> compared to a kernel implementation, such as not having to convert to
> crypto API scatter-gather arrays and back to network structures.
>
> A few related points
>
> - The implementation support multiplexed connections, but only one
> crypto sendmsg can be outstanding at any time:
>
>    + /**
>    + * To synchronize concurrent sendmsg() requests through the same socket
>    + * and protect preallocated per-context memory.
>    + **/
>    + struct mutex sendmsg_mux;
>
> That is quite limiting for production workloads.

The use case that we have with MVFST library currently runs a single 
worker for a connection and has a single socket attached to it. QUIC 
allows simultaneous use of multiple connection IDs to swap them in 
runtime, and implementation would request only a handful of these. The 
MVFST batches writes into a block of about 8Kb and then uses GSO to send 
them all at once.

> - Crypto operations are also executed synchronously, using
> crypto_wait_req after each operationn. This limits throughput by using
> at most one core per UDP socket. And adds sendmsg latency (which may
> or may not be important to the application). Wireguard shows an
> example of how to parallelize software crypto across cores.
>
> - The implementation avoids dynamic allocation of cipher text pages by
> using a single ctx->cipher_page. This is protected by sendmsg_mux (see
> above). Is that safe when packets leave the protocol stack and are
> then held in a qdisc or when being processed by the NIC?
> quic_sendmsg_locked will return, but the cipher page is not free to
> reuse yet.
There is currently no use case that we have in hands that requires 
parallel transmission of data for the same connection. Multiple 
connections would have no issue running in parallel as each of them will 
have it's own preallocated cipher_page in the context.

There is a fragmentation further down the stack with 
ip_generic_getfrag() that eventually does copy_from_iter() and makea a 
copy of the data. This is executed as part of __ip_append_data() called 
from udp_sendmsg() in ipv4/udp.c. The assumption was that this is 
executed synchronously and the queues and NIC will see a mapping of a 
different memory area than the ciphertext in the pre-allocated page.

>
> - The real benefit of kernel QUIC will come from HW offload. Would it
> be better to avoid the complexity of an in-kernel software
> implementation and only focus on HW offload? Basically, pass the
> plaintext QUIC packets over a standard UDP socket and alongside in a
> cmsg pass either an index into a HW security association database or
> the immediate { key, iv } connection_info (for stateless sockets), to
> be encoded into the descriptor by the device driver.
Hardware usually targets a single ciphersuite such as AES-GCM-128/256, 
while QUIC also supports Chacha20-Poly1305 and AES-CCM. The generalized 
support for offload prompted implementation of these ciphers in kernel 
code. The kernel code could also engage if the future hardware has 
capacity caps preventing it from handling all requests in the hardware.
> - With such a simpler path, could we avoid introducing ULP and just
> have udp [gs]etsockopt CRYPTO_STATE. Where QUIC is the only defined
> state type yet.
>
> - Small aside: as the series introduces new APIs with non-trivial
> parsing in the kernel, it's good to run a fuzzer like syzkaller on it
> (if not having done so yet).
Agreed.
>> The other possible obstacle would be that eventual support
>> of QUIC encryption and decryption in hardware would integrate well with
>> this current approach.
>>> Didn't really understand the "GSO" you mentioned, as I don't see any
>>> code about kernel GSO, I guess it's just "Fragment size", right?
>>> BTW, it‘s not common to use "//" for the kernel annotation.
> minor point: fragment has meaning in IPv4. For GSO, prefer gso_size.
Sure, will change it to gso_size.
>
>> Once the payload arrives into the kernel, the GSO on the interface would
>> instruct L3/L4 stack on fragmentation. In this case, the plaintext QUIC
>> packets should be aligned on the GSO marks less the tag size that would
>> be added by encryption. For GSO size 1000, the QUIC packets in the batch
>> for transmission should all be 984 bytes long, except maybe the last
>> one. Once the tag is attached, the new size of 1000 will correctly split
>> the QUIC packets further down the stack for transmission in individual
>> IP/UDP packets. The code is also saving processing time by sending all
>> packets at once to UDP in a single call, when GSO is enabled.
>>> I'm not sure if it's worth adding a ULP layer over UDP for this QUIC
>>> TX only. Honestly, I'm more supporting doing a full QUIC stack in the
>>> kernel independently with socket APIs to use it:
>>> https://github.com/lxin/tls_hs.
>>>
>>> Thanks.

  reply	other threads:[~2022-09-27 16:44 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <Adel Abouchaev <adel.abushaev@gmail.com>
2022-08-01 19:52 ` [RFC net-next 0/6] net: support QUIC crypto Adel Abouchaev
2022-08-01 19:52   ` [RFC net-next 1/6] net: Documentation on QUIC kernel Tx crypto Adel Abouchaev
2022-08-01 19:52   ` [RFC net-next 2/6] net: Define QUIC specific constants, control and data plane structures Adel Abouchaev
2022-08-01 19:52   ` [RFC net-next 3/6] net: Add UDP ULP operations, initialization and handling prototype functions Adel Abouchaev
2022-08-01 19:52   ` [RFC net-next 4/6] net: Implement QUIC offload functions Adel Abouchaev
2022-08-01 19:52   ` [RFC net-next 5/6] net: Add flow counters and Tx processing error counter Adel Abouchaev
2022-08-01 19:52   ` [RFC net-next 6/6] net: Add self tests for ULP operations, flow setup and crypto tests Adel Abouchaev
2022-08-05  3:37   ` [RFC net-next 0/6] net: support QUIC crypto Bagas Sanjaya
2022-08-03 16:40 ` Adel Abouchaev
2022-08-03 16:40   ` [RFC net-next 1/6] net: Documentation on QUIC kernel Tx crypto Adel Abouchaev
2022-08-03 18:23     ` Andrew Lunn
2022-08-03 18:51       ` Adel Abouchaev
2022-08-04 15:29         ` Andrew Lunn
2022-08-04 16:57           ` Adel Abouchaev
2022-08-04 17:00             ` Eric Dumazet
2022-08-04 18:09               ` Jakub Kicinski
2022-08-04 18:45                 ` Eric Dumazet
2022-08-04 13:57     ` Jonathan Corbet
2022-08-03 16:40   ` [RFC net-next 2/6] net: Define QUIC specific constants, control and data plane structures Adel Abouchaev
2022-08-03 16:40   ` [RFC net-next 3/6] net: Add UDP ULP operations, initialization and handling prototype functions Adel Abouchaev
2022-08-03 16:40   ` [RFC net-next 4/6] net: Implement QUIC offload functions Adel Abouchaev
2022-08-03 16:40   ` [RFC net-next 5/6] net: Add flow counters and Tx processing error counter Adel Abouchaev
2022-08-03 16:40   ` [RFC net-next 6/6] net: Add self tests for ULP operations, flow setup and crypto tests Adel Abouchaev
2022-08-06  0:11 ` [RFC net-next v2 0/6] net: support QUIC crypto Adel Abouchaev
2022-08-06  0:11   ` [RFC net-next v2 1/6] Documentation on QUIC kernel Tx crypto Adel Abouchaev
2022-08-06  3:05     ` Bagas Sanjaya
2022-08-08 19:05       ` Adel Abouchaev
2022-08-06  0:11   ` [RFC net-next v2 2/6] Define QUIC specific constants, control and data plane structures Adel Abouchaev
2022-08-06  0:11   ` [RFC net-next v2 3/6] Add UDP ULP operations, initialization and handling prototype functions Adel Abouchaev
2022-08-06  0:11   ` [RFC net-next v2 4/6] Implement QUIC offload functions Adel Abouchaev
2022-08-06  0:11   ` [RFC net-next v2 5/6] Add flow counters and Tx processing error counter Adel Abouchaev
2022-08-06  0:11   ` [RFC net-next v2 6/6] Add self tests for ULP operations, flow setup and crypto tests Adel Abouchaev
2022-08-16 18:11 ` [net-next 0/6] net: support QUIC crypto Adel Abouchaev
2022-08-16 18:11   ` [net-next 1/6] Documentation on QUIC kernel Tx crypto Adel Abouchaev
2022-08-16 18:11   ` [net-next 2/6] Define QUIC specific constants, control and data plane structures Adel Abouchaev
2022-08-16 18:11   ` [net-next 3/6] Add UDP ULP operations, initialization and handling prototype functions Adel Abouchaev
2022-08-16 18:11   ` [net-next 4/6] Implement QUIC offload functions Adel Abouchaev
2022-08-16 18:11   ` [net-next 5/6] Add flow counters and Tx processing error counter Adel Abouchaev
2022-08-16 18:11   ` [net-next 6/6] Add self tests for ULP operations, flow setup and crypto tests Adel Abouchaev
2022-08-17  8:09   ` [net-next 0/6] net: support QUIC crypto Bagas Sanjaya
2022-08-17 18:49     ` Adel Abouchaev
2022-08-17 20:09 ` [net-next v2 " Adel Abouchaev
2022-08-17 20:09   ` [net-next v2 1/6] Documentation on QUIC kernel Tx crypto Adel Abouchaev
2022-08-18  2:53     ` Bagas Sanjaya
2022-08-17 20:09   ` [net-next v2 2/6] Define QUIC specific constants, control and data plane structures Adel Abouchaev
2022-08-17 20:09   ` [net-next v2 3/6] Add UDP ULP operations, initialization and handling prototype functions Adel Abouchaev
2022-08-17 20:09   ` [net-next v2 4/6] Implement QUIC offload functions Adel Abouchaev
2022-08-17 20:09   ` [net-next v2 5/6] Add flow counters and Tx processing error counter Adel Abouchaev
2022-08-17 20:09   ` [net-next v2 6/6] Add self tests for ULP operations, flow setup and crypto tests Adel Abouchaev
2022-08-18  2:18   ` [net-next v2 0/6] net: support QUIC crypto Bagas Sanjaya
2022-08-24 18:29   ` Xin Long
2022-08-24 19:52     ` Matt Joras
2022-08-24 23:09     ` Adel Abouchaev
2022-09-25 18:04       ` Willem de Bruijn
2022-09-27 16:44         ` Adel Abouchaev [this message]
2022-09-27 17:12           ` Willem de Bruijn
2022-09-27 17:28             ` Adel Abouchaev
2022-08-24 18:43 ` [net-next] Fix reinitialization of TEST_PROGS in net self tests Adel Abouchaev
2022-08-24 20:12   ` Shuah Khan
2022-08-25 20:30   ` patchwork-bot+netdevbpf
2022-09-07  0:49 ` [net-next v3 0/6] net: support QUIC crypto Adel Abouchaev
2022-09-07  0:49   ` [net-next v3 1/6] net: Documentation on QUIC kernel Tx crypto Adel Abouchaev
2022-09-07  3:38     ` Bagas Sanjaya
2022-09-07 17:29       ` Adel Abouchaev
2022-09-07  0:49   ` [net-next v3 2/6] net: Define QUIC specific constants, control and data plane structures Adel Abouchaev
2022-09-07  0:49   ` [net-next v3 3/6] net: Add UDP ULP operations, initialization and handling prototype functions Adel Abouchaev
2022-09-07  0:49   ` [net-next v3 4/6] net: Implement QUIC offload functions Adel Abouchaev
2022-09-07  0:49   ` [net-next v3 5/6] net: Add flow counters and Tx processing error counter Adel Abouchaev
2022-09-07  0:49   ` [net-next v3 6/6] net: Add self tests for ULP operations, flow setup and crypto tests Adel Abouchaev
2022-09-09  0:12 ` [net-next v4 0/6] net: support QUIC crypto Adel Abouchaev
2022-09-09  0:12   ` [net-next v4 1/6] net: Documentation on QUIC kernel Tx crypto Adel Abouchaev
2022-09-09  1:40     ` Bagas Sanjaya
2022-09-09  0:12   ` [net-next v4 2/6] net: Define QUIC specific constants, control and data plane structures Adel Abouchaev
2022-09-09  0:12   ` [net-next v4 3/6] net: Add UDP ULP operations, initialization and handling prototype functions Adel Abouchaev
2022-09-09  0:12   ` [net-next v4 4/6] net: Implement QUIC offload functions Adel Abouchaev
2022-09-09  0:12   ` [net-next v4 5/6] net: Add flow counters and Tx processing error counter Adel Abouchaev
2022-09-09  0:12   ` [net-next v4 6/6] net: Add self tests for ULP operations, flow setup and crypto tests Adel Abouchaev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0c989c58-2aa6-eca6-2bb9-24b1ae71694a@gmail.com \
    --to=adel.abushaev@gmail.com \
    --cc=corbet@lwn.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=imagedong@tencent.com \
    --cc=kuba@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=lucien.xin@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=shuah@kernel.org \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.