From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from ws5-mx01.kavi.com (ws5-mx01.kavi.com [34.193.7.191]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C5AF2C77B60 for ; Mon, 20 Mar 2023 19:45:27 +0000 (UTC) Received: from lists.oasis-open.org (oasis.ws5.connectedcommunity.org [10.110.1.242]) by ws5-mx01.kavi.com (Postfix) with ESMTP id 0487F87484 for ; Mon, 20 Mar 2023 19:45:26 +0000 (UTC) Received: from lists.oasis-open.org (oasis-open.org [10.110.1.242]) by lists.oasis-open.org (Postfix) with ESMTP id EF443986201 for ; Mon, 20 Mar 2023 19:45:25 +0000 (UTC) Received: from host09.ws5.connectedcommunity.org (host09.ws5.connectedcommunity.org [10.110.1.97]) by lists.oasis-open.org (Postfix) with QMQP id E5818986156; Mon, 20 Mar 2023 19:45:25 +0000 (UTC) Mailing-List: contact virtio-comment-help@lists.oasis-open.org; run by ezmlm List-ID: Sender: Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Received: from lists.oasis-open.org (oasis-open.org [10.110.1.242]) by lists.oasis-open.org (Postfix) with ESMTP id C967B98643D for ; Mon, 20 Mar 2023 19:45:22 +0000 (UTC) X-Virus-Scanned: amavisd-new at kavi.com X-MC-Unique: fFlk-ngSM0W9J7KWpaRRVw-1 X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679341518; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=jR02b+Z43lvMaPLGfvtlJ9se4LI7R40SYOT5J9t38vY=; b=AnGX+ptG58KthSDcJCp8L87CdMNfDCRU2jo5+aN2vVWxhS/kY1ta9YoRwACDFXfvFl zfYb32SsNW1Eqp22ow1t8XdlOQWJ/Y6I7VS1BC5GWOwDJAVX3pDScs/q/RiebKO4VPzH 9j8Zpqe+POFo/+FmBnjKGTqYgQKzOYtMi+8XDd4BKxO2FqyUAYumFvohBzdhyEkxv/nk u9zMZhjdWKjP2GnIMrEZQ7z+lR7kTm8MgPysM5f/33TMtqE1T964d3oQNo3F5UBOpqRv 8Wki2LkTOVP6K5+gw1SvlLK245hMLnTogTlID/GrY08vQewZpqzwogpegHOOgjWYoscq 1g0A== X-Gm-Message-State: AO0yUKVedfYNfEn0vczL3aGN8WBit4p+VDcfwUB+7ni7dctNrK+lvPKx mtGsvEOd/LTGSPI1hyDWEL4U1S/WkXCiFCgPesfD3iX5yOqOXqgvhdreuFyjfTwLkGSgyXUoCRW 7u3KcGXIbxUlgaDdynX4e2kLbo76PP2WKxA== X-Received: by 2002:a05:600c:206:b0:3eb:395b:8b62 with SMTP id 6-20020a05600c020600b003eb395b8b62mr502772wmi.39.1679341518164; Mon, 20 Mar 2023 12:45:18 -0700 (PDT) X-Google-Smtp-Source: AK7set+gFTMfaNKTqv7fIbQTznBV3fN8utQgmARMJjZuEFxmjpKZN6RUbX5/jyVsX5y6+8PT6ZDLFA== X-Received: by 2002:a05:600c:206:b0:3eb:395b:8b62 with SMTP id 6-20020a05600c020600b003eb395b8b62mr502753wmi.39.1679341517765; Mon, 20 Mar 2023 12:45:17 -0700 (PDT) Date: Mon, 20 Mar 2023 15:45:13 -0400 From: "Michael S. Tsirkin" To: Heng Qi Cc: virtio-comment@lists.oasis-open.org, virtio-dev@lists.oasis-open.org, Parav Pandit , Jason Wang , Yuri Benditovich , Cornelia Huck , Xuan Zhuo Message-ID: <20230320154456-mutt-send-email-mst@kernel.org> References: <20230228061309-mutt-send-email-mst@kernel.org> <25231225-59c8-91b0-e0dd-3dab8aa8164b@linux.alibaba.com> <20230308093311-mutt-send-email-mst@kernel.org> <20230309142612-mutt-send-email-mst@kernel.org> <021eeb40-aab1-07b9-cfe7-9dd61a32e0b3@linux.alibaba.com> <20230315074633-mutt-send-email-mst@kernel.org> <4b14043c-6059-1d26-060e-7dc653c4f401@linux.alibaba.com> <20230315094102-mutt-send-email-mst@kernel.org> <20230316131726.GA20524@h68b04307.sqa.eu95> MIME-Version: 1.0 In-Reply-To: <20230316131726.GA20524@h68b04307.sqa.eu95> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit Subject: Re: [virtio-comment] Re: [virtio-dev] Re: [virtio-comment] Re: [virtio-dev] Re: [PATCH v9] virtio-net: support inner header hash On Thu, Mar 16, 2023 at 09:17:26PM +0800, Heng Qi wrote: > On Wed, Mar 15, 2023 at 10:57:40AM -0400, Michael S. Tsirkin wrote: > > On Wed, Mar 15, 2023 at 08:55:45PM +0800, Heng Qi wrote: > > > > > > > > > 在 2023/3/15 下午7:58, Michael S. Tsirkin 写道: > > > > On Sat, Mar 11, 2023 at 11:23:08AM +0800, Heng Qi wrote: > > > > > > > > > > > > > > > 在 2023/3/10 上午3:36, Michael S. Tsirkin 写道: > > > > > > On Thu, Mar 09, 2023 at 12:55:02PM +0800, Heng Qi wrote: > > > > > > > 在 2023/3/8 下午10:39, Michael S. Tsirkin 写道: > > > > > > > > On Wed, Mar 01, 2023 at 10:56:31AM +0800, Heng Qi wrote: > > > > > > > > > 在 2023/2/28 下午7:16, Michael S. Tsirkin 写道: > > > > > > > > > > On Sat, Feb 18, 2023 at 10:37:15PM +0800, Heng Qi wrote: > > > > > > > > > > > If the tunnel is used to encapsulate the packets, the hash calculated > > > > > > > > > > > using the outer header of the receive packets is always fixed for the > > > > > > > > > > > same flow packets, i.e. they will be steered to the same receive queue. > > > > > > > > > > Wait a second. How is this true? Does not everyone stick the > > > > > > > > > > inner header hash in the outer source port to solve this? > > > > > > > > > Yes, you are right. That's what we did before the inner header hash, but it > > > > > > > > > has a performance penalty, which I'll explain below. > > > > > > > > > > > > > > > > > > > For example geneve spec says: > > > > > > > > > > > > > > > > > > > > it is necessary for entropy from encapsulated packets to be > > > > > > > > > > exposed in the tunnel header. The most common technique for this is > > > > > > > > > > to use the UDP source port > > > > > > > > > The end point of the tunnel called the gateway (with DPDK on top of it). > > > > > > > > > > > > > > > > > > 1. When there is no inner header hash, entropy can be inserted into the udp > > > > > > > > > src port of the outer header of the tunnel, > > > > > > > > > and then the tunnel packet is handed over to the host. The host needs to > > > > > > > > > take out a part of the CPUs to parse the outer headers (but not drop them) > > > > > > > > > to calculate the inner hash for the inner payloads, > > > > > > > > > and then use the inner > > > > > > > > > hash to forward them to another part of the CPUs that are responsible for > > > > > > > > > processing. > > > > > > > > I don't get this part. Leave inner hashes to the guest inside the > > > > > > > > tunnel, why is your host doing this? > > > > > > > > > > Let's simplify some details and take a fresh look at two different > > > > > scenarios: VXLAN and GENEVE (Scenario1) and GRE (Scenario2). > > > > > > > > > > 1. In Scenario1, we can improve the processing performance of the same flow > > > > > by implementing inner symmetric hashing. > > > > > > > > > > This is because even though client1 and client2 communicate bidirectionally > > > > > through the same flow, their data may pass > > > > > > > > > > through and be encapsulated by different tunnels, resulting in the same flow > > > > > being hashed to different queues and processed by different CPUs. > > > > > > > > > > To ensure consistency and optimized processing, we need to parse out the > > > > > inner header and compute a symmetric hash on it using a special rss key. > > > > > > > > > > Sorry for not mentioning the inner symmetric hash before, in order to > > > > > prevent the introduction of more concepts, but it is indeed a kind of inner > > > > > hash. > > > > If parts of a flow go through different tunnels won't this cause > > > > reordering at the network level? Why is it so important to prevent it at > > > > the nic then? Or, since you are stressing symmetric hash, are you > > > > talking about TX and RX side going through different tunnels? > > > > > > Yes, the directions client1->client2 and client2->client1 may go through > > > different tunnels. > > > Using inner symmetric hashing can satisfy the same CPU to process two > > > directions of the same flow to improve performance. > > > > Well sure but ... are you just doing forwarding or inner processing too? > > When there is an inner hash, there is no forwarding anymore. > > > If forwarding why do you care about matching TX and RX queues? If e2e > > In fact, we are just matching on the same rx queue. The network topology > is roughly as follows. The processing host will receive the packets > sent from client1 and client2 respectively, then make some action judgments, > and return them to client2 and client1 respectively. > > client1 client2 > | | > | __________ | > +----->| tunnel |<--------+ > |--------| > | | > | | > | | > v v > +-----------------+ > | processing host | > +-----------------+ > > Thanks. monotoring host would be a better term > > processing can't you just store the incoming hash in the flow and reuse > > on TX? This is what Linux is doing... > > > > > > > > > > > > > > > > > > > 2. In Scenario2 with GRE, the lack of outer transport headers means that > > > > > flows between multiple communication pairs encapsulated by the same tunnel > > > > > > > > > > will all be hashed to the same queue. To address this, we need to implement > > > > > inner hashing to improve the performance of RSS. By parsing and calculating > > > > > > > > > > the inner hash, different flows can be hashed to different queues. > > > > > > > > > > Thanks. > > > > > > > > > > > > > > Well 2 is at least inexact, there's flowID there. It's just 8 bit > > > > > > We use the most basic GRE header fields (not NVGRE), not even optional > > > fields. > > > There is also no flow id in the GRE header, should you be referring to > > > NVGRE? > > > > > > Thanks. > > > > > > > so not sufficient if there are more than 512 queues. Still 512 queues > > > > is quite a lot. Are you trying to solve for configurations with > > > > more than 512 queues then? > > > > > > > > > > > > > > > Assuming that the same flow includes a unidirectional flow a->b, or a > > > > > > > bidirectional flow a->b and b->a, > > > > > > > such flow may be out of order when processed by the gateway(DPDK): > > > > > > > > > > > > > > 1. In unidirectional mode, if the same flow is switched to another gateway > > > > > > > for some reason, resulting in different outer IP address, > > > > > > >     then this flow may be processed by different CPUs after reaching the > > > > > > > host if there is no inner hash. So after the host receives the > > > > > > >     flow, first use the forwarding CPUs to parse the inner hash, and then > > > > > > > use the hash to ensure that the flow is processed by the > > > > > > >     same CPU. > > > > > > > 2. In bidirectional mode, a->b flow may go to gateway 1, and b->a flow may > > > > > > > go to gateway 2. In order to ensure that the same flow is > > > > > > >     processed by the same CPU, we still need the forwarding CPUs to parse > > > > > > > the real inner hash(here, the hash key needs to be replaced with a symmetric > > > > > > > hash key). > > > > > > Oh intersting. What are those gateways, how come there's expectation > > > > > > that you can change their addresses and topology > > > > > > completely seamlessly without any reordering whatsoever? > > > > > > Isn't network topology change kind of guaranteed to change ordering > > > > > > sometimes? > > > > > > > > > > > > > > > > > > > > > 1). During this process, the CPUs on the host is divided into two parts, one > > > > > > > > > part is used as a forwarding node to parse the outer header, > > > > > > > > >      and the CPU utilization is low. Another part handles packets. > > > > > > > > Some overhead is clearly involved in *sending* packets - > > > > > > > > to calculate the hash and stick it in the port number. > > > > > > > > This is, however, a separate problem and if you want to > > > > > > > > solve it then my suggestion would be to teach the *transmit* > > > > > > > > side about GRE offloads, so it can fill the source port in the card. > > > > > > > > > > > > > > > > > 2). The entropy of the source udp src port is not enough, that is, the queue > > > > > > > > > is not widely distributed. > > > > > > > > how isn't it enough? 16 bit is enough to cover all vqs ... > > > > > > > A 5-tuple brings more entropy than a single port, doesn't it? > > > > > > But you don't need more for RSS, the indirection table is not > > > > > > that large. > > > > > > > > > > > > > In fact, the > > > > > > > inner hash of the physical network card used by > > > > > > > the business team is indeed better than the udp port number of the outer > > > > > > > header we modify now, but they did not give me the data. > > > > > > Admittedly, out hash value is 32 bit. > > > > > > > > > > > > > > > 2. When there is an inner header hash, the gateway will directly help parse > > > > > > > > > the outer header, and use the inner 5 tuples to calculate the inner hash. > > > > > > > > > The tunneled packet is then handed over to the host. > > > > > > > > > 1) All the CPUs of the host are used to process data packets, and there is > > > > > > > > > no need to use some CPUs to forward and parse the outer header. > > > > > > > > You really have to parse the outer header anyway, > > > > > > > > otherwise there's no tunneling. > > > > > > > > Unless you want to teach virtio to implement tunneling > > > > > > > > in hardware, which is something I'd find it easier to > > > > > > > > get behind. > > > > > > > There is no need to parse the outer header twice, because we use shared > > > > > > > memory. > > > > > > shared with what? you need the outer header to identify the tunnel. > > > > > > > > > > > > > > > 2) The entropy of the original quintuple is sufficient, and the queue is > > > > > > > > > widely distributed. > > > > > > > > It's exactly the same entropy, why would it be better? In fact you > > > > > > > > are taking out the outer hash entropy making things worse. > > > > > > > I don't get the point, why the entropy of the inner 5-tuple and the outer > > > > > > > tunnel header is the same, > > > > > > > multiple streams have the same outer header. > > > > > > > > > > > > > > Thanks. > > > > > > well our hash is 32 bit. source port is just 16 bit. > > > > > > so yes it's more entropy but RSS can't use more than 16 bit. > > > > > > why do you need so many? you have more than 64k CPUs to offload to? > > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > same goes for vxlan did not check further. > > > > > > > > > > > > > > > > > > > > so what is the problem? and which tunnel types actually suffer from the > > > > > > > > > > problem? > > > > > > > > > > > > > > > > > > > This publicly archived list offers a means to provide input to the > > > > > > > > > OASIS Virtual I/O Device (VIRTIO) TC. > > > > > > > > > > > > > > > > > > In order to verify user consent to the Feedback License terms and > > > > > > > > > to minimize spam in the list archive, subscription is required > > > > > > > > > before posting. > > > > > > > > > > > > > > > > > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org > > > > > > > > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org > > > > > > > > > List help: virtio-comment-help@lists.oasis-open.org > > > > > > > > > List archive: https://lists.oasis-open.org/archives/virtio-comment/ > > > > > > > > > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf > > > > > > > > > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists > > > > > > > > > Committee: https://www.oasis-open.org/committees/virtio/ > > > > > > > > > Join OASIS: https://www.oasis-open.org/join/ > > > > > > > > --------------------------------------------------------------------- > > > > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org > > > > > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org > > > > > > > > > > This publicly archived list offers a means to provide input to the > > > > > OASIS Virtual I/O Device (VIRTIO) TC. > > > > > > > > > > In order to verify user consent to the Feedback License terms and > > > > > to minimize spam in the list archive, subscription is required > > > > > before posting. > > > > > > > > > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org > > > > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org > > > > > List help: virtio-comment-help@lists.oasis-open.org > > > > > List archive: https://lists.oasis-open.org/archives/virtio-comment/ > > > > > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf > > > > > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists > > > > > Committee: https://www.oasis-open.org/committees/virtio/ > > > > > Join OASIS: https://www.oasis-open.org/join/ > > > > > > > > This publicly archived list offers a means to provide input to the > > > > OASIS Virtual I/O Device (VIRTIO) TC. > > > > > > > > In order to verify user consent to the Feedback License terms and > > > > to minimize spam in the list archive, subscription is required > > > > before posting. > > > > > > > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org > > > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org > > > > List help: virtio-comment-help@lists.oasis-open.org > > > > List archive: https://lists.oasis-open.org/archives/virtio-comment/ > > > > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf > > > > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists > > > > Committee: https://www.oasis-open.org/committees/virtio/ > > > > Join OASIS: https://www.oasis-open.org/join/ This publicly archived list offers a means to provide input to the OASIS Virtual I/O Device (VIRTIO) TC. In order to verify user consent to the Feedback License terms and to minimize spam in the list archive, subscription is required before posting. Subscribe: virtio-comment-subscribe@lists.oasis-open.org Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org List help: virtio-comment-help@lists.oasis-open.org List archive: https://lists.oasis-open.org/archives/virtio-comment/ Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists Committee: https://www.oasis-open.org/committees/virtio/ Join OASIS: https://www.oasis-open.org/join/