From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from ws5-mx01.kavi.com (ws5-mx01.kavi.com [34.193.7.191]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 72D34C6FD1D for ; Thu, 30 Mar 2023 12:10:56 +0000 (UTC) Received: from lists.oasis-open.org (oasis.ws5.connectedcommunity.org [10.110.1.242]) by ws5-mx01.kavi.com (Postfix) with ESMTP id 8CDA5150E24 for ; Thu, 30 Mar 2023 12:10:55 +0000 (UTC) Received: from lists.oasis-open.org (oasis-open.org [10.110.1.242]) by lists.oasis-open.org (Postfix) with ESMTP id 7430098654A for ; Thu, 30 Mar 2023 12:10:55 +0000 (UTC) Received: from host09.ws5.connectedcommunity.org (host09.ws5.connectedcommunity.org [10.110.1.97]) by lists.oasis-open.org (Postfix) with QMQP id 58FB8986531; Thu, 30 Mar 2023 12:10:55 +0000 (UTC) Mailing-List: contact virtio-comment-help@lists.oasis-open.org; run by ezmlm List-ID: Sender: Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Received: from lists.oasis-open.org (oasis-open.org [10.110.1.242]) by lists.oasis-open.org (Postfix) with ESMTP id 44802986533; Thu, 30 Mar 2023 12:10:55 +0000 (UTC) X-Virus-Scanned: amavisd-new at kavi.com X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R771e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=hengqi@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0Vf-Qqaa_1680178247; Message-ID: Date: Thu, 30 Mar 2023 20:10:45 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.8.0 To: "Michael S. Tsirkin" Cc: virtio-comment@lists.oasis-open.org, virtio-dev@lists.oasis-open.org, Parav Pandit , Jason Wang , Yuri Benditovich , Cornelia Huck , Xuan Zhuo References: <20230228061309-mutt-send-email-mst@kernel.org> <25231225-59c8-91b0-e0dd-3dab8aa8164b@linux.alibaba.com> <20230308093311-mutt-send-email-mst@kernel.org> <20230309142612-mutt-send-email-mst@kernel.org> <021eeb40-aab1-07b9-cfe7-9dd61a32e0b3@linux.alibaba.com> <20230315074633-mutt-send-email-mst@kernel.org> <4b14043c-6059-1d26-060e-7dc653c4f401@linux.alibaba.com> <20230315094102-mutt-send-email-mst@kernel.org> <20230316131726.GA20524@h68b04307.sqa.eu95> <20230320154456-mutt-send-email-mst@kernel.org> From: Heng Qi In-Reply-To: <20230320154456-mutt-send-email-mst@kernel.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [virtio-comment] Re: [virtio-dev] Re: [virtio-comment] Re: [virtio-dev] Re: [PATCH v9] virtio-net: support inner header hash 在 2023/3/21 上午3:45, Michael S. Tsirkin 写道: > On Thu, Mar 16, 2023 at 09:17:26PM +0800, Heng Qi wrote: >> On Wed, Mar 15, 2023 at 10:57:40AM -0400, Michael S. Tsirkin wrote: >>> On Wed, Mar 15, 2023 at 08:55:45PM +0800, Heng Qi wrote: >>>> >>>> 在 2023/3/15 下午7:58, Michael S. Tsirkin 写道: >>>>> On Sat, Mar 11, 2023 at 11:23:08AM +0800, Heng Qi wrote: >>>>>> >>>>>> 在 2023/3/10 上午3:36, Michael S. Tsirkin 写道: >>>>>>> On Thu, Mar 09, 2023 at 12:55:02PM +0800, Heng Qi wrote: >>>>>>>> 在 2023/3/8 下午10:39, Michael S. Tsirkin 写道: >>>>>>>>> On Wed, Mar 01, 2023 at 10:56:31AM +0800, Heng Qi wrote: >>>>>>>>>> 在 2023/2/28 下午7:16, Michael S. Tsirkin 写道: >>>>>>>>>>> On Sat, Feb 18, 2023 at 10:37:15PM +0800, Heng Qi wrote: >>>>>>>>>>>> If the tunnel is used to encapsulate the packets, the hash calculated >>>>>>>>>>>> using the outer header of the receive packets is always fixed for the >>>>>>>>>>>> same flow packets, i.e. they will be steered to the same receive queue. >>>>>>>>>>> Wait a second. How is this true? Does not everyone stick the >>>>>>>>>>> inner header hash in the outer source port to solve this? >>>>>>>>>> Yes, you are right. That's what we did before the inner header hash, but it >>>>>>>>>> has a performance penalty, which I'll explain below. >>>>>>>>>> >>>>>>>>>>> For example geneve spec says: >>>>>>>>>>> >>>>>>>>>>> it is necessary for entropy from encapsulated packets to be >>>>>>>>>>> exposed in the tunnel header. The most common technique for this is >>>>>>>>>>> to use the UDP source port >>>>>>>>>> The end point of the tunnel called the gateway (with DPDK on top of it). >>>>>>>>>> >>>>>>>>>> 1. When there is no inner header hash, entropy can be inserted into the udp >>>>>>>>>> src port of the outer header of the tunnel, >>>>>>>>>> and then the tunnel packet is handed over to the host. The host needs to >>>>>>>>>> take out a part of the CPUs to parse the outer headers (but not drop them) >>>>>>>>>> to calculate the inner hash for the inner payloads, >>>>>>>>>> and then use the inner >>>>>>>>>> hash to forward them to another part of the CPUs that are responsible for >>>>>>>>>> processing. >>>>>>>>> I don't get this part. Leave inner hashes to the guest inside the >>>>>>>>> tunnel, why is your host doing this? >>>>>> Let's simplify some details and take a fresh look at two different >>>>>> scenarios: VXLAN and GENEVE (Scenario1) and GRE (Scenario2). >>>>>> >>>>>> 1. In Scenario1, we can improve the processing performance of the same flow >>>>>> by implementing inner symmetric hashing. >>>>>> >>>>>> This is because even though client1 and client2 communicate bidirectionally >>>>>> through the same flow, their data may pass >>>>>> >>>>>> through and be encapsulated by different tunnels, resulting in the same flow >>>>>> being hashed to different queues and processed by different CPUs. >>>>>> >>>>>> To ensure consistency and optimized processing, we need to parse out the >>>>>> inner header and compute a symmetric hash on it using a special rss key. >>>>>> >>>>>> Sorry for not mentioning the inner symmetric hash before, in order to >>>>>> prevent the introduction of more concepts, but it is indeed a kind of inner >>>>>> hash. >>>>> If parts of a flow go through different tunnels won't this cause >>>>> reordering at the network level? Why is it so important to prevent it at >>>>> the nic then? Or, since you are stressing symmetric hash, are you >>>>> talking about TX and RX side going through different tunnels? >>>> Yes, the directions client1->client2 and client2->client1 may go through >>>> different tunnels. >>>> Using inner symmetric hashing can satisfy the same CPU to process two >>>> directions of the same flow to improve performance. >>> Well sure but ... are you just doing forwarding or inner processing too? >> When there is an inner hash, there is no forwarding anymore. >> >>> If forwarding why do you care about matching TX and RX queues? If e2e >> In fact, we are just matching on the same rx queue. The network topology >> is roughly as follows. The processing host will receive the packets >> sent from client1 and client2 respectively, then make some action judgments, >> and return them to client2 and client1 respectively. >> >> client1 client2 >> | | >> | __________ | >> +----->| tunnel |<--------+ >> |--------| >> | | >> | | >> | | >> v v >> +-----------------+ >> | processing host | >> +-----------------+ >> >> Thanks. > monotoring host would be a better term Sure. I'm so sorry I didn't realize I missed this until I checked my emails. 😮 :( > >>> processing can't you just store the incoming hash in the flow and reuse >>> on TX? This is what Linux is doing... >>> >>> >>> >>>>> >>>>>> 2. In Scenario2 with GRE, the lack of outer transport headers means that >>>>>> flows between multiple communication pairs encapsulated by the same tunnel >>>>>> >>>>>> will all be hashed to the same queue. To address this, we need to implement >>>>>> inner hashing to improve the performance of RSS. By parsing and calculating >>>>>> >>>>>> the inner hash, different flows can be hashed to different queues. >>>>>> >>>>>> Thanks. >>>>>> >>>>>> >>>>> Well 2 is at least inexact, there's flowID there. It's just 8 bit >>>> We use the most basic GRE header fields (not NVGRE), not even optional >>>> fields. >>>> There is also no flow id in the GRE header, should you be referring to >>>> NVGRE? >>>> >>>> Thanks. >>>> >>>>> so not sufficient if there are more than 512 queues. Still 512 queues >>>>> is quite a lot. Are you trying to solve for configurations with >>>>> more than 512 queues then? >>>>> >>>>> >>>>>>>> Assuming that the same flow includes a unidirectional flow a->b, or a >>>>>>>> bidirectional flow a->b and b->a, >>>>>>>> such flow may be out of order when processed by the gateway(DPDK): >>>>>>>> >>>>>>>> 1. In unidirectional mode, if the same flow is switched to another gateway >>>>>>>> for some reason, resulting in different outer IP address, >>>>>>>>     then this flow may be processed by different CPUs after reaching the >>>>>>>> host if there is no inner hash. So after the host receives the >>>>>>>>     flow, first use the forwarding CPUs to parse the inner hash, and then >>>>>>>> use the hash to ensure that the flow is processed by the >>>>>>>>     same CPU. >>>>>>>> 2. In bidirectional mode, a->b flow may go to gateway 1, and b->a flow may >>>>>>>> go to gateway 2. In order to ensure that the same flow is >>>>>>>>     processed by the same CPU, we still need the forwarding CPUs to parse >>>>>>>> the real inner hash(here, the hash key needs to be replaced with a symmetric >>>>>>>> hash key). >>>>>>> Oh intersting. What are those gateways, how come there's expectation >>>>>>> that you can change their addresses and topology >>>>>>> completely seamlessly without any reordering whatsoever? >>>>>>> Isn't network topology change kind of guaranteed to change ordering >>>>>>> sometimes? >>>>>>> >>>>>>> >>>>>>>>>> 1). During this process, the CPUs on the host is divided into two parts, one >>>>>>>>>> part is used as a forwarding node to parse the outer header, >>>>>>>>>>      and the CPU utilization is low. Another part handles packets. >>>>>>>>> Some overhead is clearly involved in *sending* packets - >>>>>>>>> to calculate the hash and stick it in the port number. >>>>>>>>> This is, however, a separate problem and if you want to >>>>>>>>> solve it then my suggestion would be to teach the *transmit* >>>>>>>>> side about GRE offloads, so it can fill the source port in the card. >>>>>>>>> >>>>>>>>>> 2). The entropy of the source udp src port is not enough, that is, the queue >>>>>>>>>> is not widely distributed. >>>>>>>>> how isn't it enough? 16 bit is enough to cover all vqs ... >>>>>>>> A 5-tuple brings more entropy than a single port, doesn't it? >>>>>>> But you don't need more for RSS, the indirection table is not >>>>>>> that large. >>>>>>> >>>>>>>> In fact, the >>>>>>>> inner hash of the physical network card used by >>>>>>>> the business team is indeed better than the udp port number of the outer >>>>>>>> header we modify now, but they did not give me the data. >>>>>>> Admittedly, out hash value is 32 bit. >>>>>>> >>>>>>>>>> 2. When there is an inner header hash, the gateway will directly help parse >>>>>>>>>> the outer header, and use the inner 5 tuples to calculate the inner hash. >>>>>>>>>> The tunneled packet is then handed over to the host. >>>>>>>>>> 1) All the CPUs of the host are used to process data packets, and there is >>>>>>>>>> no need to use some CPUs to forward and parse the outer header. >>>>>>>>> You really have to parse the outer header anyway, >>>>>>>>> otherwise there's no tunneling. >>>>>>>>> Unless you want to teach virtio to implement tunneling >>>>>>>>> in hardware, which is something I'd find it easier to >>>>>>>>> get behind. >>>>>>>> There is no need to parse the outer header twice, because we use shared >>>>>>>> memory. >>>>>>> shared with what? you need the outer header to identify the tunnel. >>>>>>> >>>>>>>>>> 2) The entropy of the original quintuple is sufficient, and the queue is >>>>>>>>>> widely distributed. >>>>>>>>> It's exactly the same entropy, why would it be better? In fact you >>>>>>>>> are taking out the outer hash entropy making things worse. >>>>>>>> I don't get the point, why the entropy of the inner 5-tuple and the outer >>>>>>>> tunnel header is the same, >>>>>>>> multiple streams have the same outer header. >>>>>>>> >>>>>>>> Thanks. >>>>>>> well our hash is 32 bit. source port is just 16 bit. >>>>>>> so yes it's more entropy but RSS can't use more than 16 bit. >>>>>>> why do you need so many? you have more than 64k CPUs to offload to? >>>>>>> >>>>>>> >>>>>>>>>> Thanks. >>>>>>>>>>> same goes for vxlan did not check further. >>>>>>>>>>> >>>>>>>>>>> so what is the problem? and which tunnel types actually suffer from the >>>>>>>>>>> problem? >>>>>>>>>>> >>>>>>>>>> This publicly archived list offers a means to provide input to the >>>>>>>>>> OASIS Virtual I/O Device (VIRTIO) TC. >>>>>>>>>> >>>>>>>>>> In order to verify user consent to the Feedback License terms and >>>>>>>>>> to minimize spam in the list archive, subscription is required >>>>>>>>>> before posting. >>>>>>>>>> >>>>>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org >>>>>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org >>>>>>>>>> List help: virtio-comment-help@lists.oasis-open.org >>>>>>>>>> List archive: https://lists.oasis-open.org/archives/virtio-comment/ >>>>>>>>>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf >>>>>>>>>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists >>>>>>>>>> Committee: https://www.oasis-open.org/committees/virtio/ >>>>>>>>>> Join OASIS: https://www.oasis-open.org/join/ >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org >>>>>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org >>>>>> This publicly archived list offers a means to provide input to the >>>>>> OASIS Virtual I/O Device (VIRTIO) TC. >>>>>> >>>>>> In order to verify user consent to the Feedback License terms and >>>>>> to minimize spam in the list archive, subscription is required >>>>>> before posting. >>>>>> >>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org >>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org >>>>>> List help: virtio-comment-help@lists.oasis-open.org >>>>>> List archive: https://lists.oasis-open.org/archives/virtio-comment/ >>>>>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf >>>>>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists >>>>>> Committee: https://www.oasis-open.org/committees/virtio/ >>>>>> Join OASIS: https://www.oasis-open.org/join/ >>>>> This publicly archived list offers a means to provide input to the >>>>> OASIS Virtual I/O Device (VIRTIO) TC. >>>>> >>>>> In order to verify user consent to the Feedback License terms and >>>>> to minimize spam in the list archive, subscription is required >>>>> before posting. >>>>> >>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org >>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org >>>>> List help: virtio-comment-help@lists.oasis-open.org >>>>> List archive: https://lists.oasis-open.org/archives/virtio-comment/ >>>>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf >>>>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists >>>>> Committee: https://www.oasis-open.org/committees/virtio/ >>>>> Join OASIS: https://www.oasis-open.org/join/ This publicly archived list offers a means to provide input to the OASIS Virtual I/O Device (VIRTIO) TC. In order to verify user consent to the Feedback License terms and to minimize spam in the list archive, subscription is required before posting. Subscribe: virtio-comment-subscribe@lists.oasis-open.org Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org List help: virtio-comment-help@lists.oasis-open.org List archive: https://lists.oasis-open.org/archives/virtio-comment/ Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists Committee: https://www.oasis-open.org/committees/virtio/ Join OASIS: https://www.oasis-open.org/join/