From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=hUSU=PA=dpdk.org=dev-bounces@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EF9CEC433F5
	for <dpdk-dev@archiver.kernel.org>; Tue, 12 Oct 2021 08:48:35 +0000 (UTC)
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by mail.kernel.org (Postfix) with ESMTP id 005AD604D2
	for <dpdk-dev@archiver.kernel.org>; Tue, 12 Oct 2021 08:48:34 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 005AD604D2
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=oktetlabs.ru
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=dpdk.org
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 64735410FF;
	Tue, 12 Oct 2021 10:48:34 +0200 (CEST)
Received: from shelob.oktetlabs.ru (shelob.oktetlabs.ru [91.220.146.113])
 by mails.dpdk.org (Postfix) with ESMTP id 87A88410FC
 for <dev@dpdk.org>; Tue, 12 Oct 2021 10:48:33 +0200 (CEST)
Received: from [192.168.38.17] (aros.oktetlabs.ru [192.168.38.17])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by shelob.oktetlabs.ru (Postfix) with ESMTPSA id 2B7B07F6B9;
 Tue, 12 Oct 2021 11:48:33 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 shelob.oktetlabs.ru 2B7B07F6B9
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=oktetlabs.ru;
 s=default; t=1634028513;
 bh=I+WbJdq//AL48+XrysxT0U31Hd1o4B6qpSE36xWZ4qA=;
 h=Subject:To:Cc:References:From:Date:In-Reply-To;
 b=a/u2hNnUJdnfwIddK6uj7hN3LKHNRQ61QuMP/1rcEJ34MgnvUyanevpra0dXXCWp3
 GNFJ5KffvvFZMS9q8MmkgRx4Ml4xvdILOvziOHmpJ2vHGKz8evNYogCwkqPY6Fa4Bi
 FFwnDGABEDVJx8jx1uEJHGXCXwp8jXf6fg1EDrXc=
To: "Xueming(Steven) Li" <xuemingl@nvidia.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: "jerinjacobk@gmail.com" <jerinjacobk@gmail.com>,
 NBU-Contact-Thomas Monjalon <thomas@monjalon.net>,
 Lior Margalit <lmargalit@nvidia.com>,
 Slava Ovsiienko <viacheslavo@nvidia.com>,
 "konstantin.ananyev@intel.com" <konstantin.ananyev@intel.com>,
 "ferruh.yigit@intel.com" <ferruh.yigit@intel.com>
References: <20210727034204.20649-1-xuemingl@nvidia.com>
 <20210930145602.763969-1-xuemingl@nvidia.com>
 <8494d5f3-f134-e9d7-d782-dca9a9efaa03@oktetlabs.ru>
 <5584827eb502362297c71432991be7852438e94b.camel@nvidia.com>
 <2cfa1d1cf5b7fc01d361b04ec4fdf1500c89c50b.camel@nvidia.com>
From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Organization: OKTET Labs
Message-ID: <6a61a2db-6a55-e857-52df-5fe97b5bd60e@oktetlabs.ru>
Date: Tue, 12 Oct 2021 11:48:33 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.14.0
MIME-Version: 1.0
In-Reply-To: <2cfa1d1cf5b7fc01d361b04ec4fdf1500c89c50b.camel@nvidia.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 10/12/21 9:37 AM, Xueming(Steven) Li wrote:
> On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
>> On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
>>> Hi Xueming,
>>>
>>> On 9/30/21 5:55 PM, Xueming Li wrote:
>>>> In current DPDK framework, all RX queues is pre-loaded with mbufs for
>>>> incoming packets. When number of representors scale out in a switch
>>>> domain, the memory consumption became significant. Further more,
>>>> polling all ports leads to high cache miss, high latency and low
>>>> throughputs.
>>>>  
>>>> This patch introduces shared RX queue. PF and representors with same
>>>> configuration in same switch domain could share RX queue set by
>>>> specifying shared Rx queue offloading flag and sharing group.
>>>>
>>>> All ports that Shared Rx queue actually shares One Rx queue and only
>>>> pre-load mbufs to one Rx queue, memory is saved.
>>>>
>>>> Polling any queue using same shared RX queue receives packets from all
>>>> member ports. Source port is identified by mbuf->port.
>>>>
>>>> Multiple groups is supported by group ID. Port queue number in a shared
>>>> group should be identical. Queue index is 1:1 mapped in shared group.
>>>> An example of polling two share groups:
>>>>   core	group	queue
>>>>   0	0	0
>>>>   1	0	1
>>>>   2	0	2
>>>>   3	0	3
>>>>   4	1	0
>>>>   5	1	1
>>>>   6	1	2
>>>>   7	1	3
>>>>
>>>> Shared RX queue must be polled on single thread or core. If both PF0 and
>>>> representor0 joined same share group, can't poll pf0rxq0 on core1 and
>>>> rep0rxq0 on core2. Actually, polling one port within share group is
>>>> sufficient since polling any port in group will return packets for any
>>>> port in group.
>>>
>>> I apologize that I jump in into the review process that late.
>>
>> Appreciate the bold suggestion, never too late :)
>>
>>>
>>> Frankly speaking I doubt that it is the best design to solve
>>> the problem. Yes, I confirm that the problem exists, but I
>>> think there is better and simpler way to solve it.
>>>
>>> The problem of the suggested solution is that it puts all
>>> the headache about consistency to application and PMDs
>>> without any help from ethdev layer to guarantee the
>>> consistency. As the result I believe it will be either
>>> missing/lost consistency checks or huge duplication in
>>> each PMD which supports the feature. Shared RxQs must be
>>> equally configured including number of queues, offloads
>>> (taking device level Rx offloads into account), RSS
>>> settings etc. So, applications must care about it and
>>> PMDs (or ethdev layer) must check it.
>>
>> The name might be confusing, here is my understanding:
>> 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
>> 2. PMD polls one shared RxQ - for latency and performance
>> 3. Most per queue features like offloads and RSS not impacted. That's
>> why this not mentioned. Some offloading might not being supported due
>> to PMD or hw limitation, need to add check in PMD case by case.
>> 4. Multiple group is defined for service level flexibility. For
>> example, PF and VIP customer's load distributed via queues and dedicate
>> cores. Low priority customers share one core with one shared queue.
>> multiple groups enables more combination.
>> 5. One port could assign queues to different group for polling
>> flexibility. For example first 4 queues in group 0 and next 4 queues in
>> group1, each group have other member ports with 4 queues, so the port
>> with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
>> in other words, each core only poll one shared RxQ.
>>
>>>
>>> The advantage of the solution is that any device may
>>> create group and subsequent devices join. Absence of
>>> primary device is nice. But do we really need it?
>>> Will the design work if some representors are configured
>>> to use shared RxQ, but some do not? Theoretically it
>>> is possible, but could require extra non-trivial code
>>> on fast path.
>>
>> If multiple groups, any device could be hot-unplugged.
>>
>> Mixed configuration is supported, the only difference is how to set
>> mbuf->port. Since group is per queue, mixed is better to be supported,
>> didn't see any difficulty here.
>>
>> PDM could select to support only group 0, same settings for each rxq,
>> that fits most scenario.
>>
>>>
>>> Also looking at the first two patch I don't understand
>>> how application will find out which devices may share
>>> RxQs. E.g. if we have two difference NICs which support
>>> sharing, we can try to setup only one group 0, but
>>> finally will have two devices (not one) which must be
>>> polled.
>>>
>>> 1. We need extra flag in dev_info->dev_capa
>>>    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
>>>    the device supports Rx sharing.
>>
>> dev_info->rx_queue_offload_capa could be used here, no?

It depends. But we definitely need a flag which
says that below rx_domain makes sense. It could be
either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload
capability.

The question is if it is really an offload. The offload is
when something could be done by HW/FW and result is provided
to SW. May be it is just a nit picking...

May be we don't need an offload at all. Just have
RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID
as a flag that an RxQ should be shared (zero - default,
no sharing). ethdev layer may check consistency on
its layer to ensure that the device capability is
reported if non-zero group is specified on queue setup.

>>
>>>
>>> 2. I think we need "rx_domain" in device info
>>>    (which should be treated in boundaries of the
>>>    switch_domain) if and only if
>>>    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
>>>    Otherwise rx_domain value does not make sense.
>>
>> I see, this will give flexibility of different hw, will add it.
>>
>>>
>>> (1) and (2) will allow application to find out which
>>> devices can share Rx.
>>>
>>> 3. Primary device (representors backing device) should
>>>    advertise shared RxQ offload. Enabling of the offload
>>>    tells the device to provide packets to all device in
>>>    the Rx domain with mbuf->port filled in appropriately.
>>>    Also it allows app to identify primary device in the
>>>    Rx domain. When application enables the offload, it
>>>    must ensure that it does not treat used port_id as an
>>>    input port_id, but always check mbuf->port for each
>>>    packet.
>>>
>>> 4. A new Rx mode should be introduced for secondary
>>>    devices. It should not allow to configure RSS, specify
>>>    any Rx offloads etc. ethdev must ensure it.
>>>    It is an open question right now if it should require
>>>    to provide primary port_id. In theory representors
>>>    have it. However, may be it is nice for consistency
>>>    to ensure that application knows that it does.
>>>    If shared Rx mode is specified for device, application
>>>    does not need to setup RxQs and attempts to do it
>>>    should be discarded in ethdev.
>>>    For consistency it is better to ensure that number of
>>>    queues match.
>>
>> RSS and Rx offloads should be supported as individual, PMD needs to
>> check if not supported.

Thinking a bit more about it I agree that RSS settings could
be individual. Offload could be individual as well, but I'm
not sure about all offloads. E.g. Rx scatter which is related
to Rx buffer size (which is shared since Rx mempool is shared)
vs MTU. May be it is acceptable. We just must define rules
what should happen if offloads contradict to each other.
It should be highlighted in the description including
driver callback to ensure that PMD maintainers are responsible
for consistency checks.

>>
>>>    It is an interesting question what should happen if
>>>    primary device is reconfigured and shared Rx is
>>>    disabled on reconfiguration.
>>
>> I feel better no primary port/queue assumption in configuration, all
>> members are equally treated, each queue can join or quit share group,
>> that's important to support multiple groups.

I agree. The problem of many flexible solutions is
complexity to support. We'll see how it goes.

>>
>>>
>>> 5. If so, in theory implementation of the Rx burst
>>>    in the secondary could simply call Rx burst on
>>>    primary device.
>>>
>>> Andrew.
>>
> 
> Hi Andrew,
> 
> I realized that we are talking different things, this feature
> introduced 2 RxQ share:
> 1. Share mempool to save memory
> 2. Share polling to save latency
> 
> What you suggested is reuse all RxQ configuration IIUC, maybe we should
> break the flag into 3, so application could learn PMD capability and
> configure accordingly, how do you think?
> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL

Not sure that I understand. Just specify the same mempool
on Rx queue setup. Isn't it sufficient?

> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL

It implies pool sharing if I'm not mistaken. Of course,
we can pool many different HW queues in one poll, but it
hardly makes sense to care specially about it.
IMHO RxQ sharing is a sharing of the underlying HW Rx queue.

> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL

It is hardly a feature. Rather a possible limitation.

Andrew.