Re: [PATCH v24 01/20] net: Introduce direct data placement tcp offload

From: Sagi Grimberg <sagi@grimberg.me>
To: Aurelien Aptel <aaptel@nvidia.com>,
	linux-nvme@lists.infradead.org, netdev@vger.kernel.org,
	hch@lst.de, kbusch@kernel.org, axboe@fb.com,
	chaitanyak@nvidia.com, davem@davemloft.net, kuba@kernel.org
Cc: Boris Pismenny <borisp@nvidia.com>,
	aurelien.aptel@gmail.com, smalin@nvidia.com, malin1024@gmail.com,
	ogerlitz@nvidia.com, yorayz@nvidia.com, galshalom@nvidia.com,
	mgurtovoy@nvidia.com, edumazet@google.com, pabeni@redhat.com,
	dsahern@kernel.org, ast@kernel.org, jacob.e.keller@intel.com
Subject: Re: [PATCH v24 01/20] net: Introduce direct data placement tcp offload
Date: Fri, 3 May 2024 10:31:50 +0300	[thread overview]
Message-ID: <29655a73-5d4c-4773-a425-e16628b8ba7a@grimberg.me> (raw)
In-Reply-To: <253frv0r8yc.fsf@nvidia.com>

On 5/2/24 10:04, Aurelien Aptel wrote:
> Sagi Grimberg <sagi@grimberg.me> writes:
>> Well, you cannot rely on the fact that the application will be pinned to a
>> specific cpu core. That may be the case by accident, but you must not and
>> cannot assume it.
> Just to be clear, any CPU can read from the socket and benefit from the
> offload but there will be an extra cost if the queue CPU is different
> from the offload CPU. We use cfg->io_cpu as a hint.

Understood. It is usually the case as io threads are not aligned to the 
rss steering rules (unless
arfs is used).

>
>> Even today, nvme-tcp has an option to run from an unbound wq context,
>> where queue->io_cpu is set to WORK_CPU_UNBOUND. What are you going to
>> do there?
> When the CPU is not bound to a specific core, we will most likely always
> have CPU misalignment and the extra cost that goes with it.

Yes, as done today.

>
> But when it is bound, which is still the default common case, we will
> benefit from the alignment. To not lose that benefit for the default
> most common case, we would like to keep cfg->io_cpu.

Well, this explanation is much more reasonable. Setting .affinity_hint 
argument
seems like a proper argument to the interface and nvme-tcp can set it to 
queue->io_cpu.

>
> Could you clarify what are the advantages of running unbounded queues,
> or to handle RX on a different cpu than the current io_cpu?

See the discussion related to the patch from Li Feng:
https://lore.kernel.org/lkml/20230413062339.2454616-1-fengli@smartx.com/

>
>> nvme-tcp may handle rx side directly from .data_ready() in the future, what
>> will the offload do in that case?
> It is not clear to us what the benefit of handling rx in .data_ready()
> will achieve. From our experiment, ->sk_data_ready() is called either
> from queue->io_cpu, or sk->sk_incoming_cpu. Unless you enable aRFS,
> sk_incoming_cpu will be constant for the whole connection. Can you
> clarify would handling RX from data_ready() provide?

Save the context switching to a kthread from softirq, can reduce latency 
substantially
for some workloads.