Re: [PATCH v2 2/3] nvme-tcp: support specifying the congestion-control

From: John Meneghini <jmeneghi@redhat.com>
To: linux-nvme@lists.infradead.org, Sagi Grimberg <sagi@grimberg.me>
Subject: Re: [PATCH v2 2/3] nvme-tcp: support specifying the congestion-control
Date: Tue, 5 Apr 2022 12:50:24 -0400	[thread overview]
Message-ID: <9b45bd0a-872c-7fe2-09b1-1bb54aeef2f2@redhat.com> (raw)
In-Reply-To: <9be1e68c-00aa-3547-9cb5-b3ca302e209b@redhat.com>

If you want things to slow down with NVMe, use the protocol's built in flow control mechanism: SQ flow control. This will keep 
the commands out of the transport queue and avoid the possibility of unwanted or unexpected command timeouts.

But this is another topic for discussion.

/John

On 4/5/22 12:48, John Meneghini wrote:
> 
> On 3/29/22 03:46, Sagi Grimberg wrote:
>>> In addition, distributed storage products like the following also have
>>> the above problem:
>>>
>>>      - The product consists of a cluster of servers.
>>>
>>>      - Each server serves clients via its front-end NIC
>>>       (WAN, high latency).
>>>
>>>      - All servers interact with each other via NVMe/TCP via back-end NIC
>>>       (LAN, low latency, ECN-enabled, ideal for dctcp).
>>
>> Separate networks are still not application (nvme-tcp) specific and as
>> mentioned, we have a way to control that. IMO, this still does not
>> qualify as solid justification to add this to nvme-tcp.
>>
>> What do others think?
> 
> OK. I'll bite.
> 
> In my experience adding any type of QOS control a Storage Area Network causes problems because it increases the likelihood of 
> ULP timeouts (command timeouts).
> 
> NAS protocols like NFS and CIFs have built in assumptions about latency. They have long timeouts at the session layer and they 
> trade latency for reliable delivery.  SAN protocols like iSCSI and NVMe/TCP make no such trade off. All block protocols have 
> much shorter per-command timeouts and they expect reliable delivery. These timeouts are much shorter and doing anything to the 
> TCP connection which could increase latency runs the risk of causing the side effect of command timeouts.  In NVMe we also have 
> the Keep alive timeout which could be affected by TCP latency. It's for this reason that most SANs are deployed on LANs not 
> WANs. It's also for this reason that most Cluster monitor mechanisms (components that maintain cluster wide membership through 
> heat beats) use UDP not TCP.
> 
> With NVMe/TCP we want the connection layer to go as fast as possible and I agree with Sagi that adding any kind of QOS mechanism 
> to the transport is not desirable.
> 
> /John
>