RE: One Question About PCIe BUS Config Type with pcie_bus_safe or pcie_bus_perf On NVMe Device

From: Ron Yuan <ron.yuan@memblaze.com>
To: Sinan Kaya <okaya@codeaurora.org>, Bjorn Helgaas <helgaas@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
	Bo Chen <bo.chen@memblaze.com>,
	William Huang <william.huang@memblaze.com>,
	Fengming Wu <fengming.wu@memblaze.com>,
	Jason Jiang <jason.jiang@microsemi.com>,
	Radjendirane Codandaramane <radjendirane.codanda@microsemi.com>,
	"Ramyakanth Edupuganti" <Ramyakanth.Edupuganti@microsemi.com>,
	William Cheng <william.cheng@microsemi.com>,
	"Kim Helper (khelper)" <khelper@micron.com>,
	Linux PCI <linux-pci@vger.kernel.org>
Subject: RE: One Question About PCIe BUS Config Type with pcie_bus_safe or pcie_bus_perf On NVMe Device
Date: Tue, 23 Jan 2018 13:25:56 +0000	[thread overview]
Message-ID: <SHAPR01MB173EFAB42A8838DC3D52FD1FEE30@SHAPR01MB173.CHNPR01.prod.partner.outlook.cn> (raw)
In-Reply-To: <09bd3f5c-4671-d9dd-fa39-4d7619ee5860@codeaurora.org>

[-- Attachment #1: Type: text/plain, Size: 6812 bytes --]

Bjorn and Sinan,
Thanks for the discussion.

> PERFORMANCE mode reduces MRRS not because of a starvation issue, but 
> because reducing EP1's MRRS allows EP0 to use a larger MPS.
Looks like this case is talking about EP1 requests data directly from EP0, using MRRS to control the return data payload, while still keeping the traffic from EP0 to RC in 256B. This is reasonable. 
However, is this common in the real world? We can see kernel driver usually control the whole system, set MPS well to align all EPs. Only in rare case where different EP has no other sync method, then setting MRRS for EP1 can provide more "safe" for itself. This does not match with "perf" name, so I proposed earlier move the MRRS setting to "safe" mode to make it clear. 

In my situation, data center customer is using many NVMe SSDs, they care most about best performance from each SSD, while there is no out of control EP to EP access. 

I will provide a full lspci log tomorrow. First attach the screen shoot in "safe" mode for different devices. You can see MPS are set to 256B while MRRS default remain differently. In Perf mode, all MRRS would be limited to 256B and loss some performance. 

Thanks
Ron

-----Original Message-----
From: Sinan Kaya [mailto:okaya@codeaurora.org] 
Sent: 2018年1月23日 10:27
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Ron Yuan <ron.yuan@memblaze.com>; Bjorn Helgaas <bhelgaas@google.com>; Bo Chen <bo.chen@memblaze.com>; William Huang <william.huang@memblaze.com>; Fengming Wu <fengming.wu@memblaze.com>; Jason Jiang <jason.jiang@microsemi.com>; Radjendirane Codandaramane <radjendirane.codanda@microsemi.com>; Ramyakanth Edupuganti <Ramyakanth.Edupuganti@microsemi.com>; William Cheng <william.cheng@microsemi.com>; Kim Helper (khelper) <khelper@micron.com>; Linux PCI <linux-pci@vger.kernel.org>
Subject: Re: One Question About PCIe BUS Config Type with pcie_bus_safe or pcie_bus_perf On NVMe Device

On 1/22/2018 7:16 PM, Bjorn Helgaas wrote:
> On Mon, Jan 22, 2018 at 06:24:22PM -0500, Sinan Kaya wrote:
>> On 1/22/2018 5:51 PM, Bjorn Helgaas wrote:
>>> On Mon, Jan 22, 2018 at 05:04:03PM -0500, Sinan Kaya wrote:
>>>> On 1/22/2018 4:36 PM, Bjorn Helgaas wrote:
>>
>>>>> Reducing MPS may be necessary if there are several devices in the 
>>>>> hierarchy and one requires a smaller MPS than the others.  That 
>>>>> obviously reduces the maximum read and write performance.
>>>>>
>>>>> Reducing the MRRS may be useful to prevent one device from hogging 
>>>>> a link, but of course, it reduces read performance for that device 
>>>>> because we need more read requests.
>>>>
>>>> Maybe, a picture could help.
>>>>
>>>>                root (MPS=256)
>>>>                  |
>>>>          ------------------
>>>>         /                  \
>>>>    bridge0 (MPS=256)      bridge1 (MPS=128)
>>>>       /                       \
>>>>     EP0 (MPS=256)            EP1 (MPS=128)
>>>>
>>>> If I understood this right, code allows the configuration above 
>>>> with the performance mode so that MPS doesn't have to be uniform 
>>>> across the tree.
>>>
>>> Yes.  In PERFORMANCE mode, we will set EP1's MRRS=128 and EP0's 
>>> MRRS=256, just as you show.
>>>
>>>> It just needs to be consistent between the root port and endpoints.
>>>
>>> No, it doesn't need to be consistent.  In PERFORMANCE mode, we'll 
>>> set the root's MPS=256 and EP1's MPS=128.
>>>
>>> (I'm not actually 100% convinced that the PERFORMANCE mode approach 
>>> of reducing MRRS is safe, necessary, and maintainable.  I suspect 
>>> that in many of the interesting cases, the device we care about is 
>>> the only one below a Root Port, and we can get the performance we 
>>> need by maximizing MPS and MRRS for that Root Port and its children, 
>>> independent of the rest of the system.)
>>
>> Maybe, I started seeing more and more NVMe devices behind a switch 
>> every day. That's why, I'm concerned.
> 
> Are there a mix of devices that support large MPS and those that only 
> support a smaller MPS behind the same switch?

I guess we should understand the environment from Ron Yuan. Can we see lspci -vvv output of your system?

We could maybe detect coexistence of slow and fast device condition and put some suggestions like moving the card to another slot as a best effort solution.

> 
> I don't have any actual data about topologies of interest.  I'm just 
> guessing that high-performance devices will often have their own root 
> port, without being mixed in with low-speed devices.  It's OK to have 
> switches in the path; it's the mixing high-speed with low-speed 
> devices that causes the problems.

Agreed. Mine was just a general rant. I was curious if we could make it better.

> 
>>>> Why are we reducing MRRS in this case?
>>>
>>> We have to set EP1's MRRS=128 so it will never receive a completion 
>>> larger than 128 bytes.  If we set EP1's MRRS=256, it could receive 
>>> 256-byte TLPs, which it would treat as malformed.  (We also assume 
>>> no peer-to-peer DMA that targets EP1.)
>>
>> What if we were to keep root port MPS as 128? and not touch the BIOS 
>> configured MRRS (4k) ?
>>
>> Everybody should be happy, right?
> 
> No.  In your picture (which helps a lot, thank you!), if you set the 
> root's MPS=128, EP1 will be fine no matter what its MRRS is because 
> the entire path has MPS=128.
> 
> But if EP0's MRRS is larger than 128, it's in trouble because it may 
> issue a 256-byte DMA write, which the root port will treat as 
> malformed.  The root port, as the receiver of that Memory Write 
> Request, is required to check the TLP against the MPS in its Device 
> Control (not Device Capability) register (this is in PCIe r4.0, sec 
> 2.2.2).

Yeah, this breaks my theory.

> 
>> I know there is a rule to check the completions against MPS. 
> 
> Sec 2.2.2 says a receiver must check "data payloads".  I think that 
> includes both Completions and Memory Writes.
> 
>> Root port could generate transactions that is a multiple of 128 bytes 
>> for reads.
> 
> If the root port generates a 256-byte Memory Read request to EP1, 
> that's fine because EP1 will only respond with 128-byte completions.
> If it sends that 256-byte Memory Read to EP0, we have a problem 
> because EP0 may generate a 256-byte completion, which will cause an 
> error if the root port has MPS=128.
> 
>> Is there any rule against checking incoming writes?
> 
> Sec 2.2.2 says a receiver *must* check the payload size of incoming 
> Memory Write requests.
> 

--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

[-- Attachment #2: D3559A46-A52F-407a-909A-3BCD16FFDC7B.PNG --]
[-- Type: image/png, Size: 63193 bytes --]