On Thu, May 31, 2018 at 7:24 PM Harris, James R <james.r.harris@intel.com> wrote:

From: SPDK <spdk-bounces@lists.01.org> on behalf of Baruch Even <baruch@weka.io>

Reply-To: Storage Performance Development Kit <spdk@lists.01.org>

Date: Thursday, May 31, 2018 at 12:37 AM

To: Storage Performance Development Kit <spdk@lists.01.org>
Subject: Re: [SPDK] Handling of physical disk removals

[…]

I understand your point but my the main advantage of doing the extra effort is the ability to better integrate the system and avoid doing redundant work along the way. The proposed separation may also at times cause opposing things to be done that make no sense. For example, we found tha a large sequential IO works far better than several smaller ones and we also found that if we combine some unrelated IOs in our queue that will be nearly sequential with a scratch buffer in reads we improve overall performance. If we do that and down the stack you will break the IO at that exact junction than we just added wasteful reads. As such we need to have this combination logic know about the specific drive and specific IO that it needs to handle and the various constraints and it makes a lot of sense to do it in our level rather than at the SPDK level. For this reason I would to have a layer in SPDK that does the least amount of work and just handle the NVMe protocol, exposing the device/hw constraints and let the application above do the needed smarts and adapt as the landscape will change.

The main thing to remember is that I am writing a storage system, not a random application. As such I *need* to know the device characteristics in order to make the best use of it. I *cannot* just let spdk hide these details from me. I fully understand that there are other users that write an application and don't want to bother with the device details at this level and are fully content to get the first and major performance improvement by using spdk over the kernel driver but for my use case that is not enough.

Hi Baruch,

I’d like to understand this a bit more. It sounds like you’d to see a mode that completely ignores things like MDTS and PRP violations, or device quirks related to performance (i.e. 128KB stripe boundaries on a number of Intel NVMe SSDs). Maybe this mode is a lower-level API, maybe it’s a compile-time option that removes some of the checks in the driver, maybe something else. Is that accurate?

Yes. I want to be able to just send an IO request without most of the safety nets for which I will take responsibility. In addition it would be nice if I could forego the additional tracking that is done and embed what is needed for it in my own IO tracking. I simply want a bare ability to send IO commands and get the replies, even to the level of skipping the callback and just have a loop that returns the IO identifier. This could just be a low level part of the existing code that does the higher level tracking, splitting and verification that the IO is valid for the specific NVMe device.

Note that the SPDK nvme driver never splits an I/O arbitrarily – it always based on conforming to NVMe protocol or device-specific quirks that dramatically affect performance. If the MDTS for the device is 256KB, then submitting an I/O with a size larger than 256KB *must* be split. If an I/O spans a 128KB boundary on an SSD like the Intel P3700, the driver splits the I/O on that boundary to avoid a long delay (up to 1ms) for handling those types of I/O in SSDs that exhibit this type of striping phenomena. For scattered payloads, the driver will only split the I/O if the payload vectors would violate PRP otherwise. You said “[SPDK] will break the IO at that exact junction” – SPDK should only be breaking the I/O to meet one of these three cases. If you’re seeing something different, please advise.

What I said is that there are places where my code combines IOs and if I was not aware of the NVMe mandated splitting than the system is wasteful. I follow all the NVMe and device specific rules to avoid this waste. I have not seen SPDK do something that is incorrect.

In regards to your concerns on I/O splitting - I think step one is making sure the driver has APIs to expose any device-specific characteristics such as sectors per stripe. Step two is measuring the overhead of the splitting logic when splitting is not required. Based on those measurements, we can consider optimizations and/or evaluate a bypass mode. The bypass mode seems important to Weka.io – would you like to submit an RFC for what this bypass mode might look like in more detail?

What is needed for the RFC?

Baruch