Reply-To: Storage Performance Development Kit <spdk@lists.01.org>
Date: Thursday, May 31, 2018 at 12:37 AM
To: Storage Performance Development Kit <spdk@lists.01.org>
Subject: Re: [SPDK] Handling of physical disk removals
[…]
I understand your point but my the main advantage of doing the extra effort is the ability to better integrate the system and avoid doing redundant work along the way. The proposed separation may also at times cause opposing things to be done that make no sense. For example, we found tha a large sequential IO works far better than several smaller ones and we also found that if we combine some unrelated IOs in our queue that will be nearly sequential with a scratch buffer in reads we improve overall performance. If we do that and down the stack you will break the IO at that exact junction than we just added wasteful reads. As such we need to have this combination logic know about the specific drive and specific IO that it needs to handle and the various constraints and it makes a lot of sense to do it in our level rather than at the SPDK level. For this reason I would to have a layer in SPDK that does the least amount of work and just handle the NVMe protocol, exposing the device/hw constraints and let the application above do the needed smarts and adapt as the landscape will change.
The main thing to remember is that I am writing a storage system, not a random application. As such I *need* to know the device characteristics in order to make the best use of it. I *cannot* just let spdk hide these details from me. I fully understand that there are other users that write an application and don't want to bother with the device details at this level and are fully content to get the first and major performance improvement by using spdk over the kernel driver but for my use case that is not enough.
Hi Baruch,
I’d like to understand this a bit more. It sounds like you’d to see a mode that completely ignores things like MDTS and PRP violations, or device quirks related to performance (i.e. 128KB stripe boundaries on a number of Intel NVMe SSDs). Maybe this mode is a lower-level API, maybe it’s a compile-time option that removes some of the checks in the driver, maybe something else. Is that accurate?
Note that the SPDK nvme driver never splits an I/O arbitrarily – it always based on conforming to NVMe protocol or device-specific quirks that dramatically affect performance. If the MDTS for the device is 256KB, then submitting an I/O with a size larger than 256KB *must* be split. If an I/O spans a 128KB boundary on an SSD like the Intel P3700, the driver splits the I/O on that boundary to avoid a long delay (up to 1ms) for handling those types of I/O in SSDs that exhibit this type of striping phenomena. For scattered payloads, the driver will only split the I/O if the payload vectors would violate PRP otherwise. You said “[SPDK] will break the IO at that exact junction” – SPDK should only be breaking the I/O to meet one of these three cases. If you’re seeing something different, please advise.
In regards to your concerns on I/O splitting - I think step one is making sure the driver has APIs to expose any device-specific characteristics such as sectors per stripe. Step two is measuring the overhead of the splitting logic when splitting is not required. Based on those measurements, we can consider optimizations and/or evaluate a bypass mode. The bypass mode seems important to Weka.io – would you like to submit an RFC for what this bypass mode might look like in more detail?