All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [SPDK] Handling of physical disk removals
@ 2018-06-01  8:25 Baruch Even
  0 siblings, 0 replies; 17+ messages in thread
From: Baruch Even @ 2018-06-01  8:25 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5052 bytes --]

On Thu, May 31, 2018 at 7:24 PM Harris, James R <james.r.harris(a)intel.com>
wrote:

>
>
>
>
> *From: *SPDK <spdk-bounces(a)lists.01.org> on behalf of Baruch Even <
> baruch(a)weka.io>
>
>
> *Reply-To: *Storage Performance Development Kit <spdk(a)lists.01.org>
>
> *Date: *Thursday, May 31, 2018 at 12:37 AM
>
>
> *To: *Storage Performance Development Kit <spdk(a)lists.01.org>
> *Subject: *Re: [SPDK] Handling of physical disk removals
>
>
>
> […]
>
>
>
> I understand your point but my the main advantage of doing the extra
> effort is the ability to better integrate the system and avoid doing
> redundant work along the way. The proposed separation may also at times
> cause opposing things to be done that make no sense. For example, we found
> tha a large sequential IO works far better than several smaller ones and we
> also found that if we combine some unrelated IOs in our queue that will be
> nearly sequential with a scratch buffer in reads we improve overall
> performance. If we do that and down the stack you will break the IO at that
> exact junction than we just added wasteful reads. As such we need to have
> this combination logic know about the specific drive and specific IO that
> it needs to handle and the various constraints and it makes a lot of sense
> to do it in our level rather than at the SPDK level. For this reason I
> would to have a layer in SPDK that does the least amount of work and just
> handle the NVMe protocol, exposing the device/hw constraints and let the
> application above do the needed smarts and adapt as the landscape will
> change.
>
> The main thing to remember is that I am writing a storage system, not a
> random application. As such I *need* to know the device characteristics in
> order to make the best use of it. I *cannot* just let spdk hide these
> details from me. I fully understand that there are other users that write
> an application and don't want to bother with the device details at this
> level and are fully content to get the first and major performance
> improvement by using spdk over the kernel driver but for my use case that
> is not enough.
>
>
>
> Hi Baruch,
>
> I’d like to understand this a bit more.  It sounds like you’d to see a
> mode that completely ignores things like MDTS and PRP violations, or device
> quirks related to performance (i.e. 128KB stripe boundaries on a number of
> Intel NVMe SSDs).  Maybe this mode is a lower-level API, maybe it’s a
> compile-time option that removes some of the checks in the driver, maybe
> something else.  Is that accurate?
>
Yes. I want to be able to just send an IO request without most of the
safety nets for which I will take responsibility. In addition it would be
nice if I could forego the additional tracking that is done and embed what
is needed for it in my own IO tracking. I simply want a bare ability to
send IO commands and get the replies, even to the level of skipping the
callback and just have a loop that returns the IO identifier. This could
just be a low level part of the existing code that does the higher level
tracking, splitting and verification that the IO is valid for the specific
NVMe device.


> Note that the SPDK nvme driver never splits an I/O arbitrarily – it always
> based on conforming to NVMe protocol or device-specific quirks that
> dramatically affect performance.  If the MDTS for the device is 256KB, then
> submitting an I/O with a size larger than 256KB **must** be split.  If an
> I/O spans a 128KB boundary on an SSD like the Intel P3700, the driver
> splits the I/O on that boundary to avoid a long delay (up to 1ms) for
> handling those types of I/O in SSDs that exhibit this type of striping
> phenomena.  For scattered payloads, the driver will only split the I/O if
> the payload vectors would violate PRP otherwise.  You said “[SPDK] will
> break the IO at that exact junction” – SPDK should only be breaking the I/O
> to meet one of these three cases.  If you’re seeing something different,
> please advise.
>
What I said is that there are places where my code combines IOs and if I
was not aware of the NVMe mandated splitting than the system is wasteful. I
follow all the NVMe and device specific rules to avoid this waste. I have
not seen SPDK do something that is incorrect.

> In regards to your concerns on I/O splitting - I think step one is making
> sure the driver has APIs to expose any device-specific characteristics such
> as sectors per stripe.  Step two is measuring the overhead of the splitting
> logic when splitting is not required.  Based on those measurements, we can
> consider optimizations and/or evaluate a bypass mode.  The bypass mode
> seems important to Weka.io – would you like to submit an RFC for what this
> bypass mode might look like in more detail?
>
What is needed for the RFC?

Baruch
-- 



*Baruch Even, Software Developer  E  baruch(a)weka.io <liran(a)weka.io>
www.weka.io <http://www.weka.io>*

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 10061 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-06-03  8:35 Baruch Even
  0 siblings, 0 replies; 17+ messages in thread
From: Baruch Even @ 2018-06-03  8:35 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2071 bytes --]

On Fri, Jun 1, 2018 at 7:54 PM Harris, James R <james.r.harris(a)intel.com>
wrote:

>
> […]
>
>
>
> In regards to your concerns on I/O splitting - I think step one is making
> sure the driver has APIs to expose any device-specific characteristics such
> as sectors per stripe.  Step two is measuring the overhead of the splitting
> logic when splitting is not required.  Based on those measurements, we can
> consider optimizations and/or evaluate a bypass mode.  The bypass mode
> seems important to Weka.io – would you like to submit an RFC for what this
> bypass mode might look like in more detail?
>
> What is needed for the RFC?
>
>
>
> It doesn’t need to be especially formal.  Basically, what your new API
> would look like and how it would impact the existing code at a reasonable
> level of detail.  Also, what tests you would plan to add to automate
> testing of the new API.  An RFC typically comes with an assumption that the
> author plans to carry the work forward if and when a consensus has been
> reached.  If that assumption is not correct, it is best to clarify that up
> front.
>
>
>
> Any kind of empirical performance data is critical.  I’m really interested
> in seeing it.  The existing driver can get >4M IO/s on a single Xeon core,
> so the extra complexity and maintenance burden of the new API has to show
> demonstrable improvements.  Hopefully there would be no changes to the
> existing API, but if there are, that’s another data point that would need
> to be considered.
>
>
>

That will take some time as it will require me to review the code and at
least flesh out a basic implementation to be able to show what changes are
needed and any performance impacts resulting. As I have other topics to
resolve at this time at work that are higher priority that will not be soon
enough but I take from it that there is an openness to consider such a
change.

Baruch
-- 



*Baruch Even, Software Developer  E  baruch(a)weka.io <liran(a)weka.io>
www.weka.io <http://www.weka.io>*

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 4917 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-06-01 16:54 Harris, James R
  0 siblings, 0 replies; 17+ messages in thread
From: Harris, James R @ 2018-06-01 16:54 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 1964 bytes --]



From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Baruch Even <baruch(a)weka.io>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
Date: Friday, June 1, 2018 at 1:25 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Handling of physical disk removals

[…]

In regards to your concerns on I/O splitting - I think step one is making sure the driver has APIs to expose any device-specific characteristics such as sectors per stripe.  Step two is measuring the overhead of the splitting logic when splitting is not required.  Based on those measurements, we can consider optimizations and/or evaluate a bypass mode.  The bypass mode seems important to Weka.io – would you like to submit an RFC for what this bypass mode might look like in more detail?
What is needed for the RFC?

It doesn’t need to be especially formal.  Basically, what your new API would look like and how it would impact the existing code at a reasonable level of detail.  Also, what tests you would plan to add to automate testing of the new API.  An RFC typically comes with an assumption that the author plans to carry the work forward if and when a consensus has been reached.  If that assumption is not correct, it is best to clarify that up front.

Any kind of empirical performance data is critical.  I’m really interested in seeing it.  The existing driver can get >4M IO/s on a single Xeon core, so the extra complexity and maintenance burden of the new API has to show demonstrable improvements.  Hopefully there would be no changes to the existing API, but if there are, that’s another data point that would need to be considered.


Baruch
--
[https://docs.google.com/uc?export=download&id=1w6mlhCJZRlvqVOKRoulFtalU3TMY41VY&revid=0Bw_6cJeSSMVJR2JMQzdxV3VqVi9IWFNDM1FVcnFoRlc2NkJzPQ]
Baruch Even, Software Developer

E  baruch(a)weka.io<mailto:liran(a)weka.io>
www.weka.io<http://www.weka.io>

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 6267 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-06-01  8:27 Baruch Even
  0 siblings, 0 replies; 17+ messages in thread
From: Baruch Even @ 2018-06-01  8:27 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5322 bytes --]

On Thu, May 31, 2018 at 10:02 PM Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
wrote:

>
>
> On Thu, May 31, 2018, 19:24 Harris, James R <james.r.harris(a)intel.com>
> wrote:
>
>>
>>
>>
>>
>> *From: *SPDK <spdk-bounces(a)lists.01.org> on behalf of Baruch Even <
>> baruch(a)weka.io>
>> *Reply-To: *Storage Performance Development Kit <spdk(a)lists.01.org>
>> *Date: *Thursday, May 31, 2018 at 12:37 AM
>> *To: *Storage Performance Development Kit <spdk(a)lists.01.org>
>> *Subject: *Re: [SPDK] Handling of physical disk removals
>>
>>
>>
>> […]
>>
>>
>>
>> I understand your point but my the main advantage of doing the extra
>> effort is the ability to better integrate the system and avoid doing
>> redundant work along the way. The proposed separation may also at times
>> cause opposing things to be done that make no sense. For example, we found
>> tha a large sequential IO works far better than several smaller ones and we
>> also found that if we combine some unrelated IOs in our queue that will be
>> nearly sequential with a scratch buffer in reads we improve overall
>> performance. If we do that and down the stack you will break the IO at that
>> exact junction than we just added wasteful reads. As such we need to have
>> this combination logic know about the specific drive and specific IO that
>> it needs to handle and the various constraints and it makes a lot of sense
>> to do it in our level rather than at the SPDK level. For this reason I
>> would to have a layer in SPDK that does the least amount of work and just
>> handle the NVMe protocol, exposing the device/hw constraints and let the
>> application above do the needed smarts and adapt as the landscape will
>> change.
>>
>> The main thing to remember is that I am writing a storage system, not a
>> random application. As such I *need* to know the device characteristics in
>> order to make the best use of it. I *cannot* just let spdk hide these
>> details from me. I fully understand that there are other users that write
>> an application and don't want to bother with the device details at this
>> level and are fully content to get the first and major performance
>> improvement by using spdk over the kernel driver but for my use case that
>> is not enough.
>>
>>
>>
>> Hi Baruch,
>>
>> I’d like to understand this a bit more.  It sounds like you’d to see a
>> mode that completely ignores things like MDTS and PRP violations, or device
>> quirks related to performance (i.e. 128KB stripe boundaries on a number of
>> Intel NVMe SSDs).  Maybe this mode is a lower-level API, maybe it’s a
>> compile-time option that removes some of the checks in the driver, maybe
>> something else.  Is that accurate?
>>
>> Note that the SPDK nvme driver never splits an I/O arbitrarily – it
>> always based on conforming to NVMe protocol or device-specific quirks that
>> dramatically affect performance.  If the MDTS for the device is 256KB, then
>> submitting an I/O with a size larger than 256KB **must** be split.  If
>> an I/O spans a 128KB boundary on an SSD like the Intel P3700, the driver
>> splits the I/O on that boundary to avoid a long delay (up to 1ms) for
>> handling those types of I/O in SSDs that exhibit this type of striping
>> phenomena.  For scattered payloads, the driver will only split the I/O if
>> the payload vectors would violate PRP otherwise.  You said “[SPDK] will
>> break the IO at that exact junction” – SPDK should only be breaking the I/O
>> to meet one of these three cases.  If you’re seeing something different,
>> please advise.
>>
>> Adding an API to expose a namespace’s sectors per stripe is a good idea.
>> This is certainly something that should be added to give an application a
>> better idea of the device characteristics.  NOIOB was added to the spec to
>> cover this case recently, but older SSDs have vendor-specific mechanisms so
>> just examining the namespace’s identify data is not sufficient.
>>
>> In regards to your concerns on I/O splitting - I think step one is making
>> sure the driver has APIs to expose any device-specific characteristics such
>> as sectors per stripe.  Step two is measuring the overhead of the splitting
>> logic when splitting is not required.  Based on those measurements, we can
>> consider optimizations and/or evaluate a bypass mode.  The bypass mode
>> seems important to Weka.io – would you like to submit an RFC for what this
>> bypass mode might look like in more detail?
>>
>
> Adding spdk_nvme_{admin,nvm}_cmd_raw sounds like a good idea to me
> irrespective of whether a specific use case has already been presented.
>
> After all, SPDK is first and foremost about ultrafast user-space NVMe, so
> adding a low-level passthrough API which would let users do whatever I/O
> optimization they want at their own risk sounds pretty much mandatory.
>

The  submission of the command is an important step, I also care about
polling and the tracking that is currently associated with it. I'm a big
fan of intrusive data structures, and embedding the minimal tracking needed
will also help performance IMO.

Baruch
-- 



*Baruch Even, Software Developer  E  baruch(a)weka.io <liran(a)weka.io>
www.weka.io <http://www.weka.io>*

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 8610 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-05-31 19:01 Andrey Kuzmin
  0 siblings, 0 replies; 17+ messages in thread
From: Andrey Kuzmin @ 2018-05-31 19:01 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6815 bytes --]

On Thu, May 31, 2018, 19:24 Harris, James R <james.r.harris(a)intel.com>
wrote:

>
>
>
>
> *From: *SPDK <spdk-bounces(a)lists.01.org> on behalf of Baruch Even <
> baruch(a)weka.io>
> *Reply-To: *Storage Performance Development Kit <spdk(a)lists.01.org>
> *Date: *Thursday, May 31, 2018 at 12:37 AM
> *To: *Storage Performance Development Kit <spdk(a)lists.01.org>
> *Subject: *Re: [SPDK] Handling of physical disk removals
>
>
>
> […]
>
>
>
> I understand your point but my the main advantage of doing the extra
> effort is the ability to better integrate the system and avoid doing
> redundant work along the way. The proposed separation may also at times
> cause opposing things to be done that make no sense. For example, we found
> tha a large sequential IO works far better than several smaller ones and we
> also found that if we combine some unrelated IOs in our queue that will be
> nearly sequential with a scratch buffer in reads we improve overall
> performance. If we do that and down the stack you will break the IO at that
> exact junction than we just added wasteful reads. As such we need to have
> this combination logic know about the specific drive and specific IO that
> it needs to handle and the various constraints and it makes a lot of sense
> to do it in our level rather than at the SPDK level. For this reason I
> would to have a layer in SPDK that does the least amount of work and just
> handle the NVMe protocol, exposing the device/hw constraints and let the
> application above do the needed smarts and adapt as the landscape will
> change.
>
> The main thing to remember is that I am writing a storage system, not a
> random application. As such I *need* to know the device characteristics in
> order to make the best use of it. I *cannot* just let spdk hide these
> details from me. I fully understand that there are other users that write
> an application and don't want to bother with the device details at this
> level and are fully content to get the first and major performance
> improvement by using spdk over the kernel driver but for my use case that
> is not enough.
>
>
>
> Hi Baruch,
>
> I’d like to understand this a bit more.  It sounds like you’d to see a
> mode that completely ignores things like MDTS and PRP violations, or device
> quirks related to performance (i.e. 128KB stripe boundaries on a number of
> Intel NVMe SSDs).  Maybe this mode is a lower-level API, maybe it’s a
> compile-time option that removes some of the checks in the driver, maybe
> something else.  Is that accurate?
>
> Note that the SPDK nvme driver never splits an I/O arbitrarily – it always
> based on conforming to NVMe protocol or device-specific quirks that
> dramatically affect performance.  If the MDTS for the device is 256KB, then
> submitting an I/O with a size larger than 256KB **must** be split.  If an
> I/O spans a 128KB boundary on an SSD like the Intel P3700, the driver
> splits the I/O on that boundary to avoid a long delay (up to 1ms) for
> handling those types of I/O in SSDs that exhibit this type of striping
> phenomena.  For scattered payloads, the driver will only split the I/O if
> the payload vectors would violate PRP otherwise.  You said “[SPDK] will
> break the IO at that exact junction” – SPDK should only be breaking the I/O
> to meet one of these three cases.  If you’re seeing something different,
> please advise.
>
> Adding an API to expose a namespace’s sectors per stripe is a good idea.
> This is certainly something that should be added to give an application a
> better idea of the device characteristics.  NOIOB was added to the spec to
> cover this case recently, but older SSDs have vendor-specific mechanisms so
> just examining the namespace’s identify data is not sufficient.
>
> In regards to your concerns on I/O splitting - I think step one is making
> sure the driver has APIs to expose any device-specific characteristics such
> as sectors per stripe.  Step two is measuring the overhead of the splitting
> logic when splitting is not required.  Based on those measurements, we can
> consider optimizations and/or evaluate a bypass mode.  The bypass mode
> seems important to Weka.io – would you like to submit an RFC for what this
> bypass mode might look like in more detail?
>

Adding spdk_nvme_{admin,nvm}_cmd_raw sounds like a good idea to me
irrespective of whether a specific use case has already been presented.

After all, SPDK is first and foremost about ultrafast user-space NVMe, so
adding a low-level passthrough API which would let users do whatever I/O
optimization they want at their own risk sounds pretty much mandatory.

Regards,
Andrey

> Thanks,
>
> -Jim
>
> > There is also the issue that we track our IOs and SPDK because it does
> the
> > extra work has IO tracking of its own and it means there is an extra
> memory
> > allocation on the data path for no good reason from our point of view.
>
> The memory that SPDK uses for this tracking is allocated when the queue
> pair is
> allocated, so while certainly an extra cache line is accessed, it isn't a
> memory
> allocation.
>
>
>
> That extra cache line access shows up in our perf reports. It's not the
> top line item so far but it is there. I would like to be able to eliminate
> that.
>
>
>
> > When I'm building my command in the queue I need to keep the list of
> buffers
> > in some data structure, currently I keep it in a vector and then this
> gets
> > translated to PRP/SGL when the command is submitted. I could save that
> step if
> > I could just maintain the PRP/SGL myself and just pass the ready-made
> list to
> > the command. This will save me some memory and time.
>
> Building valid PRP/SGL data structures is challenging, but for those up to
> the
> task I'd be willing to consider an API that let the user provide them
> (we'll see
> if the rest of the community agrees). I'm always in favor of improvements
> for
> performance, but that seems like a lot of extra work to save what is
> probably
> just a handful of CPU instructions.
>
>
>
> I'm perfectly aware of the extra layers of complexity that I will add to
> my system and I'm also aware that there are only very few users who will
> want this complexity but it is something that I would like to be able to
> do. It's not the extra CPU instructions, it is the memory bandwidth and
> cache lines that need to be touched.
>
> Baruch
>
>
>
> --
>
>
> * Baruch Even, **Software Developer  *
>
> *E * baruch(a)weka.io <liran(a)weka.io>
> www.weka.io
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
>

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 11490 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-05-31 16:33 Verkamp, Daniel
  0 siblings, 0 replies; 17+ messages in thread
From: Verkamp, Daniel @ 2018-05-31 16:33 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6661 bytes --]

Just to add to what Jim mentioned below, we do have an API to retrieve the “sectors per stripe” in a generic way: spdk_nvme_ns_get_optimal_io_boundary() returns this information (in blocks) based on either the recently-standardized NOIOB field or the Intel vendor-specific quirk that Jim mentioned below.

Thanks,
-- Daniel

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Thursday, May 31, 2018 9:24 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Handling of physical disk removals



From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Baruch Even <baruch(a)weka.io<mailto:baruch(a)weka.io>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Thursday, May 31, 2018 at 12:37 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Handling of physical disk removals

[…]

I understand your point but my the main advantage of doing the extra effort is the ability to better integrate the system and avoid doing redundant work along the way. The proposed separation may also at times cause opposing things to be done that make no sense. For example, we found tha a large sequential IO works far better than several smaller ones and we also found that if we combine some unrelated IOs in our queue that will be nearly sequential with a scratch buffer in reads we improve overall performance. If we do that and down the stack you will break the IO at that exact junction than we just added wasteful reads. As such we need to have this combination logic know about the specific drive and specific IO that it needs to handle and the various constraints and it makes a lot of sense to do it in our level rather than at the SPDK level. For this reason I would to have a layer in SPDK that does the least amount of work and just handle the NVMe protocol, exposing the device/hw constraints and let the application above do the needed smarts and adapt as the landscape will change.
The main thing to remember is that I am writing a storage system, not a random application. As such I *need* to know the device characteristics in order to make the best use of it. I *cannot* just let spdk hide these details from me. I fully understand that there are other users that write an application and don't want to bother with the device details at this level and are fully content to get the first and major performance improvement by using spdk over the kernel driver but for my use case that is not enough.

Hi Baruch,
I’d like to understand this a bit more.  It sounds like you’d to see a mode that completely ignores things like MDTS and PRP violations, or device quirks related to performance (i.e. 128KB stripe boundaries on a number of Intel NVMe SSDs).  Maybe this mode is a lower-level API, maybe it’s a compile-time option that removes some of the checks in the driver, maybe something else.  Is that accurate?
Note that the SPDK nvme driver never splits an I/O arbitrarily – it always based on conforming to NVMe protocol or device-specific quirks that dramatically affect performance.  If the MDTS for the device is 256KB, then submitting an I/O with a size larger than 256KB *must* be split.  If an I/O spans a 128KB boundary on an SSD like the Intel P3700, the driver splits the I/O on that boundary to avoid a long delay (up to 1ms) for handling those types of I/O in SSDs that exhibit this type of striping phenomena.  For scattered payloads, the driver will only split the I/O if the payload vectors would violate PRP otherwise.  You said “[SPDK] will break the IO at that exact junction” – SPDK should only be breaking the I/O to meet one of these three cases.  If you’re seeing something different, please advise.
Adding an API to expose a namespace’s sectors per stripe is a good idea.  This is certainly something that should be added to give an application a better idea of the device characteristics.  NOIOB was added to the spec to cover this case recently, but older SSDs have vendor-specific mechanisms so just examining the namespace’s identify data is not sufficient.
In regards to your concerns on I/O splitting - I think step one is making sure the driver has APIs to expose any device-specific characteristics such as sectors per stripe.  Step two is measuring the overhead of the splitting logic when splitting is not required.  Based on those measurements, we can consider optimizations and/or evaluate a bypass mode.  The bypass mode seems important to Weka.io – would you like to submit an RFC for what this bypass mode might look like in more detail?
Thanks,
-Jim
> There is also the issue that we track our IOs and SPDK because it does the
> extra work has IO tracking of its own and it means there is an extra memory
> allocation on the data path for no good reason from our point of view.

The memory that SPDK uses for this tracking is allocated when the queue pair is
allocated, so while certainly an extra cache line is accessed, it isn't a memory
allocation.

That extra cache line access shows up in our perf reports. It's not the top line item so far but it is there. I would like to be able to eliminate that.

> When I'm building my command in the queue I need to keep the list of buffers
> in some data structure, currently I keep it in a vector and then this gets
> translated to PRP/SGL when the command is submitted. I could save that step if
> I could just maintain the PRP/SGL myself and just pass the ready-made list to
> the command. This will save me some memory and time.

Building valid PRP/SGL data structures is challenging, but for those up to the
task I'd be willing to consider an API that let the user provide them (we'll see
if the rest of the community agrees). I'm always in favor of improvements for
performance, but that seems like a lot of extra work to save what is probably
just a handful of CPU instructions.

I'm perfectly aware of the extra layers of complexity that I will add to my system and I'm also aware that there are only very few users who will want this complexity but it is something that I would like to be able to do. It's not the extra CPU instructions, it is the memory bandwidth and cache lines that need to be touched.
Baruch

--
[https://docs.google.com/uc?export=download&id=1w6mlhCJZRlvqVOKRoulFtalU3TMY41VY&revid=0Bw_6cJeSSMVJR2JMQzdxV3VqVi9IWFNDM1FVcnFoRlc2NkJzPQ]
Baruch Even, Software Developer
E  baruch(a)weka.io<mailto:liran(a)weka.io>
www.weka.io<http://www.weka.io>

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 13930 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-05-31 16:24 Harris, James R
  0 siblings, 0 replies; 17+ messages in thread
From: Harris, James R @ 2018-05-31 16:24 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5975 bytes --]



From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Baruch Even <baruch(a)weka.io>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
Date: Thursday, May 31, 2018 at 12:37 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Handling of physical disk removals

[…]

I understand your point but my the main advantage of doing the extra effort is the ability to better integrate the system and avoid doing redundant work along the way. The proposed separation may also at times cause opposing things to be done that make no sense. For example, we found tha a large sequential IO works far better than several smaller ones and we also found that if we combine some unrelated IOs in our queue that will be nearly sequential with a scratch buffer in reads we improve overall performance. If we do that and down the stack you will break the IO at that exact junction than we just added wasteful reads. As such we need to have this combination logic know about the specific drive and specific IO that it needs to handle and the various constraints and it makes a lot of sense to do it in our level rather than at the SPDK level. For this reason I would to have a layer in SPDK that does the least amount of work and just handle the NVMe protocol, exposing the device/hw constraints and let the application above do the needed smarts and adapt as the landscape will change.
The main thing to remember is that I am writing a storage system, not a random application. As such I *need* to know the device characteristics in order to make the best use of it. I *cannot* just let spdk hide these details from me. I fully understand that there are other users that write an application and don't want to bother with the device details at this level and are fully content to get the first and major performance improvement by using spdk over the kernel driver but for my use case that is not enough.

Hi Baruch,
I’d like to understand this a bit more.  It sounds like you’d to see a mode that completely ignores things like MDTS and PRP violations, or device quirks related to performance (i.e. 128KB stripe boundaries on a number of Intel NVMe SSDs).  Maybe this mode is a lower-level API, maybe it’s a compile-time option that removes some of the checks in the driver, maybe something else.  Is that accurate?
Note that the SPDK nvme driver never splits an I/O arbitrarily – it always based on conforming to NVMe protocol or device-specific quirks that dramatically affect performance.  If the MDTS for the device is 256KB, then submitting an I/O with a size larger than 256KB *must* be split.  If an I/O spans a 128KB boundary on an SSD like the Intel P3700, the driver splits the I/O on that boundary to avoid a long delay (up to 1ms) for handling those types of I/O in SSDs that exhibit this type of striping phenomena.  For scattered payloads, the driver will only split the I/O if the payload vectors would violate PRP otherwise.  You said “[SPDK] will break the IO at that exact junction” – SPDK should only be breaking the I/O to meet one of these three cases.  If you’re seeing something different, please advise.
Adding an API to expose a namespace’s sectors per stripe is a good idea.  This is certainly something that should be added to give an application a better idea of the device characteristics.  NOIOB was added to the spec to cover this case recently, but older SSDs have vendor-specific mechanisms so just examining the namespace’s identify data is not sufficient.
In regards to your concerns on I/O splitting - I think step one is making sure the driver has APIs to expose any device-specific characteristics such as sectors per stripe.  Step two is measuring the overhead of the splitting logic when splitting is not required.  Based on those measurements, we can consider optimizations and/or evaluate a bypass mode.  The bypass mode seems important to Weka.io – would you like to submit an RFC for what this bypass mode might look like in more detail?
Thanks,
-Jim
> There is also the issue that we track our IOs and SPDK because it does the
> extra work has IO tracking of its own and it means there is an extra memory
> allocation on the data path for no good reason from our point of view.

The memory that SPDK uses for this tracking is allocated when the queue pair is
allocated, so while certainly an extra cache line is accessed, it isn't a memory
allocation.

That extra cache line access shows up in our perf reports. It's not the top line item so far but it is there. I would like to be able to eliminate that.

> When I'm building my command in the queue I need to keep the list of buffers
> in some data structure, currently I keep it in a vector and then this gets
> translated to PRP/SGL when the command is submitted. I could save that step if
> I could just maintain the PRP/SGL myself and just pass the ready-made list to
> the command. This will save me some memory and time.

Building valid PRP/SGL data structures is challenging, but for those up to the
task I'd be willing to consider an API that let the user provide them (we'll see
if the rest of the community agrees). I'm always in favor of improvements for
performance, but that seems like a lot of extra work to save what is probably
just a handful of CPU instructions.

I'm perfectly aware of the extra layers of complexity that I will add to my system and I'm also aware that there are only very few users who will want this complexity but it is something that I would like to be able to do. It's not the extra CPU instructions, it is the memory bandwidth and cache lines that need to be touched.
Baruch

--
[https://docs.google.com/uc?export=download&id=1w6mlhCJZRlvqVOKRoulFtalU3TMY41VY&revid=0Bw_6cJeSSMVJR2JMQzdxV3VqVi9IWFNDM1FVcnFoRlc2NkJzPQ]
Baruch Even, Software Developer

E  baruch(a)weka.io<mailto:liran(a)weka.io>
www.weka.io<http://www.weka.io>

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 11402 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-05-31  7:54 Andrey Kuzmin
  0 siblings, 0 replies; 17+ messages in thread
From: Andrey Kuzmin @ 2018-05-31  7:54 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3874 bytes --]

On Thu, May 31, 2018, 10:25 Baruch Even <baruch(a)weka.io> wrote:

>
> On Wed, May 30, 2018 at 8:49 PM Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> wrote:
>
>> On Wed, May 30, 2018 at 2:46 PM, Baruch Even <baruch(a)weka.io> wrote:
>>
>>> As for threading, the only thing I need to make things integrate better
>>> into my green-threads is to use a function call to do a context switch if
>>> needed (spdk_yield() of sorts), that is empty/no-op for most users and can
>>> be a compile time option (weak symbol? define?) to use some user-provided
>>> function. This will immediately integrate into any green thread system by
>>> switching to another user-thread and returning for a poll when other
>>> actions are taken. This way the posix-thread will not block and there is no
>>> special async logic that needs to happen.
>>>
>>> As for the low level api, my application already has its own queue for
>>> each device, and we split our requests as we need to and combine them as
>>> possible to improve performance. It took us by surprise to find that SPDK
>>> internally breaks out the requests to smaller chunks, queues them
>>> internally and completions may not really complete the full request. There
>>> was even was scenario that we had a deadlock because we overflowed the spdk
>>> queues. Once we found these out we made sure to expose to our application
>>> all the device constraints and break the IOs with the device limitations in
>>> mind and now all the extra work and checks that SPDK does are a waste of
>>> cpu cycles as far as we are concerned.
>>>
>>> There is also the issue that we track our IOs and SPDK because it does
>>> the extra work has IO tracking of its own and it means there is an extra
>>> memory allocation on the data path for no good reason from our point of
>>> view.
>>>
>>
>> A small but mighty step in this direction would be to make
>> spdk_bdev_submit_io
>> <https://github.com/spdk/spdk/blob/d34bd0a60bace1b7e4022f12d0892f028b7f2377/lib/bdev/bdev.c#L911>
>> public. If the caller wishes to manage the setup and completion chaining
>> (in the stacked case, it is trivial to do as the caller has per I/O context
>> already allocated at this point), thus avoiding the overhead if the extra
>> I/O allocation in spdk_bdev_{read,write}*, that would be the least
>> resistance path. And the code would be much simpler, avoiding the
>>
> (often useless) io_type switch.
>>
>
> I am not using bdev at all so such an approach might be useful to others
> but not for me.
>

>
> When I'm building my command in the queue I need to keep the list of
>>> buffers in some data structure, currently I keep it in a vector and then
>>> this gets translated to PRP/SGL when the command is submitted. I could save
>>> that step if I could just maintain the PRP/SGL myself and just pass the
>>> ready-made list to the command. This will save me some memory and time.
>>>
>>
>> If your use case is limited to NVMe hardware, you might want to consider
>> NVMe passthrough which avoids the buffer setup overhead, and also gives you
>> complete control over completion status.
>>
>
> I'm not sure what is the NVMe passthrough that you refer to, I can only
> assume that you assume that I'm using bdev. I am in fact simply calling the
> spdk_nvme_* functions directly and work with that.
>

Passthrough in storage typically refers to submitting a well-formed
protocol command directly to the hardware. That sounds very much what
you're looking for, judging by the discussion.

Regards,
Andrey

>
> Baruch
>
> --
>
>
>
> *Baruch Even, Software Developer  E  baruch(a)weka.io <liran(a)weka.io>
> www.weka.io <http://www.weka.io>*
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
>

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 7522 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-05-31  7:37 Baruch Even
  0 siblings, 0 replies; 17+ messages in thread
From: Baruch Even @ 2018-05-31  7:37 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6842 bytes --]

On Thu, May 31, 2018 at 12:27 AM Walker, Benjamin <benjamin.walker(a)intel.com>
wrote:

> On Wed, 2018-05-30 at 14:46 +0300, Baruch Even wrote:
> > As for threading, the only thing I need to make things integrate better
> into
> > my green-threads is to use a function call to do a context switch if
> needed
> > (spdk_yield() of sorts), that is empty/no-op for most users and can be a
> > compile time option (weak symbol? define?) to use some user-provided
> function.
> > This will immediately integrate into any green thread system by
> switching to
> > another user-thread and returning for a poll when other actions are
> taken.
> > This way the posix-thread will not block and there is no special async
> logic
> > that needs to happen.
>
> Understood and we'd like to accommodate frameworks like you are using. See
> the
> ongoing thread titled SPDK Dynamic Threading Model. We're just getting
> started
> on that effort. In parallel, any time you see code that performs a
> blocking or
> excessively long operation, please do let us know (or help us remove it).
>

For now I resorted to move such function calls to a service thread, it is a
little less clean but since this is not the direct data path it is less
critical. With the recent patch that was made to extract all of these wait
loops to a single function I'm sure it will be easier to modify to enable
my use case.


> > As for the low level api, my application already has its own queue for
> each
> > device, and we split our requests as we need to and combine them as
> possible
> > to improve performance. It took us by surprise to find that SPDK
> internally
> > breaks out the requests to smaller chunks, queues them internally and
> > completions may not really complete the full request. There was even was
> > scenario that we had a deadlock because we overflowed the spdk queues.
> Once we
> > found these out we made sure to expose to our application all the device
> > constraints and break the IOs with the device limitations in mind and
> now all
> > the extra work and checks that SPDK does are a waste of cpu cycles as
> far as
> > we are concerned.
>
> SPDK will indeed split I/O automatically based on device characteristics.
> Splitting is non-trivial because I/O may need to be split for a large
> number of
> reasons (max I/O size, device-internal striping, PRP/SGL restrictions,
> etc.).
> Getting this right is very difficult and I'd highly recommend that people
> avoid
> re-implementing it (see the top half of lib/nvme/nvme_ns_cmd.c). SPDK will
> also
> queue I/O sent beyond the available queue depth. These were both done to
> make
> code using the NVMe driver not have to deal with device-specific concerns
> (i.e.
> the splitting required on one device is different than the splitting on
> another,
> and the queue depth available on one device is different than the queue
> depth
> available on another).
>
> The splitting and queueing should all be entirely transparent to the user
> though. There shouldn't ever be a case where a user completion callback is
> called prior to all split fragments of the I/O completing - you should get
> your
> completion callback called just one time for each time you called to
> submit a
> new I/O at the public NVMe API level.
>

I understand your point but my the main advantage of doing the extra effort
is the ability to better integrate the system and avoid doing redundant
work along the way. The proposed separation may also at times cause
opposing things to be done that make no sense. For example, we found tha a
large sequential IO works far better than several smaller ones and we also
found that if we combine some unrelated IOs in our queue that will be
nearly sequential with a scratch buffer in reads we improve overall
performance. If we do that and down the stack you will break the IO at that
exact junction than we just added wasteful reads. As such we need to have
this combination logic know about the specific drive and specific IO that
it needs to handle and the various constraints and it makes a lot of sense
to do it in our level rather than at the SPDK level. For this reason I
would to have a layer in SPDK that does the least amount of work and just
handle the NVMe protocol, exposing the device/hw constraints and let the
application above do the needed smarts and adapt as the landscape will
change.

The main thing to remember is that I am writing a storage system, not a
random application. As such I *need* to know the device characteristics in
order to make the best use of it. I *cannot* just let spdk hide these
details from me. I fully understand that there are other users that write
an application and don't want to bother with the device details at this
level and are fully content to get the first and major performance
improvement by using spdk over the kernel driver but for my use case that
is not enough.


> There is also the issue that we track our IOs and SPDK because it does the
> > extra work has IO tracking of its own and it means there is an extra
> memory
> > allocation on the data path for no good reason from our point of view.
>
> The memory that SPDK uses for this tracking is allocated when the queue
> pair is
> allocated, so while certainly an extra cache line is accessed, it isn't a
> memory
> allocation.
>

That extra cache line access shows up in our perf reports. It's not the top
line item so far but it is there. I would like to be able to eliminate
that.


> > When I'm building my command in the queue I need to keep the list of
> buffers
> > in some data structure, currently I keep it in a vector and then this
> gets
> > translated to PRP/SGL when the command is submitted. I could save that
> step if
> > I could just maintain the PRP/SGL myself and just pass the ready-made
> list to
> > the command. This will save me some memory and time.
>
> Building valid PRP/SGL data structures is challenging, but for those up to
> the
> task I'd be willing to consider an API that let the user provide them
> (we'll see
> if the rest of the community agrees). I'm always in favor of improvements
> for
> performance, but that seems like a lot of extra work to save what is
> probably
> just a handful of CPU instructions.
>

I'm perfectly aware of the extra layers of complexity that I will add to my
system and I'm also aware that there are only very few users who will want
this complexity but it is something that I would like to be able to do.
It's not the extra CPU instructions, it is the memory bandwidth and cache
lines that need to be touched.

Baruch

-- 



*Baruch Even, Software Developer  E  baruch(a)weka.io <liran(a)weka.io>
www.weka.io <http://www.weka.io>*

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 8845 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-05-31  7:24 Baruch Even
  0 siblings, 0 replies; 17+ messages in thread
From: Baruch Even @ 2018-05-31  7:24 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3358 bytes --]

On Wed, May 30, 2018 at 8:49 PM Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
wrote:

> On Wed, May 30, 2018 at 2:46 PM, Baruch Even <baruch(a)weka.io> wrote:
>
>> As for threading, the only thing I need to make things integrate better
>> into my green-threads is to use a function call to do a context switch if
>> needed (spdk_yield() of sorts), that is empty/no-op for most users and can
>> be a compile time option (weak symbol? define?) to use some user-provided
>> function. This will immediately integrate into any green thread system by
>> switching to another user-thread and returning for a poll when other
>> actions are taken. This way the posix-thread will not block and there is no
>> special async logic that needs to happen.
>>
>> As for the low level api, my application already has its own queue for
>> each device, and we split our requests as we need to and combine them as
>> possible to improve performance. It took us by surprise to find that SPDK
>> internally breaks out the requests to smaller chunks, queues them
>> internally and completions may not really complete the full request. There
>> was even was scenario that we had a deadlock because we overflowed the spdk
>> queues. Once we found these out we made sure to expose to our application
>> all the device constraints and break the IOs with the device limitations in
>> mind and now all the extra work and checks that SPDK does are a waste of
>> cpu cycles as far as we are concerned.
>>
>> There is also the issue that we track our IOs and SPDK because it does
>> the extra work has IO tracking of its own and it means there is an extra
>> memory allocation on the data path for no good reason from our point of
>> view.
>>
>
> A small but mighty step in this direction would be to make
> spdk_bdev_submit_io
> <https://github.com/spdk/spdk/blob/d34bd0a60bace1b7e4022f12d0892f028b7f2377/lib/bdev/bdev.c#L911>
> public. If the caller wishes to manage the setup and completion chaining
> (in the stacked case, it is trivial to do as the caller has per I/O context
> already allocated at this point), thus avoiding the overhead if the extra
> I/O allocation in spdk_bdev_{read,write}*, that would be the least
> resistance path. And the code would be much simpler, avoiding the
>
(often useless) io_type switch.
>

I am not using bdev at all so such an approach might be useful to others
but not for me.


When I'm building my command in the queue I need to keep the list of
>> buffers in some data structure, currently I keep it in a vector and then
>> this gets translated to PRP/SGL when the command is submitted. I could save
>> that step if I could just maintain the PRP/SGL myself and just pass the
>> ready-made list to the command. This will save me some memory and time.
>>
>
> If your use case is limited to NVMe hardware, you might want to consider
> NVMe passthrough which avoids the buffer setup overhead, and also gives you
> complete control over completion status.
>

I'm not sure what is the NVMe passthrough that you refer to, I can only
assume that you assume that I'm using bdev. I am in fact simply calling the
spdk_nvme_* functions directly and work with that.

Baruch

-- 



*Baruch Even, Software Developer  E  baruch(a)weka.io <liran(a)weka.io>
www.weka.io <http://www.weka.io>*

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-05-30 21:27 Walker, Benjamin
  0 siblings, 0 replies; 17+ messages in thread
From: Walker, Benjamin @ 2018-05-30 21:27 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 12216 bytes --]

On Wed, 2018-05-30 at 14:46 +0300, Baruch Even wrote:
> As for threading, the only thing I need to make things integrate better into
> my green-threads is to use a function call to do a context switch if needed
> (spdk_yield() of sorts), that is empty/no-op for most users and can be a
> compile time option (weak symbol? define?) to use some user-provided function.
> This will immediately integrate into any green thread system by switching to
> another user-thread and returning for a poll when other actions are taken.
> This way the posix-thread will not block and there is no special async logic
> that needs to happen.

Understood and we'd like to accommodate frameworks like you are using. See the
ongoing thread titled SPDK Dynamic Threading Model. We're just getting started
on that effort. In parallel, any time you see code that performs a blocking or
excessively long operation, please do let us know (or help us remove it).

> 
> As for the low level api, my application already has its own queue for each
> device, and we split our requests as we need to and combine them as possible
> to improve performance. It took us by surprise to find that SPDK internally
> breaks out the requests to smaller chunks, queues them internally and
> completions may not really complete the full request. There was even was
> scenario that we had a deadlock because we overflowed the spdk queues. Once we
> found these out we made sure to expose to our application all the device
> constraints and break the IOs with the device limitations in mind and now all
> the extra work and checks that SPDK does are a waste of cpu cycles as far as
> we are concerned.

SPDK will indeed split I/O automatically based on device characteristics.
Splitting is non-trivial because I/O may need to be split for a large number of
reasons (max I/O size, device-internal striping, PRP/SGL restrictions, etc.).
Getting this right is very difficult and I'd highly recommend that people avoid
re-implementing it (see the top half of lib/nvme/nvme_ns_cmd.c). SPDK will also
queue I/O sent beyond the available queue depth. These were both done to make
code using the NVMe driver not have to deal with device-specific concerns (i.e.
the splitting required on one device is different than the splitting on another,
and the queue depth available on one device is different than the queue depth
available on another).

The splitting and queueing should all be entirely transparent to the user
though. There shouldn't ever be a case where a user completion callback is
called prior to all split fragments of the I/O completing - you should get your
completion callback called just one time for each time you called to submit a
new I/O at the public NVMe API level.

> 
> There is also the issue that we track our IOs and SPDK because it does the
> extra work has IO tracking of its own and it means there is an extra memory
> allocation on the data path for no good reason from our point of view.

The memory that SPDK uses for this tracking is allocated when the queue pair is
allocated, so while certainly an extra cache line is accessed, it isn't a memory
allocation.

> 
> When I'm building my command in the queue I need to keep the list of buffers
> in some data structure, currently I keep it in a vector and then this gets
> translated to PRP/SGL when the command is submitted. I could save that step if
> I could just maintain the PRP/SGL myself and just pass the ready-made list to
> the command. This will save me some memory and time.

Building valid PRP/SGL data structures is challenging, but for those up to the
task I'd be willing to consider an API that let the user provide them (we'll see
if the rest of the community agrees). I'm always in favor of improvements for
performance, but that seems like a lot of extra work to save what is probably
just a handful of CPU instructions.

> 
> All of this just means that spdk is wasting some cpu cycles on things we don't
> want it to do and after all the main reason to use spdk and dpdk is to eke out
> all the possible performance from the system (cpu/nvme devices) by a more
> integrated design. I can understand that other users of spdk do want the
> higher levels and do need the features provided but for us we have better
> places to do these actions.
> 
> Baruch
> 
> On Thu, May 24, 2018 at 7:44 PM Walker, Benjamin <benjamin.walker(a)intel.com>
> wrote:
> > Hi Baruch,
> >  
> > Regarding blocking threads – only the NVMe initialization path does that
> > today. We considered doing a fully asynchronous initialization path for NVMe
> > devices, but we received a lot of feedback that it makes it very difficult
> > to use. I’d personally be open to adding a second initialization path that
> > was entirely asynchronous and required the user to explicitly poll some
> > initialization context to advance the state machine until the devices come
> > online. This clearly makes much more sense in the context of hot plug since
> > the device initialization may occur on a thread that is also processing I/O.
> > Contributions are always very welcome!
> >  
> > I do expect to see some fairly major movement this year regarding
> > abstractions around threading models and I’d love to define some abstraction
> > that allows SPDK to nicely integrate into applications using green threads
> > or coroutines without tying SPDK to any specific implementation. This is
> > definitely at the front of mind currently (as is dynamic memory management –
> > the other major hurdle for integrating with existing applications). I think
> > over the coming months there are going to be several lively discussions on
> > the mailing list and in IRC as we try to sort all of this out.
> >  
> > Regarding a lower level API – we have some lower level interfaces, although
> > not as low-level as you’re talking about. See spdk_nvme_ctrlr_io_cmd_raw(),
> > for example, which lets you build an arbitrary NVMe command and send it. We
> > haven’t exposed any knobs to control internal queueing or to allow a user to
> > build their own SGL/PRP. Can you outline what the use cases for those are?
> > Why would you want to deviate from what SPDK does by default today? We’re
> > always amenable to providing more control, if that control has value.
> >  
> > Thanks,
> > Ben
> >  
> > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Baruch Even
> > Sent: Thursday, May 24, 2018 1:17 AM
> > 
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: Re: [SPDK] Handling of physical disk removals
> >  
> > Hi,
> > 
> > I logged the issue as https://github.com/spdk/spdk/issues/310
> > 
> > I do call detach but I do not use the empty probe call, instead I have
> > another part of my application monitor the same uevent information and
> > trigger a larger process to remove the drive. I also have this triggered by
> > the timeout callback. I had to remove the timeout callback or I'd get
> > bombarded with the callback until it actually happened, looks like the
> > timeout callback assumes that the reset call will happen from inside it and
> > complete before it returns but that doesn't work very well with our
> > application.
> > 
> > I have an issue with the threading model you are using as we are using user-
> > space-threads (aka green threads or coroutines) and spdk will block the
> > entire application for no good reason. It would have been nice  if there was
> > a yield callback I can provide you so you won't block the main thread. For
> > now I need to resort to send the spdk admin calls to a thread to be done
> > there since spdk may block.
> > 
> > If I am on a ranting roll I also would have preferred a lower level
> > interface than currently provided that doesn't implement internal queuing
> > and internal allocations so much and doesn't hide all the nvme details (prp,
> > sgl), though by that stage I'm probably better of with unvme instead of
> > spdk. If they had support uio_pci_generic in addition to vfio-pci it would
> > probably be a better direction for me.
> >  
> > Baruch
> >  
> > On Wed, May 23, 2018 at 8:18 PM Harris, James R <james.r.harris(a)intel.com>
> > wrote:
> > > Hi Baruch,
> > >  
> > > Thanks for raising this issue – there are absolutely changes that SPDK
> > > needs to make here.
> > >  
> > > Can you describe your code path a bit more?  Or let me try to guess and
> > > you can tell me where I’m wrong.
> > >  
> > > 1)       You’re using spdk_nvme_probe() with a NULL trid and a remove_cb
> > > handler to detect physically removed devices.
> > > 2)       In your remove_cb handler, you call spdk_nvme_detach().
> > >  
> > > The SPDK bdev nvme module doesn’t call spdk_nvme_detach() in its remove_cb
> > > which is why the SPDK automated tests don’t run into this issue.  But I
> > > don’t this is correct – we should be calling spdk_nvme_detach() at some
> > > point to clean up any allocated resources.  It needs to make sure any
> > > associate IO channels are freed up first (to avoid racing between the
> > > remove callback and different threads submitting IO to that removed
> > > controller).
> > >  
> > > Could you file this as a bug in github?  Please add any additional details
> > > on how you’re hitting this issue if it’s different than what I’ve guessed
> > > above.
> > >  
> > > https://github.com/spdk/spdk/issues
> > >  
> > > Thanks,
> > >  
> > > Jim
> > >  
> > >  
> > > From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Baruch Even <baruch(a)we
> > > ka.io>
> > > Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > > Date: Wednesday, May 23, 2018 at 1:44 AM
> > > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > > Subject: [SPDK] Handling of physical disk removals
> > >  
> > > Hi,
> > > 
> > > I'm using spdk for local nvme through the nvme interface, I find that
> > > physical disk removals are not handled properly for my use case and wonder
> > > if others see it that way as well and if there is an intention to fix
> > > this.
> > > 
> > > Our system uses long running processes that control one or more disks at a
> > > time, if a disk fails it may drop completely from the pcie bus and it will
> > > also look like that if the disk is physically removed (say a technician
> > > mistakes the disk that he should replace).
> > > 
> > > The problem that I see is that spdk doesnt consider a device completely
> > > disappearing from the bus and will try to release the io qpair by sending
> > > the delete io sq and delete io cq commands, both of these will never get
> > > an answer (the device is not on the pcie device anymore) and there is no
> > > timeout logic in that code path. This means two things, the process will
> > > halt forever and there is an effective memory leak which currently means
> > > that we need to restart the process. Now, our system is resilient enough
> > > that restarting the process is not a big deal but it is a very messy way
> > > to go about handlign a physical drive removal.
> > >  
> > > Have others seen this behavior? Does it bother others?
> > > 
> > > For my own use I put a timeout in there of a few seconds and that solves
> > > it for me.
> > >  
> > > Baruch Even
> > >  
> > > --
> > > 
> > > Baruch Even, Software Developer 
> > > E  baruch(a)weka.io 
> > > www.weka.io
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > > https://lists.01.org/mailman/listinfo/spdk
> > 
> > --
> > 
> > Baruch Even, Software Developer 
> > E  baruch(a)weka.io 
> > www.weka.io
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> 
> -- 
> 
> Baruch Even, Software Developer  
> E  baruch(a)weka.io 
> www.weka.io
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-05-30 17:49 Andrey Kuzmin
  0 siblings, 0 replies; 17+ messages in thread
From: Andrey Kuzmin @ 2018-05-30 17:49 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 11172 bytes --]

On Wed, May 30, 2018 at 2:46 PM, Baruch Even <baruch(a)weka.io> wrote:

> As for threading, the only thing I need to make things integrate better
> into my green-threads is to use a function call to do a context switch if
> needed (spdk_yield() of sorts), that is empty/no-op for most users and can
> be a compile time option (weak symbol? define?) to use some user-provided
> function. This will immediately integrate into any green thread system by
> switching to another user-thread and returning for a poll when other
> actions are taken. This way the posix-thread will not block and there is no
> special async logic that needs to happen.
>
> As for the low level api, my application already has its own queue for
> each device, and we split our requests as we need to and combine them as
> possible to improve performance. It took us by surprise to find that SPDK
> internally breaks out the requests to smaller chunks, queues them
> internally and completions may not really complete the full request. There
> was even was scenario that we had a deadlock because we overflowed the spdk
> queues. Once we found these out we made sure to expose to our application
> all the device constraints and break the IOs with the device limitations in
> mind and now all the extra work and checks that SPDK does are a waste of
> cpu cycles as far as we are concerned.
>
> There is also the issue that we track our IOs and SPDK because it does the
> extra work has IO tracking of its own and it means there is an extra memory
> allocation on the data path for no good reason from our point of view.
>

A small but mighty step in this direction would be to make
spdk_bdev_submit_io
<https://github.com/spdk/spdk/blob/d34bd0a60bace1b7e4022f12d0892f028b7f2377/lib/bdev/bdev.c#L911>
public. If the caller wishes to manage the setup and completion chaining
(in the stacked case, it is trivial to do as the caller has per I/O context
already allocated at this point), thus avoiding the overhead if the extra
I/O allocation in spdk_bdev_{read,write}*, that would be the least
resistance path. And the code would be much simpler, avoiding the (often
useless) io_type switch.


>
> When I'm building my command in the queue I need to keep the list of
> buffers in some data structure, currently I keep it in a vector and then
> this gets translated to PRP/SGL when the command is submitted. I could save
> that step if I could just maintain the PRP/SGL myself and just pass the
> ready-made list to the command. This will save me some memory and time.
>

If your use case is limited to NVMe hardware, you might want to consider
NVMe passthrough which avoids the buffer setup overhead, and also gives you
complete control over completion status.

Regards,
Andrey


>
> All of this just means that spdk is wasting some cpu cycles on things we
> don't want it to do and after all the main reason to use spdk and dpdk is
> to eke out all the possible performance from the system (cpu/nvme devices)
> by a more integrated design. I can understand that other users of spdk do
> want the higher levels and do need the features provided but for us we have
> better places to do these actions.
>
> Baruch
>
> On Thu, May 24, 2018 at 7:44 PM Walker, Benjamin <
> benjamin.walker(a)intel.com> wrote:
>
>> Hi Baruch,
>>
>>
>>
>> Regarding blocking threads – only the NVMe initialization path does that
>> today. We considered doing a fully asynchronous initialization path for
>> NVMe devices, but we received a lot of feedback that it makes it very
>> difficult to use. I’d personally be open to adding a second initialization
>> path that was entirely asynchronous and required the user to explicitly
>> poll some initialization context to advance the state machine until the
>> devices come online. This clearly makes much more sense in the context of
>> hot plug since the device initialization may occur on a thread that is also
>> processing I/O. Contributions are always very welcome!
>>
>>
>>
>> I do expect to see some fairly major movement this year regarding
>> abstractions around threading models and I’d love to define some
>> abstraction that allows SPDK to nicely integrate into applications using
>> green threads or coroutines without tying SPDK to any specific
>> implementation. This is definitely at the front of mind currently (as is
>> dynamic memory management – the other major hurdle for integrating with
>> existing applications). I think over the coming months there are going to
>> be several lively discussions on the mailing list and in IRC as we try to
>> sort all of this out.
>>
>>
>>
>> Regarding a lower level API – we have some lower level interfaces,
>> although not as low-level as you’re talking about. See
>> spdk_nvme_ctrlr_io_cmd_raw()
>> <http://www.spdk.io/doc/nvme_8h.html#a1e3def668122e76abbfb74305f118291>,
>> for example, which lets you build an arbitrary NVMe command and send it. We
>> haven’t exposed any knobs to control internal queueing or to allow a user
>> to build their own SGL/PRP. Can you outline what the use cases for those
>> are? Why would you want to deviate from what SPDK does by default today?
>> We’re always amenable to providing more control, if that control has value.
>>
>>
>>
>> Thanks,
>>
>> Ben
>>
>>
>>
>> *From:* SPDK [mailto:spdk-bounces(a)lists.01.org] *On Behalf Of *Baruch
>> Even
>> *Sent:* Thursday, May 24, 2018 1:17 AM
>>
>>
>> *To:* Storage Performance Development Kit <spdk(a)lists.01.org>
>>
>> *Subject:* Re: [SPDK] Handling of physical disk removals
>>
>>
>>
>> Hi,
>>
>> I logged the issue as https://github.com/spdk/spdk/issues/310
>>
>> I do call detach but I do not use the empty probe call, instead I have
>> another part of my application monitor the same uevent information and
>> trigger a larger process to remove the drive. I also have this triggered by
>> the timeout callback. I had to remove the timeout callback or I'd get
>> bombarded with the callback until it actually happened, looks like the
>> timeout callback assumes that the reset call will happen from inside it and
>> complete before it returns but that doesn't work very well with our
>> application.
>>
>> I have an issue with the threading model you are using as we are using
>> user-space-threads (aka green threads or coroutines) and spdk will block
>> the entire application for no good reason. It would have been nice if there
>> was a yield callback I can provide you so you won't block the main thread.
>> For now I need to resort to send the spdk admin calls to a thread to be
>> done there since spdk may block.
>>
>> If I am on a ranting roll I also would have preferred a lower level
>> interface than currently provided that doesn't implement internal queuing
>> and internal allocations so much and doesn't hide all the nvme details
>> (prp, sgl), though by that stage I'm probably better of with unvme instead
>> of spdk. If they had support uio_pci_generic in addition to vfio-pci it
>> would probably be a better direction for me.
>>
>>
>>
>> Baruch
>>
>>
>>
>> On Wed, May 23, 2018 at 8:18 PM Harris, James R <james.r.harris(a)intel.com>
>> wrote:
>>
>> Hi Baruch,
>>
>>
>>
>> Thanks for raising this issue – there are absolutely changes that SPDK
>> needs to make here.
>>
>>
>>
>> Can you describe your code path a bit more?  Or let me try to guess and
>> you can tell me where I’m wrong.
>>
>>
>>
>> 1)       You’re using spdk_nvme_probe() with a NULL trid and a remove_cb
>> handler to detect physically removed devices.
>>
>> 2)       In your remove_cb handler, you call spdk_nvme_detach().
>>
>>
>>
>> The SPDK bdev nvme module doesn’t call spdk_nvme_detach() in its
>> remove_cb which is why the SPDK automated tests don’t run into this issue.
>> But I don’t this is correct – we should be calling spdk_nvme_detach() at
>> some point to clean up any allocated resources.  It needs to make sure any
>> associate IO channels are freed up first (to avoid racing between the
>> remove callback and different threads submitting IO to that removed
>> controller).
>>
>>
>>
>> Could you file this as a bug in github?  Please add any additional
>> details on how you’re hitting this issue if it’s different than what I’ve
>> guessed above.
>>
>>
>>
>> https://github.com/spdk/spdk/issues
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Jim
>>
>>
>>
>>
>>
>> *From: *SPDK <spdk-bounces(a)lists.01.org> on behalf of Baruch Even <
>> baruch(a)weka.io>
>> *Reply-To: *Storage Performance Development Kit <spdk(a)lists.01.org>
>> *Date: *Wednesday, May 23, 2018 at 1:44 AM
>> *To: *Storage Performance Development Kit <spdk(a)lists.01.org>
>> *Subject: *[SPDK] Handling of physical disk removals
>>
>>
>>
>> Hi,
>>
>> I'm using spdk for local nvme through the nvme interface, I find that
>> physical disk removals are not handled properly for my use case and wonder
>> if others see it that way as well and if there is an intention to fix this.
>>
>> Our system uses long running processes that control one or more disks at
>> a time, if a disk fails it may drop completely from the pcie bus and it
>> will also look like that if the disk is physically removed (say a
>> technician mistakes the disk that he should replace).
>>
>> The problem that I see is that spdk doesnt consider a device completely
>> disappearing from the bus and will try to release the io qpair by sending
>> the delete io sq and delete io cq commands, both of these will never get an
>> answer (the device is not on the pcie device anymore) and there is no
>> timeout logic in that code path. This means two things, the process will
>> halt forever and there is an effective memory leak which currently means
>> that we need to restart the process. Now, our system is resilient enough
>> that restarting the process is not a big deal but it is a very messy way to
>> go about handlign a physical drive removal.
>>
>>
>>
>> Have others seen this behavior? Does it bother others?
>>
>> For my own use I put a timeout in there of a few seconds and that solves
>> it for me.
>>
>>
>>
>> Baruch Even
>>
>>
>>
>> --
>>
>> [image: image001.jpg]
>> * Baruch Even, **Software Developer  *
>>
>> *E * baruch(a)weka.io <liran(a)weka.io>
>> www.weka.io
>>
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
>>
>> --
>>
>> [image: ~WRD000.jpg]
>> * Baruch Even, **Software Developer  *
>>
>> *E * baruch(a)weka.io <liran(a)weka.io>
>> www.weka.io
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
>>
> --
>
>
>
> *Baruch Even, Software Developer  E  baruch(a)weka.io <liran(a)weka.io>
> www.weka.io <http://www.weka.io>*
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
>
>

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 26783 bytes --]

[-- Attachment #3: image001.jpg --]
[-- Type: image/jpeg, Size: 476 bytes --]

[-- Attachment #4: WRD000.jpg --]
[-- Type: image/jpeg, Size: 823 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-05-30 11:46 Baruch Even
  0 siblings, 0 replies; 17+ messages in thread
From: Baruch Even @ 2018-05-30 11:46 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 9843 bytes --]

As for threading, the only thing I need to make things integrate better
into my green-threads is to use a function call to do a context switch if
needed (spdk_yield() of sorts), that is empty/no-op for most users and can
be a compile time option (weak symbol? define?) to use some user-provided
function. This will immediately integrate into any green thread system by
switching to another user-thread and returning for a poll when other
actions are taken. This way the posix-thread will not block and there is no
special async logic that needs to happen.

As for the low level api, my application already has its own queue for each
device, and we split our requests as we need to and combine them as
possible to improve performance. It took us by surprise to find that SPDK
internally breaks out the requests to smaller chunks, queues them
internally and completions may not really complete the full request. There
was even was scenario that we had a deadlock because we overflowed the spdk
queues. Once we found these out we made sure to expose to our application
all the device constraints and break the IOs with the device limitations in
mind and now all the extra work and checks that SPDK does are a waste of
cpu cycles as far as we are concerned.

There is also the issue that we track our IOs and SPDK because it does the
extra work has IO tracking of its own and it means there is an extra memory
allocation on the data path for no good reason from our point of view.

When I'm building my command in the queue I need to keep the list of
buffers in some data structure, currently I keep it in a vector and then
this gets translated to PRP/SGL when the command is submitted. I could save
that step if I could just maintain the PRP/SGL myself and just pass the
ready-made list to the command. This will save me some memory and time.

All of this just means that spdk is wasting some cpu cycles on things we
don't want it to do and after all the main reason to use spdk and dpdk is
to eke out all the possible performance from the system (cpu/nvme devices)
by a more integrated design. I can understand that other users of spdk do
want the higher levels and do need the features provided but for us we have
better places to do these actions.

Baruch

On Thu, May 24, 2018 at 7:44 PM Walker, Benjamin <benjamin.walker(a)intel.com>
wrote:

> Hi Baruch,
>
>
>
> Regarding blocking threads – only the NVMe initialization path does that
> today. We considered doing a fully asynchronous initialization path for
> NVMe devices, but we received a lot of feedback that it makes it very
> difficult to use. I’d personally be open to adding a second initialization
> path that was entirely asynchronous and required the user to explicitly
> poll some initialization context to advance the state machine until the
> devices come online. This clearly makes much more sense in the context of
> hot plug since the device initialization may occur on a thread that is also
> processing I/O. Contributions are always very welcome!
>
>
>
> I do expect to see some fairly major movement this year regarding
> abstractions around threading models and I’d love to define some
> abstraction that allows SPDK to nicely integrate into applications using
> green threads or coroutines without tying SPDK to any specific
> implementation. This is definitely at the front of mind currently (as is
> dynamic memory management – the other major hurdle for integrating with
> existing applications). I think over the coming months there are going to
> be several lively discussions on the mailing list and in IRC as we try to
> sort all of this out.
>
>
>
> Regarding a lower level API – we have some lower level interfaces,
> although not as low-level as you’re talking about. See
> spdk_nvme_ctrlr_io_cmd_raw()
> <http://www.spdk.io/doc/nvme_8h.html#a1e3def668122e76abbfb74305f118291>,
> for example, which lets you build an arbitrary NVMe command and send it. We
> haven’t exposed any knobs to control internal queueing or to allow a user
> to build their own SGL/PRP. Can you outline what the use cases for those
> are? Why would you want to deviate from what SPDK does by default today?
> We’re always amenable to providing more control, if that control has value.
>
>
>
> Thanks,
>
> Ben
>
>
>
> *From:* SPDK [mailto:spdk-bounces(a)lists.01.org] *On Behalf Of *Baruch Even
> *Sent:* Thursday, May 24, 2018 1:17 AM
>
>
> *To:* Storage Performance Development Kit <spdk(a)lists.01.org>
>
> *Subject:* Re: [SPDK] Handling of physical disk removals
>
>
>
> Hi,
>
> I logged the issue as https://github.com/spdk/spdk/issues/310
>
> I do call detach but I do not use the empty probe call, instead I have
> another part of my application monitor the same uevent information and
> trigger a larger process to remove the drive. I also have this triggered by
> the timeout callback. I had to remove the timeout callback or I'd get
> bombarded with the callback until it actually happened, looks like the
> timeout callback assumes that the reset call will happen from inside it and
> complete before it returns but that doesn't work very well with our
> application.
>
> I have an issue with the threading model you are using as we are using
> user-space-threads (aka green threads or coroutines) and spdk will block
> the entire application for no good reason. It would have been nice if there
> was a yield callback I can provide you so you won't block the main thread.
> For now I need to resort to send the spdk admin calls to a thread to be
> done there since spdk may block.
>
> If I am on a ranting roll I also would have preferred a lower level
> interface than currently provided that doesn't implement internal queuing
> and internal allocations so much and doesn't hide all the nvme details
> (prp, sgl), though by that stage I'm probably better of with unvme instead
> of spdk. If they had support uio_pci_generic in addition to vfio-pci it
> would probably be a better direction for me.
>
>
>
> Baruch
>
>
>
> On Wed, May 23, 2018 at 8:18 PM Harris, James R <james.r.harris(a)intel.com>
> wrote:
>
> Hi Baruch,
>
>
>
> Thanks for raising this issue – there are absolutely changes that SPDK
> needs to make here.
>
>
>
> Can you describe your code path a bit more?  Or let me try to guess and
> you can tell me where I’m wrong.
>
>
>
> 1)       You’re using spdk_nvme_probe() with a NULL trid and a remove_cb
> handler to detect physically removed devices.
>
> 2)       In your remove_cb handler, you call spdk_nvme_detach().
>
>
>
> The SPDK bdev nvme module doesn’t call spdk_nvme_detach() in its remove_cb
> which is why the SPDK automated tests don’t run into this issue.  But I
> don’t this is correct – we should be calling spdk_nvme_detach() at some
> point to clean up any allocated resources.  It needs to make sure any
> associate IO channels are freed up first (to avoid racing between the
> remove callback and different threads submitting IO to that removed
> controller).
>
>
>
> Could you file this as a bug in github?  Please add any additional details
> on how you’re hitting this issue if it’s different than what I’ve guessed
> above.
>
>
>
> https://github.com/spdk/spdk/issues
>
>
>
> Thanks,
>
>
>
> Jim
>
>
>
>
>
> *From: *SPDK <spdk-bounces(a)lists.01.org> on behalf of Baruch Even <
> baruch(a)weka.io>
> *Reply-To: *Storage Performance Development Kit <spdk(a)lists.01.org>
> *Date: *Wednesday, May 23, 2018 at 1:44 AM
> *To: *Storage Performance Development Kit <spdk(a)lists.01.org>
> *Subject: *[SPDK] Handling of physical disk removals
>
>
>
> Hi,
>
> I'm using spdk for local nvme through the nvme interface, I find that
> physical disk removals are not handled properly for my use case and wonder
> if others see it that way as well and if there is an intention to fix this.
>
> Our system uses long running processes that control one or more disks at a
> time, if a disk fails it may drop completely from the pcie bus and it will
> also look like that if the disk is physically removed (say a technician
> mistakes the disk that he should replace).
>
> The problem that I see is that spdk doesnt consider a device completely
> disappearing from the bus and will try to release the io qpair by sending
> the delete io sq and delete io cq commands, both of these will never get an
> answer (the device is not on the pcie device anymore) and there is no
> timeout logic in that code path. This means two things, the process will
> halt forever and there is an effective memory leak which currently means
> that we need to restart the process. Now, our system is resilient enough
> that restarting the process is not a big deal but it is a very messy way to
> go about handlign a physical drive removal.
>
>
>
> Have others seen this behavior? Does it bother others?
>
> For my own use I put a timeout in there of a few seconds and that solves
> it for me.
>
>
>
> Baruch Even
>
>
>
> --
>
> [image: image001.jpg]
> * Baruch Even, **Software Developer  *
>
> *E * baruch(a)weka.io <liran(a)weka.io>
> www.weka.io
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
>
> --
>
> [image: ~WRD000.jpg]
> * Baruch Even, **Software Developer  *
>
> *E * baruch(a)weka.io <liran(a)weka.io>
> www.weka.io
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
>
-- 



*Baruch Even, Software Developer  E  baruch(a)weka.io <liran(a)weka.io>
www.weka.io <http://www.weka.io>*

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 23934 bytes --]

[-- Attachment #3: WRD000.jpg --]
[-- Type: image/jpeg, Size: 823 bytes --]

[-- Attachment #4: image001.jpg --]
[-- Type: image/jpeg, Size: 476 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-05-24 16:44 Walker, Benjamin
  0 siblings, 0 replies; 17+ messages in thread
From: Walker, Benjamin @ 2018-05-24 16:44 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6880 bytes --]

Hi Baruch,

Regarding blocking threads – only the NVMe initialization path does that today. We considered doing a fully asynchronous initialization path for NVMe devices, but we received a lot of feedback that it makes it very difficult to use. I’d personally be open to adding a second initialization path that was entirely asynchronous and required the user to explicitly poll some initialization context to advance the state machine until the devices come online. This clearly makes much more sense in the context of hot plug since the device initialization may occur on a thread that is also processing I/O. Contributions are always very welcome!

I do expect to see some fairly major movement this year regarding abstractions around threading models and I’d love to define some abstraction that allows SPDK to nicely integrate into applications using green threads or coroutines without tying SPDK to any specific implementation. This is definitely at the front of mind currently (as is dynamic memory management – the other major hurdle for integrating with existing applications). I think over the coming months there are going to be several lively discussions on the mailing list and in IRC as we try to sort all of this out.

Regarding a lower level API – we have some lower level interfaces, although not as low-level as you’re talking about. See spdk_nvme_ctrlr_io_cmd_raw()<http://www.spdk.io/doc/nvme_8h.html#a1e3def668122e76abbfb74305f118291>, for example, which lets you build an arbitrary NVMe command and send it. We haven’t exposed any knobs to control internal queueing or to allow a user to build their own SGL/PRP. Can you outline what the use cases for those are? Why would you want to deviate from what SPDK does by default today? We’re always amenable to providing more control, if that control has value.

Thanks,
Ben

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Baruch Even
Sent: Thursday, May 24, 2018 1:17 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Handling of physical disk removals

Hi,
I logged the issue as https://github.com/spdk/spdk/issues/310
I do call detach but I do not use the empty probe call, instead I have another part of my application monitor the same uevent information and trigger a larger process to remove the drive. I also have this triggered by the timeout callback. I had to remove the timeout callback or I'd get bombarded with the callback until it actually happened, looks like the timeout callback assumes that the reset call will happen from inside it and complete before it returns but that doesn't work very well with our application.
I have an issue with the threading model you are using as we are using user-space-threads (aka green threads or coroutines) and spdk will block the entire application for no good reason. It would have been nice if there was a yield callback I can provide you so you won't block the main thread. For now I need to resort to send the spdk admin calls to a thread to be done there since spdk may block.
If I am on a ranting roll I also would have preferred a lower level interface than currently provided that doesn't implement internal queuing and internal allocations so much and doesn't hide all the nvme details (prp, sgl), though by that stage I'm probably better of with unvme instead of spdk. If they had support uio_pci_generic in addition to vfio-pci it would probably be a better direction for me.

Baruch

On Wed, May 23, 2018 at 8:18 PM Harris, James R <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>> wrote:
Hi Baruch,

Thanks for raising this issue – there are absolutely changes that SPDK needs to make here.

Can you describe your code path a bit more?  Or let me try to guess and you can tell me where I’m wrong.


1)       You’re using spdk_nvme_probe() with a NULL trid and a remove_cb handler to detect physically removed devices.

2)       In your remove_cb handler, you call spdk_nvme_detach().

The SPDK bdev nvme module doesn’t call spdk_nvme_detach() in its remove_cb which is why the SPDK automated tests don’t run into this issue.  But I don’t this is correct – we should be calling spdk_nvme_detach() at some point to clean up any allocated resources.  It needs to make sure any associate IO channels are freed up first (to avoid racing between the remove callback and different threads submitting IO to that removed controller).

Could you file this as a bug in github?  Please add any additional details on how you’re hitting this issue if it’s different than what I’ve guessed above.

https://github.com/spdk/spdk/issues

Thanks,

Jim


From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Baruch Even <baruch(a)weka.io<mailto:baruch(a)weka.io>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Wednesday, May 23, 2018 at 1:44 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: [SPDK] Handling of physical disk removals

Hi,
I'm using spdk for local nvme through the nvme interface, I find that physical disk removals are not handled properly for my use case and wonder if others see it that way as well and if there is an intention to fix this.
Our system uses long running processes that control one or more disks at a time, if a disk fails it may drop completely from the pcie bus and it will also look like that if the disk is physically removed (say a technician mistakes the disk that he should replace).
The problem that I see is that spdk doesnt consider a device completely disappearing from the bus and will try to release the io qpair by sending the delete io sq and delete io cq commands, both of these will never get an answer (the device is not on the pcie device anymore) and there is no timeout logic in that code path. This means two things, the process will halt forever and there is an effective memory leak which currently means that we need to restart the process. Now, our system is resilient enough that restarting the process is not a big deal but it is a very messy way to go about handlign a physical drive removal.

Have others seen this behavior? Does it bother others?
For my own use I put a timeout in there of a few seconds and that solves it for me.

Baruch Even

--
[Image removed by sender.]
Baruch Even, Software Developer
E  baruch(a)weka.io<mailto:liran(a)weka.io>
www.weka.io<http://www.weka.io>
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org<mailto:SPDK(a)lists.01.org>
https://lists.01.org/mailman/listinfo/spdk
--
[Image removed by sender.]
Baruch Even, Software Developer
E  baruch(a)weka.io<mailto:liran(a)weka.io>
www.weka.io<http://www.weka.io>

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 20578 bytes --]

[-- Attachment #3: WRD000.jpg --]
[-- Type: image/jpeg, Size: 823 bytes --]

[-- Attachment #4: image001.jpg --]
[-- Type: image/jpeg, Size: 476 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-05-24  8:16 Baruch Even
  0 siblings, 0 replies; 17+ messages in thread
From: Baruch Even @ 2018-05-24  8:16 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4822 bytes --]

Hi,

I logged the issue as https://github.com/spdk/spdk/issues/310

I do call detach but I do not use the empty probe call, instead I have
another part of my application monitor the same uevent information and
trigger a larger process to remove the drive. I also have this triggered by
the timeout callback. I had to remove the timeout callback or I'd get
bombarded with the callback until it actually happened, looks like the
timeout callback assumes that the reset call will happen from inside it and
complete before it returns but that doesn't work very well with our
application.

I have an issue with the threading model you are using as we are using
user-space-threads (aka green threads or coroutines) and spdk will block
the entire application for no good reason. It would have been nice if there
was a yield callback I can provide you so you won't block the main thread.
For now I need to resort to send the spdk admin calls to a thread to be
done there since spdk may block.

If I am on a ranting roll I also would have preferred a lower level
interface than currently provided that doesn't implement internal queuing
and internal allocations so much and doesn't hide all the nvme details
(prp, sgl), though by that stage I'm probably better of with unvme instead
of spdk. If they had support uio_pci_generic in addition to vfio-pci it
would probably be a better direction for me.

Baruch

On Wed, May 23, 2018 at 8:18 PM Harris, James R <james.r.harris(a)intel.com>
wrote:

> Hi Baruch,
>
>
>
> Thanks for raising this issue – there are absolutely changes that SPDK
> needs to make here.
>
>
>
> Can you describe your code path a bit more?  Or let me try to guess and
> you can tell me where I’m wrong.
>
>
>
> 1)       You’re using spdk_nvme_probe() with a NULL trid and a remove_cb
> handler to detect physically removed devices.
>
> 2)       In your remove_cb handler, you call spdk_nvme_detach().
>
>
>
> The SPDK bdev nvme module doesn’t call spdk_nvme_detach() in its remove_cb
> which is why the SPDK automated tests don’t run into this issue.  But I
> don’t this is correct – we should be calling spdk_nvme_detach() at some
> point to clean up any allocated resources.  It needs to make sure any
> associate IO channels are freed up first (to avoid racing between the
> remove callback and different threads submitting IO to that removed
> controller).
>
>
>
> Could you file this as a bug in github?  Please add any additional details
> on how you’re hitting this issue if it’s different than what I’ve guessed
> above.
>
>
>
> https://github.com/spdk/spdk/issues
>
>
>
> Thanks,
>
>
>
> Jim
>
>
>
>
>
> *From: *SPDK <spdk-bounces(a)lists.01.org> on behalf of Baruch Even <
> baruch(a)weka.io>
> *Reply-To: *Storage Performance Development Kit <spdk(a)lists.01.org>
> *Date: *Wednesday, May 23, 2018 at 1:44 AM
> *To: *Storage Performance Development Kit <spdk(a)lists.01.org>
> *Subject: *[SPDK] Handling of physical disk removals
>
>
>
> Hi,
>
> I'm using spdk for local nvme through the nvme interface, I find that
> physical disk removals are not handled properly for my use case and wonder
> if others see it that way as well and if there is an intention to fix this.
>
> Our system uses long running processes that control one or more disks at a
> time, if a disk fails it may drop completely from the pcie bus and it will
> also look like that if the disk is physically removed (say a technician
> mistakes the disk that he should replace).
>
> The problem that I see is that spdk doesnt consider a device completely
> disappearing from the bus and will try to release the io qpair by sending
> the delete io sq and delete io cq commands, both of these will never get an
> answer (the device is not on the pcie device anymore) and there is no
> timeout logic in that code path. This means two things, the process will
> halt forever and there is an effective memory leak which currently means
> that we need to restart the process. Now, our system is resilient enough
> that restarting the process is not a big deal but it is a very messy way to
> go about handlign a physical drive removal.
>
>
>
> Have others seen this behavior? Does it bother others?
>
> For my own use I put a timeout in there of a few seconds and that solves
> it for me.
>
>
>
> Baruch Even
>
>
>
> --
>
>
> * Baruch Even, **Software Developer  *
>
> *E * baruch(a)weka.io <liran(a)weka.io>
> www.weka.io
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
>
-- 



*Baruch Even, Software Developer  E  baruch(a)weka.io <liran(a)weka.io>
www.weka.io <http://www.weka.io>*

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 11700 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Handling of physical disk removals
@ 2018-05-23 17:18 Harris, James R
  0 siblings, 0 replies; 17+ messages in thread
From: Harris, James R @ 2018-05-23 17:18 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2946 bytes --]

Hi Baruch,

Thanks for raising this issue – there are absolutely changes that SPDK needs to make here.

Can you describe your code path a bit more?  Or let me try to guess and you can tell me where I’m wrong.


1)       You’re using spdk_nvme_probe() with a NULL trid and a remove_cb handler to detect physically removed devices.

2)       In your remove_cb handler, you call spdk_nvme_detach().

The SPDK bdev nvme module doesn’t call spdk_nvme_detach() in its remove_cb which is why the SPDK automated tests don’t run into this issue.  But I don’t this is correct – we should be calling spdk_nvme_detach() at some point to clean up any allocated resources.  It needs to make sure any associate IO channels are freed up first (to avoid racing between the remove callback and different threads submitting IO to that removed controller).

Could you file this as a bug in github?  Please add any additional details on how you’re hitting this issue if it’s different than what I’ve guessed above.

https://github.com/spdk/spdk/issues

Thanks,

Jim


From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Baruch Even <baruch(a)weka.io>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
Date: Wednesday, May 23, 2018 at 1:44 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: [SPDK] Handling of physical disk removals

Hi,
I'm using spdk for local nvme through the nvme interface, I find that physical disk removals are not handled properly for my use case and wonder if others see it that way as well and if there is an intention to fix this.
Our system uses long running processes that control one or more disks at a time, if a disk fails it may drop completely from the pcie bus and it will also look like that if the disk is physically removed (say a technician mistakes the disk that he should replace).
The problem that I see is that spdk doesnt consider a device completely disappearing from the bus and will try to release the io qpair by sending the delete io sq and delete io cq commands, both of these will never get an answer (the device is not on the pcie device anymore) and there is no timeout logic in that code path. This means two things, the process will halt forever and there is an effective memory leak which currently means that we need to restart the process. Now, our system is resilient enough that restarting the process is not a big deal but it is a very messy way to go about handlign a physical drive removal.

Have others seen this behavior? Does it bother others?
For my own use I put a timeout in there of a few seconds and that solves it for me.

Baruch Even

--
[https://docs.google.com/uc?export=download&id=1w6mlhCJZRlvqVOKRoulFtalU3TMY41VY&revid=0Bw_6cJeSSMVJR2JMQzdxV3VqVi9IWFNDM1FVcnFoRlc2NkJzPQ]
Baruch Even, Software Developer

E  baruch(a)weka.io<mailto:liran(a)weka.io>
www.weka.io<http://www.weka.io>

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 11409 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [SPDK] Handling of physical disk removals
@ 2018-05-23  8:44 Baruch Even
  0 siblings, 0 replies; 17+ messages in thread
From: Baruch Even @ 2018-05-23  8:44 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 1426 bytes --]

Hi,

I'm using spdk for local nvme through the nvme interface, I find that
physical disk removals are not handled properly for my use case and wonder
if others see it that way as well and if there is an intention to fix this.

Our system uses long running processes that control one or more disks at a
time, if a disk fails it may drop completely from the pcie bus and it will
also look like that if the disk is physically removed (say a technician
mistakes the disk that he should replace).

The problem that I see is that spdk doesnt consider a device completely
disappearing from the bus and will try to release the io qpair by sending
the delete io sq and delete io cq commands, both of these will never get an
answer (the device is not on the pcie device anymore) and there is no
timeout logic in that code path. This means two things, the process will
halt forever and there is an effective memory leak which currently means
that we need to restart the process. Now, our system is resilient enough
that restarting the process is not a big deal but it is a very messy way to
go about handlign a physical drive removal.

Have others seen this behavior? Does it bother others?

For my own use I put a timeout in there of a few seconds and that solves it
for me.

Baruch Even

-- 



*Baruch Even, Software Developer  E  baruch(a)weka.io <liran(a)weka.io>
www.weka.io <http://www.weka.io>*

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 2775 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-06-03  8:35 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-01  8:25 [SPDK] Handling of physical disk removals Baruch Even
  -- strict thread matches above, loose matches on Subject: below --
2018-06-03  8:35 Baruch Even
2018-06-01 16:54 Harris, James R
2018-06-01  8:27 Baruch Even
2018-05-31 19:01 Andrey Kuzmin
2018-05-31 16:33 Verkamp, Daniel
2018-05-31 16:24 Harris, James R
2018-05-31  7:54 Andrey Kuzmin
2018-05-31  7:37 Baruch Even
2018-05-31  7:24 Baruch Even
2018-05-30 21:27 Walker, Benjamin
2018-05-30 17:49 Andrey Kuzmin
2018-05-30 11:46 Baruch Even
2018-05-24 16:44 Walker, Benjamin
2018-05-24  8:16 Baruch Even
2018-05-23 17:18 Harris, James R
2018-05-23  8:44 Baruch Even

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.