All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [SPDK] FW: sharing of single NVMe device between spdk userspace and native kernel driver
@ 2016-07-15 20:10 Walker, Benjamin
  0 siblings, 0 replies; 5+ messages in thread
From: Walker, Benjamin @ 2016-07-15 20:10 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 9086 bytes --]

On Thu, 2016-07-14 at 21:34 +0000, Raj (Rajinikanth) Pandurangan wrote:
> Hi Ben,
> 
> Yes, I too agree that one of the most important requirement is to have a filesystem with SPDK.
> 
> What are all the known challenges to develop a filesystem with SPDK?  Are the interface/APIs
> provided by SPDK good enough?

SPDK provides only block-level access to storage devices today. At a minimum, a "filesystem" on top
of SPDK would need to provide a mechanism to dynamically allocate discontiguous physical blocks and
present them as a contiguous space that can be written to or read from in some unit (4k? 1 byte?).
I've been choosing to call that a "blob" to differentiate it from a file in the Unix sense, and I've
been calling the whole thing a "blobstore" as opposed to a filesystem. The blobstore would ensure
blobs are persistent and rediscoverable across reboots.

> 
> It would be good to list down the known challenges in the mailing-list, so that community may try
> to address/discuss about them?

Beyond the very basic requirements above, I think the additional requirements depend on the
application that is using it. Some applications can tolerate only being allowed to write and read in
sector-size chunks, for instance, which is important if the blobstore wishes to implement zero copy.
Other applications need finer granularity. Many databases don't need directories either - they can
live with a flat namespace in which to place their blobs. I think file-level permissions aren't
needed either.

Functional requirements aside, in order to get the best performance possible, the blobstore would need to be asynchronous, lockless, and polled mode. That's a real challenge due to shared metadata, although I have a number of ideas in this area.

> 
> Thanks,
> 
> 
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
> Sent: Thursday, July 14, 2016 1:18 PM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] FW: sharing of single NVMe device between spdk userspace and native kernel
> driver
> 
> On Thu, 2016-07-14 at 11:59 -0700, txcy uio wrote:
> > > Thank you Ben for the detailed reply.
> > A filesystem which can make use of SPDK is precisely the requirement. 
> > Everything else is a way to get around that. In my specific use case I 
> > wish to have a single nvme device which will have a rootfs as well. So 
> > such a filesystem will need to handle that as well (probably I am being too ambitious here).
> 
> It won't ever be possible to use SPDK as the driver for your boot device. SR-IOV would let you
> share the device between the kernel (for booting) and your application, if and/or when that
> exists.
> 
> > The only other "filesystem" that I am aware of is Ceph's bluefs which 
> > is very minimal and specific to Rocksdb backend.
> 
> This is the only one that I'm aware of currently as well, and it has a number of features that
> make it not particularly suitable for use with SPDK (even though it does work with SPDK). The
> biggest problems are around synchronous I/O operations and lack of memory pre-registration,
> forcing copies on every I/O.
> 
> >  On a side note if I had more than one nvme device on a system , do 
> > all the nvme devices need to be unbound from the kernel driver?
> 
> Each NVMe device is independent. You can use some NVMe devices with SPDK and others with the
> kernel at the same time with no conflict. Our setup scripts do either bind or unbind all of them
> at once, but that's just for convenience.
> 
> > 
> > --Tyc
> >  
> > > On 7/14/16, 10:55 AM, "SPDK on behalf of Walker, Benjamin" 
> > > <spdk-bounces(a)lists.01.org on behalf of benjamin.walker(a)intel.com> wrote:
> > > 
> > > > 
> > > > 
> > > > On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote:
> > > > > Hello Ben
> > > > > 
> > > > > I have a use case where I want to attach one namespace of a nvme 
> > > > > device to spdk driver and
> > > use the
> > > > > other namespace as a kernel block device to create a regular 
> > > > > filesystem. Current
> > > implementation of
> > > > > spdk requires the device to be unbound completely from the native 
> > > > > kernel driver. I was
> > > wondering
> > > > > if this is at all possible and if yes can this be accomplished 
> > > > > with the current spdk implementation?
> > > > 
> > > > Your request is one we get every few days or so, and it is a perfectly reasonable thing to
> > > > ask.
> > > I
> > > > haven't written down my standard response on the mailing list yet, 
> > > > so I'm going to take this opportunity to lay out our position for all to see and discuss.
> > > > 
> > > > From a purely technical standpoint, it is impossible to both load 
> > > > the SPDK driver as it exists
> > > today
> > > > and the kernel driver against the same PCI device. The registers 
> > > > exposed by the PCI device
> > > contain
> > > > global state and so there can only be a single "owner". There is an 
> > > > established hardware
> > > mechanism
> > > > for creating multiple virtual PCI devices from a single physical 
> > > > devices that each can load
> > > their
> > > > own driver called SR-IOV. This is typically used by NICs today and 
> > > > I'm not aware of any NVMe
> > > SSDs
> > > > that support it currently. SR-IOV is the right solution for sharing 
> > > > the device like you outline
> > > in
> > > > the long term, though.
> > > > 
> > > > In the short term, it would be technically possible to create some 
> > > > kernel patches that add
> > > entries
> > > > to sysfs or provide ioctls that allow a user space process to claim 
> > > > an NVMe hardware queue for
> > > a
> > > > device that the kernel is managing. You could then run the SPDK 
> > > > driver's I/O path against that queue. Unfortunately, there are two 
> > > > insurmountable issues with this strategy. First, NVMe
> > > hardware
> > > > queues can write to any namespace on the device. Therefore, you 
> > > > couldn't enforce that the queue
> > > can
> > > > only write to the namespace you are intending. You couldn't even 
> > > > enforce that the queue is only
> > > used
> > > > for reads - you basically just have to trust the application to only do reasonable things.
> > > Second,
> > > > the device is owned by the kernel and therefore is not in an IOMMU 
> > > > protection domain with this strategy. The device can directly 
> > > > access the DMA engine, and with a small amount of work, you
> > > could
> > > > hijack that DMA engine to copy data to wherever you wanted on the 
> > > > system. For these two
> > > reasons,
> > > > patches of this nature would never be accepted into the mainline 
> > > > kernel. The SPDK team can't be
> > > in
> > > > the business of supporting patches that have been rejected by the kernel community.
> > > > 
> > > > Clearly, lots of people have requested to share a device between 
> > > > the kernel and SPDK, so I've
> > > been
> > > > trying to uncover all of the reasons they may want to do that. So 
> > > > far, in every case, it boils
> > > down
> > > > to not having a filesystem for use with SPDK. I'm hoping to steer 
> > > > the community to solve the
> > > problem
> > > > of not having a filesystem rather than trying to share the device. 
> > > > I'm not advocating for
> > > writing a
> > > > (mostly) POSIX compliant filesystem, but I do think there is a 
> > > > small core of functionality that
> > > most
> > > > databases or storage applications all require. These are things 
> > > > like allocating blocks into
> > > some
> > > > unit (I've been calling it a blob) that has a name and is 
> > > > persistent and rediscoverable across reboots. Writing this layer 
> > > > requires some serious thought - SPDK is fast in no small part
> > > because it
> > > > is purely asynchronous, polled, and lockless - so this layer would 
> > > > need to preserve those characteristics.
> > > > 
> > > > Sorry for the very long response, but I wanted to document my 
> > > > current thoughts on the mailing
> > > list
> > > > for all to see.
> > > > 
> > > > > 
> > > > > --Tyc
> > > > > 
> > > > > _______________________________________________
> > > > > SPDK mailing list
> > > > > SPDK(a)lists.01.org
> > > > > https://lists.01.org/mailman/listinfo/spdk
> > > > _______________________________________________
> > > > SPDK mailing list
> > > > SPDK(a)lists.01.org
> > > > https://lists.01.org/mailman/listinfo/spdk
> > > 
> > 
> > 
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [SPDK] FW: sharing of single NVMe device between spdk userspace and native kernel driver
@ 2016-07-15 21:07 Andrey Kuzmin
  0 siblings, 0 replies; 5+ messages in thread
From: Andrey Kuzmin @ 2016-07-15 21:07 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 10044 bytes --]

On Fri, Jul 15, 2016, 23:10 Walker, Benjamin <benjamin.walker(a)intel.com>
wrote:

> On Thu, 2016-07-14 at 21:34 +0000, Raj (Rajinikanth) Pandurangan wrote:
> > Hi Ben,
> >
> > Yes, I too agree that one of the most important requirement is to have a
> filesystem with SPDK.
> >
> > What are all the known challenges to develop a filesystem with
> SPDK?  Are the interface/APIs
> > provided by SPDK good enough?
>
> SPDK provides only block-level access to storage devices today.


I don't think SPDK does even that today as it's very much (naturally)
NVMe-centric, exposing a wealth of details no block device provides. In a
kernel I/O stack, SPDK would be a protocol driver, with the block layer
above to be added.

At a minimum, a "filesystem" on top
> of SPDK would need to provide a mechanism to dynamically allocate
> discontiguous physical blocks and
> present them as a contiguous space that can be written to or read from in
> some unit (4k? 1 byte?).
> I've been choosing to call that a "blob" to differentiate it from a file
> in the Unix sense, and I've
> been calling the whole thing a "blobstore" as opposed to a filesystem. The
> blobstore would ensure
> blobs are persistent and rediscoverable across reboots.
>

It sounds very much like a key-value store.

Regards,
Andrey

>
> > It would be good to list down the known challenges in the mailing-list,
> so that community may try
> > to address/discuss about them?
>
> Beyond the very basic requirements above, I think the additional
> requirements depend on the
> application that is using it. Some applications can tolerate only being
> allowed to write and read in
> sector-size chunks, for instance, which is important if the blobstore
> wishes to implement zero copy.
> Other applications need finer granularity. Many databases don't need
> directories either - they can
> live with a flat namespace in which to place their blobs. I think
> file-level permissions aren't
> needed either.
>
> Functional requirements aside, in order to get the best performance
> possible, the blobstore would need to be asynchronous, lockless, and polled
> mode. That's a real challenge due to shared metadata, although I have a
> number of ideas in this area.
>
> >
> > Thanks,
> >
> >
> > -----Original Message-----
> > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker,
> Benjamin
> > Sent: Thursday, July 14, 2016 1:18 PM
> > To: spdk(a)lists.01.org
> > Subject: Re: [SPDK] FW: sharing of single NVMe device between spdk
> userspace and native kernel
> > driver
> >
> > On Thu, 2016-07-14 at 11:59 -0700, txcy uio wrote:
> > > > Thank you Ben for the detailed reply.
> > > A filesystem which can make use of SPDK is precisely the requirement.
> > > Everything else is a way to get around that. In my specific use case I
> > > wish to have a single nvme device which will have a rootfs as well. So
> > > such a filesystem will need to handle that as well (probably I am
> being too ambitious here).
> >
> > It won't ever be possible to use SPDK as the driver for your boot
> device. SR-IOV would let you
> > share the device between the kernel (for booting) and your application,
> if and/or when that
> > exists.
> >
> > > The only other "filesystem" that I am aware of is Ceph's bluefs which
> > > is very minimal and specific to Rocksdb backend.
> >
> > This is the only one that I'm aware of currently as well, and it has a
> number of features that
> > make it not particularly suitable for use with SPDK (even though it does
> work with SPDK). The
> > biggest problems are around synchronous I/O operations and lack of
> memory pre-registration,
> > forcing copies on every I/O.
> >
> > >  On a side note if I had more than one nvme device on a system , do
> > > all the nvme devices need to be unbound from the kernel driver?
> >
> > Each NVMe device is independent. You can use some NVMe devices with SPDK
> and others with the
> > kernel at the same time with no conflict. Our setup scripts do either
> bind or unbind all of them
> > at once, but that's just for convenience.
> >
> > >
> > > --Tyc
> > >
> > > > On 7/14/16, 10:55 AM, "SPDK on behalf of Walker, Benjamin"
> > > > <spdk-bounces(a)lists.01.org on behalf of benjamin.walker(a)intel.com>
> wrote:
> > > >
> > > > >
> > > > >
> > > > > On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote:
> > > > > > Hello Ben
> > > > > >
> > > > > > I have a use case where I want to attach one namespace of a nvme
> > > > > > device to spdk driver and
> > > > use the
> > > > > > other namespace as a kernel block device to create a regular
> > > > > > filesystem. Current
> > > > implementation of
> > > > > > spdk requires the device to be unbound completely from the
> native
> > > > > > kernel driver. I was
> > > > wondering
> > > > > > if this is at all possible and if yes can this be accomplished
> > > > > > with the current spdk implementation?
> > > > >
> > > > > Your request is one we get every few days or so, and it is a
> perfectly reasonable thing to
> > > > > ask.
> > > > I
> > > > > haven't written down my standard response on the mailing list yet,
> > > > > so I'm going to take this opportunity to lay out our position for
> all to see and discuss.
> > > > >
> > > > > From a purely technical standpoint, it is impossible to both load
> > > > > the SPDK driver as it exists
> > > > today
> > > > > and the kernel driver against the same PCI device. The registers
> > > > > exposed by the PCI device
> > > > contain
> > > > > global state and so there can only be a single "owner". There is
> an
> > > > > established hardware
> > > > mechanism
> > > > > for creating multiple virtual PCI devices from a single physical
> > > > > devices that each can load
> > > > their
> > > > > own driver called SR-IOV. This is typically used by NICs today and
> > > > > I'm not aware of any NVMe
> > > > SSDs
> > > > > that support it currently. SR-IOV is the right solution for
> sharing
> > > > > the device like you outline
> > > > in
> > > > > the long term, though.
> > > > >
> > > > > In the short term, it would be technically possible to create some
> > > > > kernel patches that add
> > > > entries
> > > > > to sysfs or provide ioctls that allow a user space process to
> claim
> > > > > an NVMe hardware queue for
> > > > a
> > > > > device that the kernel is managing. You could then run the SPDK
> > > > > driver's I/O path against that queue. Unfortunately, there are two
> > > > > insurmountable issues with this strategy. First, NVMe
> > > > hardware
> > > > > queues can write to any namespace on the device. Therefore, you
> > > > > couldn't enforce that the queue
> > > > can
> > > > > only write to the namespace you are intending. You couldn't even
> > > > > enforce that the queue is only
> > > > used
> > > > > for reads - you basically just have to trust the application to
> only do reasonable things.
> > > > Second,
> > > > > the device is owned by the kernel and therefore is not in an IOMMU
> > > > > protection domain with this strategy. The device can directly
> > > > > access the DMA engine, and with a small amount of work, you
> > > > could
> > > > > hijack that DMA engine to copy data to wherever you wanted on the
> > > > > system. For these two
> > > > reasons,
> > > > > patches of this nature would never be accepted into the mainline
> > > > > kernel. The SPDK team can't be
> > > > in
> > > > > the business of supporting patches that have been rejected by the
> kernel community.
> > > > >
> > > > > Clearly, lots of people have requested to share a device between
> > > > > the kernel and SPDK, so I've
> > > > been
> > > > > trying to uncover all of the reasons they may want to do that. So
> > > > > far, in every case, it boils
> > > > down
> > > > > to not having a filesystem for use with SPDK. I'm hoping to steer
> > > > > the community to solve the
> > > > problem
> > > > > of not having a filesystem rather than trying to share the device.
> > > > > I'm not advocating for
> > > > writing a
> > > > > (mostly) POSIX compliant filesystem, but I do think there is a
> > > > > small core of functionality that
> > > > most
> > > > > databases or storage applications all require. These are things
> > > > > like allocating blocks into
> > > > some
> > > > > unit (I've been calling it a blob) that has a name and is
> > > > > persistent and rediscoverable across reboots. Writing this layer
> > > > > requires some serious thought - SPDK is fast in no small part
> > > > because it
> > > > > is purely asynchronous, polled, and lockless - so this layer would
> > > > > need to preserve those characteristics.
> > > > >
> > > > > Sorry for the very long response, but I wanted to document my
> > > > > current thoughts on the mailing
> > > > list
> > > > > for all to see.
> > > > >
> > > > > >
> > > > > > --Tyc
> > > > > >
> > > > > > _______________________________________________
> > > > > > SPDK mailing list
> > > > > > SPDK(a)lists.01.org
> > > > > > https://lists.01.org/mailman/listinfo/spdk
> > > > > _______________________________________________
> > > > > SPDK mailing list
> > > > > SPDK(a)lists.01.org
> > > > > https://lists.01.org/mailman/listinfo/spdk
> > > >
> > >
> > >
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > > https://lists.01.org/mailman/listinfo/spdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
>
-- 

Regards,
Andrey

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 13814 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [SPDK] FW: sharing of single NVMe device between spdk userspace and native kernel driver
@ 2016-07-14 21:34 Raj Pandurangan
  0 siblings, 0 replies; 5+ messages in thread
From: Raj Pandurangan @ 2016-07-14 21:34 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 7047 bytes --]

Hi Ben,

Yes, I too agree that one of the most important requirement is to have a filesystem with SPDK.

What are all the known challenges to develop a filesystem with SPDK?  Are the interface/APIs provided by SPDK good enough?

It would be good to list down the known challenges in the mailing-list, so that community may try to address/discuss about them?

Thanks,


-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
Sent: Thursday, July 14, 2016 1:18 PM
To: spdk(a)lists.01.org
Subject: Re: [SPDK] FW: sharing of single NVMe device between spdk userspace and native kernel driver

On Thu, 2016-07-14 at 11:59 -0700, txcy uio wrote:
> > Thank you Ben for the detailed reply.
> A filesystem which can make use of SPDK is precisely the requirement. 
> Everything else is a way to get around that. In my specific use case I 
> wish to have a single nvme device which will have a rootfs as well. So 
> such a filesystem will need to handle that as well (probably I am being too ambitious here).

It won't ever be possible to use SPDK as the driver for your boot device. SR-IOV would let you share the device between the kernel (for booting) and your application, if and/or when that exists.

> The only other "filesystem" that I am aware of is Ceph's bluefs which 
> is very minimal and specific to Rocksdb backend.

This is the only one that I'm aware of currently as well, and it has a number of features that make it not particularly suitable for use with SPDK (even though it does work with SPDK). The biggest problems are around synchronous I/O operations and lack of memory pre-registration, forcing copies on every I/O.

>  On a side note if I had more than one nvme device on a system , do 
> all the nvme devices need to be unbound from the kernel driver?

Each NVMe device is independent. You can use some NVMe devices with SPDK and others with the kernel at the same time with no conflict. Our setup scripts do either bind or unbind all of them at once, but that's just for convenience.

> 
> --Tyc
>  
> > On 7/14/16, 10:55 AM, "SPDK on behalf of Walker, Benjamin" 
> > <spdk-bounces(a)lists.01.org on behalf of benjamin.walker(a)intel.com> wrote:
> > 
> > >
> > >
> > >On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote:
> > >> Hello Ben
> > >>
> > >> I have a use case where I want to attach one namespace of a nvme 
> > >> device to spdk driver and
> > use the
> > >> other namespace as a kernel block device to create a regular 
> > >> filesystem. Current
> > implementation of
> > >> spdk requires the device to be unbound completely from the native 
> > >> kernel driver. I was
> > wondering
> > >> if this is at all possible and if yes can this be accomplished 
> > >> with the current spdk implementation?
> > >
> > >Your request is one we get every few days or so, and it is a perfectly reasonable thing to ask.
> > I
> > >haven't written down my standard response on the mailing list yet, 
> > >so I'm going to take this opportunity to lay out our position for all to see and discuss.
> > >
> > >From a purely technical standpoint, it is impossible to both load 
> > >the SPDK driver as it exists
> > today
> > >and the kernel driver against the same PCI device. The registers 
> > >exposed by the PCI device
> > contain
> > >global state and so there can only be a single "owner". There is an 
> > >established hardware
> > mechanism
> > >for creating multiple virtual PCI devices from a single physical 
> > >devices that each can load
> > their
> > >own driver called SR-IOV. This is typically used by NICs today and 
> > >I'm not aware of any NVMe
> > SSDs
> > >that support it currently. SR-IOV is the right solution for sharing 
> > >the device like you outline
> > in
> > >the long term, though.
> > >
> > >In the short term, it would be technically possible to create some 
> > >kernel patches that add
> > entries
> > >to sysfs or provide ioctls that allow a user space process to claim 
> > >an NVMe hardware queue for
> > a
> > >device that the kernel is managing. You could then run the SPDK 
> > >driver's I/O path against that queue. Unfortunately, there are two 
> > >insurmountable issues with this strategy. First, NVMe
> > hardware
> > >queues can write to any namespace on the device. Therefore, you 
> > >couldn't enforce that the queue
> > can
> > >only write to the namespace you are intending. You couldn't even 
> > >enforce that the queue is only
> > used
> > >for reads - you basically just have to trust the application to only do reasonable things.
> > Second,
> > >the device is owned by the kernel and therefore is not in an IOMMU 
> > >protection domain with this strategy. The device can directly 
> > >access the DMA engine, and with a small amount of work, you
> > could
> > >hijack that DMA engine to copy data to wherever you wanted on the 
> > >system. For these two
> > reasons,
> > >patches of this nature would never be accepted into the mainline 
> > >kernel. The SPDK team can't be
> > in
> > >the business of supporting patches that have been rejected by the kernel community.
> > >
> > >Clearly, lots of people have requested to share a device between 
> > >the kernel and SPDK, so I've
> > been
> > >trying to uncover all of the reasons they may want to do that. So 
> > >far, in every case, it boils
> > down
> > >to not having a filesystem for use with SPDK. I'm hoping to steer 
> > >the community to solve the
> > problem
> > >of not having a filesystem rather than trying to share the device. 
> > >I'm not advocating for
> > writing a
> > >(mostly) POSIX compliant filesystem, but I do think there is a 
> > >small core of functionality that
> > most
> > >databases or storage applications all require. These are things 
> > >like allocating blocks into
> > some
> > >unit (I've been calling it a blob) that has a name and is 
> > >persistent and rediscoverable across reboots. Writing this layer 
> > >requires some serious thought - SPDK is fast in no small part
> > because it
> > >is purely asynchronous, polled, and lockless - so this layer would 
> > >need to preserve those characteristics.
> > >
> > >Sorry for the very long response, but I wanted to document my 
> > >current thoughts on the mailing
> > list
> > >for all to see.
> > >
> > >>
> > >> --Tyc
> > >>
> > >> _______________________________________________
> > >> SPDK mailing list
> > >> SPDK(a)lists.01.org
> > >> https://lists.01.org/mailman/listinfo/spdk
> > >_______________________________________________
> > >SPDK mailing list
> > >SPDK(a)lists.01.org
> > >https://lists.01.org/mailman/listinfo/spdk
> > 
> 
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [SPDK] FW: sharing of single NVMe device between spdk userspace and native kernel driver
@ 2016-07-14 20:18 Walker, Benjamin
  0 siblings, 0 replies; 5+ messages in thread
From: Walker, Benjamin @ 2016-07-14 20:18 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6074 bytes --]

On Thu, 2016-07-14 at 11:59 -0700, txcy uio wrote:
> > Thank you Ben for the detailed reply.
> A filesystem which can make use of SPDK is precisely the requirement. Everything else is a way to
> get around that. In my specific use case I wish to have a single nvme device which will have a
> rootfs as well. So such a filesystem will need to handle that as well (probably I am being too
> ambitious here). 

It won't ever be possible to use SPDK as the driver for your boot device. SR-IOV would let you share
the device between the kernel (for booting) and your application, if and/or when that exists.

> The only other "filesystem" that I am aware of is Ceph's bluefs which is very minimal and specific
> to Rocksdb backend. 

This is the only one that I'm aware of currently as well, and it has a number of features that make
it not particularly suitable for use with SPDK (even though it does work with SPDK). The biggest
problems are around synchronous I/O operations and lack of memory pre-registration, forcing copies
on every I/O.

>  On a side note if I had more than one nvme device on a system , do all the nvme devices need to
> be unbound from the kernel driver? 

Each NVMe device is independent. You can use some NVMe devices with SPDK and others with the kernel
at the same time with no conflict. Our setup scripts do either bind or unbind all of them at once,
but that's just for convenience.

> 
> --Tyc
>  
> > On 7/14/16, 10:55 AM, "SPDK on behalf of Walker, Benjamin" <spdk-bounces(a)lists.01.org on behalf
> > of benjamin.walker(a)intel.com> wrote:
> > 
> > >
> > >
> > >On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote:
> > >> Hello Ben
> > >>
> > >> I have a use case where I want to attach one namespace of a nvme device to spdk driver and
> > use the
> > >> other namespace as a kernel block device to create a regular filesystem. Current
> > implementation of
> > >> spdk requires the device to be unbound completely from the native kernel driver. I was
> > wondering
> > >> if this is at all possible and if yes can this be accomplished with the current spdk
> > >> implementation?
> > >
> > >Your request is one we get every few days or so, and it is a perfectly reasonable thing to ask.
> > I
> > >haven't written down my standard response on the mailing list yet, so I'm going to take this
> > >opportunity to lay out our position for all to see and discuss.
> > >
> > >From a purely technical standpoint, it is impossible to both load the SPDK driver as it exists
> > today
> > >and the kernel driver against the same PCI device. The registers exposed by the PCI device
> > contain
> > >global state and so there can only be a single "owner". There is an established hardware
> > mechanism
> > >for creating multiple virtual PCI devices from a single physical devices that each can load
> > their
> > >own driver called SR-IOV. This is typically used by NICs today and I'm not aware of any NVMe
> > SSDs
> > >that support it currently. SR-IOV is the right solution for sharing the device like you outline
> > in
> > >the long term, though.
> > >
> > >In the short term, it would be technically possible to create some kernel patches that add
> > entries
> > >to sysfs or provide ioctls that allow a user space process to claim an NVMe hardware queue for
> > a
> > >device that the kernel is managing. You could then run the SPDK driver's I/O path against that
> > >queue. Unfortunately, there are two insurmountable issues with this strategy. First, NVMe
> > hardware
> > >queues can write to any namespace on the device. Therefore, you couldn't enforce that the queue
> > can
> > >only write to the namespace you are intending. You couldn't even enforce that the queue is only
> > used
> > >for reads - you basically just have to trust the application to only do reasonable things.
> > Second,
> > >the device is owned by the kernel and therefore is not in an IOMMU protection domain with this
> > >strategy. The device can directly access the DMA engine, and with a small amount of work, you
> > could
> > >hijack that DMA engine to copy data to wherever you wanted on the system. For these two
> > reasons,
> > >patches of this nature would never be accepted into the mainline kernel. The SPDK team can't be
> > in
> > >the business of supporting patches that have been rejected by the kernel community.
> > >
> > >Clearly, lots of people have requested to share a device between the kernel and SPDK, so I've
> > been
> > >trying to uncover all of the reasons they may want to do that. So far, in every case, it boils
> > down
> > >to not having a filesystem for use with SPDK. I'm hoping to steer the community to solve the
> > problem
> > >of not having a filesystem rather than trying to share the device. I'm not advocating for
> > writing a
> > >(mostly) POSIX compliant filesystem, but I do think there is a small core of functionality that
> > most
> > >databases or storage applications all require. These are things like allocating blocks into
> > some
> > >unit (I've been calling it a blob) that has a name and is persistent and rediscoverable across
> > >reboots. Writing this layer requires some serious thought - SPDK is fast in no small part
> > because it
> > >is purely asynchronous, polled, and lockless - so this layer would need to preserve those
> > >characteristics.
> > >
> > >Sorry for the very long response, but I wanted to document my current thoughts on the mailing
> > list
> > >for all to see.
> > >
> > >>
> > >> --Tyc
> > >>
> > >> _______________________________________________
> > >> SPDK mailing list
> > >> SPDK(a)lists.01.org
> > >> https://lists.01.org/mailman/listinfo/spdk
> > >_______________________________________________
> > >SPDK mailing list
> > >SPDK(a)lists.01.org
> > >https://lists.01.org/mailman/listinfo/spdk
> > 
> 
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [SPDK] FW: sharing of single NVMe device between spdk userspace and native kernel driver
@ 2016-07-14 18:59 txcy uio
  0 siblings, 0 replies; 5+ messages in thread
From: txcy uio @ 2016-07-14 18:59 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4929 bytes --]

>
> Thank you Ben for the detailed reply.

A filesystem which can make use of SPDK is precisely the requirement.
Everything else is a way to get around that. In my specific use case I wish
to have a single nvme device which will have a rootfs as well. So such a
filesystem will need to handle that as well (probably I am being too
ambitious here). The only other "filesystem" that I am aware of is Ceph's
bluefs which is very minimal and specific to Rocksdb backend.  On a side
note if I had more than one nvme device on a system , do all the nvme
devices need to be unbound from the kernel driver?

--Tyc


> On 7/14/16, 10:55 AM, "SPDK on behalf of Walker, Benjamin" <
> spdk-bounces(a)lists.01.org on behalf of benjamin.walker(a)intel.com> wrote:
>
> >
> >
> >On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote:
> >> Hello Ben
> >>
> >> I have a use case where I want to attach one namespace of a nvme device
> to spdk driver and use the
> >> other namespace as a kernel block device to create a regular
> filesystem. Current implementation of
> >> spdk requires the device to be unbound completely from the native
> kernel driver. I was wondering
> >> if this is at all possible and if yes can this be accomplished with the
> current spdk
> >> implementation?
> >
> >Your request is one we get every few days or so, and it is a perfectly
> reasonable thing to ask. I
> >haven't written down my standard response on the mailing list yet, so I'm
> going to take this
> >opportunity to lay out our position for all to see and discuss.
> >
> >From a purely technical standpoint, it is impossible to both load the
> SPDK driver as it exists today
> >and the kernel driver against the same PCI device. The registers exposed
> by the PCI device contain
> >global state and so there can only be a single "owner". There is an
> established hardware mechanism
> >for creating multiple virtual PCI devices from a single physical devices
> that each can load their
> >own driver called SR-IOV. This is typically used by NICs today and I'm
> not aware of any NVMe SSDs
> >that support it currently. SR-IOV is the right solution for sharing the
> device like you outline in
> >the long term, though.
> >
> >In the short term, it would be technically possible to create some kernel
> patches that add entries
> >to sysfs or provide ioctls that allow a user space process to claim an
> NVMe hardware queue for a
> >device that the kernel is managing. You could then run the SPDK driver's
> I/O path against that
> >queue. Unfortunately, there are two insurmountable issues with this
> strategy. First, NVMe hardware
> >queues can write to any namespace on the device. Therefore, you couldn't
> enforce that the queue can
> >only write to the namespace you are intending. You couldn't even enforce
> that the queue is only used
> >for reads - you basically just have to trust the application to only do
> reasonable things. Second,
> >the device is owned by the kernel and therefore is not in an IOMMU
> protection domain with this
> >strategy. The device can directly access the DMA engine, and with a small
> amount of work, you could
> >hijack that DMA engine to copy data to wherever you wanted on the system.
> For these two reasons,
> >patches of this nature would never be accepted into the mainline kernel.
> The SPDK team can't be in
> >the business of supporting patches that have been rejected by the kernel
> community.
> >
> >Clearly, lots of people have requested to share a device between the
> kernel and SPDK, so I've been
> >trying to uncover all of the reasons they may want to do that. So far, in
> every case, it boils down
> >to not having a filesystem for use with SPDK. I'm hoping to steer the
> community to solve the problem
> >of not having a filesystem rather than trying to share the device. I'm
> not advocating for writing a
> >(mostly) POSIX compliant filesystem, but I do think there is a small core
> of functionality that most
> >databases or storage applications all require. These are things like
> allocating blocks into some
> >unit (I've been calling it a blob) that has a name and is persistent and
> rediscoverable across
> >reboots. Writing this layer requires some serious thought - SPDK is fast
> in no small part because it
> >is purely asynchronous, polled, and lockless - so this layer would need
> to preserve those
> >characteristics.
> >
> >Sorry for the very long response, but I wanted to document my current
> thoughts on the mailing list
> >for all to see.
> >
> >>
> >> --Tyc
> >>
> >> _______________________________________________
> >> SPDK mailing list
> >> SPDK(a)lists.01.org
> >> https://lists.01.org/mailman/listinfo/spdk
> >_______________________________________________
> >SPDK mailing list
> >SPDK(a)lists.01.org
> >https://lists.01.org/mailman/listinfo/spdk
>

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 5966 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-07-15 21:07 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-15 20:10 [SPDK] FW: sharing of single NVMe device between spdk userspace and native kernel driver Walker, Benjamin
  -- strict thread matches above, loose matches on Subject: below --
2016-07-15 21:07 Andrey Kuzmin
2016-07-14 21:34 Raj Pandurangan
2016-07-14 20:18 Walker, Benjamin
2016-07-14 18:59 txcy uio

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.