All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [SPDK] SPDK Blob Store Fundamentals
@ 2017-03-29 20:40 Walker, Benjamin
  0 siblings, 0 replies; 3+ messages in thread
From: Walker, Benjamin @ 2017-03-29 20:40 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6755 bytes --]

On Wed, 2017-03-29 at 19:06 +0000, George Kondiles wrote:
> Hello,
> 
> I am attempting to use the SPDK blob store to implement a basic NVMe-
> based flat file store. I understand that this is a new addition to
> the SPDK that is under active development and that
> documentation/examples of usage are sparse. But this is a great new
> addition to the SPDK that I've been tracking and so I'm eager to
> begin using it.

I'm glad you're using it! Note that this is not even part of an
official release yet. Further, the API we're going to release as part
of SPDK 17.03 is not even the API that I envision the blobstore will
have when all is said and done. I just want to correctly set
expectations - I'm going to change the API quite a bit and not
everything in the API currently makes sense. I also reserve the right
to change the on-disk format still, for at least a few more months.
Feedback of any kind is very much welcome.

> 
> With that being said, I've been scouring through its usage in the
> bdev component, as well as the test cases in an attempt to glean how
> I might integrate it into my code base (specifically, I am already
> successfully using the SPDK to interact with NVMe devices) but have a
> few high-level questions that I hope are easy to answer.
> 
> 1) In the most basic usage, it seems IO channels should be 1-to-1
> with threads. It looks like I must start a thread,
> call spdk_allocate_thread(), then spdk_get_io_channel() to get the
> spdk_io_channel instance created and associated with that thread.

You'll always need to call spdk_allocate_thread when each new thread
starts up that is using the blobstore (unless you are using our event
framework from lib/event, which does that for you). If you want the
blobstore to talk to the bdev layer, then you'll want to call
spdk_get_io_channel and pass it the bdev as the io_device parameter.
There is a full example of how to do this in lib/blob/bdev/blob_bdev.c.
I highly recommend for this first version that you just follow that
example.

If you want the blobstore to talk directly to the NVMe driver, however,
I haven't written an example to show you how just yet. I think the
easiest way to implement spdk_bs_dev::create_channel directly on the
NVMe driver is to make it call spdk_nvme_ctrlr_alloc_io_qpair and then
return (and cast) the queue pair to an spdk_io_channel object. That's
cheating a bit but I think it will work out. I'll try to write up an
example that demonstrates the best way to do this in the next week or
two. There are some other challenges here, such as who polls each queue
pair for completions, that using the bdev layer just solves for you.

> 
> Since spdk_bs_dev.create_channel is synchronous, it looks like I must
> block the create_channel() call while the above is happening in the
> new IO thread. Is this a reasonable approach, or am I misinterpreting
> how IO channels are intended to work?

The spdk_bs_dev::create_channel function will only be called on the
thread that will be using that channel. That thread should have already
been set up with the spdk_allocate_thread when it started, but you can
just call spdk_get_io_channel from within the create_channel callback.
See lib/blob/bdev/blob_bdev.c for an example that you can probably just
use outright.

> 
> 2) I've already got a set of IO threads for executing asynchronous
> NVMe operations (e.g. spdk_nvme_ns_cmd_read(...)) against one or more
> devices. These IO threads each own a set of NVMe queue pairs, and
> have queuing mechanisms allowing for the submission of work to be
> performed against a specific device. Given this, I am interpreting an
> IO channel to essentially be an additional "outer" queue of pending
> blob-IO operations that are processed by an additional, dedicated
> thread. A call to spdk_bs_dev.read() or .write() would find the
> correct IO channel thread, enqueue an "outer" blob op, and the
> channel IO thread would then enqueue one or more lower-level NVMe IO
> operations on the "inner" queue. Does this interpretation match the
> intended usage? Am I missing something?

I think you're on the right track here. Our spdk_io_channel structure
is just a software construct for tracking per-thread contexts up and
down the I/O stack. The bottom of that stack is an NVMe queue pair
typically. This idea is a powerful one, but one that we haven't done a
great job explaining just yet. It is a dramatic departure from concepts
present in POSIX too, so it will be unfamiliar to most people.

> 
> 3) spdk_bs_dev.unmap() appears to correspond to dealloc/TRIM. Is this
> correct?

Yes. SATA calls it TRIM. NVMe calls it dealloc. SCSI calls it UNMAP.
Maybe I should call it dealloc because that's actually the most
descriptive term and we're very NVMe-centric. Of the three terms, I'm
sure that's the one least used though.

> 
> 4) I've read through the docs at http://www.spdk.io/doc/blob.html and
> understand at a high level how things are being stored on disk, but
> there are references to the caching of metadata. My current workload
> will likely generate on the order of 100K to 1M blobs of sizes
> ranging from 512KB to 32MB, each with a couple of small attributes.
> Is there any way to estimate the total size (in memory) of the cache?
> Also, are any metadata modifications O(n) in the number of blobs?

Blob metadata is cached, but only while a blob is open. If you close
the blob all of the memory is released. I don't have exact counts (and
it is very much subject to change), but you can expect maybe ~128B per
open blob. There are a few operations (i.e. opening a blob) that are
currently O(N) where N is the number of OPEN blobs. This is just
because I haven't had a chance to implement a better algorithm yet.
There aren't any operations that are O(N) where N is the total number
of blobs. In general, blobs are entirely independent of one another
because they each have their own blocks for metadata and data and the
location of that metadata can be determined entirely from the blobid
with no shared data structure. That's the real key to this design -
with the exception of a bit mask that requires central coordination for
a brief, synchronous period only when doing a few rare metadata
operations (create, sync, delete), every other operation on the
blobstore can happen entirely in parallel with no locks.

> 
> Thanks in advance for any help or insight anyone can provide. Any
> assistance is greatly appreciated.
> 
> - George Kondiles
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 3274 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [SPDK] SPDK Blob Store Fundamentals
@ 2017-03-29 19:54 Marushak, Nathan
  0 siblings, 0 replies; 3+ messages in thread
From: Marushak, Nathan @ 2017-03-29 19:54 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3493 bytes --]

Hi George,

Until someone from the team responds to your questions, I will take a moment to mention that our upcoming SPDK Summit event on April 19th and 20th at the Hyatt in Santa Clara will have a 1 hour deep dive dedicated to this topic. During the 2 days we'll cover just about every inch of SPDK and will also have discussions on the Intel Intelligent Storage Acceleration Library and Intel's Cache Acceleration Software.  Additionally, there will be some storage companies talking about their use of SPDK. If you're interested in attending, follow the link below.

https://goo.gl/XkS7Xx

Thanks,
Nate

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of George Kondiles
Sent: Wednesday, March 29, 2017 12:07 PM
To: spdk(a)lists.01.org
Subject: [SPDK] SPDK Blob Store Fundamentals


Hello,



I am attempting to use the SPDK blob store to implement a basic NVMe-based flat file store. I understand that this is a new addition to the SPDK that is under active development and that documentation/examples of usage are sparse. But this is a great new addition to the SPDK that I've been tracking and so I'm eager to begin using it.



With that being said, I've been scouring through its usage in the bdev component, as well as the test cases in an attempt to glean how I might integrate it into my code base (specifically, I am already successfully using the SPDK to interact with NVMe devices) but have a few high-level questions that I hope are easy to answer.



1) In the most basic usage, it seems IO channels should be 1-to-1 with threads. It looks like I must start a thread, call spdk_allocate_thread(), then spdk_get_io_channel() to get the spdk_io_channel instance created and associated with that thread.



Since spdk_bs_dev.create_channel is synchronous, it looks like I must block the create_channel() call while the above is happening in the new IO thread. Is this a reasonable approach, or am I misinterpreting how IO channels are intended to work?



2) I've already got a set of IO threads for executing asynchronous NVMe operations (e.g. spdk_nvme_ns_cmd_read(...)) against one or more devices. These IO threads each own a set of NVMe queue pairs, and have queuing mechanisms allowing for the submission of work to be performed against a specific device. Given this, I am interpreting an IO channel to essentially be an additional "outer" queue of pending blob-IO operations that are processed by an additional, dedicated thread. A call to spdk_bs_dev.read() or .write() would find the correct IO channel thread, enqueue an "outer" blob op, and the channel IO thread would then enqueue one or more lower-level NVMe IO operations on the "inner" queue. Does this interpretation match the intended usage? Am I missing something?



3) spdk_bs_dev.unmap() appears to correspond to dealloc/TRIM. Is this correct?



4) I've read through the docs at http://www.spdk.io/doc/blob.html and understand at a high level how things are being stored on disk, but there are references to the caching of metadata. My current workload will likely generate on the order of 100K to 1M blobs of sizes ranging from 512KB to 32MB, each with a couple of small attributes. Is there any way to estimate the total size (in memory) of the cache? Also, are any metadata modifications O(n) in the number of blobs?



Thanks in advance for any help or insight anyone can provide. Any assistance is greatly appreciated.



- George Kondiles

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 9166 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [SPDK] SPDK Blob Store Fundamentals
@ 2017-03-29 19:06 George Kondiles
  0 siblings, 0 replies; 3+ messages in thread
From: George Kondiles @ 2017-03-29 19:06 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2663 bytes --]

Hello,


I am attempting to use the SPDK blob store to implement a basic NVMe-based flat file store. I understand that this is a new addition to the SPDK that is under active development and that documentation/examples of usage are sparse. But this is a great new addition to the SPDK that I've been tracking and so I'm eager to begin using it.


With that being said, I've been scouring through its usage in the bdev component, as well as the test cases in an attempt to glean how I might integrate it into my code base (specifically, I am already successfully using the SPDK to interact with NVMe devices) but have a few high-level questions that I hope are easy to answer.


1) In the most basic usage, it seems IO channels should be 1-to-1 with threads. It looks like I must start a thread, call spdk_allocate_thread(), then spdk_get_io_channel() to get the spdk_io_channel instance created and associated with that thread.


Since spdk_bs_dev.create_channel is synchronous, it looks like I must block the create_channel() call while the above is happening in the new IO thread. Is this a reasonable approach, or am I misinterpreting how IO channels are intended to work?


2) I've already got a set of IO threads for executing asynchronous NVMe operations (e.g. spdk_nvme_ns_cmd_read(...)) against one or more devices. These IO threads each own a set of NVMe queue pairs, and have queuing mechanisms allowing for the submission of work to be performed against a specific device. Given this, I am interpreting an IO channel to essentially be an additional "outer" queue of pending blob-IO operations that are processed by an additional, dedicated thread. A call to spdk_bs_dev.read() or .write() would find the correct IO channel thread, enqueue an "outer" blob op, and the channel IO thread would then enqueue one or more lower-level NVMe IO operations on the "inner" queue. Does this interpretation match the intended usage? Am I missing something?


3) spdk_bs_dev.unmap() appears to correspond to dealloc/TRIM. Is this correct?


4) I've read through the docs at http://www.spdk.io/doc/blob.html and understand at a high level how things are being stored on disk, but there are references to the caching of metadata. My current workload will likely generate on the order of 100K to 1M blobs of sizes ranging from 512KB to 32MB, each with a couple of small attributes. Is there any way to estimate the total size (in memory) of the cache? Also, are any metadata modifications O(n) in the number of blobs?


Thanks in advance for any help or insight anyone can provide. Any assistance is greatly appreciated.


- George Kondiles

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 5512 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-03-29 20:40 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-29 20:40 [SPDK] SPDK Blob Store Fundamentals Walker, Benjamin
  -- strict thread matches above, loose matches on Subject: below --
2017-03-29 19:54 Marushak, Nathan
2017-03-29 19:06 George Kondiles

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.