Re: [SPDK] SPDK Blob Store Fundamentals

* Re: [SPDK] SPDK Blob Store Fundamentals
@ 2017-03-29 20:40 Walker, Benjamin
  0 siblings, 0 replies; 3+ messages in thread
From: Walker, Benjamin @ 2017-03-29 20:40 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6755 bytes --]

On Wed, 2017-03-29 at 19:06 +0000, George Kondiles wrote:
> Hello,
> 
> I am attempting to use the SPDK blob store to implement a basic NVMe-
> based flat file store. I understand that this is a new addition to
> the SPDK that is under active development and that
> documentation/examples of usage are sparse. But this is a great new
> addition to the SPDK that I've been tracking and so I'm eager to
> begin using it.

I'm glad you're using it! Note that this is not even part of an
official release yet. Further, the API we're going to release as part
of SPDK 17.03 is not even the API that I envision the blobstore will
have when all is said and done. I just want to correctly set
expectations - I'm going to change the API quite a bit and not
everything in the API currently makes sense. I also reserve the right
to change the on-disk format still, for at least a few more months.
Feedback of any kind is very much welcome.

> 
> With that being said, I've been scouring through its usage in the
> bdev component, as well as the test cases in an attempt to glean how
> I might integrate it into my code base (specifically, I am already
> successfully using the SPDK to interact with NVMe devices) but have a
> few high-level questions that I hope are easy to answer.
> 
> 1) In the most basic usage, it seems IO channels should be 1-to-1
> with threads. It looks like I must start a thread,
> call spdk_allocate_thread(), then spdk_get_io_channel() to get the
> spdk_io_channel instance created and associated with that thread.

You'll always need to call spdk_allocate_thread when each new thread
starts up that is using the blobstore (unless you are using our event
framework from lib/event, which does that for you). If you want the
blobstore to talk to the bdev layer, then you'll want to call
spdk_get_io_channel and pass it the bdev as the io_device parameter.
There is a full example of how to do this in lib/blob/bdev/blob_bdev.c.
I highly recommend for this first version that you just follow that
example.

If you want the blobstore to talk directly to the NVMe driver, however,
I haven't written an example to show you how just yet. I think the
easiest way to implement spdk_bs_dev::create_channel directly on the
NVMe driver is to make it call spdk_nvme_ctrlr_alloc_io_qpair and then
return (and cast) the queue pair to an spdk_io_channel object. That's
cheating a bit but I think it will work out. I'll try to write up an
example that demonstrates the best way to do this in the next week or
two. There are some other challenges here, such as who polls each queue
pair for completions, that using the bdev layer just solves for you.

> 
> Since spdk_bs_dev.create_channel is synchronous, it looks like I must
> block the create_channel() call while the above is happening in the
> new IO thread. Is this a reasonable approach, or am I misinterpreting
> how IO channels are intended to work?

The spdk_bs_dev::create_channel function will only be called on the
thread that will be using that channel. That thread should have already
been set up with the spdk_allocate_thread when it started, but you can
just call spdk_get_io_channel from within the create_channel callback.
See lib/blob/bdev/blob_bdev.c for an example that you can probably just
use outright.

> 
> 2) I've already got a set of IO threads for executing asynchronous
> NVMe operations (e.g. spdk_nvme_ns_cmd_read(...)) against one or more
> devices. These IO threads each own a set of NVMe queue pairs, and
> have queuing mechanisms allowing for the submission of work to be
> performed against a specific device. Given this, I am interpreting an
> IO channel to essentially be an additional "outer" queue of pending
> blob-IO operations that are processed by an additional, dedicated
> thread. A call to spdk_bs_dev.read() or .write() would find the
> correct IO channel thread, enqueue an "outer" blob op, and the
> channel IO thread would then enqueue one or more lower-level NVMe IO
> operations on the "inner" queue. Does this interpretation match the
> intended usage? Am I missing something?

I think you're on the right track here. Our spdk_io_channel structure
is just a software construct for tracking per-thread contexts up and
down the I/O stack. The bottom of that stack is an NVMe queue pair
typically. This idea is a powerful one, but one that we haven't done a
great job explaining just yet. It is a dramatic departure from concepts
present in POSIX too, so it will be unfamiliar to most people.

> 
> 3) spdk_bs_dev.unmap() appears to correspond to dealloc/TRIM. Is this
> correct?

Yes. SATA calls it TRIM. NVMe calls it dealloc. SCSI calls it UNMAP.
Maybe I should call it dealloc because that's actually the most
descriptive term and we're very NVMe-centric. Of the three terms, I'm
sure that's the one least used though.

> 
> 4) I've read through the docs at http://www.spdk.io/doc/blob.html and
> understand at a high level how things are being stored on disk, but
> there are references to the caching of metadata. My current workload
> will likely generate on the order of 100K to 1M blobs of sizes
> ranging from 512KB to 32MB, each with a couple of small attributes.
> Is there any way to estimate the total size (in memory) of the cache?
> Also, are any metadata modifications O(n) in the number of blobs?

Blob metadata is cached, but only while a blob is open. If you close
the blob all of the memory is released. I don't have exact counts (and
it is very much subject to change), but you can expect maybe ~128B per
open blob. There are a few operations (i.e. opening a blob) that are
currently O(N) where N is the number of OPEN blobs. This is just
because I haven't had a chance to implement a better algorithm yet.
There aren't any operations that are O(N) where N is the total number
of blobs. In general, blobs are entirely independent of one another
because they each have their own blocks for metadata and data and the
location of that metadata can be determined entirely from the blobid
with no shared data structure. That's the real key to this design -
with the exception of a bit mask that requires central coordination for
a brief, synchronous period only when doing a few rare metadata
operations (create, sync, delete), every other operation on the
blobstore can happen entirely in parallel with no locks.

> 
> Thanks in advance for any help or insight anyone can provide. Any
> assistance is greatly appreciated.
> 
> - George Kondiles
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 3274 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread