All of lore.kernel.org
 help / color / mirror / Atom feed
* NewStore update
@ 2015-02-19 23:50 Sage Weil
  2015-02-20 10:01 ` Haomai Wang
  2015-02-21 15:50 ` Christoph Hellwig
  0 siblings, 2 replies; 8+ messages in thread
From: Sage Weil @ 2015-02-19 23:50 UTC (permalink / raw)
  To: ceph-devel

Hi everyone,

We talked a bit about the proposed "KeyFile" backend a couple months back.  
I've started putting together a basic implementation and wanted to give 
people and update about what things are currently looking like.  We're 
calling it NewStore for now unless/until someone comes up with a better 
name (KeyFileStore is way too confusing). (*)

You can peruse the incomplete code at

	https://github.com/liewegas/ceph/tree/wip-newstore/src/os/newstore

This is a bit of a brain dump.  Please ask questions if anything isn't 
clear.  Also keep in mind I'm still at the stage where I'm trying to get 
it into a semi-working state as quickly as possible so the implementation 
is pretty rough.

Basic design:

We use a KeyValueDB (leveldb, rocksdb, ...) for all of our metadata.  
Object data is stored in files with simple names (%d) in a simple 
directory structure (one level deep, default 1M files per dir).  The main 
piece of metadata we store is a mapping from object name (ghobject_t) to 
onode_t, which looks like this:

 struct onode_t {
   uint64_t size;                       ///< object size
   map<string, bufferptr> attrs;        ///< attrs
   map<uint64_t, fragment_t> data_map;  ///< data (offset to fragment mapping)

i.e., it's what we used to rely on xattrs on the inode for.  Here, we'll 
only lean on the file system for file data and it's block management.

fragment_t looks like

 struct fragment_t {
   uint32_t offset;   ///< offset in file to first byte of this fragment
   uint32_t length;   ///< length of fragment/extent
   fid_t fid;         ///< file backing this fragment

and fid_t is

 struct fid_t {
   uint32_t fset, fno;   // identify the file name: fragments/%d/%d

To start we'll keep the mapping pretty simple (just one fragment_t) but 
later we can go for varying degrees of complexity.

We lean on the kvdb for our transactions.

If we are creating new objects, we write data into a new file/fid, 
[aio_]fsync, and then commit the transaction.

If we are doing an overwrite, we include a write-ahead log (wal) 
item in our transaction, and then apply it afterwards.  For example, a 4k 
overwrite would make whatever metadata changes are included, and a wal 
item that says "then overwrite this 4k in this fid with this data".  i.e., 
the worst case is more or less what FileStore is doing now with its 
journal, except here we're using the kvdb (and its journal) for that.  On 
restart we can queue up and apply any unapplied wal items.

An alternative approach here that we discussed a bit yesterday would be to 
write the small overwrites into the kvdb adjacent to the onode.  Actually 
writing them back to the file could be deferred until later, maybe when 
there are many small writes to be done together.

But right now the write behavior is very simple, and handles just 3 cases:

	https://github.com/liewegas/ceph/blob/wip-newstore/src/os/newstore/NewStore.cc#L1339

1. New object: create a new file and write there.

2. Append: append to an existing fid.  We store the size in the onode so 
we can be a bit sloppy and in the failure case (where we write some 
extra data to the file but don't commit the onode) just ignore any 
trailing file data.

3. Anything else: generate a WAL item.

4. Maybe later, for some small [over]writes, we instead put the new data 
next to the onode.

There is no omap yet.  I think we should do basically what DBObjectMap did 
(with a layer of indirection to allow clone etc), but we need to rejigger 
it so that the initial pointer into that structure is embedded in the 
onode.  We may want to do some other optimization to avoid extra 
indirection in the common case.  Leaving this for later, though...

We are designing for the case where the workload is already sharded across 
collections.  Each collection gets an in-memory Collection, which has its 
own RWLock and its own onode_map (SharedLRU cache).  A split will 
basically amount to registering the new collection in the kvdb and 
clearing the in-memory onode cache.

There is a TransContext structure that is used to track the progress of a 
transaction.  It'll list which fd's need to get synced pre-commit, which 
onodes need to get written back in the transaction, and any WAL items to 
include and queue up after the transaction commits.  Right now the 
queue_transaction path does most of the work synchronously just to get 
things working.  Looking ahead I think what it needs to do is:

 - assemble the transaction
 - start any aio writes (we could use O_DIRECT here if the new hints 
include WONTNEED?)
 - start any aio fsync's
 - queue kvdb transaction
 - fire onreadable[_sync] notifications (I suspect we'll want to do this 
unconditionally; maybe we avoid using them entirely?)

On transaction commit,
 - fire commit notifications
 - queue WAL operations to a finisher

The WAL ops will be linked to the TransContext so that if you want to do a 
read on the onode you can block until it completes.  If we keep the 
(currently simple) locking then we can use the Collection rwlock to block 
new writes while we want for previous ones to apply.  Or we can get more 
granular with the read vs write locks, but I'm not sure it'll be any use 
until we make major changes in the OSD (like dispatching parallel reads 
within a PG).

Clone is annoying; if the FS doesn't support it natively (anything not 
btrfs) I think we should just do a sync read and then write for 
simplicity.

A few other thoughts:

- For a fast kvdb, we may want to do the transaction commit synchronously.  
For disk backends I think we'll want it async, though, to avoid blocking 
the caller.

- The fid_t has a inode number stashed in it.  The idea is to use 
open_by_handle to avoid traversing the (shallow) directory and go straight 
to the inode.  On XFS this means we traverse the inode btree to verify it 
is in fast a valid ino, which isn't totally ideal but probably what we 
have to live with.  Note that open_by_handle will work on any other 
(NFS-exportable) filesystem as well so this is in no way XFS-specific. 
This is implemented yet, but when we do, we'll probably want to verify we 
got the right file by putting some id in an xattr; that way you could 
safely copy the whole thing to another filesystem and it could gracefully 
fall back to opening using the file names.

- I think we could build a variation on this implementation on top of an 
NVMe device instead of a file system. It could pretty trivially lay out 
writes in the address space as a linear sweep across the virutal address 
space.  If the NVMe address space is big enough, maybe we could even avoid 
thinking about reusing addresses for deleted object?  We'd just send a 
discard and then forget about it.  Not sure if the address space is really 
that big, though...  If not, we'd need to do make a simple allocator 
(blah).

sage


* This follows in the Messenger's naming footsteps, which went like this: 
MPIMessenger, NewMessenger, NewerMessenger, SimpleMessenger (which ended 
up being anything but simple).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NewStore update
  2015-02-19 23:50 NewStore update Sage Weil
@ 2015-02-20 10:01 ` Haomai Wang
  2015-02-20 15:00   ` Sage Weil
  2015-02-21 15:50 ` Christoph Hellwig
  1 sibling, 1 reply; 8+ messages in thread
From: Haomai Wang @ 2015-02-20 10:01 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

So cool!

A little notes:

1. What about sync thread in NewStore?
2. Could we consider skipping WAL for large overwrite(backfill, RGW)?
3. Sorry, what means [aio_]fsync?


On Fri, Feb 20, 2015 at 7:50 AM, Sage Weil <sweil@redhat.com> wrote:
> Hi everyone,
>
> We talked a bit about the proposed "KeyFile" backend a couple months back.
> I've started putting together a basic implementation and wanted to give
> people and update about what things are currently looking like.  We're
> calling it NewStore for now unless/until someone comes up with a better
> name (KeyFileStore is way too confusing). (*)
>
> You can peruse the incomplete code at
>
>         https://github.com/liewegas/ceph/tree/wip-newstore/src/os/newstore
>
> This is a bit of a brain dump.  Please ask questions if anything isn't
> clear.  Also keep in mind I'm still at the stage where I'm trying to get
> it into a semi-working state as quickly as possible so the implementation
> is pretty rough.
>
> Basic design:
>
> We use a KeyValueDB (leveldb, rocksdb, ...) for all of our metadata.
> Object data is stored in files with simple names (%d) in a simple
> directory structure (one level deep, default 1M files per dir).  The main
> piece of metadata we store is a mapping from object name (ghobject_t) to
> onode_t, which looks like this:
>
>  struct onode_t {
>    uint64_t size;                       ///< object size
>    map<string, bufferptr> attrs;        ///< attrs
>    map<uint64_t, fragment_t> data_map;  ///< data (offset to fragment mapping)
>
> i.e., it's what we used to rely on xattrs on the inode for.  Here, we'll
> only lean on the file system for file data and it's block management.
>
> fragment_t looks like
>
>  struct fragment_t {
>    uint32_t offset;   ///< offset in file to first byte of this fragment
>    uint32_t length;   ///< length of fragment/extent
>    fid_t fid;         ///< file backing this fragment
>
> and fid_t is
>
>  struct fid_t {
>    uint32_t fset, fno;   // identify the file name: fragments/%d/%d
>
> To start we'll keep the mapping pretty simple (just one fragment_t) but
> later we can go for varying degrees of complexity.
>
> We lean on the kvdb for our transactions.
>
> If we are creating new objects, we write data into a new file/fid,
> [aio_]fsync, and then commit the transaction.
>
> If we are doing an overwrite, we include a write-ahead log (wal)
> item in our transaction, and then apply it afterwards.  For example, a 4k
> overwrite would make whatever metadata changes are included, and a wal
> item that says "then overwrite this 4k in this fid with this data".  i.e.,
> the worst case is more or less what FileStore is doing now with its
> journal, except here we're using the kvdb (and its journal) for that.  On
> restart we can queue up and apply any unapplied wal items.
>
> An alternative approach here that we discussed a bit yesterday would be to
> write the small overwrites into the kvdb adjacent to the onode.  Actually
> writing them back to the file could be deferred until later, maybe when
> there are many small writes to be done together.
>
> But right now the write behavior is very simple, and handles just 3 cases:
>
>         https://github.com/liewegas/ceph/blob/wip-newstore/src/os/newstore/NewStore.cc#L1339
>
> 1. New object: create a new file and write there.
>
> 2. Append: append to an existing fid.  We store the size in the onode so
> we can be a bit sloppy and in the failure case (where we write some
> extra data to the file but don't commit the onode) just ignore any
> trailing file data.
>
> 3. Anything else: generate a WAL item.
>
> 4. Maybe later, for some small [over]writes, we instead put the new data
> next to the onode.
>
> There is no omap yet.  I think we should do basically what DBObjectMap did
> (with a layer of indirection to allow clone etc), but we need to rejigger
> it so that the initial pointer into that structure is embedded in the
> onode.  We may want to do some other optimization to avoid extra
> indirection in the common case.  Leaving this for later, though...
>
> We are designing for the case where the workload is already sharded across
> collections.  Each collection gets an in-memory Collection, which has its
> own RWLock and its own onode_map (SharedLRU cache).  A split will
> basically amount to registering the new collection in the kvdb and
> clearing the in-memory onode cache.
>
> There is a TransContext structure that is used to track the progress of a
> transaction.  It'll list which fd's need to get synced pre-commit, which
> onodes need to get written back in the transaction, and any WAL items to
> include and queue up after the transaction commits.  Right now the
> queue_transaction path does most of the work synchronously just to get
> things working.  Looking ahead I think what it needs to do is:
>
>  - assemble the transaction
>  - start any aio writes (we could use O_DIRECT here if the new hints
> include WONTNEED?)
>  - start any aio fsync's
>  - queue kvdb transaction
>  - fire onreadable[_sync] notifications (I suspect we'll want to do this
> unconditionally; maybe we avoid using them entirely?)
>
> On transaction commit,
>  - fire commit notifications
>  - queue WAL operations to a finisher
>
> The WAL ops will be linked to the TransContext so that if you want to do a
> read on the onode you can block until it completes.  If we keep the
> (currently simple) locking then we can use the Collection rwlock to block
> new writes while we want for previous ones to apply.  Or we can get more
> granular with the read vs write locks, but I'm not sure it'll be any use
> until we make major changes in the OSD (like dispatching parallel reads
> within a PG).
>
> Clone is annoying; if the FS doesn't support it natively (anything not
> btrfs) I think we should just do a sync read and then write for
> simplicity.
>
> A few other thoughts:
>
> - For a fast kvdb, we may want to do the transaction commit synchronously.
> For disk backends I think we'll want it async, though, to avoid blocking
> the caller.
>
> - The fid_t has a inode number stashed in it.  The idea is to use
> open_by_handle to avoid traversing the (shallow) directory and go straight
> to the inode.  On XFS this means we traverse the inode btree to verify it
> is in fast a valid ino, which isn't totally ideal but probably what we
> have to live with.  Note that open_by_handle will work on any other
> (NFS-exportable) filesystem as well so this is in no way XFS-specific.
> This is implemented yet, but when we do, we'll probably want to verify we
> got the right file by putting some id in an xattr; that way you could
> safely copy the whole thing to another filesystem and it could gracefully
> fall back to opening using the file names.
>
> - I think we could build a variation on this implementation on top of an
> NVMe device instead of a file system. It could pretty trivially lay out
> writes in the address space as a linear sweep across the virutal address
> space.  If the NVMe address space is big enough, maybe we could even avoid
> thinking about reusing addresses for deleted object?  We'd just send a
> discard and then forget about it.  Not sure if the address space is really
> that big, though...  If not, we'd need to do make a simple allocator
> (blah).
>
> sage
>
>
> * This follows in the Messenger's naming footsteps, which went like this:
> MPIMessenger, NewMessenger, NewerMessenger, SimpleMessenger (which ended
> up being anything but simple).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NewStore update
  2015-02-20 10:01 ` Haomai Wang
@ 2015-02-20 15:00   ` Sage Weil
  2015-02-20 16:16     ` Haomai Wang
  2015-02-20 16:35     ` Mark Nelson
  0 siblings, 2 replies; 8+ messages in thread
From: Sage Weil @ 2015-02-20 15:00 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

On Fri, 20 Feb 2015, Haomai Wang wrote:
> So cool!
> 
> A little notes:
> 
> 1. What about sync thread in NewStore?

My thought right now is that there will be a WAL thread and (maybe) a 
transaction commit completion thread.  What do you mean by sync thread?

One thing I want to avoid is the current 'op' thread in FileStore.  
Instead of queueing a transaction we will start all of the aio operations 
synchronously.  This has the nice (?) side-effect that if there is memory 
blackpressure it will block at submit time so we don't need to do our own 
throttling.  (...though we may want to do it ourselves later anyway.)

> 2. Could we consider skipping WAL for large overwrite(backfill, RGW)?

We do (or will)... if there is a truncate to 0 it doesn't need to do WAL 
at all.  The onode stores the size so we'll ignore any stray bytes after 
that in the file; that let's us do the truncate async after the txn 
commits.  (Slightly sloppy but the space leakage window is so small I 
don't think it's worth worrying about.)

> 3. Sorry, what means [aio_]fsync?

aio_fsync is just an fsync that's submitted as an aio operation.  It'll 
make fsync fit into the same bucket as the aio writes we queue up, and it 
also means that if/when the experimental batched fsync stuff goes into XFS 
we'll take advantage of it (lots of fsyncs will be merged into a single 
XFS transaction and be much more efficient).

sage


> 
> 
> On Fri, Feb 20, 2015 at 7:50 AM, Sage Weil <sweil@redhat.com> wrote:
> > Hi everyone,
> >
> > We talked a bit about the proposed "KeyFile" backend a couple months back.
> > I've started putting together a basic implementation and wanted to give
> > people and update about what things are currently looking like.  We're
> > calling it NewStore for now unless/until someone comes up with a better
> > name (KeyFileStore is way too confusing). (*)
> >
> > You can peruse the incomplete code at
> >
> >         https://github.com/liewegas/ceph/tree/wip-newstore/src/os/newstore
> >
> > This is a bit of a brain dump.  Please ask questions if anything isn't
> > clear.  Also keep in mind I'm still at the stage where I'm trying to get
> > it into a semi-working state as quickly as possible so the implementation
> > is pretty rough.
> >
> > Basic design:
> >
> > We use a KeyValueDB (leveldb, rocksdb, ...) for all of our metadata.
> > Object data is stored in files with simple names (%d) in a simple
> > directory structure (one level deep, default 1M files per dir).  The main
> > piece of metadata we store is a mapping from object name (ghobject_t) to
> > onode_t, which looks like this:
> >
> >  struct onode_t {
> >    uint64_t size;                       ///< object size
> >    map<string, bufferptr> attrs;        ///< attrs
> >    map<uint64_t, fragment_t> data_map;  ///< data (offset to fragment mapping)
> >
> > i.e., it's what we used to rely on xattrs on the inode for.  Here, we'll
> > only lean on the file system for file data and it's block management.
> >
> > fragment_t looks like
> >
> >  struct fragment_t {
> >    uint32_t offset;   ///< offset in file to first byte of this fragment
> >    uint32_t length;   ///< length of fragment/extent
> >    fid_t fid;         ///< file backing this fragment
> >
> > and fid_t is
> >
> >  struct fid_t {
> >    uint32_t fset, fno;   // identify the file name: fragments/%d/%d
> >
> > To start we'll keep the mapping pretty simple (just one fragment_t) but
> > later we can go for varying degrees of complexity.
> >
> > We lean on the kvdb for our transactions.
> >
> > If we are creating new objects, we write data into a new file/fid,
> > [aio_]fsync, and then commit the transaction.
> >
> > If we are doing an overwrite, we include a write-ahead log (wal)
> > item in our transaction, and then apply it afterwards.  For example, a 4k
> > overwrite would make whatever metadata changes are included, and a wal
> > item that says "then overwrite this 4k in this fid with this data".  i.e.,
> > the worst case is more or less what FileStore is doing now with its
> > journal, except here we're using the kvdb (and its journal) for that.  On
> > restart we can queue up and apply any unapplied wal items.
> >
> > An alternative approach here that we discussed a bit yesterday would be to
> > write the small overwrites into the kvdb adjacent to the onode.  Actually
> > writing them back to the file could be deferred until later, maybe when
> > there are many small writes to be done together.
> >
> > But right now the write behavior is very simple, and handles just 3 cases:
> >
> >         https://github.com/liewegas/ceph/blob/wip-newstore/src/os/newstore/NewStore.cc#L1339
> >
> > 1. New object: create a new file and write there.
> >
> > 2. Append: append to an existing fid.  We store the size in the onode so
> > we can be a bit sloppy and in the failure case (where we write some
> > extra data to the file but don't commit the onode) just ignore any
> > trailing file data.
> >
> > 3. Anything else: generate a WAL item.
> >
> > 4. Maybe later, for some small [over]writes, we instead put the new data
> > next to the onode.
> >
> > There is no omap yet.  I think we should do basically what DBObjectMap did
> > (with a layer of indirection to allow clone etc), but we need to rejigger
> > it so that the initial pointer into that structure is embedded in the
> > onode.  We may want to do some other optimization to avoid extra
> > indirection in the common case.  Leaving this for later, though...
> >
> > We are designing for the case where the workload is already sharded across
> > collections.  Each collection gets an in-memory Collection, which has its
> > own RWLock and its own onode_map (SharedLRU cache).  A split will
> > basically amount to registering the new collection in the kvdb and
> > clearing the in-memory onode cache.
> >
> > There is a TransContext structure that is used to track the progress of a
> > transaction.  It'll list which fd's need to get synced pre-commit, which
> > onodes need to get written back in the transaction, and any WAL items to
> > include and queue up after the transaction commits.  Right now the
> > queue_transaction path does most of the work synchronously just to get
> > things working.  Looking ahead I think what it needs to do is:
> >
> >  - assemble the transaction
> >  - start any aio writes (we could use O_DIRECT here if the new hints
> > include WONTNEED?)
> >  - start any aio fsync's
> >  - queue kvdb transaction
> >  - fire onreadable[_sync] notifications (I suspect we'll want to do this
> > unconditionally; maybe we avoid using them entirely?)
> >
> > On transaction commit,
> >  - fire commit notifications
> >  - queue WAL operations to a finisher
> >
> > The WAL ops will be linked to the TransContext so that if you want to do a
> > read on the onode you can block until it completes.  If we keep the
> > (currently simple) locking then we can use the Collection rwlock to block
> > new writes while we want for previous ones to apply.  Or we can get more
> > granular with the read vs write locks, but I'm not sure it'll be any use
> > until we make major changes in the OSD (like dispatching parallel reads
> > within a PG).
> >
> > Clone is annoying; if the FS doesn't support it natively (anything not
> > btrfs) I think we should just do a sync read and then write for
> > simplicity.
> >
> > A few other thoughts:
> >
> > - For a fast kvdb, we may want to do the transaction commit synchronously.
> > For disk backends I think we'll want it async, though, to avoid blocking
> > the caller.
> >
> > - The fid_t has a inode number stashed in it.  The idea is to use
> > open_by_handle to avoid traversing the (shallow) directory and go straight
> > to the inode.  On XFS this means we traverse the inode btree to verify it
> > is in fast a valid ino, which isn't totally ideal but probably what we
> > have to live with.  Note that open_by_handle will work on any other
> > (NFS-exportable) filesystem as well so this is in no way XFS-specific.
> > This is implemented yet, but when we do, we'll probably want to verify we
> > got the right file by putting some id in an xattr; that way you could
> > safely copy the whole thing to another filesystem and it could gracefully
> > fall back to opening using the file names.
> >
> > - I think we could build a variation on this implementation on top of an
> > NVMe device instead of a file system. It could pretty trivially lay out
> > writes in the address space as a linear sweep across the virutal address
> > space.  If the NVMe address space is big enough, maybe we could even avoid
> > thinking about reusing addresses for deleted object?  We'd just send a
> > discard and then forget about it.  Not sure if the address space is really
> > that big, though...  If not, we'd need to do make a simple allocator
> > (blah).
> >
> > sage
> >
> >
> > * This follows in the Messenger's naming footsteps, which went like this:
> > MPIMessenger, NewMessenger, NewerMessenger, SimpleMessenger (which ended
> > up being anything but simple).
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Best Regards,
> 
> Wheat
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NewStore update
  2015-02-20 15:00   ` Sage Weil
@ 2015-02-20 16:16     ` Haomai Wang
  2015-02-20 16:35     ` Mark Nelson
  1 sibling, 0 replies; 8+ messages in thread
From: Haomai Wang @ 2015-02-20 16:16 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

OK, I just viewed part of codes and realized it.

It looks like we want to sync metadata each time when WAL and we ahead
do_transaction jobs before WAL things. It may cause larger latency
than before? Because the latency of do_transactions couldn't be simply
ignore under some latency sensitive cases and it may trigger lookup
operation(get_onode).

On Fri, Feb 20, 2015 at 11:00 PM, Sage Weil <sweil@redhat.com> wrote:
> On Fri, 20 Feb 2015, Haomai Wang wrote:
>> So cool!
>>
>> A little notes:
>>
>> 1. What about sync thread in NewStore?
>
> My thought right now is that there will be a WAL thread and (maybe) a
> transaction commit completion thread.  What do you mean by sync thread?
>
> One thing I want to avoid is the current 'op' thread in FileStore.
> Instead of queueing a transaction we will start all of the aio operations
> synchronously.  This has the nice (?) side-effect that if there is memory
> blackpressure it will block at submit time so we don't need to do our own
> throttling.  (...though we may want to do it ourselves later anyway.)
>
>> 2. Could we consider skipping WAL for large overwrite(backfill, RGW)?
>
> We do (or will)... if there is a truncate to 0 it doesn't need to do WAL
> at all.  The onode stores the size so we'll ignore any stray bytes after
> that in the file; that let's us do the truncate async after the txn
> commits.  (Slightly sloppy but the space leakage window is so small I
> don't think it's worth worrying about.)
>
>> 3. Sorry, what means [aio_]fsync?
>
> aio_fsync is just an fsync that's submitted as an aio operation.  It'll
> make fsync fit into the same bucket as the aio writes we queue up, and it
> also means that if/when the experimental batched fsync stuff goes into XFS
> we'll take advantage of it (lots of fsyncs will be merged into a single
> XFS transaction and be much more efficient).
>
> sage
>
>
>>
>>
>> On Fri, Feb 20, 2015 at 7:50 AM, Sage Weil <sweil@redhat.com> wrote:
>> > Hi everyone,
>> >
>> > We talked a bit about the proposed "KeyFile" backend a couple months back.
>> > I've started putting together a basic implementation and wanted to give
>> > people and update about what things are currently looking like.  We're
>> > calling it NewStore for now unless/until someone comes up with a better
>> > name (KeyFileStore is way too confusing). (*)
>> >
>> > You can peruse the incomplete code at
>> >
>> >         https://github.com/liewegas/ceph/tree/wip-newstore/src/os/newstore
>> >
>> > This is a bit of a brain dump.  Please ask questions if anything isn't
>> > clear.  Also keep in mind I'm still at the stage where I'm trying to get
>> > it into a semi-working state as quickly as possible so the implementation
>> > is pretty rough.
>> >
>> > Basic design:
>> >
>> > We use a KeyValueDB (leveldb, rocksdb, ...) for all of our metadata.
>> > Object data is stored in files with simple names (%d) in a simple
>> > directory structure (one level deep, default 1M files per dir).  The main
>> > piece of metadata we store is a mapping from object name (ghobject_t) to
>> > onode_t, which looks like this:
>> >
>> >  struct onode_t {
>> >    uint64_t size;                       ///< object size
>> >    map<string, bufferptr> attrs;        ///< attrs
>> >    map<uint64_t, fragment_t> data_map;  ///< data (offset to fragment mapping)
>> >
>> > i.e., it's what we used to rely on xattrs on the inode for.  Here, we'll
>> > only lean on the file system for file data and it's block management.
>> >
>> > fragment_t looks like
>> >
>> >  struct fragment_t {
>> >    uint32_t offset;   ///< offset in file to first byte of this fragment
>> >    uint32_t length;   ///< length of fragment/extent
>> >    fid_t fid;         ///< file backing this fragment
>> >
>> > and fid_t is
>> >
>> >  struct fid_t {
>> >    uint32_t fset, fno;   // identify the file name: fragments/%d/%d
>> >
>> > To start we'll keep the mapping pretty simple (just one fragment_t) but
>> > later we can go for varying degrees of complexity.
>> >
>> > We lean on the kvdb for our transactions.
>> >
>> > If we are creating new objects, we write data into a new file/fid,
>> > [aio_]fsync, and then commit the transaction.
>> >
>> > If we are doing an overwrite, we include a write-ahead log (wal)
>> > item in our transaction, and then apply it afterwards.  For example, a 4k
>> > overwrite would make whatever metadata changes are included, and a wal
>> > item that says "then overwrite this 4k in this fid with this data".  i.e.,
>> > the worst case is more or less what FileStore is doing now with its
>> > journal, except here we're using the kvdb (and its journal) for that.  On
>> > restart we can queue up and apply any unapplied wal items.
>> >
>> > An alternative approach here that we discussed a bit yesterday would be to
>> > write the small overwrites into the kvdb adjacent to the onode.  Actually
>> > writing them back to the file could be deferred until later, maybe when
>> > there are many small writes to be done together.
>> >
>> > But right now the write behavior is very simple, and handles just 3 cases:
>> >
>> >         https://github.com/liewegas/ceph/blob/wip-newstore/src/os/newstore/NewStore.cc#L1339
>> >
>> > 1. New object: create a new file and write there.
>> >
>> > 2. Append: append to an existing fid.  We store the size in the onode so
>> > we can be a bit sloppy and in the failure case (where we write some
>> > extra data to the file but don't commit the onode) just ignore any
>> > trailing file data.
>> >
>> > 3. Anything else: generate a WAL item.
>> >
>> > 4. Maybe later, for some small [over]writes, we instead put the new data
>> > next to the onode.
>> >
>> > There is no omap yet.  I think we should do basically what DBObjectMap did
>> > (with a layer of indirection to allow clone etc), but we need to rejigger
>> > it so that the initial pointer into that structure is embedded in the
>> > onode.  We may want to do some other optimization to avoid extra
>> > indirection in the common case.  Leaving this for later, though...
>> >
>> > We are designing for the case where the workload is already sharded across
>> > collections.  Each collection gets an in-memory Collection, which has its
>> > own RWLock and its own onode_map (SharedLRU cache).  A split will
>> > basically amount to registering the new collection in the kvdb and
>> > clearing the in-memory onode cache.
>> >
>> > There is a TransContext structure that is used to track the progress of a
>> > transaction.  It'll list which fd's need to get synced pre-commit, which
>> > onodes need to get written back in the transaction, and any WAL items to
>> > include and queue up after the transaction commits.  Right now the
>> > queue_transaction path does most of the work synchronously just to get
>> > things working.  Looking ahead I think what it needs to do is:
>> >
>> >  - assemble the transaction
>> >  - start any aio writes (we could use O_DIRECT here if the new hints
>> > include WONTNEED?)
>> >  - start any aio fsync's
>> >  - queue kvdb transaction
>> >  - fire onreadable[_sync] notifications (I suspect we'll want to do this
>> > unconditionally; maybe we avoid using them entirely?)
>> >
>> > On transaction commit,
>> >  - fire commit notifications
>> >  - queue WAL operations to a finisher
>> >
>> > The WAL ops will be linked to the TransContext so that if you want to do a
>> > read on the onode you can block until it completes.  If we keep the
>> > (currently simple) locking then we can use the Collection rwlock to block
>> > new writes while we want for previous ones to apply.  Or we can get more
>> > granular with the read vs write locks, but I'm not sure it'll be any use
>> > until we make major changes in the OSD (like dispatching parallel reads
>> > within a PG).
>> >
>> > Clone is annoying; if the FS doesn't support it natively (anything not
>> > btrfs) I think we should just do a sync read and then write for
>> > simplicity.
>> >
>> > A few other thoughts:
>> >
>> > - For a fast kvdb, we may want to do the transaction commit synchronously.
>> > For disk backends I think we'll want it async, though, to avoid blocking
>> > the caller.
>> >
>> > - The fid_t has a inode number stashed in it.  The idea is to use
>> > open_by_handle to avoid traversing the (shallow) directory and go straight
>> > to the inode.  On XFS this means we traverse the inode btree to verify it
>> > is in fast a valid ino, which isn't totally ideal but probably what we
>> > have to live with.  Note that open_by_handle will work on any other
>> > (NFS-exportable) filesystem as well so this is in no way XFS-specific.
>> > This is implemented yet, but when we do, we'll probably want to verify we
>> > got the right file by putting some id in an xattr; that way you could
>> > safely copy the whole thing to another filesystem and it could gracefully
>> > fall back to opening using the file names.
>> >
>> > - I think we could build a variation on this implementation on top of an
>> > NVMe device instead of a file system. It could pretty trivially lay out
>> > writes in the address space as a linear sweep across the virutal address
>> > space.  If the NVMe address space is big enough, maybe we could even avoid
>> > thinking about reusing addresses for deleted object?  We'd just send a
>> > discard and then forget about it.  Not sure if the address space is really
>> > that big, though...  If not, we'd need to do make a simple allocator
>> > (blah).
>> >
>> > sage
>> >
>> >
>> > * This follows in the Messenger's naming footsteps, which went like this:
>> > MPIMessenger, NewMessenger, NewerMessenger, SimpleMessenger (which ended
>> > up being anything but simple).
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NewStore update
  2015-02-20 15:00   ` Sage Weil
  2015-02-20 16:16     ` Haomai Wang
@ 2015-02-20 16:35     ` Mark Nelson
  1 sibling, 0 replies; 8+ messages in thread
From: Mark Nelson @ 2015-02-20 16:35 UTC (permalink / raw)
  To: Sage Weil, Haomai Wang; +Cc: ceph-devel



On 02/20/2015 09:00 AM, Sage Weil wrote:
> On Fri, 20 Feb 2015, Haomai Wang wrote:
>> So cool!
>>
>> A little notes:
>>
>> 1. What about sync thread in NewStore?
>
> My thought right now is that there will be a WAL thread and (maybe) a
> transaction commit completion thread.  What do you mean by sync thread?
>
> One thing I want to avoid is the current 'op' thread in FileStore.
> Instead of queueing a transaction we will start all of the aio operations
> synchronously.  This has the nice (?) side-effect that if there is memory
> blackpressure it will block at submit time so we don't need to do our own
> throttling.  (...though we may want to do it ourselves later anyway.)
>
>> 2. Could we consider skipping WAL for large overwrite(backfill, RGW)?
>
> We do (or will)... if there is a truncate to 0 it doesn't need to do WAL
> at all.  The onode stores the size so we'll ignore any stray bytes after
> that in the file; that let's us do the truncate async after the txn
> commits.  (Slightly sloppy but the space leakage window is so small I
> don't think it's worth worrying about.)
>
>> 3. Sorry, what means [aio_]fsync?
>
> aio_fsync is just an fsync that's submitted as an aio operation.  It'll
> make fsync fit into the same bucket as the aio writes we queue up, and it
> also means that if/when the experimental batched fsync stuff goes into XFS
> we'll take advantage of it (lots of fsyncs will be merged into a single
> XFS transaction and be much more efficient).

Looks like I need to reacquaint myself with aio.c again and figure out 
why it was breaking.  :)

>
> sage
>
>
>>
>>
>> On Fri, Feb 20, 2015 at 7:50 AM, Sage Weil <sweil@redhat.com> wrote:
>>> Hi everyone,
>>>
>>> We talked a bit about the proposed "KeyFile" backend a couple months back.
>>> I've started putting together a basic implementation and wanted to give
>>> people and update about what things are currently looking like.  We're
>>> calling it NewStore for now unless/until someone comes up with a better
>>> name (KeyFileStore is way too confusing). (*)
>>>
>>> You can peruse the incomplete code at
>>>
>>>          https://github.com/liewegas/ceph/tree/wip-newstore/src/os/newstore
>>>
>>> This is a bit of a brain dump.  Please ask questions if anything isn't
>>> clear.  Also keep in mind I'm still at the stage where I'm trying to get
>>> it into a semi-working state as quickly as possible so the implementation
>>> is pretty rough.
>>>
>>> Basic design:
>>>
>>> We use a KeyValueDB (leveldb, rocksdb, ...) for all of our metadata.
>>> Object data is stored in files with simple names (%d) in a simple
>>> directory structure (one level deep, default 1M files per dir).  The main
>>> piece of metadata we store is a mapping from object name (ghobject_t) to
>>> onode_t, which looks like this:
>>>
>>>   struct onode_t {
>>>     uint64_t size;                       ///< object size
>>>     map<string, bufferptr> attrs;        ///< attrs
>>>     map<uint64_t, fragment_t> data_map;  ///< data (offset to fragment mapping)
>>>
>>> i.e., it's what we used to rely on xattrs on the inode for.  Here, we'll
>>> only lean on the file system for file data and it's block management.
>>>
>>> fragment_t looks like
>>>
>>>   struct fragment_t {
>>>     uint32_t offset;   ///< offset in file to first byte of this fragment
>>>     uint32_t length;   ///< length of fragment/extent
>>>     fid_t fid;         ///< file backing this fragment
>>>
>>> and fid_t is
>>>
>>>   struct fid_t {
>>>     uint32_t fset, fno;   // identify the file name: fragments/%d/%d
>>>
>>> To start we'll keep the mapping pretty simple (just one fragment_t) but
>>> later we can go for varying degrees of complexity.
>>>
>>> We lean on the kvdb for our transactions.
>>>
>>> If we are creating new objects, we write data into a new file/fid,
>>> [aio_]fsync, and then commit the transaction.
>>>
>>> If we are doing an overwrite, we include a write-ahead log (wal)
>>> item in our transaction, and then apply it afterwards.  For example, a 4k
>>> overwrite would make whatever metadata changes are included, and a wal
>>> item that says "then overwrite this 4k in this fid with this data".  i.e.,
>>> the worst case is more or less what FileStore is doing now with its
>>> journal, except here we're using the kvdb (and its journal) for that.  On
>>> restart we can queue up and apply any unapplied wal items.
>>>
>>> An alternative approach here that we discussed a bit yesterday would be to
>>> write the small overwrites into the kvdb adjacent to the onode.  Actually
>>> writing them back to the file could be deferred until later, maybe when
>>> there are many small writes to be done together.
>>>
>>> But right now the write behavior is very simple, and handles just 3 cases:
>>>
>>>          https://github.com/liewegas/ceph/blob/wip-newstore/src/os/newstore/NewStore.cc#L1339
>>>
>>> 1. New object: create a new file and write there.
>>>
>>> 2. Append: append to an existing fid.  We store the size in the onode so
>>> we can be a bit sloppy and in the failure case (where we write some
>>> extra data to the file but don't commit the onode) just ignore any
>>> trailing file data.
>>>
>>> 3. Anything else: generate a WAL item.
>>>
>>> 4. Maybe later, for some small [over]writes, we instead put the new data
>>> next to the onode.
>>>
>>> There is no omap yet.  I think we should do basically what DBObjectMap did
>>> (with a layer of indirection to allow clone etc), but we need to rejigger
>>> it so that the initial pointer into that structure is embedded in the
>>> onode.  We may want to do some other optimization to avoid extra
>>> indirection in the common case.  Leaving this for later, though...
>>>
>>> We are designing for the case where the workload is already sharded across
>>> collections.  Each collection gets an in-memory Collection, which has its
>>> own RWLock and its own onode_map (SharedLRU cache).  A split will
>>> basically amount to registering the new collection in the kvdb and
>>> clearing the in-memory onode cache.
>>>
>>> There is a TransContext structure that is used to track the progress of a
>>> transaction.  It'll list which fd's need to get synced pre-commit, which
>>> onodes need to get written back in the transaction, and any WAL items to
>>> include and queue up after the transaction commits.  Right now the
>>> queue_transaction path does most of the work synchronously just to get
>>> things working.  Looking ahead I think what it needs to do is:
>>>
>>>   - assemble the transaction
>>>   - start any aio writes (we could use O_DIRECT here if the new hints
>>> include WONTNEED?)
>>>   - start any aio fsync's
>>>   - queue kvdb transaction
>>>   - fire onreadable[_sync] notifications (I suspect we'll want to do this
>>> unconditionally; maybe we avoid using them entirely?)
>>>
>>> On transaction commit,
>>>   - fire commit notifications
>>>   - queue WAL operations to a finisher
>>>
>>> The WAL ops will be linked to the TransContext so that if you want to do a
>>> read on the onode you can block until it completes.  If we keep the
>>> (currently simple) locking then we can use the Collection rwlock to block
>>> new writes while we want for previous ones to apply.  Or we can get more
>>> granular with the read vs write locks, but I'm not sure it'll be any use
>>> until we make major changes in the OSD (like dispatching parallel reads
>>> within a PG).
>>>
>>> Clone is annoying; if the FS doesn't support it natively (anything not
>>> btrfs) I think we should just do a sync read and then write for
>>> simplicity.
>>>
>>> A few other thoughts:
>>>
>>> - For a fast kvdb, we may want to do the transaction commit synchronously.
>>> For disk backends I think we'll want it async, though, to avoid blocking
>>> the caller.
>>>
>>> - The fid_t has a inode number stashed in it.  The idea is to use
>>> open_by_handle to avoid traversing the (shallow) directory and go straight
>>> to the inode.  On XFS this means we traverse the inode btree to verify it
>>> is in fast a valid ino, which isn't totally ideal but probably what we
>>> have to live with.  Note that open_by_handle will work on any other
>>> (NFS-exportable) filesystem as well so this is in no way XFS-specific.
>>> This is implemented yet, but when we do, we'll probably want to verify we
>>> got the right file by putting some id in an xattr; that way you could
>>> safely copy the whole thing to another filesystem and it could gracefully
>>> fall back to opening using the file names.
>>>
>>> - I think we could build a variation on this implementation on top of an
>>> NVMe device instead of a file system. It could pretty trivially lay out
>>> writes in the address space as a linear sweep across the virutal address
>>> space.  If the NVMe address space is big enough, maybe we could even avoid
>>> thinking about reusing addresses for deleted object?  We'd just send a
>>> discard and then forget about it.  Not sure if the address space is really
>>> that big, though...  If not, we'd need to do make a simple allocator
>>> (blah).
>>>
>>> sage
>>>
>>>
>>> * This follows in the Messenger's naming footsteps, which went like this:
>>> MPIMessenger, NewMessenger, NewerMessenger, SimpleMessenger (which ended
>>> up being anything but simple).
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NewStore update
  2015-02-19 23:50 NewStore update Sage Weil
  2015-02-20 10:01 ` Haomai Wang
@ 2015-02-21 15:50 ` Christoph Hellwig
  2015-02-21 17:53   ` Sage Weil
  1 sibling, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2015-02-21 15:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Thu, Feb 19, 2015 at 03:50:45PM -0800, Sage Weil wrote:
>  - assemble the transaction
>  - start any aio writes (we could use O_DIRECT here if the new hints 
> include WONTNEED?)

Note that kernel aio only is async if you specifiy O_DIRECT, otherwise
io_submit will simply block.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NewStore update
  2015-02-21 15:50 ` Christoph Hellwig
@ 2015-02-21 17:53   ` Sage Weil
  2015-02-22 15:51     ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2015-02-21 17:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: ceph-devel

On Sat, 21 Feb 2015, Christoph Hellwig wrote:
> On Thu, Feb 19, 2015 at 03:50:45PM -0800, Sage Weil wrote:
> >  - assemble the transaction
> >  - start any aio writes (we could use O_DIRECT here if the new hints 
> > include WONTNEED?)
> 
> Note that kernel aio only is async if you specifiy O_DIRECT, otherwise
> io_submit will simply block.

Ah, thanks. I guess in the buffered case though we won't block normally 
anyway (unless we've hit the bdi dirty threshold).  So it's probably 
either aio direct or buffered write + aio fsync, depending on the cache 
hints?

sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NewStore update
  2015-02-21 17:53   ` Sage Weil
@ 2015-02-22 15:51     ` Christoph Hellwig
  0 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2015-02-22 15:51 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Sat, Feb 21, 2015 at 09:53:45AM -0800, Sage Weil wrote:
> Ah, thanks. I guess in the buffered case though we won't block normally 
> anyway (unless we've hit the bdi dirty threshold).  So it's probably 
> either aio direct or buffered write + aio fsync, depending on the cache 
> hints?

buffered I/O will also block on:

 - acquiring i_mutex (do you plan on having parallel writers to the same
   file?)
 - reading in the page for read-modify-write cycles
 - waiting for writeback to finish for a previous write to the page

In adition to all the other ways even O_DIRECT aio could block (most
importantly block allocation)

I have a hacked prototype to do non-blocking writes similar to the
non-blocking reads we've been discussion on fsdevel for the last half
year.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-02-22 15:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-19 23:50 NewStore update Sage Weil
2015-02-20 10:01 ` Haomai Wang
2015-02-20 15:00   ` Sage Weil
2015-02-20 16:16     ` Haomai Wang
2015-02-20 16:35     ` Mark Nelson
2015-02-21 15:50 ` Christoph Hellwig
2015-02-21 17:53   ` Sage Weil
2015-02-22 15:51     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.