All of lore.kernel.org
 help / color / mirror / Atom feed
* newstore direction
@ 2015-10-19 19:49 Sage Weil
  2015-10-19 20:22 ` Robert LeBlanc
                   ` (7 more replies)
  0 siblings, 8 replies; 71+ messages in thread
From: Sage Weil @ 2015-10-19 19:49 UTC (permalink / raw)
  To: ceph-devel

The current design is based on two simple ideas:

 1) a key/value interface is better way to manage all of our internal 
metadata (object metadata, attrs, layout, collection membership, 
write-ahead logging, overlay data, etc.)

 2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A few 
things:

 - We currently write the data to the file, fsync, then commit the kv 
transaction.  That's at least 3 IOs: one for the data, one for the fs 
journal, one for the kv txn to commit (at least once my rocksdb changes 
land... the kv commit is currently 2-3).  So two people are managing 
metadata, here: the fs managing the file metadata (with its own 
journal) and the kv backend (with its journal).

 - On read we have to open files by name, which means traversing the fs 
namespace.  Newstore tries to keep it as flat and simple as possible, but 
at a minimum it is a couple btree lookups.  We'd love to use open by 
handle (which would reduce this to 1 btree traversal), but running 
the daemon as ceph and not root makes that hard...

 - ...and file systems insist on updating mtime on writes, even when it is 
a overwrite with no allocation changes.  (We don't care about mtime.)  
O_NOCMTIME patches exist but it is hard to get these past the kernel 
brainfreeze.

 - XFS is (probably) never going going to give us data checksums, which we 
want desperately.

But what's the alternative?  My thought is to just bite the bullet and 
consume a raw block device directly.  Write an allocator, hopefully keep 
it pretty simple, and manage it in kv store along with all of our other 
metadata.

Wins:

 - 2 IOs for most: one to write the data to unused space in the block 
device, one to commit our transaction (vs 4+ before).  For overwrites, 
we'd have one io to do our write-ahead log (kv journal), then do 
the overwrite async (vs 4+ before).

 - No concern about mtime getting in the way

 - Faster reads (no fs lookup)

 - Similarly sized metadata for most objects.  If we assume most objects 
are not fragmented, then the metadata to store the block offsets is about 
the same size as the metadata to store the filenames we have now. 

Problems:

 - We have to size the kv backend storage (probably still an XFS 
partition) vs the block storage.  Maybe we do this anyway (put metadata on 
SSD!) so it won't matter.  But what happens when we are storing gobs of 
rgw index data or cephfs metadata?  Suddenly we are pulling storage out of 
a different pool and those aren't currently fungible.

 - We have to write and maintain an allocator.  I'm still optimistic this 
can be reasonbly simple, especially for the flash case (where 
fragmentation isn't such an issue as long as our blocks are reasonbly 
sized).  For disk we may beed to be moderately clever.

 - We'll need a fsck to ensure our internal metadata is consistent.  The 
good news is it'll just need to validate what we have stored in the kv 
store.

Other thoughts:

 - We might want to consider whether dm-thin or bcache or other block 
layers might help us with elasticity of file vs block areas.

 - Rocksdb can push colder data to a second directory, so we could have a 
fast ssd primary area (for wal and most metadata) and a second hdd 
directory for stuff it has to push off.  Then have a conservative amount 
of file space on the hdd.  If our block fills up, use the existing file 
mechanism to put data there too.  (But then we have to maintain both the 
current kv + file approach and not go all-in on kv + block.)

Thoughts?
sage

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-19 19:49 newstore direction Sage Weil
@ 2015-10-19 20:22 ` Robert LeBlanc
  2015-10-19 20:30 ` Somnath Roy
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 71+ messages in thread
From: Robert LeBlanc @ 2015-10-19 20:22 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I think there is a lot that can be gained by Ceph managing a raw block
device. As I mentioned on ceph-users, I've given this some though and
a lot of optimizations could be done that is conducive to storing
objects. I didn't think however to bypass VFS all together by opening
the raw device directly, but this would make things simpler as you
don't have to program things for VFS that don't make sense.

Some of my thoughts were to employ a hashing algorithm for inode
lookup (CRUSH like). Is there a good use case for listing a directory?
We may need to keep a list for deletion, but there may be a better way
to handle this. Is there a need to do snapshots at the block layer if
operations can be atomic? Is there a real advantage to have an
allocation as small as 4K, or does it make since to use something like
512K?

I'm interested in how this might pan out.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWJVEACRDmVDuy+mK58QAAIQEQAK9GUmGQBP1wYa9yXNEp
juofzj5SCxxiNCBdY3kkdHXELCWkLGn331JX2El8h1lPaqH8/nWNy4U6hx0s
7A5EBgQp7+LN03OLroSfiSccPhEe5B/OB1cnyZjmxwDXyaMJzqXwn231f5ev
lBEzvU5PpHrMdNIIGxNFEHgduxfPIw5ciOokP27Tle1JdAGSn6fL6nRLtQfd
HmVLnnXJT9zaGRyxnL8ZQU8IlfjfhMpIc1bM3QKkQkBmXanzCaNaULrlO35L
XtIy0fEXAjkcGHpxOTz4yx5OFKwkpirFduU2PBn+5kqxPRvGL/eEzIxTV89c
SfhAkyBFpl+g7G+q532i7L/34r2wXOL7wcn9seLdOZIt1LVnb059r0tpy4Fz
X/V2/ao1Fua2BFMYzMskPXiKFzxLu/jOS12CjvYWkNhN4C2pGUbRxhqYnC0k
gjRpoOZHDr+RogQdlzXeUmcbZzvtwWqk2uECIX2mLR1aHTVgnpegJhvvHdl3
Nm7jxLyTof2bcXQgSwO5YEXvWO3dNfQynrb5zE+aIVM5ps9D95Mmm94lJtda
47zraQNwrL1OVS7Fd4ot9VepLcQ4orCUZPSqrm5FBlBWj5G+/U0F8VQl8u/g
/nSZrxMXjHJWRhFvzFMYC3yUp59N75LXR5wId8RkAkgZVM+PftB4LmB7spHC
WcGR
=j3i1
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Oct 19, 2015 at 1:49 PM, Sage Weil <sweil@redhat.com> wrote:
> The current design is based on two simple ideas:
>
>  1) a key/value interface is better way to manage all of our internal
> metadata (object metadata, attrs, layout, collection membership,
> write-ahead logging, overlay data, etc.)
>
>  2) a file system is well suited for storage object data (as files).
>
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
>
>  - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs
> journal, one for the kv txn to commit (at least once my rocksdb changes
> land... the kv commit is currently 2-3).  So two people are managing
> metadata, here: the fs managing the file metadata (with its own
> journal) and the kv backend (with its journal).
>
>  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but
> at a minimum it is a couple btree lookups.  We'd love to use open by
> handle (which would reduce this to 1 btree traversal), but running
> the daemon as ceph and not root makes that hard...
>
>  - ...and file systems insist on updating mtime on writes, even when it is
> a overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
>
>  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
>
> But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep
> it pretty simple, and manage it in kv store along with all of our other
> metadata.
>
> Wins:
>
>  - 2 IOs for most: one to write the data to unused space in the block
> device, one to commit our transaction (vs 4+ before).  For overwrites,
> we'd have one io to do our write-ahead log (kv journal), then do
> the overwrite async (vs 4+ before).
>
>  - No concern about mtime getting in the way
>
>  - Faster reads (no fs lookup)
>
>  - Similarly sized metadata for most objects.  If we assume most objects
> are not fragmented, then the metadata to store the block offsets is about
> the same size as the metadata to store the filenames we have now.
>
> Problems:
>
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> a different pool and those aren't currently fungible.
>
>  - We have to write and maintain an allocator.  I'm still optimistic this
> can be reasonbly simple, especially for the flash case (where
> fragmentation isn't such an issue as long as our blocks are reasonbly
> sized).  For disk we may beed to be moderately clever.
>
>  - We'll need a fsck to ensure our internal metadata is consistent.  The
> good news is it'll just need to validate what we have stored in the kv
> store.
>
> Other thoughts:
>
>  - We might want to consider whether dm-thin or bcache or other block
> layers might help us with elasticity of file vs block areas.
>
>  - Rocksdb can push colder data to a second directory, so we could have a
> fast ssd primary area (for wal and most metadata) and a second hdd
> directory for stuff it has to push off.  Then have a conservative amount
> of file space on the hdd.  If our block fills up, use the existing file
> mechanism to put data there too.  (But then we have to maintain both the
> current kv + file approach and not go all-in on kv + block.)
>
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-19 19:49 newstore direction Sage Weil
  2015-10-19 20:22 ` Robert LeBlanc
@ 2015-10-19 20:30 ` Somnath Roy
  2015-10-19 20:54   ` Sage Weil
  2015-10-19 21:18 ` Wido den Hollander
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 71+ messages in thread
From: Somnath Roy @ 2015-10-19 20:30 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

Sage,
I fully support that.  If we want to saturate SSDs , we need to get rid of this filesystem overhead (which I am in process of measuring).
Also, it will be good if we can eliminate the dependency on the k/v dbs (for storing allocators and all). The reason is the unknown write amps they causes.

Thanks & Regards
Somnath


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, October 19, 2015 12:49 PM
To: ceph-devel@vger.kernel.org
Subject: newstore direction

The current design is based on two simple ideas:

 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.)

 2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
things:

 - We currently write the data to the file, fsync, then commit the kv transaction.  That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3).  So two people are managing metadata, here: the fs managing the file metadata (with its own
journal) and the kv backend (with its journal).

 - On read we have to open files by name, which means traversing the fs namespace.  Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups.  We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard...

 - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes.  (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze.

 - XFS is (probably) never going going to give us data checksums, which we want desperately.

But what's the alternative?  My thought is to just bite the bullet and consume a raw block device directly.  Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata.

Wins:

 - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before).  For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before).

 - No concern about mtime getting in the way

 - Faster reads (no fs lookup)

 - Similarly sized metadata for most objects.  If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now.

Problems:

 - We have to size the kv backend storage (probably still an XFS
partition) vs the block storage.  Maybe we do this anyway (put metadata on
SSD!) so it won't matter.  But what happens when we are storing gobs of rgw index data or cephfs metadata?  Suddenly we are pulling storage out of a different pool and those aren't currently fungible.

 - We have to write and maintain an allocator.  I'm still optimistic this can be reasonbly simple, especially for the flash case (where fragmentation isn't such an issue as long as our blocks are reasonbly sized).  For disk we may beed to be moderately clever.

 - We'll need a fsck to ensure our internal metadata is consistent.  The good news is it'll just need to validate what we have stored in the kv store.

Other thoughts:

 - We might want to consider whether dm-thin or bcache or other block layers might help us with elasticity of file vs block areas.

 - Rocksdb can push colder data to a second directory, so we could have a fast ssd primary area (for wal and most metadata) and a second hdd directory for stuff it has to push off.  Then have a conservative amount of file space on the hdd.  If our block fills up, use the existing file mechanism to put data there too.  (But then we have to maintain both the current kv + file approach and not go all-in on kv + block.)

Thoughts?
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-19 20:30 ` Somnath Roy
@ 2015-10-19 20:54   ` Sage Weil
  2015-10-19 22:21     ` James (Fei) Liu-SSI
  0 siblings, 1 reply; 71+ messages in thread
From: Sage Weil @ 2015-10-19 20:54 UTC (permalink / raw)
  To: Somnath Roy; +Cc: ceph-devel

On Mon, 19 Oct 2015, Somnath Roy wrote:
> Sage,
> I fully support that.  If we want to saturate SSDs , we need to get rid 
> of this filesystem overhead (which I am in process of measuring). Also, 
> it will be good if we can eliminate the dependency on the k/v dbs (for 
> storing allocators and all). The reason is the unknown write amps they 
> causes.

My hope is to keep behing the KeyValueDB interface (and/more change it as 
appropriate) so that other backends can be easily swapped in (e.g. a 
btree-based one for high-end flash).

sage


> 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 12:49 PM
> To: ceph-devel@vger.kernel.org
> Subject: newstore direction
> 
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
> 
>  - We currently write the data to the file, fsync, then commit the kv transaction.  That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3).  So two people are managing metadata, here: the fs managing the file metadata (with its own
> journal) and the kv backend (with its journal).
> 
>  - On read we have to open files by name, which means traversing the fs namespace.  Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups.  We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard...
> 
>  - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes.  (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze.
> 
>  - XFS is (probably) never going going to give us data checksums, which we want desperately.
> 
> But what's the alternative?  My thought is to just bite the bullet and consume a raw block device directly.  Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata.
> 
> Wins:
> 
>  - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before).  For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before).
> 
>  - No concern about mtime getting in the way
> 
>  - Faster reads (no fs lookup)
> 
>  - Similarly sized metadata for most objects.  If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now.
> 
> Problems:
> 
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of rgw index data or cephfs metadata?  Suddenly we are pulling storage out of a different pool and those aren't currently fungible.
> 
>  - We have to write and maintain an allocator.  I'm still optimistic this can be reasonbly simple, especially for the flash case (where fragmentation isn't such an issue as long as our blocks are reasonbly sized).  For disk we may beed to be moderately clever.
> 
>  - We'll need a fsck to ensure our internal metadata is consistent.  The good news is it'll just need to validate what we have stored in the kv store.
> 
> Other thoughts:
> 
>  - We might want to consider whether dm-thin or bcache or other block layers might help us with elasticity of file vs block areas.
> 
>  - Rocksdb can push colder data to a second directory, so we could have a fast ssd primary area (for wal and most metadata) and a second hdd directory for stuff it has to push off.  Then have a conservative amount of file space on the hdd.  If our block fills up, use the existing file mechanism to put data there too.  (But then we have to maintain both the current kv + file approach and not go all-in on kv + block.)
> 
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-19 19:49 newstore direction Sage Weil
  2015-10-19 20:22 ` Robert LeBlanc
  2015-10-19 20:30 ` Somnath Roy
@ 2015-10-19 21:18 ` Wido den Hollander
  2015-10-19 22:40 ` Varada Kari
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 71+ messages in thread
From: Wido den Hollander @ 2015-10-19 21:18 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

On 10/19/2015 09:49 PM, Sage Weil wrote:
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal 
> metadata (object metadata, attrs, layout, collection membership, 
> write-ahead logging, overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few 
> things:
> 
>  - We currently write the data to the file, fsync, then commit the kv 
> transaction.  That's at least 3 IOs: one for the data, one for the fs 
> journal, one for the kv txn to commit (at least once my rocksdb changes 
> land... the kv commit is currently 2-3).  So two people are managing 
> metadata, here: the fs managing the file metadata (with its own 
> journal) and the kv backend (with its journal).
> 
>  - On read we have to open files by name, which means traversing the fs 
> namespace.  Newstore tries to keep it as flat and simple as possible, but 
> at a minimum it is a couple btree lookups.  We'd love to use open by 
> handle (which would reduce this to 1 btree traversal), but running 
> the daemon as ceph and not root makes that hard...
> 
>  - ...and file systems insist on updating mtime on writes, even when it is 
> a overwrite with no allocation changes.  (We don't care about mtime.)  
> O_NOCMTIME patches exist but it is hard to get these past the kernel 
> brainfreeze.
> 
>  - XFS is (probably) never going going to give us data checksums, which we 
> want desperately.
> 
> But what's the alternative?  My thought is to just bite the bullet and 
> consume a raw block device directly.  Write an allocator, hopefully keep 
> it pretty simple, and manage it in kv store along with all of our other 
> metadata.
> 
> Wins:
> 
>  - 2 IOs for most: one to write the data to unused space in the block 
> device, one to commit our transaction (vs 4+ before).  For overwrites, 
> we'd have one io to do our write-ahead log (kv journal), then do 
> the overwrite async (vs 4+ before).
> 
>  - No concern about mtime getting in the way
> 
>  - Faster reads (no fs lookup)
> 
>  - Similarly sized metadata for most objects.  If we assume most objects 
> are not fragmented, then the metadata to store the block offsets is about 
> the same size as the metadata to store the filenames we have now. 
> 
> Problems:
> 
>  - We have to size the kv backend storage (probably still an XFS 
> partition) vs the block storage.  Maybe we do this anyway (put metadata on 
> SSD!) so it won't matter.  But what happens when we are storing gobs of 
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of 
> a different pool and those aren't currently fungible.
> 
>  - We have to write and maintain an allocator.  I'm still optimistic this 
> can be reasonbly simple, especially for the flash case (where 
> fragmentation isn't such an issue as long as our blocks are reasonbly 
> sized).  For disk we may beed to be moderately clever.
> 
>  - We'll need a fsck to ensure our internal metadata is consistent.  The 
> good news is it'll just need to validate what we have stored in the kv 
> store.
> 
> Other thoughts:
> 
>  - We might want to consider whether dm-thin or bcache or other block 
> layers might help us with elasticity of file vs block areas.
> 

I've been using bcache for a while now in production and that helped a lot.

Intel SSDs with GPT. First few partitions as Journals and then one big
partition for bcache.

/dev/bcache0    2.8T  264G  2.5T  10% /var/lib/ceph/osd/ceph-60
/dev/bcache1    2.8T  317G  2.5T  12% /var/lib/ceph/osd/ceph-61
/dev/bcache2    2.8T  303G  2.5T  11% /var/lib/ceph/osd/ceph-62
/dev/bcache3    2.8T  316G  2.5T  12% /var/lib/ceph/osd/ceph-63
/dev/bcache4    2.8T  167G  2.6T   6% /var/lib/ceph/osd/ceph-64
/dev/bcache5    2.8T  295G  2.5T  11% /var/lib/ceph/osd/ceph-65

The maintainers from bcache also presented bcachefs:
https://lkml.org/lkml/2015/8/21/22

"checksumming, compression: currently only zlib is supported for
compression, and for checksumming there's crc32c and a 64 bit checksum."

Wouldn't that be something that can be leveraged from? Consuming a raw
block device seems like re-inventing the wheel to me. I might be wrong
though.

I have no idea how stable bcachefs is, but it might be worth looking in to.

>  - Rocksdb can push colder data to a second directory, so we could have a 
> fast ssd primary area (for wal and most metadata) and a second hdd 
> directory for stuff it has to push off.  Then have a conservative amount 
> of file space on the hdd.  If our block fills up, use the existing file 
> mechanism to put data there too.  (But then we have to maintain both the 
> current kv + file approach and not go all-in on kv + block.)
> 
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-19 20:54   ` Sage Weil
@ 2015-10-19 22:21     ` James (Fei) Liu-SSI
  2015-10-20  2:24       ` Chen, Xiaoxi
                         ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: James (Fei) Liu-SSI @ 2015-10-19 22:21 UTC (permalink / raw)
  To: Sage Weil, Somnath Roy; +Cc: ceph-devel

Hi Sage and Somnath,
  In my humble opinion, There is another more aggressive  solution than raw block device base keyvalue store as backend for objectstore. The new key value  SSD device with transaction support would be  ideal to solve the issues. First of all, it is raw SSD device. Secondly , It provides key value interface directly from SSD. Thirdly, it can provide transaction support, consistency will be guaranteed by hardware device. It pretty much satisfied all of objectstore needs without any extra overhead since there is not any extra layer in between device and objectstore. 
   Either way, I strongly support to have CEPH own data format instead of relying on filesystem.  

  Regards,
  James

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, October 19, 2015 1:55 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: RE: newstore direction

On Mon, 19 Oct 2015, Somnath Roy wrote:
> Sage,
> I fully support that.  If we want to saturate SSDs , we need to get 
> rid of this filesystem overhead (which I am in process of measuring). 
> Also, it will be good if we can eliminate the dependency on the k/v 
> dbs (for storing allocators and all). The reason is the unknown write 
> amps they causes.

My hope is to keep behing the KeyValueDB interface (and/more change it as
appropriate) so that other backends can be easily swapped in (e.g. a btree-based one for high-end flash).

sage


> 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 12:49 PM
> To: ceph-devel@vger.kernel.org
> Subject: newstore direction
> 
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal 
> metadata (object metadata, attrs, layout, collection membership, 
> write-ahead logging, overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A 
> few
> things:
> 
>  - We currently write the data to the file, fsync, then commit the kv 
> transaction.  That's at least 3 IOs: one for the data, one for the fs 
> journal, one for the kv txn to commit (at least once my rocksdb 
> changes land... the kv commit is currently 2-3).  So two people are 
> managing metadata, here: the fs managing the file metadata (with its 
> own
> journal) and the kv backend (with its journal).
> 
>  - On read we have to open files by name, which means traversing the fs namespace.  Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups.  We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard...
> 
>  - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes.  (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze.
> 
>  - XFS is (probably) never going going to give us data checksums, which we want desperately.
> 
> But what's the alternative?  My thought is to just bite the bullet and consume a raw block device directly.  Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata.
> 
> Wins:
> 
>  - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before).  For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before).
> 
>  - No concern about mtime getting in the way
> 
>  - Faster reads (no fs lookup)
> 
>  - Similarly sized metadata for most objects.  If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now.
> 
> Problems:
> 
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put 
> metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of rgw index data or cephfs metadata?  Suddenly we are pulling storage out of a different pool and those aren't currently fungible.
> 
>  - We have to write and maintain an allocator.  I'm still optimistic this can be reasonbly simple, especially for the flash case (where fragmentation isn't such an issue as long as our blocks are reasonbly sized).  For disk we may beed to be moderately clever.
> 
>  - We'll need a fsck to ensure our internal metadata is consistent.  The good news is it'll just need to validate what we have stored in the kv store.
> 
> Other thoughts:
> 
>  - We might want to consider whether dm-thin or bcache or other block layers might help us with elasticity of file vs block areas.
> 
>  - Rocksdb can push colder data to a second directory, so we could 
> have a fast ssd primary area (for wal and most metadata) and a second 
> hdd directory for stuff it has to push off.  Then have a conservative 
> amount of file space on the hdd.  If our block fills up, use the 
> existing file mechanism to put data there too.  (But then we have to 
> maintain both the current kv + file approach and not go all-in on kv + 
> block.)
> 
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-19 19:49 newstore direction Sage Weil
                   ` (2 preceding siblings ...)
  2015-10-19 21:18 ` Wido den Hollander
@ 2015-10-19 22:40 ` Varada Kari
  2015-10-20  0:48 ` John Spray
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 71+ messages in thread
From: Varada Kari @ 2015-10-19 22:40 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

Hi Sage,

If we are managing the raw device, does it make sense to have a key value store to manage the whole space? 
Having metadata of the allocator might cause some other problems of consistency. Getting an fsck for that implementation can be tougher, we might have to have strict crc computations on the data. And have to manage sanity of the DB managing them.
If we can have a common mechanism of having data and metadata the same keyvalue store, will improve the performance. 
We have integrated a custom made key value store which works on raw device the key value store backend. And we have observed better bw utilization and iops.
Read/writes can be faster and no fslookup needed. We have tools like fsck to care of consistency of DB. 

Couple of comments inline.

Thanks,
Varada

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Tuesday, October 20, 2015 1:19 AM
> To: ceph-devel@vger.kernel.org
> Subject: newstore direction
> 
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal metadata
> (object metadata, attrs, layout, collection membership, write-ahead logging,
> overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
> 
>  - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs journal, one
> for the kv txn to commit (at least once my rocksdb changes land... the kv
> commit is currently 2-3).  So two people are managing metadata, here: the fs
> managing the file metadata (with its own
> journal) and the kv backend (with its journal).
> 
>  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but at a
> minimum it is a couple btree lookups.  We'd love to use open by handle
> (which would reduce this to 1 btree traversal), but running the daemon as
> ceph and not root makes that hard...
> 
>  - ...and file systems insist on updating mtime on writes, even when it is a
> overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
> 
>  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
> 
> But what's the alternative?  My thought is to just bite the bullet and consume
> a raw block device directly.  Write an allocator, hopefully keep it pretty
> simple, and manage it in kv store along with all of our other metadata.
> 
> Wins:
> 
>  - 2 IOs for most: one to write the data to unused space in the block device,
> one to commit our transaction (vs 4+ before).  For overwrites, we'd have one
> io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+
> before).
> 
>  - No concern about mtime getting in the way
> 
>  - Faster reads (no fs lookup)
> 
>  - Similarly sized metadata for most objects.  If we assume most objects are
> not fragmented, then the metadata to store the block offsets is about the
> same size as the metadata to store the filenames we have now.
> 
> Problems:
> 
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of rgw
> index data or cephfs metadata?  Suddenly we are pulling storage out of a
> different pool and those aren't currently fungible.

[Varada Kari]  Ideally if we can manage the raw device as key value store indirection to manage metadata and data both, we can benefit with faster lookups and writes (if the KVStore supports a batch atomic transactional write). SSD's might suffer with more write  amplification by putting the meta data alone, if we can manage this part(KV Store to deal with raw device) also(handling small writes) we can avoid write amplification and get better throughput from the device.

>  - We have to write and maintain an allocator.  I'm still optimistic this can be
> reasonbly simple, especially for the flash case (where fragmentation isn't
> such an issue as long as our blocks are reasonbly sized).  For disk we may
> beed to be moderately clever.
> 
[Varada Kari] Yes. If the writes are aligned to flash programmable page size, that will not cause any issues. But writes less than programmable page size will cause internal fragmentation. Repeated overwrites to the same, will cause more write amplification.

>  - We'll need a fsck to ensure our internal metadata is consistent.  The good
> news is it'll just need to validate what we have stored in the kv store.
> 
> Other thoughts:
> 
>  - We might want to consider whether dm-thin or bcache or other block
> layers might help us with elasticity of file vs block areas.
> 
>  - Rocksdb can push colder data to a second directory, so we could have a fast
> ssd primary area (for wal and most metadata) and a second hdd directory for
> stuff it has to push off.  Then have a conservative amount of file space on the
> hdd.  If our block fills up, use the existing file mechanism to put data there
> too.  (But then we have to maintain both the current kv + file approach and
> not go all-in on kv + block.)
> 
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-19 19:49 newstore direction Sage Weil
                   ` (3 preceding siblings ...)
  2015-10-19 22:40 ` Varada Kari
@ 2015-10-20  0:48 ` John Spray
  2015-10-20 20:00   ` Sage Weil
  2015-10-20  2:08 ` Haomai Wang
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 71+ messages in thread
From: John Spray @ 2015-10-20  0:48 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sweil@redhat.com> wrote:
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> a different pool and those aren't currently fungible.

This is the concerning bit for me -- the other parts one "just" has to
get the code right, but this problem could linger and be something we
have to keep explaining to users indefinitely.  It reminds me of cases
in other systems where users had to make an educated guess about inode
size up front, depending on whether you're expecting to efficiently
store a lot of xattrs.

In practice it's rare for users to make these kinds of decisions well
up-front: it really needs to be adjustable later, ideally
automatically.  That could be pretty straightforward if the KV part
was stored directly on block storage, instead of having XFS in the
mix.  I'm not quite up with the state of the art in this area: are
there any reasonable alternatives for the KV part that would consume
some defined range of a block device from userspace, instead of
sitting on top of a filesystem?

John

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-19 19:49 newstore direction Sage Weil
                   ` (4 preceding siblings ...)
  2015-10-20  0:48 ` John Spray
@ 2015-10-20  2:08 ` Haomai Wang
  2015-10-20 12:25   ` Sage Weil
  2015-10-20  7:06 ` Dałek, Piotr
  2015-10-20 18:31 ` Ric Wheeler
  7 siblings, 1 reply; 71+ messages in thread
From: Haomai Wang @ 2015-10-20  2:08 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Tue, Oct 20, 2015 at 3:49 AM, Sage Weil <sweil@redhat.com> wrote:
> The current design is based on two simple ideas:
>
>  1) a key/value interface is better way to manage all of our internal
> metadata (object metadata, attrs, layout, collection membership,
> write-ahead logging, overlay data, etc.)
>
>  2) a file system is well suited for storage object data (as files).
>
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
>
>  - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs
> journal, one for the kv txn to commit (at least once my rocksdb changes
> land... the kv commit is currently 2-3).  So two people are managing
> metadata, here: the fs managing the file metadata (with its own
> journal) and the kv backend (with its journal).
>
>  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but
> at a minimum it is a couple btree lookups.  We'd love to use open by
> handle (which would reduce this to 1 btree traversal), but running
> the daemon as ceph and not root makes that hard...
>
>  - ...and file systems insist on updating mtime on writes, even when it is
> a overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
>
>  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
>
> But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep
> it pretty simple, and manage it in kv store along with all of our other
> metadata.

This is really a tough decision. Although making a block device based
objectstore never walk out my mind since two years ago.

We would much more concern about the effective of space utilization
compared to local fs,  the buggy, the consuming time to build a tiny
local filesystem. I'm a little afraid of we would stuck into....

>
> Wins:
>
>  - 2 IOs for most: one to write the data to unused space in the block
> device, one to commit our transaction (vs 4+ before).  For overwrites,
> we'd have one io to do our write-ahead log (kv journal), then do
> the overwrite async (vs 4+ before).

Compared to filejournal, it seemed keyvaluedb doesn't play well in WAL
area from my perf.

>
>  - No concern about mtime getting in the way
>
>  - Faster reads (no fs lookup)
>
>  - Similarly sized metadata for most objects.  If we assume most objects
> are not fragmented, then the metadata to store the block offsets is about
> the same size as the metadata to store the filenames we have now.
>
> Problems:
>
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> a different pool and those aren't currently fungible.
>
>  - We have to write and maintain an allocator.  I'm still optimistic this
> can be reasonbly simple, especially for the flash case (where
> fragmentation isn't such an issue as long as our blocks are reasonbly
> sized).  For disk we may beed to be moderately clever.
>
>  - We'll need a fsck to ensure our internal metadata is consistent.  The
> good news is it'll just need to validate what we have stored in the kv
> store.
>
> Other thoughts:
>
>  - We might want to consider whether dm-thin or bcache or other block
> layers might help us with elasticity of file vs block areas.
>
>  - Rocksdb can push colder data to a second directory, so we could have a
> fast ssd primary area (for wal and most metadata) and a second hdd
> directory for stuff it has to push off.  Then have a conservative amount
> of file space on the hdd.  If our block fills up, use the existing file
> mechanism to put data there too.  (But then we have to maintain both the
> current kv + file approach and not go all-in on kv + block.)

A complex way...

Actually I would like to employ FileStore2 impl, which means we still
use FileJournal(or alike ..). But we need to employ more memory to
keep metadata/xattrs and use aio+dio to flush disk. A userspace
pagecache needed to be impl. Then we can skip journal if full write,
because osd is pg isolation we could make a barrier for single pg when
skipping journal. @Sage Is there other concerns for filestore skip
journal?

In a word, I like the model that filestore owns, but we need to have a
big refactor for existing impl.

Sorry to disturb the thought....

>
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-19 22:21     ` James (Fei) Liu-SSI
@ 2015-10-20  2:24       ` Chen, Xiaoxi
  2015-10-20 12:30         ` Sage Weil
  2015-10-20  2:32       ` Varada Kari
  2015-10-20 12:34       ` Sage Weil
  2 siblings, 1 reply; 71+ messages in thread
From: Chen, Xiaoxi @ 2015-10-20  2:24 UTC (permalink / raw)
  To: James (Fei) Liu-SSI, Sage Weil, Somnath Roy; +Cc: ceph-devel

+1,  nowadays K-V DB care more about very small key-value pairs, say several bytes to a few KB, but in SSD case we only care about 4KB or 8KB. In this way, NVMKV is a good design and seems some of the SSD vendor are also trying to build this kind of interface, we had a NVM-L library but still under development.
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> Sent: Tuesday, October 20, 2015 6:21 AM
> To: Sage Weil; Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> Hi Sage and Somnath,
>   In my humble opinion, There is another more aggressive  solution than raw
> block device base keyvalue store as backend for objectstore. The new key
> value  SSD device with transaction support would be  ideal to solve the issues.
> First of all, it is raw SSD device. Secondly , It provides key value interface
> directly from SSD. Thirdly, it can provide transaction support, consistency will
> be guaranteed by hardware device. It pretty much satisfied all of objectstore
> needs without any extra overhead since there is not any extra layer in
> between device and objectstore.
>    Either way, I strongly support to have CEPH own data format instead of
> relying on filesystem.
> 
>   Regards,
>   James
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 1:55 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> On Mon, 19 Oct 2015, Somnath Roy wrote:
> > Sage,
> > I fully support that.  If we want to saturate SSDs , we need to get
> > rid of this filesystem overhead (which I am in process of measuring).
> > Also, it will be good if we can eliminate the dependency on the k/v
> > dbs (for storing allocators and all). The reason is the unknown write
> > amps they causes.
> 
> My hope is to keep behing the KeyValueDB interface (and/more change it as
> appropriate) so that other backends can be easily swapped in (e.g. a btree-
> based one for high-end flash).
> 
> sage
> 
> 
> >
> > Thanks & Regards
> > Somnath
> >
> >
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 12:49 PM
> > To: ceph-devel@vger.kernel.org
> > Subject: newstore direction
> >
> > The current design is based on two simple ideas:
> >
> >  1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> >
> >  2) a file system is well suited for storage object data (as files).
> >
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> > few
> > things:
> >
> >  - We currently write the data to the file, fsync, then commit the kv
> > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb
> > changes land... the kv commit is currently 2-3).  So two people are
> > managing metadata, here: the fs managing the file metadata (with its
> > own
> > journal) and the kv backend (with its journal).
> >
> >  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but at a
> minimum it is a couple btree lookups.  We'd love to use open by handle
> (which would reduce this to 1 btree traversal), but running the daemon as
> ceph and not root makes that hard...
> >
> >  - ...and file systems insist on updating mtime on writes, even when it is a
> overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
> >
> >  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
> >
> > But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep it
> pretty simple, and manage it in kv store along with all of our other metadata.
> >
> > Wins:
> >
> >  - 2 IOs for most: one to write the data to unused space in the block device,
> one to commit our transaction (vs 4+ before).  For overwrites, we'd have one
> io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+
> before).
> >
> >  - No concern about mtime getting in the way
> >
> >  - Faster reads (no fs lookup)
> >
> >  - Similarly sized metadata for most objects.  If we assume most objects are
> not fragmented, then the metadata to store the block offsets is about the
> same size as the metadata to store the filenames we have now.
> >
> > Problems:
> >
> >  - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage.  Maybe we do this anyway (put
> > metadata on
> > SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of a
> different pool and those aren't currently fungible.
> >
> >  - We have to write and maintain an allocator.  I'm still optimistic this can be
> reasonbly simple, especially for the flash case (where fragmentation isn't
> such an issue as long as our blocks are reasonbly sized).  For disk we may
> beed to be moderately clever.
> >
> >  - We'll need a fsck to ensure our internal metadata is consistent.  The good
> news is it'll just need to validate what we have stored in the kv store.
> >
> > Other thoughts:
> >
> >  - We might want to consider whether dm-thin or bcache or other block
> layers might help us with elasticity of file vs block areas.
> >
> >  - Rocksdb can push colder data to a second directory, so we could
> > have a fast ssd primary area (for wal and most metadata) and a second
> > hdd directory for stuff it has to push off.  Then have a conservative
> > amount of file space on the hdd.  If our block fills up, use the
> > existing file mechanism to put data there too.  (But then we have to
> > maintain both the current kv + file approach and not go all-in on kv +
> > block.)
> >
> > Thoughts?
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> > ________________________________
> >
> > PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-19 22:21     ` James (Fei) Liu-SSI
  2015-10-20  2:24       ` Chen, Xiaoxi
@ 2015-10-20  2:32       ` Varada Kari
  2015-10-20  2:40         ` Chen, Xiaoxi
  2015-10-20 12:34       ` Sage Weil
  2 siblings, 1 reply; 71+ messages in thread
From: Varada Kari @ 2015-10-20  2:32 UTC (permalink / raw)
  To: James (Fei) Liu-SSI, Sage Weil, Somnath Roy; +Cc: ceph-devel

Hi James,

Are you mentioning SCSI OSD (http://www.t10.org/drafts.htm#OSD_Family) ? If SCSI OSD is what you are mentioning, drive has to support all osd functionality mentioned by T10.
If not, we have to implement the same functionality in kernel or have a wrapper in user space to convert them to read/write calls.  This seems more effort.

Varada

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> Sent: Tuesday, October 20, 2015 3:51 AM
> To: Sage Weil <sweil@redhat.com>; Somnath Roy
> <Somnath.Roy@sandisk.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> Hi Sage and Somnath,
>   In my humble opinion, There is another more aggressive  solution than raw
> block device base keyvalue store as backend for objectstore. The new key
> value  SSD device with transaction support would be  ideal to solve the
> issues. First of all, it is raw SSD device. Secondly , It provides key value
> interface directly from SSD. Thirdly, it can provide transaction support,
> consistency will be guaranteed by hardware device. It pretty much satisfied
> all of objectstore needs without any extra overhead since there is not any
> extra layer in between device and objectstore.
>    Either way, I strongly support to have CEPH own data format instead of
> relying on filesystem.
> 
>   Regards,
>   James
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 1:55 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> On Mon, 19 Oct 2015, Somnath Roy wrote:
> > Sage,
> > I fully support that.  If we want to saturate SSDs , we need to get
> > rid of this filesystem overhead (which I am in process of measuring).
> > Also, it will be good if we can eliminate the dependency on the k/v
> > dbs (for storing allocators and all). The reason is the unknown write
> > amps they causes.
> 
> My hope is to keep behing the KeyValueDB interface (and/more change it as
> appropriate) so that other backends can be easily swapped in (e.g. a btree-
> based one for high-end flash).
> 
> sage
> 
> 
> >
> > Thanks & Regards
> > Somnath
> >
> >
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 12:49 PM
> > To: ceph-devel@vger.kernel.org
> > Subject: newstore direction
> >
> > The current design is based on two simple ideas:
> >
> >  1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> >
> >  2) a file system is well suited for storage object data (as files).
> >
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> > few
> > things:
> >
> >  - We currently write the data to the file, fsync, then commit the kv
> > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb
> > changes land... the kv commit is currently 2-3).  So two people are
> > managing metadata, here: the fs managing the file metadata (with its
> > own
> > journal) and the kv backend (with its journal).
> >
> >  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but at a
> minimum it is a couple btree lookups.  We'd love to use open by handle
> (which would reduce this to 1 btree traversal), but running the daemon as
> ceph and not root makes that hard...
> >
> >  - ...and file systems insist on updating mtime on writes, even when it is a
> overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
> >
> >  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
> >
> > But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep it
> pretty simple, and manage it in kv store along with all of our other metadata.
> >
> > Wins:
> >
> >  - 2 IOs for most: one to write the data to unused space in the block device,
> one to commit our transaction (vs 4+ before).  For overwrites, we'd have one
> io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+
> before).
> >
> >  - No concern about mtime getting in the way
> >
> >  - Faster reads (no fs lookup)
> >
> >  - Similarly sized metadata for most objects.  If we assume most objects are
> not fragmented, then the metadata to store the block offsets is about the
> same size as the metadata to store the filenames we have now.
> >
> > Problems:
> >
> >  - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage.  Maybe we do this anyway (put
> > metadata on
> > SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of a
> different pool and those aren't currently fungible.
> >
> >  - We have to write and maintain an allocator.  I'm still optimistic this can be
> reasonbly simple, especially for the flash case (where fragmentation isn't
> such an issue as long as our blocks are reasonbly sized).  For disk we may
> beed to be moderately clever.
> >
> >  - We'll need a fsck to ensure our internal metadata is consistent.  The good
> news is it'll just need to validate what we have stored in the kv store.
> >
> > Other thoughts:
> >
> >  - We might want to consider whether dm-thin or bcache or other block
> layers might help us with elasticity of file vs block areas.
> >
> >  - Rocksdb can push colder data to a second directory, so we could
> > have a fast ssd primary area (for wal and most metadata) and a second
> > hdd directory for stuff it has to push off.  Then have a conservative
> > amount of file space on the hdd.  If our block fills up, use the
> > existing file mechanism to put data there too.  (But then we have to
> > maintain both the current kv + file approach and not go all-in on kv +
> > block.)
> >
> > Thoughts?
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> > ________________________________
> >
> > PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-20  2:32       ` Varada Kari
@ 2015-10-20  2:40         ` Chen, Xiaoxi
  0 siblings, 0 replies; 71+ messages in thread
From: Chen, Xiaoxi @ 2015-10-20  2:40 UTC (permalink / raw)
  To: Varada Kari, James (Fei) Liu-SSI, Sage Weil, Somnath Roy; +Cc: ceph-devel

There is something like : http://pmem.io/nvml/libpmemobj/ to adapt NVMe to transactional object storage.

But definitely need some more works

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Varada Kari
> Sent: Tuesday, October 20, 2015 10:33 AM
> To: James (Fei) Liu-SSI; Sage Weil; Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> Hi James,
> 
> Are you mentioning SCSI OSD (http://www.t10.org/drafts.htm#OSD_Family) ?
> If SCSI OSD is what you are mentioning, drive has to support all osd
> functionality mentioned by T10.
> If not, we have to implement the same functionality in kernel or have a
> wrapper in user space to convert them to read/write calls.  This seems more
> effort.
> 
> Varada
> 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> > Sent: Tuesday, October 20, 2015 3:51 AM
> > To: Sage Weil <sweil@redhat.com>; Somnath Roy
> > <Somnath.Roy@sandisk.com>
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: newstore direction
> >
> > Hi Sage and Somnath,
> >   In my humble opinion, There is another more aggressive  solution
> > than raw block device base keyvalue store as backend for objectstore.
> > The new key value  SSD device with transaction support would be  ideal
> > to solve the issues. First of all, it is raw SSD device. Secondly , It
> > provides key value interface directly from SSD. Thirdly, it can
> > provide transaction support, consistency will be guaranteed by
> > hardware device. It pretty much satisfied all of objectstore needs
> > without any extra overhead since there is not any extra layer in between
> device and objectstore.
> >    Either way, I strongly support to have CEPH own data format instead
> > of relying on filesystem.
> >
> >   Regards,
> >   James
> >
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 1:55 PM
> > To: Somnath Roy
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: newstore direction
> >
> > On Mon, 19 Oct 2015, Somnath Roy wrote:
> > > Sage,
> > > I fully support that.  If we want to saturate SSDs , we need to get
> > > rid of this filesystem overhead (which I am in process of measuring).
> > > Also, it will be good if we can eliminate the dependency on the k/v
> > > dbs (for storing allocators and all). The reason is the unknown
> > > write amps they causes.
> >
> > My hope is to keep behing the KeyValueDB interface (and/more change it
> > as
> > appropriate) so that other backends can be easily swapped in (e.g. a
> > btree- based one for high-end flash).
> >
> > sage
> >
> >
> > >
> > > Thanks & Regards
> > > Somnath
> > >
> > >
> > > -----Original Message-----
> > > From: ceph-devel-owner@vger.kernel.org
> > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Monday, October 19, 2015 12:49 PM
> > > To: ceph-devel@vger.kernel.org
> > > Subject: newstore direction
> > >
> > > The current design is based on two simple ideas:
> > >
> > >  1) a key/value interface is better way to manage all of our
> > > internal metadata (object metadata, attrs, layout, collection
> > > membership, write-ahead logging, overlay data, etc.)
> > >
> > >  2) a file system is well suited for storage object data (as files).
> > >
> > > So far 1 is working out well, but I'm questioning the wisdom of #2.
> > > A few
> > > things:
> > >
> > >  - We currently write the data to the file, fsync, then commit the
> > > kv transaction.  That's at least 3 IOs: one for the data, one for
> > > the fs journal, one for the kv txn to commit (at least once my
> > > rocksdb changes land... the kv commit is currently 2-3).  So two
> > > people are managing metadata, here: the fs managing the file
> > > metadata (with its own
> > > journal) and the kv backend (with its journal).
> > >
> > >  - On read we have to open files by name, which means traversing the
> > > fs
> > namespace.  Newstore tries to keep it as flat and simple as possible,
> > but at a minimum it is a couple btree lookups.  We'd love to use open
> > by handle (which would reduce this to 1 btree traversal), but running
> > the daemon as ceph and not root makes that hard...
> > >
> > >  - ...and file systems insist on updating mtime on writes, even when
> > > it is a
> > overwrite with no allocation changes.  (We don't care about mtime.)
> > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > brainfreeze.
> > >
> > >  - XFS is (probably) never going going to give us data checksums,
> > > which we
> > want desperately.
> > >
> > > But what's the alternative?  My thought is to just bite the bullet
> > > and
> > consume a raw block device directly.  Write an allocator, hopefully
> > keep it pretty simple, and manage it in kv store along with all of our other
> metadata.
> > >
> > > Wins:
> > >
> > >  - 2 IOs for most: one to write the data to unused space in the
> > > block device,
> > one to commit our transaction (vs 4+ before).  For overwrites, we'd
> > have one io to do our write-ahead log (kv journal), then do the
> > overwrite async (vs 4+ before).
> > >
> > >  - No concern about mtime getting in the way
> > >
> > >  - Faster reads (no fs lookup)
> > >
> > >  - Similarly sized metadata for most objects.  If we assume most
> > > objects are
> > not fragmented, then the metadata to store the block offsets is about
> > the same size as the metadata to store the filenames we have now.
> > >
> > > Problems:
> > >
> > >  - We have to size the kv backend storage (probably still an XFS
> > > partition) vs the block storage.  Maybe we do this anyway (put
> > > metadata on
> > > SSD!) so it won't matter.  But what happens when we are storing gobs
> > > of
> > rgw index data or cephfs metadata?  Suddenly we are pulling storage
> > out of a different pool and those aren't currently fungible.
> > >
> > >  - We have to write and maintain an allocator.  I'm still optimistic
> > > this can be
> > reasonbly simple, especially for the flash case (where fragmentation
> > isn't such an issue as long as our blocks are reasonbly sized).  For
> > disk we may beed to be moderately clever.
> > >
> > >  - We'll need a fsck to ensure our internal metadata is consistent.
> > > The good
> > news is it'll just need to validate what we have stored in the kv store.
> > >
> > > Other thoughts:
> > >
> > >  - We might want to consider whether dm-thin or bcache or other
> > > block
> > layers might help us with elasticity of file vs block areas.
> > >
> > >  - Rocksdb can push colder data to a second directory, so we could
> > > have a fast ssd primary area (for wal and most metadata) and a
> > > second hdd directory for stuff it has to push off.  Then have a
> > > conservative amount of file space on the hdd.  If our block fills
> > > up, use the existing file mechanism to put data there too.  (But
> > > then we have to maintain both the current kv + file approach and not
> > > go all-in on kv +
> > > block.)
> > >
> > > Thoughts?
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > > ________________________________
> > >
> > > PLEASE NOTE: The information contained in this electronic mail
> > > message is
> > intended only for the use of the designated recipient(s) named above.
> > If the reader of this message is not the intended recipient, you are
> > hereby notified that you have received this message in error and that
> > any review, dissemination, distribution, or copying of this message is
> > strictly prohibited. If you have received this communication in error,
> > please notify the sender by telephone or e-mail (as shown above)
> > immediately and destroy any and all copies of this message in your
> > possession (whether hard copies or electronically stored copies).
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-19 19:49 newstore direction Sage Weil
                   ` (5 preceding siblings ...)
  2015-10-20  2:08 ` Haomai Wang
@ 2015-10-20  7:06 ` Dałek, Piotr
  2015-10-20 18:31 ` Ric Wheeler
  7 siblings, 0 replies; 71+ messages in thread
From: Dałek, Piotr @ 2015-10-20  7:06 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 9:49 PM
> 
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal metadata
> (object metadata, attrs, layout, collection membership, write-ahead logging,
> overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
> 
> [..]
> 
> But what's the alternative?  My thought is to just bite the bullet and consume
> a raw block device directly.  Write an allocator, hopefully keep it pretty
> simple, and manage it in kv store along with all of our other metadata.

This is pretty much reinventing the file system, but...

I actually did something similar for my personal project (e-mail client), moving from maildir-like structure (each message was one file) to something resembling mbox (one large file per mail folder, containing pre-decoded structures for fast and easy access). And this worked out really well, especially with searches and bulk processing (filtering by body contents, and so on). I don't remember exact figures, but the performance benefit was in at least order of magnitude. If huge amounts of small-to-medium (0-128k) objects are the target, this is the way to go.

The most serious issue was fragmentation. Since I actually put my box files on top of actual FS (here: NTFS), low-level fragmentation was not a problem (each message was read and written in one fread/fwrite anyway). High-level fragmentation was an issue - each time a message was moved away, it still occupied space. To combat this, I wrote a space reclaimer that moved messages within box file (consolidated them) and maintained a bitmap of 4k free spaces, so I could re-use unused space without taking too much time iterating through messages and without calling reclaimer. Also, reclaimer was smart enough to not move messages one-by-one, but instead it loaded up to n messages in at most n reads (in common case it was less than that) and wrote them in one call and do its work until some space was actually reclaimed, instead of doing full garbage collection. Machinery was also aware of fact that messages were (mostly) appended to the end of box, so instead of blindly doing that, it moved back end-of-box pointer once messages at the end of box were deleted.
Other issue was reliability. Obviously, I had an option of secondary temp file, but still, everything above is doable without that.
Benefits included reduced requirements for metadata storage. Instead of generating unique ID (filename) for each message (apparently, message-id header is not reliable in that regard), I just stored offset and size (8+4 bytes per message), which, for 300 thousand of messages calculated to just 3,5MB of memory and could be kept in RAM. I/O performance has also improved due to less random access pattern (messages were physically close to each other instead of being scattered all over the drive)
For Ceph, benefits could be even greater. I can imagine faster deep scrubs that are way more efficient on spinning drives; efficient object storage (no per-object fragmentation and less disk-intensive object readahead, maybe with better support from hardware); possibly more reliability (when we fsync, we actually fsync - we don't get cheated by underlying FS), and we could get it optimized for particular devices (for example, most SSDs suck like vacuum on I/Os below 4k, so we could enforce I/Os of at least 4k).

Just my 0.02$.

With best regards / Pozdrawiam
Piotr Dałek


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20  2:08 ` Haomai Wang
@ 2015-10-20 12:25   ` Sage Weil
  0 siblings, 0 replies; 71+ messages in thread
From: Sage Weil @ 2015-10-20 12:25 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

On Tue, 20 Oct 2015, Haomai Wang wrote:
> On Tue, Oct 20, 2015 at 3:49 AM, Sage Weil <sweil@redhat.com> wrote:
> > The current design is based on two simple ideas:
> >
> >  1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> >
> >  2) a file system is well suited for storage object data (as files).
> >
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> > things:
> >
> >  - We currently write the data to the file, fsync, then commit the kv
> > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb changes
> > land... the kv commit is currently 2-3).  So two people are managing
> > metadata, here: the fs managing the file metadata (with its own
> > journal) and the kv backend (with its journal).
> >
> >  - On read we have to open files by name, which means traversing the fs
> > namespace.  Newstore tries to keep it as flat and simple as possible, but
> > at a minimum it is a couple btree lookups.  We'd love to use open by
> > handle (which would reduce this to 1 btree traversal), but running
> > the daemon as ceph and not root makes that hard...
> >
> >  - ...and file systems insist on updating mtime on writes, even when it is
> > a overwrite with no allocation changes.  (We don't care about mtime.)
> > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > brainfreeze.
> >
> >  - XFS is (probably) never going going to give us data checksums, which we
> > want desperately.
> >
> > But what's the alternative?  My thought is to just bite the bullet and
> > consume a raw block device directly.  Write an allocator, hopefully keep
> > it pretty simple, and manage it in kv store along with all of our other
> > metadata.
> 
> This is really a tough decision. Although making a block device based
> objectstore never walk out my mind since two years ago.
> 
> We would much more concern about the effective of space utilization
> compared to local fs,  the buggy, the consuming time to build a tiny
> local filesystem. I'm a little afraid of we would stuck into....
> 
> >
> > Wins:
> >
> >  - 2 IOs for most: one to write the data to unused space in the block
> > device, one to commit our transaction (vs 4+ before).  For overwrites,
> > we'd have one io to do our write-ahead log (kv journal), then do
> > the overwrite async (vs 4+ before).
> 
> Compared to filejournal, it seemed keyvaluedb doesn't play well in WAL
> area from my perf.

With this change it is close to parity:

	https://github.com/facebook/rocksdb/pull/746

> >  - No concern about mtime getting in the way
> >
> >  - Faster reads (no fs lookup)
> >
> >  - Similarly sized metadata for most objects.  If we assume most objects
> > are not fragmented, then the metadata to store the block offsets is about
> > the same size as the metadata to store the filenames we have now.
> >
> > Problems:
> >
> >  - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage.  Maybe we do this anyway (put metadata on
> > SSD!) so it won't matter.  But what happens when we are storing gobs of
> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> > a different pool and those aren't currently fungible.
> >
> >  - We have to write and maintain an allocator.  I'm still optimistic this
> > can be reasonbly simple, especially for the flash case (where
> > fragmentation isn't such an issue as long as our blocks are reasonbly
> > sized).  For disk we may beed to be moderately clever.
> >
> >  - We'll need a fsck to ensure our internal metadata is consistent.  The
> > good news is it'll just need to validate what we have stored in the kv
> > store.
> >
> > Other thoughts:
> >
> >  - We might want to consider whether dm-thin or bcache or other block
> > layers might help us with elasticity of file vs block areas.
> >
> >  - Rocksdb can push colder data to a second directory, so we could have a
> > fast ssd primary area (for wal and most metadata) and a second hdd
> > directory for stuff it has to push off.  Then have a conservative amount
> > of file space on the hdd.  If our block fills up, use the existing file
> > mechanism to put data there too.  (But then we have to maintain both the
> > current kv + file approach and not go all-in on kv + block.)
> 
> A complex way...
> 
> Actually I would like to employ FileStore2 impl, which means we still
> use FileJournal(or alike ..). But we need to employ more memory to
> keep metadata/xattrs and use aio+dio to flush disk. A userspace
> pagecache needed to be impl. Then we can skip journal if full write,
> because osd is pg isolation we could make a barrier for single pg when
> skipping journal. @Sage Is there other concerns for filestore skip
> journal?
> 
> In a word, I like the model that filestore owns, but we need to have a
> big refactor for existing impl.
> 
> Sorry to disturb the thought....

I think the directory (re)hashing strategy in filestore is too expensive, 
and I don't see how it can be fixed without managing the namespace 
ourselves (as newstore does).

If we want a middle road approach where we still rely on a file system for 
doing block allocation then IMO the current incarnation of newstore is the 
right path...

sage

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-20  2:24       ` Chen, Xiaoxi
@ 2015-10-20 12:30         ` Sage Weil
  2015-10-20 13:19           ` Mark Nelson
  0 siblings, 1 reply; 71+ messages in thread
From: Sage Weil @ 2015-10-20 12:30 UTC (permalink / raw)
  To: Chen, Xiaoxi; +Cc: James (Fei) Liu-SSI, Somnath Roy, ceph-devel

On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
> +1, nowadays K-V DB care more about very small key-value pairs, say 
> several bytes to a few KB, but in SSD case we only care about 4KB or 
> 8KB. In this way, NVMKV is a good design and seems some of the SSD 
> vendor are also trying to build this kind of interface, we had a NVM-L 
> library but still under development.

Do you have an NVMKV link?  I see a paper and a stale github repo.. not 
sure if I'm looking at the right thing.

My concern with using a key/value interface for the object data is that 
you end up with lots of key/value pairs (e.g., $inode_$offset = 
$4kb_of_data) that is pretty inefficient to store and (depending on the 
implementation) tends to break alignment.  I don't think these interfaces 
are targetted toward block-sized/aligned payloads.  Storing just the 
metadata (block allocation map) w/ the kv api and storing the data 
directly on a block/page interface makes more sense to me.

sage


> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> > Sent: Tuesday, October 20, 2015 6:21 AM
> > To: Sage Weil; Somnath Roy
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: newstore direction
> > 
> > Hi Sage and Somnath,
> >   In my humble opinion, There is another more aggressive  solution than raw
> > block device base keyvalue store as backend for objectstore. The new key
> > value  SSD device with transaction support would be  ideal to solve the issues.
> > First of all, it is raw SSD device. Secondly , It provides key value interface
> > directly from SSD. Thirdly, it can provide transaction support, consistency will
> > be guaranteed by hardware device. It pretty much satisfied all of objectstore
> > needs without any extra overhead since there is not any extra layer in
> > between device and objectstore.
> >    Either way, I strongly support to have CEPH own data format instead of
> > relying on filesystem.
> > 
> >   Regards,
> >   James
> > 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 1:55 PM
> > To: Somnath Roy
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: newstore direction
> > 
> > On Mon, 19 Oct 2015, Somnath Roy wrote:
> > > Sage,
> > > I fully support that.  If we want to saturate SSDs , we need to get
> > > rid of this filesystem overhead (which I am in process of measuring).
> > > Also, it will be good if we can eliminate the dependency on the k/v
> > > dbs (for storing allocators and all). The reason is the unknown write
> > > amps they causes.
> > 
> > My hope is to keep behing the KeyValueDB interface (and/more change it as
> > appropriate) so that other backends can be easily swapped in (e.g. a btree-
> > based one for high-end flash).
> > 
> > sage
> > 
> > 
> > >
> > > Thanks & Regards
> > > Somnath
> > >
> > >
> > > -----Original Message-----
> > > From: ceph-devel-owner@vger.kernel.org
> > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Monday, October 19, 2015 12:49 PM
> > > To: ceph-devel@vger.kernel.org
> > > Subject: newstore direction
> > >
> > > The current design is based on two simple ideas:
> > >
> > >  1) a key/value interface is better way to manage all of our internal
> > > metadata (object metadata, attrs, layout, collection membership,
> > > write-ahead logging, overlay data, etc.)
> > >
> > >  2) a file system is well suited for storage object data (as files).
> > >
> > > So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> > > few
> > > things:
> > >
> > >  - We currently write the data to the file, fsync, then commit the kv
> > > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > > journal, one for the kv txn to commit (at least once my rocksdb
> > > changes land... the kv commit is currently 2-3).  So two people are
> > > managing metadata, here: the fs managing the file metadata (with its
> > > own
> > > journal) and the kv backend (with its journal).
> > >
> > >  - On read we have to open files by name, which means traversing the fs
> > namespace.  Newstore tries to keep it as flat and simple as possible, but at a
> > minimum it is a couple btree lookups.  We'd love to use open by handle
> > (which would reduce this to 1 btree traversal), but running the daemon as
> > ceph and not root makes that hard...
> > >
> > >  - ...and file systems insist on updating mtime on writes, even when it is a
> > overwrite with no allocation changes.  (We don't care about mtime.)
> > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > brainfreeze.
> > >
> > >  - XFS is (probably) never going going to give us data checksums, which we
> > want desperately.
> > >
> > > But what's the alternative?  My thought is to just bite the bullet and
> > consume a raw block device directly.  Write an allocator, hopefully keep it
> > pretty simple, and manage it in kv store along with all of our other metadata.
> > >
> > > Wins:
> > >
> > >  - 2 IOs for most: one to write the data to unused space in the block device,
> > one to commit our transaction (vs 4+ before).  For overwrites, we'd have one
> > io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+
> > before).
> > >
> > >  - No concern about mtime getting in the way
> > >
> > >  - Faster reads (no fs lookup)
> > >
> > >  - Similarly sized metadata for most objects.  If we assume most objects are
> > not fragmented, then the metadata to store the block offsets is about the
> > same size as the metadata to store the filenames we have now.
> > >
> > > Problems:
> > >
> > >  - We have to size the kv backend storage (probably still an XFS
> > > partition) vs the block storage.  Maybe we do this anyway (put
> > > metadata on
> > > SSD!) so it won't matter.  But what happens when we are storing gobs of
> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of a
> > different pool and those aren't currently fungible.
> > >
> > >  - We have to write and maintain an allocator.  I'm still optimistic this can be
> > reasonbly simple, especially for the flash case (where fragmentation isn't
> > such an issue as long as our blocks are reasonbly sized).  For disk we may
> > beed to be moderately clever.
> > >
> > >  - We'll need a fsck to ensure our internal metadata is consistent.  The good
> > news is it'll just need to validate what we have stored in the kv store.
> > >
> > > Other thoughts:
> > >
> > >  - We might want to consider whether dm-thin or bcache or other block
> > layers might help us with elasticity of file vs block areas.
> > >
> > >  - Rocksdb can push colder data to a second directory, so we could
> > > have a fast ssd primary area (for wal and most metadata) and a second
> > > hdd directory for stuff it has to push off.  Then have a conservative
> > > amount of file space on the hdd.  If our block fills up, use the
> > > existing file mechanism to put data there too.  (But then we have to
> > > maintain both the current kv + file approach and not go all-in on kv +
> > > block.)
> > >
> > > Thoughts?
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > > ________________________________
> > >
> > > PLEASE NOTE: The information contained in this electronic mail message is
> > intended only for the use of the designated recipient(s) named above. If the
> > reader of this message is not the intended recipient, you are hereby notified
> > that you have received this message in error and that any review,
> > dissemination, distribution, or copying of this message is strictly prohibited. If
> > you have received this communication in error, please notify the sender by
> > telephone or e-mail (as shown above) immediately and destroy any and all
> > copies of this message in your possession (whether hard copies or
> > electronically stored copies).
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> > body of a message to majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> > body of a message to majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-19 22:21     ` James (Fei) Liu-SSI
  2015-10-20  2:24       ` Chen, Xiaoxi
  2015-10-20  2:32       ` Varada Kari
@ 2015-10-20 12:34       ` Sage Weil
  2015-10-20 20:18         ` Martin Millnert
  2015-10-20 20:32         ` James (Fei) Liu-SSI
  2 siblings, 2 replies; 71+ messages in thread
From: Sage Weil @ 2015-10-20 12:34 UTC (permalink / raw)
  To: James (Fei) Liu-SSI; +Cc: Somnath Roy, ceph-devel

On Mon, 19 Oct 2015, James (Fei) Liu-SSI wrote:
> Hi Sage and Somnath,
>   In my humble opinion, There is another more aggressive solution than 
> raw block device base keyvalue store as backend for objectstore. The new 
> key value SSD device with transaction support would be ideal to solve 
> the issues. First of all, it is raw SSD device. Secondly , It provides 
> key value interface directly from SSD. Thirdly, it can provide 
> transaction support, consistency will be guaranteed by hardware device. 
> It pretty much satisfied all of objectstore needs without any extra 
> overhead since there is not any extra layer in between device and 
> objectstore.

Are you talking about open channel SSDs?  Or something else?  Everything 
I'm familiar with that is currently shipping is exposing a vanilla block 
interface (conventional SSDs) that hides all of that or NVMe (which isn't 
much better).

If there is a low-level KV interface we can consume that would be 
great--especially if we can glue it to our KeyValueDB abstract API.  Even 
so, we need to make sure that the object *data* also has an efficient API 
we can utilize that efficiently handles block-sized/aligned data.

sage


>    Either way, I strongly support to have CEPH own data format instead 
> of relying on filesystem.
> 
>   Regards,
>   James
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 1:55 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> On Mon, 19 Oct 2015, Somnath Roy wrote:
> > Sage,
> > I fully support that.  If we want to saturate SSDs , we need to get 
> > rid of this filesystem overhead (which I am in process of measuring). 
> > Also, it will be good if we can eliminate the dependency on the k/v 
> > dbs (for storing allocators and all). The reason is the unknown write 
> > amps they causes.
> 
> My hope is to keep behing the KeyValueDB interface (and/more change it as
> appropriate) so that other backends can be easily swapped in (e.g. a btree-based one for high-end flash).
> 
> sage
> 
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org 
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 12:49 PM
> > To: ceph-devel@vger.kernel.org
> > Subject: newstore direction
> > 
> > The current design is based on two simple ideas:
> > 
> >  1) a key/value interface is better way to manage all of our internal 
> > metadata (object metadata, attrs, layout, collection membership, 
> > write-ahead logging, overlay data, etc.)
> > 
> >  2) a file system is well suited for storage object data (as files).
> > 
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A 
> > few
> > things:
> > 
> >  - We currently write the data to the file, fsync, then commit the kv 
> > transaction.  That's at least 3 IOs: one for the data, one for the fs 
> > journal, one for the kv txn to commit (at least once my rocksdb 
> > changes land... the kv commit is currently 2-3).  So two people are 
> > managing metadata, here: the fs managing the file metadata (with its 
> > own
> > journal) and the kv backend (with its journal).
> > 
> >  - On read we have to open files by name, which means traversing the fs namespace.  Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups.  We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard...
> > 
> >  - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes.  (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze.
> > 
> >  - XFS is (probably) never going going to give us data checksums, which we want desperately.
> > 
> > But what's the alternative?  My thought is to just bite the bullet and consume a raw block device directly.  Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata.
> > 
> > Wins:
> > 
> >  - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before).  For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before).
> > 
> >  - No concern about mtime getting in the way
> > 
> >  - Faster reads (no fs lookup)
> > 
> >  - Similarly sized metadata for most objects.  If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now.
> > 
> > Problems:
> > 
> >  - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage.  Maybe we do this anyway (put 
> > metadata on
> > SSD!) so it won't matter.  But what happens when we are storing gobs of rgw index data or cephfs metadata?  Suddenly we are pulling storage out of a different pool and those aren't currently fungible.
> > 
> >  - We have to write and maintain an allocator.  I'm still optimistic this can be reasonbly simple, especially for the flash case (where fragmentation isn't such an issue as long as our blocks are reasonbly sized).  For disk we may beed to be moderately clever.
> > 
> >  - We'll need a fsck to ensure our internal metadata is consistent.  The good news is it'll just need to validate what we have stored in the kv store.
> > 
> > Other thoughts:
> > 
> >  - We might want to consider whether dm-thin or bcache or other block layers might help us with elasticity of file vs block areas.
> > 
> >  - Rocksdb can push colder data to a second directory, so we could 
> > have a fast ssd primary area (for wal and most metadata) and a second 
> > hdd directory for stuff it has to push off.  Then have a conservative 
> > amount of file space on the hdd.  If our block fills up, use the 
> > existing file mechanism to put data there too.  (But then we have to 
> > maintain both the current kv + file approach and not go all-in on kv + 
> > block.)
> > 
> > Thoughts?
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > ________________________________
> > 
> > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20 12:30         ` Sage Weil
@ 2015-10-20 13:19           ` Mark Nelson
  2015-10-20 17:04             ` kernel neophyte
  2015-10-21 10:06             ` Allen Samuels
  0 siblings, 2 replies; 71+ messages in thread
From: Mark Nelson @ 2015-10-20 13:19 UTC (permalink / raw)
  To: Sage Weil, Chen, Xiaoxi; +Cc: James (Fei) Liu-SSI, Somnath Roy, ceph-devel

On 10/20/2015 07:30 AM, Sage Weil wrote:
> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
>> +1, nowadays K-V DB care more about very small key-value pairs, say
>> several bytes to a few KB, but in SSD case we only care about 4KB or
>> 8KB. In this way, NVMKV is a good design and seems some of the SSD
>> vendor are also trying to build this kind of interface, we had a NVM-L
>> library but still under development.
>
> Do you have an NVMKV link?  I see a paper and a stale github repo.. not
> sure if I'm looking at the right thing.
>
> My concern with using a key/value interface for the object data is that
> you end up with lots of key/value pairs (e.g., $inode_$offset =
> $4kb_of_data) that is pretty inefficient to store and (depending on the
> implementation) tends to break alignment.  I don't think these interfaces
> are targetted toward block-sized/aligned payloads.  Storing just the
> metadata (block allocation map) w/ the kv api and storing the data
> directly on a block/page interface makes more sense to me.
>
> sage

I get the feeling that some of the folks that were involved with nvmkv 
at Fusion IO have left.  Nisha Talagala is now out at Parallel Systems 
for instance.  http://pmem.io might be a better bet, though I haven't 
looked closely at it.

Mark

>
>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
>>> Sent: Tuesday, October 20, 2015 6:21 AM
>>> To: Sage Weil; Somnath Roy
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: RE: newstore direction
>>>
>>> Hi Sage and Somnath,
>>>    In my humble opinion, There is another more aggressive  solution than raw
>>> block device base keyvalue store as backend for objectstore. The new key
>>> value  SSD device with transaction support would be  ideal to solve the issues.
>>> First of all, it is raw SSD device. Secondly , It provides key value interface
>>> directly from SSD. Thirdly, it can provide transaction support, consistency will
>>> be guaranteed by hardware device. It pretty much satisfied all of objectstore
>>> needs without any extra overhead since there is not any extra layer in
>>> between device and objectstore.
>>>     Either way, I strongly support to have CEPH own data format instead of
>>> relying on filesystem.
>>>
>>>    Regards,
>>>    James
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Sage Weil
>>> Sent: Monday, October 19, 2015 1:55 PM
>>> To: Somnath Roy
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: RE: newstore direction
>>>
>>> On Mon, 19 Oct 2015, Somnath Roy wrote:
>>>> Sage,
>>>> I fully support that.  If we want to saturate SSDs , we need to get
>>>> rid of this filesystem overhead (which I am in process of measuring).
>>>> Also, it will be good if we can eliminate the dependency on the k/v
>>>> dbs (for storing allocators and all). The reason is the unknown write
>>>> amps they causes.
>>>
>>> My hope is to keep behing the KeyValueDB interface (and/more change it as
>>> appropriate) so that other backends can be easily swapped in (e.g. a btree-
>>> based one for high-end flash).
>>>
>>> sage
>>>
>>>
>>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
>>>> Sent: Monday, October 19, 2015 12:49 PM
>>>> To: ceph-devel@vger.kernel.org
>>>> Subject: newstore direction
>>>>
>>>> The current design is based on two simple ideas:
>>>>
>>>>   1) a key/value interface is better way to manage all of our internal
>>>> metadata (object metadata, attrs, layout, collection membership,
>>>> write-ahead logging, overlay data, etc.)
>>>>
>>>>   2) a file system is well suited for storage object data (as files).
>>>>
>>>> So far 1 is working out well, but I'm questioning the wisdom of #2.  A
>>>> few
>>>> things:
>>>>
>>>>   - We currently write the data to the file, fsync, then commit the kv
>>>> transaction.  That's at least 3 IOs: one for the data, one for the fs
>>>> journal, one for the kv txn to commit (at least once my rocksdb
>>>> changes land... the kv commit is currently 2-3).  So two people are
>>>> managing metadata, here: the fs managing the file metadata (with its
>>>> own
>>>> journal) and the kv backend (with its journal).
>>>>
>>>>   - On read we have to open files by name, which means traversing the fs
>>> namespace.  Newstore tries to keep it as flat and simple as possible, but at a
>>> minimum it is a couple btree lookups.  We'd love to use open by handle
>>> (which would reduce this to 1 btree traversal), but running the daemon as
>>> ceph and not root makes that hard...
>>>>
>>>>   - ...and file systems insist on updating mtime on writes, even when it is a
>>> overwrite with no allocation changes.  (We don't care about mtime.)
>>> O_NOCMTIME patches exist but it is hard to get these past the kernel
>>> brainfreeze.
>>>>
>>>>   - XFS is (probably) never going going to give us data checksums, which we
>>> want desperately.
>>>>
>>>> But what's the alternative?  My thought is to just bite the bullet and
>>> consume a raw block device directly.  Write an allocator, hopefully keep it
>>> pretty simple, and manage it in kv store along with all of our other metadata.
>>>>
>>>> Wins:
>>>>
>>>>   - 2 IOs for most: one to write the data to unused space in the block device,
>>> one to commit our transaction (vs 4+ before).  For overwrites, we'd have one
>>> io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+
>>> before).
>>>>
>>>>   - No concern about mtime getting in the way
>>>>
>>>>   - Faster reads (no fs lookup)
>>>>
>>>>   - Similarly sized metadata for most objects.  If we assume most objects are
>>> not fragmented, then the metadata to store the block offsets is about the
>>> same size as the metadata to store the filenames we have now.
>>>>
>>>> Problems:
>>>>
>>>>   - We have to size the kv backend storage (probably still an XFS
>>>> partition) vs the block storage.  Maybe we do this anyway (put
>>>> metadata on
>>>> SSD!) so it won't matter.  But what happens when we are storing gobs of
>>> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of a
>>> different pool and those aren't currently fungible.
>>>>
>>>>   - We have to write and maintain an allocator.  I'm still optimistic this can be
>>> reasonbly simple, especially for the flash case (where fragmentation isn't
>>> such an issue as long as our blocks are reasonbly sized).  For disk we may
>>> beed to be moderately clever.
>>>>
>>>>   - We'll need a fsck to ensure our internal metadata is consistent.  The good
>>> news is it'll just need to validate what we have stored in the kv store.
>>>>
>>>> Other thoughts:
>>>>
>>>>   - We might want to consider whether dm-thin or bcache or other block
>>> layers might help us with elasticity of file vs block areas.
>>>>
>>>>   - Rocksdb can push colder data to a second directory, so we could
>>>> have a fast ssd primary area (for wal and most metadata) and a second
>>>> hdd directory for stuff it has to push off.  Then have a conservative
>>>> amount of file space on the hdd.  If our block fills up, use the
>>>> existing file mechanism to put data there too.  (But then we have to
>>>> maintain both the current kv + file approach and not go all-in on kv +
>>>> block.)
>>>>
>>>> Thoughts?
>>>> sage
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>>> majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>> ________________________________
>>>>
>>>> PLEASE NOTE: The information contained in this electronic mail message is
>>> intended only for the use of the designated recipient(s) named above. If the
>>> reader of this message is not the intended recipient, you are hereby notified
>>> that you have received this message in error and that any review,
>>> dissemination, distribution, or copying of this message is strictly prohibited. If
>>> you have received this communication in error, please notify the sender by
>>> telephone or e-mail (as shown above) immediately and destroy any and all
>>> copies of this message in your possession (whether hard copies or
>>> electronically stored copies).
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>>> majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>>> body of a message to majordomo@vger.kernel.org More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>>> body of a message to majordomo@vger.kernel.org More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20 13:19           ` Mark Nelson
@ 2015-10-20 17:04             ` kernel neophyte
  2015-10-21 10:06             ` Allen Samuels
  1 sibling, 0 replies; 71+ messages in thread
From: kernel neophyte @ 2015-10-20 17:04 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Sage Weil, Chen, Xiaoxi, James (Fei) Liu-SSI, Somnath Roy, ceph-devel

On Tue, Oct 20, 2015 at 6:19 AM, Mark Nelson <mnelson@redhat.com> wrote:
> On 10/20/2015 07:30 AM, Sage Weil wrote:
>>
>> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
>>>
>>> +1, nowadays K-V DB care more about very small key-value pairs, say
>>> several bytes to a few KB, but in SSD case we only care about 4KB or
>>> 8KB. In this way, NVMKV is a good design and seems some of the SSD
>>> vendor are also trying to build this kind of interface, we had a NVM-L
>>> library but still under development.
>>
>>
>> Do you have an NVMKV link?  I see a paper and a stale github repo.. not
>> sure if I'm looking at the right thing.
>>
>> My concern with using a key/value interface for the object data is that
>> you end up with lots of key/value pairs (e.g., $inode_$offset =
>> $4kb_of_data) that is pretty inefficient to store and (depending on the
>> implementation) tends to break alignment.  I don't think these interfaces
>> are targetted toward block-sized/aligned payloads.  Storing just the
>> metadata (block allocation map) w/ the kv api and storing the data
>> directly on a block/page interface makes more sense to me.
>>
>> sage
>
>
> I get the feeling that some of the folks that were involved with nvmkv at
> Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for
> instance.  http://pmem.io might be a better bet, though I haven't looked
> closely at it.
>

IMO pmem.io is more suited for SCM (Storage Class Memory) than for SSD's.

If Newstore is target towards production deployments (Eventually
replacing FileStore someday) then IMO I agree with sage, i.e. rely on
a file system for doing block allocation.

-Neo


> Mark
>
>
>>
>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>> owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
>>>> Sent: Tuesday, October 20, 2015 6:21 AM
>>>> To: Sage Weil; Somnath Roy
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: RE: newstore direction
>>>>
>>>> Hi Sage and Somnath,
>>>>    In my humble opinion, There is another more aggressive  solution than
>>>> raw
>>>> block device base keyvalue store as backend for objectstore. The new key
>>>> value  SSD device with transaction support would be  ideal to solve the
>>>> issues.
>>>> First of all, it is raw SSD device. Secondly , It provides key value
>>>> interface
>>>> directly from SSD. Thirdly, it can provide transaction support,
>>>> consistency will
>>>> be guaranteed by hardware device. It pretty much satisfied all of
>>>> objectstore
>>>> needs without any extra overhead since there is not any extra layer in
>>>> between device and objectstore.
>>>>     Either way, I strongly support to have CEPH own data format instead
>>>> of
>>>> relying on filesystem.
>>>>
>>>>    Regards,
>>>>    James
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
>>>> Sent: Monday, October 19, 2015 1:55 PM
>>>> To: Somnath Roy
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: RE: newstore direction
>>>>
>>>> On Mon, 19 Oct 2015, Somnath Roy wrote:
>>>>>
>>>>> Sage,
>>>>> I fully support that.  If we want to saturate SSDs , we need to get
>>>>> rid of this filesystem overhead (which I am in process of measuring).
>>>>> Also, it will be good if we can eliminate the dependency on the k/v
>>>>> dbs (for storing allocators and all). The reason is the unknown write
>>>>> amps they causes.
>>>>
>>>>
>>>> My hope is to keep behing the KeyValueDB interface (and/more change it
>>>> as
>>>> appropriate) so that other backends can be easily swapped in (e.g. a
>>>> btree-
>>>> based one for high-end flash).
>>>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>> Thanks & Regards
>>>>> Somnath
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
>>>>> Sent: Monday, October 19, 2015 12:49 PM
>>>>> To: ceph-devel@vger.kernel.org
>>>>> Subject: newstore direction
>>>>>
>>>>> The current design is based on two simple ideas:
>>>>>
>>>>>   1) a key/value interface is better way to manage all of our internal
>>>>> metadata (object metadata, attrs, layout, collection membership,
>>>>> write-ahead logging, overlay data, etc.)
>>>>>
>>>>>   2) a file system is well suited for storage object data (as files).
>>>>>
>>>>> So far 1 is working out well, but I'm questioning the wisdom of #2.  A
>>>>> few
>>>>> things:
>>>>>
>>>>>   - We currently write the data to the file, fsync, then commit the kv
>>>>> transaction.  That's at least 3 IOs: one for the data, one for the fs
>>>>> journal, one for the kv txn to commit (at least once my rocksdb
>>>>> changes land... the kv commit is currently 2-3).  So two people are
>>>>> managing metadata, here: the fs managing the file metadata (with its
>>>>> own
>>>>> journal) and the kv backend (with its journal).
>>>>>
>>>>>   - On read we have to open files by name, which means traversing the
>>>>> fs
>>>>
>>>> namespace.  Newstore tries to keep it as flat and simple as possible,
>>>> but at a
>>>> minimum it is a couple btree lookups.  We'd love to use open by handle
>>>> (which would reduce this to 1 btree traversal), but running the daemon
>>>> as
>>>> ceph and not root makes that hard...
>>>>>
>>>>>
>>>>>   - ...and file systems insist on updating mtime on writes, even when
>>>>> it is a
>>>>
>>>> overwrite with no allocation changes.  (We don't care about mtime.)
>>>> O_NOCMTIME patches exist but it is hard to get these past the kernel
>>>> brainfreeze.
>>>>>
>>>>>
>>>>>   - XFS is (probably) never going going to give us data checksums,
>>>>> which we
>>>>
>>>> want desperately.
>>>>>
>>>>>
>>>>> But what's the alternative?  My thought is to just bite the bullet and
>>>>
>>>> consume a raw block device directly.  Write an allocator, hopefully keep
>>>> it
>>>> pretty simple, and manage it in kv store along with all of our other
>>>> metadata.
>>>>>
>>>>>
>>>>> Wins:
>>>>>
>>>>>   - 2 IOs for most: one to write the data to unused space in the block
>>>>> device,
>>>>
>>>> one to commit our transaction (vs 4+ before).  For overwrites, we'd have
>>>> one
>>>> io to do our write-ahead log (kv journal), then do the overwrite async
>>>> (vs 4+
>>>> before).
>>>>>
>>>>>
>>>>>   - No concern about mtime getting in the way
>>>>>
>>>>>   - Faster reads (no fs lookup)
>>>>>
>>>>>   - Similarly sized metadata for most objects.  If we assume most
>>>>> objects are
>>>>
>>>> not fragmented, then the metadata to store the block offsets is about
>>>> the
>>>> same size as the metadata to store the filenames we have now.
>>>>>
>>>>>
>>>>> Problems:
>>>>>
>>>>>   - We have to size the kv backend storage (probably still an XFS
>>>>> partition) vs the block storage.  Maybe we do this anyway (put
>>>>> metadata on
>>>>> SSD!) so it won't matter.  But what happens when we are storing gobs of
>>>>
>>>> rgw index data or cephfs metadata?  Suddenly we are pulling storage out
>>>> of a
>>>> different pool and those aren't currently fungible.
>>>>>
>>>>>
>>>>>   - We have to write and maintain an allocator.  I'm still optimistic
>>>>> this can be
>>>>
>>>> reasonbly simple, especially for the flash case (where fragmentation
>>>> isn't
>>>> such an issue as long as our blocks are reasonbly sized).  For disk we
>>>> may
>>>> beed to be moderately clever.
>>>>>
>>>>>
>>>>>   - We'll need a fsck to ensure our internal metadata is consistent.
>>>>> The good
>>>>
>>>> news is it'll just need to validate what we have stored in the kv store.
>>>>>
>>>>>
>>>>> Other thoughts:
>>>>>
>>>>>   - We might want to consider whether dm-thin or bcache or other block
>>>>
>>>> layers might help us with elasticity of file vs block areas.
>>>>>
>>>>>
>>>>>   - Rocksdb can push colder data to a second directory, so we could
>>>>> have a fast ssd primary area (for wal and most metadata) and a second
>>>>> hdd directory for stuff it has to push off.  Then have a conservative
>>>>> amount of file space on the hdd.  If our block fills up, use the
>>>>> existing file mechanism to put data there too.  (But then we have to
>>>>> maintain both the current kv + file approach and not go all-in on kv +
>>>>> block.)
>>>>>
>>>>> Thoughts?
>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>
>>>> majordomo
>>>>>
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> ________________________________
>>>>>
>>>>> PLEASE NOTE: The information contained in this electronic mail message
>>>>> is
>>>>
>>>> intended only for the use of the designated recipient(s) named above. If
>>>> the
>>>> reader of this message is not the intended recipient, you are hereby
>>>> notified
>>>> that you have received this message in error and that any review,
>>>> dissemination, distribution, or copying of this message is strictly
>>>> prohibited. If
>>>> you have received this communication in error, please notify the sender
>>>> by
>>>> telephone or e-mail (as shown above) immediately and destroy any and all
>>>> copies of this message in your possession (whether hard copies or
>>>> electronically stored copies).
>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>
>>>> majordomo
>>>>>
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the
>>>> body of a message to majordomo@vger.kernel.org More majordomo info at
>>>> http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the
>>>> body of a message to majordomo@vger.kernel.org More majordomo info at
>>>> http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-19 19:49 newstore direction Sage Weil
                   ` (6 preceding siblings ...)
  2015-10-20  7:06 ` Dałek, Piotr
@ 2015-10-20 18:31 ` Ric Wheeler
  2015-10-20 19:44   ` Sage Weil
                     ` (3 more replies)
  7 siblings, 4 replies; 71+ messages in thread
From: Ric Wheeler @ 2015-10-20 18:31 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

On 10/19/2015 03:49 PM, Sage Weil wrote:
> The current design is based on two simple ideas:
>
>   1) a key/value interface is better way to manage all of our internal
> metadata (object metadata, attrs, layout, collection membership,
> write-ahead logging, overlay data, etc.)
>
>   2) a file system is well suited for storage object data (as files).
>
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
>
>   - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs
> journal, one for the kv txn to commit (at least once my rocksdb changes
> land... the kv commit is currently 2-3).  So two people are managing
> metadata, here: the fs managing the file metadata (with its own
> journal) and the kv backend (with its journal).

If all of the fsync()'s fall into the same backing file system, are you sure 
that each fsync() takes the same time? Depending on the local FS implementation 
of course, but the order of issuing those fsync()'s can effectively make some of 
them no-ops.

>
>   - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but
> at a minimum it is a couple btree lookups.  We'd love to use open by
> handle (which would reduce this to 1 btree traversal), but running
> the daemon as ceph and not root makes that hard...

This seems like a a pretty low hurdle to overcome.

>
>   - ...and file systems insist on updating mtime on writes, even when it is
> a overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.

Are you using O_DIRECT? Seems like there should be some enterprisey database 
tricks that we can use here.

>
>   - XFS is (probably) never going going to give us data checksums, which we
> want desperately.

What is the goal of having the file system do the checksums? How strong do they 
need to be and what size are the chunks?

If you update this on each IO, this will certainly generate more IO (each write 
will possibly generate at least one other write to update that new checksum).

>
> But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep
> it pretty simple, and manage it in kv store along with all of our other
> metadata.

The big problem with consuming block devices directly is that you ultimately end 
up recreating most of the features that you had in the file system. Even 
enterprise databases like Oracle and DB2 have been migrating away from running 
on raw block devices in favor of file systems over time.  In effect, you are 
looking at making a simple on disk file system which is always easier to start 
than it is to get back to a stable, production ready state.

I think that it might be quicker and more maintainable to spend some time 
working with the local file system people (XFS or other) to see if we can 
jointly address the concerns you have.
>
> Wins:
>
>   - 2 IOs for most: one to write the data to unused space in the block
> device, one to commit our transaction (vs 4+ before).  For overwrites,
> we'd have one io to do our write-ahead log (kv journal), then do
> the overwrite async (vs 4+ before).
>
>   - No concern about mtime getting in the way
>
>   - Faster reads (no fs lookup)
>
>   - Similarly sized metadata for most objects.  If we assume most objects
> are not fragmented, then the metadata to store the block offsets is about
> the same size as the metadata to store the filenames we have now.
>
> Problems:
>
>   - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> a different pool and those aren't currently fungible.
>
>   - We have to write and maintain an allocator.  I'm still optimistic this
> can be reasonbly simple, especially for the flash case (where
> fragmentation isn't such an issue as long as our blocks are reasonbly
> sized).  For disk we may beed to be moderately clever.
>
>   - We'll need a fsck to ensure our internal metadata is consistent.  The
> good news is it'll just need to validate what we have stored in the kv
> store.
>
> Other thoughts:
>
>   - We might want to consider whether dm-thin or bcache or other block
> layers might help us with elasticity of file vs block areas.
>
>   - Rocksdb can push colder data to a second directory, so we could have a
> fast ssd primary area (for wal and most metadata) and a second hdd
> directory for stuff it has to push off.  Then have a conservative amount
> of file space on the hdd.  If our block fills up, use the existing file
> mechanism to put data there too.  (But then we have to maintain both the
> current kv + file approach and not go all-in on kv + block.)
>
> Thoughts?
> sage
> --

I really hate the idea of making a new file system type (even if we call it a 
raw block store!).

In addition to the technical hurdles, there are also production worries like how 
long will it take for distros to pick up formal support?  How do we test it 
properly?

Regards,

Ric



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20 18:31 ` Ric Wheeler
@ 2015-10-20 19:44   ` Sage Weil
  2015-10-20 21:43     ` Ric Wheeler
  2015-10-20 19:44   ` Yehuda Sadeh-Weinraub
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 71+ messages in thread
From: Sage Weil @ 2015-10-20 19:44 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: ceph-devel

On Tue, 20 Oct 2015, Ric Wheeler wrote:
> On 10/19/2015 03:49 PM, Sage Weil wrote:
> > The current design is based on two simple ideas:
> > 
> >   1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> > 
> >   2) a file system is well suited for storage object data (as files).
> > 
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> > things:
> > 
> >   - We currently write the data to the file, fsync, then commit the kv
> > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb changes
> > land... the kv commit is currently 2-3).  So two people are managing
> > metadata, here: the fs managing the file metadata (with its own
> > journal) and the kv backend (with its journal).
> 
> If all of the fsync()'s fall into the same backing file system, are you sure
> that each fsync() takes the same time? Depending on the local FS
> implementation of course, but the order of issuing those fsync()'s can
> effectively make some of them no-ops.

Surely, yes, but the fact remains we are maintaining two journals: one 
internal to the fs that manages the allocation metadata, and one layered 
on top that handles the kv store's write stream.  The lower bound on any 
write is 3 IOs (unless we're talking about a COW fs).

> >   - On read we have to open files by name, which means traversing the fs
> > namespace.  Newstore tries to keep it as flat and simple as possible, but
> > at a minimum it is a couple btree lookups.  We'd love to use open by
> > handle (which would reduce this to 1 btree traversal), but running
> > the daemon as ceph and not root makes that hard...
> 
> This seems like a a pretty low hurdle to overcome.

I wish you luck convincing upstream to allow unprivileged access to 
open_by_handle or the XFS ioctl.  :)  But even if we had that, any object 
access requires multiple metadata lookups: one in our kv db, and a second 
to get the inode for the backing file.  Again, there's an unnecessary 
lower bound on the number of IOs needed to access a cold object.

> >   - ...and file systems insist on updating mtime on writes, even when it is
> > a overwrite with no allocation changes.  (We don't care about mtime.)
> > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > brainfreeze.
> 
> Are you using O_DIRECT? Seems like there should be some enterprisey database
> tricks that we can use here.

It's not about about the data path, but avoiding the useless bookkeeping 
the file system is doing that we don't want or need.  See the recent 
recent reception of Zach's O_NOCMTIME patches on linux-fsdevel:

	http://marc.info/?t=143094969800001&r=1&w=2

I'm generally an optimist when it comes to introducing new APIs upstream, 
but I still found this to be an unbelievingly frustrating exchange.

> >   - XFS is (probably) never going going to give us data checksums, which we
> > want desperately.
> 
> What is the goal of having the file system do the checksums? How strong do
> they need to be and what size are the chunks?
> 
> If you update this on each IO, this will certainly generate more IO (each
> write will possibly generate at least one other write to update that new
> checksum).

Not if we keep the checksums with the allocation metadata, in the 
onode/inode, which we're also doing and IO to persist.  But whther that is 
practial depends on the granularity (4KB or 16K or 128K or ...), which may 
in turn depend on the object (RBD block that'll service random 4K reads 
and writes?  or RGW fragment that is always written sequentially?).  I'm 
highly skeptical we'd ever get anything from a general-purpose file system 
that would work well here (if anything at all).

> > But what's the alternative?  My thought is to just bite the bullet and
> > consume a raw block device directly.  Write an allocator, hopefully keep
> > it pretty simple, and manage it in kv store along with all of our other
> > metadata.
> 
> The big problem with consuming block devices directly is that you ultimately
> end up recreating most of the features that you had in the file system. Even
> enterprise databases like Oracle and DB2 have been migrating away from running
> on raw block devices in favor of file systems over time.  In effect, you are
> looking at making a simple on disk file system which is always easier to start
> than it is to get back to a stable, production ready state.

This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had 
everything we were implementing and more: mainly, copy on write and data 
checksums.  But in practice the fact that its general purpose means it 
targets a very different workloads and APIs than what we need.

Now that I've realized the POSIX file namespace is a bad fit for what we 
need and opted to manage that directly, things are vastly simpler: we no 
longer have the horrific directory hashing tricks to allow PG splits (not 
because we are scared of big directories but because we need ordered 
enumeration of objects) and the transactions have exactly the granularity 
we want.  In fact, it turns out that pretty much the *only* thing the file 
system provides that we need is block allocation; everything else is 
overhead we have to play tricks to work around (batched fsync, O_NOCMTIME, 
open by handle), or something that we want but the fs will likely never 
provide (like checksums).

> I think that it might be quicker and more maintainable to spend some time
> working with the local file system people (XFS or other) to see if we can
> jointly address the concerns you have.

I have been, in cases where what we want is something that makes sense for 
other file system users.  But mostly I think that the problem is more 
that what we want isn't a file system, but an allocator + block device.

And the end result is that slotting a file system into the stack puts an 
upper bound on our performance.  On its face this isn't surprising, but 
I'm running up against it in gory detail in my efforts to make the Ceph 
OSD faster, and the question becomes whether we want to be fast or 
layered.  (I don't think 'simple' is really an option given the effort to 
work around the POSIX vs ObjectStore impedence mismatch.)

> I really hate the idea of making a new file system type (even if we call it a
> raw block store!).

Just to be clear, this isn't a new kernel file system--it's userland 
consuming a block device (ala oracle).  (But yeah, I hate it too.)

> In addition to the technical hurdles, there are also production worries like
> how long will it take for distros to pick up formal support?  How do we test
> it properly?

This actually means less for the distros to support: we'll consume 
/dev/sdb instead of an XFS mount.  Testing will be the same as before... 
the usual forced-kill and power cycle testing under the stress and 
correctness testing workloads.

What we (Ceph) will support in its place will be a combination of a kv 
store (which we already need) and a block allocator.

sage

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20 18:31 ` Ric Wheeler
  2015-10-20 19:44   ` Sage Weil
@ 2015-10-20 19:44   ` Yehuda Sadeh-Weinraub
  2015-10-21  8:22   ` Orit Wasserman
  2015-10-21 10:06   ` Allen Samuels
  3 siblings, 0 replies; 71+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2015-10-20 19:44 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Sage Weil, ceph-devel

On Tue, Oct 20, 2015 at 11:31 AM, Ric Wheeler <rwheeler@redhat.com> wrote:
> On 10/19/2015 03:49 PM, Sage Weil wrote:
>>
>> The current design is based on two simple ideas:
>>
>>   1) a key/value interface is better way to manage all of our internal
>> metadata (object metadata, attrs, layout, collection membership,
>> write-ahead logging, overlay data, etc.)
>>
>>   2) a file system is well suited for storage object data (as files).
>>
>> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
>> things:
>>
>>   - We currently write the data to the file, fsync, then commit the kv
>> transaction.  That's at least 3 IOs: one for the data, one for the fs
>> journal, one for the kv txn to commit (at least once my rocksdb changes
>> land... the kv commit is currently 2-3).  So two people are managing
>> metadata, here: the fs managing the file metadata (with its own
>> journal) and the kv backend (with its journal).
>
>
> If all of the fsync()'s fall into the same backing file system, are you sure
> that each fsync() takes the same time? Depending on the local FS
> implementation of course, but the order of issuing those fsync()'s can
> effectively make some of them no-ops.
>
>>
>>   - On read we have to open files by name, which means traversing the fs
>> namespace.  Newstore tries to keep it as flat and simple as possible, but
>> at a minimum it is a couple btree lookups.  We'd love to use open by
>> handle (which would reduce this to 1 btree traversal), but running
>> the daemon as ceph and not root makes that hard...
>
>
> This seems like a a pretty low hurdle to overcome.
>
>>
>>   - ...and file systems insist on updating mtime on writes, even when it
>> is
>> a overwrite with no allocation changes.  (We don't care about mtime.)
>> O_NOCMTIME patches exist but it is hard to get these past the kernel
>> brainfreeze.
>
>
> Are you using O_DIRECT? Seems like there should be some enterprisey database
> tricks that we can use here.
>
>>
>>   - XFS is (probably) never going going to give us data checksums, which
>> we
>> want desperately.
>
>
> What is the goal of having the file system do the checksums? How strong do
> they need to be and what size are the chunks?
>
> If you update this on each IO, this will certainly generate more IO (each
> write will possibly generate at least one other write to update that new
> checksum).
>
>>
>> But what's the alternative?  My thought is to just bite the bullet and
>> consume a raw block device directly.  Write an allocator, hopefully keep
>> it pretty simple, and manage it in kv store along with all of our other
>> metadata.
>
>
> The big problem with consuming block devices directly is that you ultimately
> end up recreating most of the features that you had in the file system. Even
> enterprise databases like Oracle and DB2 have been migrating away from
> running on raw block devices in favor of file systems over time.  In effect,
> you are looking at making a simple on disk file system which is always
> easier to start than it is to get back to a stable, production ready state.
>
> I think that it might be quicker and more maintainable to spend some time
> working with the local file system people (XFS or other) to see if we can
> jointly address the concerns you have.
>
>>
>> Wins:
>>
>>   - 2 IOs for most: one to write the data to unused space in the block
>> device, one to commit our transaction (vs 4+ before).  For overwrites,
>> we'd have one io to do our write-ahead log (kv journal), then do
>> the overwrite async (vs 4+ before).
>>
>>   - No concern about mtime getting in the way
>>
>>   - Faster reads (no fs lookup)
>>
>>   - Similarly sized metadata for most objects.  If we assume most objects
>> are not fragmented, then the metadata to store the block offsets is about
>> the same size as the metadata to store the filenames we have now.
>>
>> Problems:
>>
>>   - We have to size the kv backend storage (probably still an XFS
>> partition) vs the block storage.  Maybe we do this anyway (put metadata on
>> SSD!) so it won't matter.  But what happens when we are storing gobs of
>> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
>> a different pool and those aren't currently fungible.
>>
>>   - We have to write and maintain an allocator.  I'm still optimistic this
>> can be reasonbly simple, especially for the flash case (where
>> fragmentation isn't such an issue as long as our blocks are reasonbly
>> sized).  For disk we may beed to be moderately clever.
>>
>>   - We'll need a fsck to ensure our internal metadata is consistent.  The
>> good news is it'll just need to validate what we have stored in the kv
>> store.
>>
>> Other thoughts:
>>
>>   - We might want to consider whether dm-thin or bcache or other block
>> layers might help us with elasticity of file vs block areas.
>>
>>   - Rocksdb can push colder data to a second directory, so we could have a
>> fast ssd primary area (for wal and most metadata) and a second hdd
>> directory for stuff it has to push off.  Then have a conservative amount
>> of file space on the hdd.  If our block fills up, use the existing file
>> mechanism to put data there too.  (But then we have to maintain both the
>> current kv + file approach and not go all-in on kv + block.)
>>
>> Thoughts?
>> sage
>> --
>
>
> I really hate the idea of making a new file system type (even if we call it
> a raw block store!).

While I mostly agree with the sentiment, (and I also believe that as
with any project like that you know where you start, but 5 years later
you still don't know when you're going to end) I do think that it
seems quite different in requirements and functionality than a normal
filesystem (e.g., no need for directories, filenames?). Maybe we need
to have a proper understanding of the requirements, and then we can
weigh what the proper solution is?
>
> In addition to the technical hurdles, there are also production worries like
> how long will it take for distros to pick up formal support?  How do we test
> it properly?
>

Does it even need to be a kernel module?

Yehuda

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20  0:48 ` John Spray
@ 2015-10-20 20:00   ` Sage Weil
  2015-10-20 20:36     ` Gregory Farnum
                       ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Sage Weil @ 2015-10-20 20:00 UTC (permalink / raw)
  To: John Spray; +Cc: Ceph Development

On Tue, 20 Oct 2015, John Spray wrote:
> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sweil@redhat.com> wrote:
> >  - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage.  Maybe we do this anyway (put metadata on
> > SSD!) so it won't matter.  But what happens when we are storing gobs of
> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> > a different pool and those aren't currently fungible.
> 
> This is the concerning bit for me -- the other parts one "just" has to
> get the code right, but this problem could linger and be something we
> have to keep explaining to users indefinitely.  It reminds me of cases
> in other systems where users had to make an educated guess about inode
> size up front, depending on whether you're expecting to efficiently
> store a lot of xattrs.
> 
> In practice it's rare for users to make these kinds of decisions well
> up-front: it really needs to be adjustable later, ideally
> automatically.  That could be pretty straightforward if the KV part
> was stored directly on block storage, instead of having XFS in the
> mix.  I'm not quite up with the state of the art in this area: are
> there any reasonable alternatives for the KV part that would consume
> some defined range of a block device from userspace, instead of
> sitting on top of a filesystem?

I agree: this is my primary concern with the raw block approach.

There are some KV alternatives that could consume block, but the problem 
would be similar: we need to dynamically size up or down the kv portion of 
the device.

I see two basic options:

1) Wire into the Env abstraction in rocksdb to provide something just 
smart enough to let rocksdb work.  It isn't much: named files (not that 
many--we could easily keep the file table in ram), always written 
sequentially, to be read later with random access. All of the code is 
written around abstractions of SequentialFileWriter so that everything 
posix is neatly hidden in env_posix (and there are various other env 
implementations for in-memory mock tests etc.).

2) Use something like dm-thin to sit between the raw block device and XFS 
(for rocksdb) and the block device consumed by newstore.  As long as XFS 
doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb 
files in their entirety) we can fstrim and size down the fs portion.  If 
we similarly make newstores allocator stick to large blocks only we would 
be able to size down the block portion as well.  Typical dm-thin block 
sizes seem to range from 64KB to 512KB, which seems reasonable enough to 
me.  In fact, we could likely just size the fs volume at something 
conservatively large (like 90%) and rely on -o discard or periodic fstrim 
to keep its actual utilization in check.

sage

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20 12:34       ` Sage Weil
@ 2015-10-20 20:18         ` Martin Millnert
  2015-10-20 20:32         ` James (Fei) Liu-SSI
  1 sibling, 0 replies; 71+ messages in thread
From: Martin Millnert @ 2015-10-20 20:18 UTC (permalink / raw)
  To: Sage Weil; +Cc: James (Fei) Liu-SSI, Somnath Roy, ceph-devel

Adding to this,

On Tue, 2015-10-20 at 05:34 -0700, Sage Weil wrote:
> On Mon, 19 Oct 2015, James (Fei) Liu-SSI wrote:
> > Hi Sage and Somnath,
> >   In my humble opinion, There is another more aggressive solution than 
> > raw block device base keyvalue store as backend for objectstore. The new 
> > key value SSD device with transaction support would be ideal to solve 
> > the issues. First of all, it is raw SSD device. Secondly , It provides 
> > key value interface directly from SSD. Thirdly, it can provide 
> > transaction support, consistency will be guaranteed by hardware device. 
> > It pretty much satisfied all of objectstore needs without any extra 
> > overhead since there is not any extra layer in between device and 
> > objectstore.
> 
> Are you talking about open channel SSDs?  Or something else?  Everything 
> I'm familiar with that is currently shipping is exposing a vanilla block 
> interface (conventional SSDs) that hides all of that or NVMe (which isn't 
> much better).
> 
> If there is a low-level KV interface we can consume that would be 
> great--especially if we can glue it to our KeyValueDB abstract API.  Even 
> so, we need to make sure that the object *data* also has an efficient API 
> we can utilize that efficiently handles block-sized/aligned data.

If there's a way to efficiently utilize more generic NVRAM-based block
devices for quick metadata ops such that payload data can fly without
much delay, I'd be quite happy. 

Also, a current concern of mine is backups in some fashion of the
metadata, given risk for (human configuration error||device
malfunction)&&(cluster wide power outage).
Some type of flushing to underlying consistent media, and/or
snapshot-like backups.

As long as the constructs aren't too exotic,  perhaps this could be
addressed using standard Linux FS or device mapper code (bcache, or
other)

Not sure how popular journals on NVRAM is. But here's one user at least.

/M


> sage
> 
> 
> >    Either way, I strongly support to have CEPH own data format instead 
> > of relying on filesystem.
> > 
> >   Regards,
> >   James
> > 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 1:55 PM
> > To: Somnath Roy
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: newstore direction
> > 
> > On Mon, 19 Oct 2015, Somnath Roy wrote:
> > > Sage,
> > > I fully support that.  If we want to saturate SSDs , we need to get 
> > > rid of this filesystem overhead (which I am in process of measuring). 
> > > Also, it will be good if we can eliminate the dependency on the k/v 
> > > dbs (for storing allocators and all). The reason is the unknown write 
> > > amps they causes.
> > 
> > My hope is to keep behing the KeyValueDB interface (and/more change it as
> > appropriate) so that other backends can be easily swapped in (e.g. a btree-based one for high-end flash).
> > 
> > sage
> > 
> > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > 
> > > -----Original Message-----
> > > From: ceph-devel-owner@vger.kernel.org 
> > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Monday, October 19, 2015 12:49 PM
> > > To: ceph-devel@vger.kernel.org
> > > Subject: newstore direction
> > > 
> > > The current design is based on two simple ideas:
> > > 
> > >  1) a key/value interface is better way to manage all of our internal 
> > > metadata (object metadata, attrs, layout, collection membership, 
> > > write-ahead logging, overlay data, etc.)
> > > 
> > >  2) a file system is well suited for storage object data (as files).
> > > 
> > > So far 1 is working out well, but I'm questioning the wisdom of #2.  A 
> > > few
> > > things:
> > > 
> > >  - We currently write the data to the file, fsync, then commit the kv 
> > > transaction.  That's at least 3 IOs: one for the data, one for the fs 
> > > journal, one for the kv txn to commit (at least once my rocksdb 
> > > changes land... the kv commit is currently 2-3).  So two people are 
> > > managing metadata, here: the fs managing the file metadata (with its 
> > > own
> > > journal) and the kv backend (with its journal).
> > > 
> > >  - On read we have to open files by name, which means traversing the fs namespace.  Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups.  We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard...
> > > 
> > >  - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes.  (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze.
> > > 
> > >  - XFS is (probably) never going going to give us data checksums, which we want desperately.
> > > 
> > > But what's the alternative?  My thought is to just bite the bullet and consume a raw block device directly.  Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata.
> > > 
> > > Wins:
> > > 
> > >  - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before).  For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before).
> > > 
> > >  - No concern about mtime getting in the way
> > > 
> > >  - Faster reads (no fs lookup)
> > > 
> > >  - Similarly sized metadata for most objects.  If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now.
> > > 
> > > Problems:
> > > 
> > >  - We have to size the kv backend storage (probably still an XFS
> > > partition) vs the block storage.  Maybe we do this anyway (put 
> > > metadata on
> > > SSD!) so it won't matter.  But what happens when we are storing gobs of rgw index data or cephfs metadata?  Suddenly we are pulling storage out of a different pool and those aren't currently fungible.
> > > 
> > >  - We have to write and maintain an allocator.  I'm still optimistic this can be reasonbly simple, especially for the flash case (where fragmentation isn't such an issue as long as our blocks are reasonbly sized).  For disk we may beed to be moderately clever.
> > > 
> > >  - We'll need a fsck to ensure our internal metadata is consistent.  The good news is it'll just need to validate what we have stored in the kv store.
> > > 
> > > Other thoughts:
> > > 
> > >  - We might want to consider whether dm-thin or bcache or other block layers might help us with elasticity of file vs block areas.
> > > 
> > >  - Rocksdb can push colder data to a second directory, so we could 
> > > have a fast ssd primary area (for wal and most metadata) and a second 
> > > hdd directory for stuff it has to push off.  Then have a conservative 
> > > amount of file space on the hdd.  If our block fills up, use the 
> > > existing file mechanism to put data there too.  (But then we have to 
> > > maintain both the current kv + file approach and not go all-in on kv + 
> > > block.)
> > > 
> > > Thoughts?
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > ________________________________
> > > 
> > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-20 12:34       ` Sage Weil
  2015-10-20 20:18         ` Martin Millnert
@ 2015-10-20 20:32         ` James (Fei) Liu-SSI
  2015-10-20 20:39           ` James (Fei) Liu-SSI
  2015-10-20 21:20           ` Sage Weil
  1 sibling, 2 replies; 71+ messages in thread
From: James (Fei) Liu-SSI @ 2015-10-20 20:32 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath Roy, ceph-devel

Hi Sage, 
   Sorry for confusing you. SSDs with key value interfaces are still under development by several vendors.  It has totally different design approach  than Open Channel SSD. I met Matias several months ago and discussed about possibilities to have key value interface support with  Open Channel SSD . I am not following the progress since then. If Matias is in this group, He will definitely can give us better explanations. Here is his presentation for key value support with open channel SSD for your reference.

http://events.linuxfoundation.org/sites/events/files/slides/LightNVM-Vault2015.pdf


  Regards,
  James  

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Tuesday, October 20, 2015 5:34 AM
To: James (Fei) Liu-SSI
Cc: Somnath Roy; ceph-devel@vger.kernel.org
Subject: RE: newstore direction

On Mon, 19 Oct 2015, James (Fei) Liu-SSI wrote:
> Hi Sage and Somnath,
>   In my humble opinion, There is another more aggressive solution than 
> raw block device base keyvalue store as backend for objectstore. The 
> new key value SSD device with transaction support would be ideal to 
> solve the issues. First of all, it is raw SSD device. Secondly , It 
> provides key value interface directly from SSD. Thirdly, it can 
> provide transaction support, consistency will be guaranteed by hardware device.
> It pretty much satisfied all of objectstore needs without any extra 
> overhead since there is not any extra layer in between device and 
> objectstore.

Are you talking about open channel SSDs?  Or something else?  Everything I'm familiar with that is currently shipping is exposing a vanilla block interface (conventional SSDs) that hides all of that or NVMe (which isn't much better).

If there is a low-level KV interface we can consume that would be great--especially if we can glue it to our KeyValueDB abstract API.  Even so, we need to make sure that the object *data* also has an efficient API we can utilize that efficiently handles block-sized/aligned data.

sage


>    Either way, I strongly support to have CEPH own data format instead 
> of relying on filesystem.
> 
>   Regards,
>   James
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 1:55 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> On Mon, 19 Oct 2015, Somnath Roy wrote:
> > Sage,
> > I fully support that.  If we want to saturate SSDs , we need to get 
> > rid of this filesystem overhead (which I am in process of measuring).
> > Also, it will be good if we can eliminate the dependency on the k/v 
> > dbs (for storing allocators and all). The reason is the unknown 
> > write amps they causes.
> 
> My hope is to keep behing the KeyValueDB interface (and/more change it 
> as
> appropriate) so that other backends can be easily swapped in (e.g. a btree-based one for high-end flash).
> 
> sage
> 
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org 
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 12:49 PM
> > To: ceph-devel@vger.kernel.org
> > Subject: newstore direction
> > 
> > The current design is based on two simple ideas:
> > 
> >  1) a key/value interface is better way to manage all of our 
> > internal metadata (object metadata, attrs, layout, collection 
> > membership, write-ahead logging, overlay data, etc.)
> > 
> >  2) a file system is well suited for storage object data (as files).
> > 
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  
> > A few
> > things:
> > 
> >  - We currently write the data to the file, fsync, then commit the 
> > kv transaction.  That's at least 3 IOs: one for the data, one for 
> > the fs journal, one for the kv txn to commit (at least once my 
> > rocksdb changes land... the kv commit is currently 2-3).  So two 
> > people are managing metadata, here: the fs managing the file 
> > metadata (with its own
> > journal) and the kv backend (with its journal).
> > 
> >  - On read we have to open files by name, which means traversing the fs namespace.  Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups.  We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard...
> > 
> >  - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes.  (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze.
> > 
> >  - XFS is (probably) never going going to give us data checksums, which we want desperately.
> > 
> > But what's the alternative?  My thought is to just bite the bullet and consume a raw block device directly.  Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata.
> > 
> > Wins:
> > 
> >  - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before).  For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before).
> > 
> >  - No concern about mtime getting in the way
> > 
> >  - Faster reads (no fs lookup)
> > 
> >  - Similarly sized metadata for most objects.  If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now.
> > 
> > Problems:
> > 
> >  - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage.  Maybe we do this anyway (put 
> > metadata on
> > SSD!) so it won't matter.  But what happens when we are storing gobs of rgw index data or cephfs metadata?  Suddenly we are pulling storage out of a different pool and those aren't currently fungible.
> > 
> >  - We have to write and maintain an allocator.  I'm still optimistic this can be reasonbly simple, especially for the flash case (where fragmentation isn't such an issue as long as our blocks are reasonbly sized).  For disk we may beed to be moderately clever.
> > 
> >  - We'll need a fsck to ensure our internal metadata is consistent.  The good news is it'll just need to validate what we have stored in the kv store.
> > 
> > Other thoughts:
> > 
> >  - We might want to consider whether dm-thin or bcache or other block layers might help us with elasticity of file vs block areas.
> > 
> >  - Rocksdb can push colder data to a second directory, so we could 
> > have a fast ssd primary area (for wal and most metadata) and a 
> > second hdd directory for stuff it has to push off.  Then have a 
> > conservative amount of file space on the hdd.  If our block fills 
> > up, use the existing file mechanism to put data there too.  (But 
> > then we have to maintain both the current kv + file approach and not 
> > go all-in on kv +
> > block.)
> > 
> > Thoughts?
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > ________________________________
> > 
> > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20 20:00   ` Sage Weil
@ 2015-10-20 20:36     ` Gregory Farnum
  2015-10-20 21:47       ` Sage Weil
  2015-10-20 20:42     ` Matt Benjamin
  2015-10-22 12:32     ` Milosz Tanski
  2 siblings, 1 reply; 71+ messages in thread
From: Gregory Farnum @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Sage Weil; +Cc: John Spray, Ceph Development

On Tue, Oct 20, 2015 at 12:44 PM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 20 Oct 2015, Ric Wheeler wrote:
>> The big problem with consuming block devices directly is that you ultimately
>> end up recreating most of the features that you had in the file system. Even
>> enterprise databases like Oracle and DB2 have been migrating away from running
>> on raw block devices in favor of file systems over time.  In effect, you are
>> looking at making a simple on disk file system which is always easier to start
>> than it is to get back to a stable, production ready state.
>
> This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had
> everything we were implementing and more: mainly, copy on write and data
> checksums.  But in practice the fact that its general purpose means it
> targets a very different workloads and APIs than what we need.

Try 7 years since ebofs...
That's one of my concerns, though. You ditched ebofs once already
because it had metastasized into an entire FS, and had reached its
limits of maintainability. What makes you think a second time through
would work better? :/

On Mon, Oct 19, 2015 at 12:49 PM, Sage Weil <sweil@redhat.com> wrote:
>  - 2 IOs for most: one to write the data to unused space in the block
> device, one to commit our transaction (vs 4+ before).  For overwrites,
> we'd have one io to do our write-ahead log (kv journal), then do
> the overwrite async (vs 4+ before).

I can't work this one out. If you're doing one write for the data and
one for the kv journal (which is on another filesystem), how does the
commit sequence work that it's only 2 IOs instead of the same 3 we
already have? Or are you planning to ditch the LevelDB/RocksDB store
for our journaling and just use something within the block layer?


If we do want to go down this road, we shouldn't need to write an
allocator from scratch. I don't remember exactly which ones it is but
we've read/seen at least a few storage papers where people have reused
existing allocators  — I think the one from ext2? And somebody managed
to get it running in userspace.

Of course, then we also need to figure out how to get checksums on the
block data, since if we're going to put in the effort to reimplement
this much of the stack we'd better get our full data integrity
guarantees along with it!

On Tue, Oct 20, 2015 at 1:00 PM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 20 Oct 2015, John Spray wrote:
>> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sweil@redhat.com> wrote:
>> >  - We have to size the kv backend storage (probably still an XFS
>> > partition) vs the block storage.  Maybe we do this anyway (put metadata on
>> > SSD!) so it won't matter.  But what happens when we are storing gobs of
>> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
>> > a different pool and those aren't currently fungible.
>>
>> This is the concerning bit for me -- the other parts one "just" has to
>> get the code right, but this problem could linger and be something we
>> have to keep explaining to users indefinitely.  It reminds me of cases
>> in other systems where users had to make an educated guess about inode
>> size up front, depending on whether you're expecting to efficiently
>> store a lot of xattrs.
>>
>> In practice it's rare for users to make these kinds of decisions well
>> up-front: it really needs to be adjustable later, ideally
>> automatically.  That could be pretty straightforward if the KV part
>> was stored directly on block storage, instead of having XFS in the
>> mix.  I'm not quite up with the state of the art in this area: are
>> there any reasonable alternatives for the KV part that would consume
>> some defined range of a block device from userspace, instead of
>> sitting on top of a filesystem?
>
> I agree: this is my primary concern with the raw block approach.
>
> There are some KV alternatives that could consume block, but the problem
> would be similar: we need to dynamically size up or down the kv portion of
> the device.
>
> I see two basic options:
>
> 1) Wire into the Env abstraction in rocksdb to provide something just
> smart enough to let rocksdb work.  It isn't much: named files (not that
> many--we could easily keep the file table in ram), always written
> sequentially, to be read later with random access. All of the code is
> written around abstractions of SequentialFileWriter so that everything
> posix is neatly hidden in env_posix (and there are various other env
> implementations for in-memory mock tests etc.).

This seems like the obviously correct move to me? Except we might want
to include the rocksdb store on flash instead of hard drives, which
means maybe we do want some unified storage system which can handle
multiple physical storage devices as a single piece of storage space.
(Not that any of those exist in "almost done" hell, or that we're
going through requirements expansion or anything!)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-20 20:32         ` James (Fei) Liu-SSI
@ 2015-10-20 20:39           ` James (Fei) Liu-SSI
  2015-10-20 21:20           ` Sage Weil
  1 sibling, 0 replies; 71+ messages in thread
From: James (Fei) Liu-SSI @ 2015-10-20 20:39 UTC (permalink / raw)
  To: James (Fei) Liu-SSI, Sage Weil, Varada Kari; +Cc: Somnath Roy, ceph-devel

Varada,

Hopefully , It will answer yours question too. It is going to be new type of key value device than traditional hard drive based OSD device. It will have its own storage stack than traditional block based storage stack. I have to admit it is a little bit more aggressive than block based approach .  

Regards,
James

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Tuesday, October 20, 2015 1:33 PM
To: Sage Weil
Cc: Somnath Roy; ceph-devel@vger.kernel.org
Subject: RE: newstore direction

Hi Sage, 
   Sorry for confusing you. SSDs with key value interfaces are still under development by several vendors.  It has totally different design approach  than Open Channel SSD. I met Matias several months ago and discussed about possibilities to have key value interface support with  Open Channel SSD . I am not following the progress since then. If Matias is in this group, He will definitely can give us better explanations. Here is his presentation for key value support with open channel SSD for your reference.

http://events.linuxfoundation.org/sites/events/files/slides/LightNVM-Vault2015.pdf


  Regards,
  James  

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Tuesday, October 20, 2015 5:34 AM
To: James (Fei) Liu-SSI
Cc: Somnath Roy; ceph-devel@vger.kernel.org
Subject: RE: newstore direction

On Mon, 19 Oct 2015, James (Fei) Liu-SSI wrote:
> Hi Sage and Somnath,
>   In my humble opinion, There is another more aggressive solution than 
> raw block device base keyvalue store as backend for objectstore. The 
> new key value SSD device with transaction support would be ideal to 
> solve the issues. First of all, it is raw SSD device. Secondly , It 
> provides key value interface directly from SSD. Thirdly, it can 
> provide transaction support, consistency will be guaranteed by hardware device.
> It pretty much satisfied all of objectstore needs without any extra 
> overhead since there is not any extra layer in between device and 
> objectstore.

Are you talking about open channel SSDs?  Or something else?  Everything I'm familiar with that is currently shipping is exposing a vanilla block interface (conventional SSDs) that hides all of that or NVMe (which isn't much better).

If there is a low-level KV interface we can consume that would be great--especially if we can glue it to our KeyValueDB abstract API.  Even so, we need to make sure that the object *data* also has an efficient API we can utilize that efficiently handles block-sized/aligned data.

sage


>    Either way, I strongly support to have CEPH own data format instead 
> of relying on filesystem.
> 
>   Regards,
>   James
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 1:55 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> On Mon, 19 Oct 2015, Somnath Roy wrote:
> > Sage,
> > I fully support that.  If we want to saturate SSDs , we need to get 
> > rid of this filesystem overhead (which I am in process of measuring).
> > Also, it will be good if we can eliminate the dependency on the k/v 
> > dbs (for storing allocators and all). The reason is the unknown 
> > write amps they causes.
> 
> My hope is to keep behing the KeyValueDB interface (and/more change it 
> as
> appropriate) so that other backends can be easily swapped in (e.g. a btree-based one for high-end flash).
> 
> sage
> 
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org 
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 12:49 PM
> > To: ceph-devel@vger.kernel.org
> > Subject: newstore direction
> > 
> > The current design is based on two simple ideas:
> > 
> >  1) a key/value interface is better way to manage all of our 
> > internal metadata (object metadata, attrs, layout, collection 
> > membership, write-ahead logging, overlay data, etc.)
> > 
> >  2) a file system is well suited for storage object data (as files).
> > 
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  
> > A few
> > things:
> > 
> >  - We currently write the data to the file, fsync, then commit the 
> > kv transaction.  That's at least 3 IOs: one for the data, one for 
> > the fs journal, one for the kv txn to commit (at least once my 
> > rocksdb changes land... the kv commit is currently 2-3).  So two 
> > people are managing metadata, here: the fs managing the file 
> > metadata (with its own
> > journal) and the kv backend (with its journal).
> > 
> >  - On read we have to open files by name, which means traversing the fs namespace.  Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups.  We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard...
> > 
> >  - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes.  (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze.
> > 
> >  - XFS is (probably) never going going to give us data checksums, which we want desperately.
> > 
> > But what's the alternative?  My thought is to just bite the bullet and consume a raw block device directly.  Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata.
> > 
> > Wins:
> > 
> >  - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before).  For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before).
> > 
> >  - No concern about mtime getting in the way
> > 
> >  - Faster reads (no fs lookup)
> > 
> >  - Similarly sized metadata for most objects.  If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now.
> > 
> > Problems:
> > 
> >  - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage.  Maybe we do this anyway (put 
> > metadata on
> > SSD!) so it won't matter.  But what happens when we are storing gobs of rgw index data or cephfs metadata?  Suddenly we are pulling storage out of a different pool and those aren't currently fungible.
> > 
> >  - We have to write and maintain an allocator.  I'm still optimistic this can be reasonbly simple, especially for the flash case (where fragmentation isn't such an issue as long as our blocks are reasonbly sized).  For disk we may beed to be moderately clever.
> > 
> >  - We'll need a fsck to ensure our internal metadata is consistent.  The good news is it'll just need to validate what we have stored in the kv store.
> > 
> > Other thoughts:
> > 
> >  - We might want to consider whether dm-thin or bcache or other block layers might help us with elasticity of file vs block areas.
> > 
> >  - Rocksdb can push colder data to a second directory, so we could 
> > have a fast ssd primary area (for wal and most metadata) and a 
> > second hdd directory for stuff it has to push off.  Then have a 
> > conservative amount of file space on the hdd.  If our block fills 
> > up, use the existing file mechanism to put data there too.  (But 
> > then we have to maintain both the current kv + file approach and not 
> > go all-in on kv +
> > block.)
> > 
> > Thoughts?
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > ________________________________
> > 
> > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20 20:00   ` Sage Weil
  2015-10-20 20:36     ` Gregory Farnum
@ 2015-10-20 20:42     ` Matt Benjamin
  2015-10-22 12:32     ` Milosz Tanski
  2 siblings, 0 replies; 71+ messages in thread
From: Matt Benjamin @ 2015-10-20 20:42 UTC (permalink / raw)
  To: Sage Weil; +Cc: John Spray, Ceph Development

We mostly assumed that sort-of transactional file systems, perhaps hosted in user space was the most tractable trajectory.  I have seen newstore and keyvalue store as essentially congruent approaches using database primitives (and I am interested in what you make of Russell Sears).  I'm skeptical of any hope of keeping things "simple."  Like Martin downthread, most systems I havce seen (filers, ZFS)) make use of a fast, durable commit log and then flex out...something else.

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309


----- Original Message -----
> From: "Sage Weil" <sweil@redhat.com>
> To: "John Spray" <jspray@redhat.com>
> Cc: "Ceph Development" <ceph-devel@vger.kernel.org>
> Sent: Tuesday, October 20, 2015 4:00:23 PM
> Subject: Re: newstore direction
> 
> On Tue, 20 Oct 2015, John Spray wrote:
> > On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sweil@redhat.com> wrote:
> > >  - We have to size the kv backend storage (probably still an XFS
> > > partition) vs the block storage.  Maybe we do this anyway (put metadata
> > > on
> > > SSD!) so it won't matter.  But what happens when we are storing gobs of
> > > rgw index data or cephfs metadata?  Suddenly we are pulling storage out
> > > of
> > > a different pool and those aren't currently fungible.
> > 
> > This is the concerning bit for me -- the other parts one "just" has to
> > get the code right, but this problem could linger and be something we
> > have to keep explaining to users indefinitely.  It reminds me of cases
> > in other systems where users had to make an educated guess about inode
> > size up front, depending on whether you're expecting to efficiently
> > store a lot of xattrs.
> > 
> > In practice it's rare for users to make these kinds of decisions well
> > up-front: it really needs to be adjustable later, ideally
> > automatically.  That could be pretty straightforward if the KV part
> > was stored directly on block storage, instead of having XFS in the
> > mix.  I'm not quite up with the state of the art in this area: are
> > there any reasonable alternatives for the KV part that would consume
> > some defined range of a block device from userspace, instead of
> > sitting on top of a filesystem?
> 
> I agree: this is my primary concern with the raw block approach.
> 
> There are some KV alternatives that could consume block, but the problem
> would be similar: we need to dynamically size up or down the kv portion of
> the device.
> 
> I see two basic options:
> 
> 1) Wire into the Env abstraction in rocksdb to provide something just
> smart enough to let rocksdb work.  It isn't much: named files (not that
> many--we could easily keep the file table in ram), always written
> sequentially, to be read later with random access. All of the code is
> written around abstractions of SequentialFileWriter so that everything
> posix is neatly hidden in env_posix (and there are various other env
> implementations for in-memory mock tests etc.).
> 
> 2) Use something like dm-thin to sit between the raw block device and XFS
> (for rocksdb) and the block device consumed by newstore.  As long as XFS
> doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb
> files in their entirety) we can fstrim and size down the fs portion.  If
> we similarly make newstores allocator stick to large blocks only we would
> be able to size down the block portion as well.  Typical dm-thin block
> sizes seem to range from 64KB to 512KB, which seems reasonable enough to
> me.  In fact, we could likely just size the fs volume at something
> conservatively large (like 90%) and rely on -o discard or periodic fstrim
> to keep its actual utilization in check.
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-20 20:32         ` James (Fei) Liu-SSI
  2015-10-20 20:39           ` James (Fei) Liu-SSI
@ 2015-10-20 21:20           ` Sage Weil
  1 sibling, 0 replies; 71+ messages in thread
From: Sage Weil @ 2015-10-20 21:20 UTC (permalink / raw)
  To: James (Fei) Liu-SSI; +Cc: Somnath Roy, ceph-devel

On Tue, 20 Oct 2015, James (Fei) Liu-SSI wrote:
> Hi Sage, 
>    Sorry for confusing you. SSDs with key value interfaces are still 
> under development by several vendors.  It has totally different design 
> approach than Open Channel SSD. I met Matias several months ago and 
> discussed about possibilities to have key value interface support with 
> Open Channel SSD . I am not following the progress since then. If Matias 
> is in this group, He will definitely can give us better explanations. 
> Here is his presentation for key value support with open channel SSD for 
> your reference.
> 
> http://events.linuxfoundation.org/sites/events/files/slides/LightNVM-Vault2015.pdf

Ok cool.  I saw Matias' talk at Vault and was very pleased to see that 
there is some real effort to get away from black box FTLs.

And I am eagerly awaiting the arrival of SSDs with a kv interface... open 
channel especially, but even proprietary devices exposing kv would be an 
improvement over proprietary devices exposing block.  :)

sage


> 
> 
>   Regards,
>   James  
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com] 
> Sent: Tuesday, October 20, 2015 5:34 AM
> To: James (Fei) Liu-SSI
> Cc: Somnath Roy; ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> On Mon, 19 Oct 2015, James (Fei) Liu-SSI wrote:
> > Hi Sage and Somnath,
> >   In my humble opinion, There is another more aggressive solution than 
> > raw block device base keyvalue store as backend for objectstore. The 
> > new key value SSD device with transaction support would be ideal to 
> > solve the issues. First of all, it is raw SSD device. Secondly , It 
> > provides key value interface directly from SSD. Thirdly, it can 
> > provide transaction support, consistency will be guaranteed by hardware device.
> > It pretty much satisfied all of objectstore needs without any extra 
> > overhead since there is not any extra layer in between device and 
> > objectstore.
> 
> Are you talking about open channel SSDs?  Or something else?  Everything I'm familiar with that is currently shipping is exposing a vanilla block interface (conventional SSDs) that hides all of that or NVMe (which isn't much better).
> 
> If there is a low-level KV interface we can consume that would be great--especially if we can glue it to our KeyValueDB abstract API.  Even so, we need to make sure that the object *data* also has an efficient API we can utilize that efficiently handles block-sized/aligned data.
> 
> sage
> 
> 
> >    Either way, I strongly support to have CEPH own data format instead 
> > of relying on filesystem.
> > 
> >   Regards,
> >   James
> > 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org 
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 1:55 PM
> > To: Somnath Roy
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: newstore direction
> > 
> > On Mon, 19 Oct 2015, Somnath Roy wrote:
> > > Sage,
> > > I fully support that.  If we want to saturate SSDs , we need to get 
> > > rid of this filesystem overhead (which I am in process of measuring).
> > > Also, it will be good if we can eliminate the dependency on the k/v 
> > > dbs (for storing allocators and all). The reason is the unknown 
> > > write amps they causes.
> > 
> > My hope is to keep behing the KeyValueDB interface (and/more change it 
> > as
> > appropriate) so that other backends can be easily swapped in (e.g. a btree-based one for high-end flash).
> > 
> > sage
> > 
> > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > 
> > > -----Original Message-----
> > > From: ceph-devel-owner@vger.kernel.org 
> > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Monday, October 19, 2015 12:49 PM
> > > To: ceph-devel@vger.kernel.org
> > > Subject: newstore direction
> > > 
> > > The current design is based on two simple ideas:
> > > 
> > >  1) a key/value interface is better way to manage all of our 
> > > internal metadata (object metadata, attrs, layout, collection 
> > > membership, write-ahead logging, overlay data, etc.)
> > > 
> > >  2) a file system is well suited for storage object data (as files).
> > > 
> > > So far 1 is working out well, but I'm questioning the wisdom of #2.  
> > > A few
> > > things:
> > > 
> > >  - We currently write the data to the file, fsync, then commit the 
> > > kv transaction.  That's at least 3 IOs: one for the data, one for 
> > > the fs journal, one for the kv txn to commit (at least once my 
> > > rocksdb changes land... the kv commit is currently 2-3).  So two 
> > > people are managing metadata, here: the fs managing the file 
> > > metadata (with its own
> > > journal) and the kv backend (with its journal).
> > > 
> > >  - On read we have to open files by name, which means traversing the fs namespace.  Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups.  We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard...
> > > 
> > >  - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes.  (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze.
> > > 
> > >  - XFS is (probably) never going going to give us data checksums, which we want desperately.
> > > 
> > > But what's the alternative?  My thought is to just bite the bullet and consume a raw block device directly.  Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata.
> > > 
> > > Wins:
> > > 
> > >  - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before).  For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before).
> > > 
> > >  - No concern about mtime getting in the way
> > > 
> > >  - Faster reads (no fs lookup)
> > > 
> > >  - Similarly sized metadata for most objects.  If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now.
> > > 
> > > Problems:
> > > 
> > >  - We have to size the kv backend storage (probably still an XFS
> > > partition) vs the block storage.  Maybe we do this anyway (put 
> > > metadata on
> > > SSD!) so it won't matter.  But what happens when we are storing gobs of rgw index data or cephfs metadata?  Suddenly we are pulling storage out of a different pool and those aren't currently fungible.
> > > 
> > >  - We have to write and maintain an allocator.  I'm still optimistic this can be reasonbly simple, especially for the flash case (where fragmentation isn't such an issue as long as our blocks are reasonbly sized).  For disk we may beed to be moderately clever.
> > > 
> > >  - We'll need a fsck to ensure our internal metadata is consistent.  The good news is it'll just need to validate what we have stored in the kv store.
> > > 
> > > Other thoughts:
> > > 
> > >  - We might want to consider whether dm-thin or bcache or other block layers might help us with elasticity of file vs block areas.
> > > 
> > >  - Rocksdb can push colder data to a second directory, so we could 
> > > have a fast ssd primary area (for wal and most metadata) and a 
> > > second hdd directory for stuff it has to push off.  Then have a 
> > > conservative amount of file space on the hdd.  If our block fills 
> > > up, use the existing file mechanism to put data there too.  (But 
> > > then we have to maintain both the current kv + file approach and not 
> > > go all-in on kv +
> > > block.)
> > > 
> > > Thoughts?
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > ________________________________
> > > 
> > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20 19:44   ` Sage Weil
@ 2015-10-20 21:43     ` Ric Wheeler
  0 siblings, 0 replies; 71+ messages in thread
From: Ric Wheeler @ 2015-10-20 21:43 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 10/20/2015 03:44 PM, Sage Weil wrote:
> On Tue, 20 Oct 2015, Ric Wheeler wrote:
>> On 10/19/2015 03:49 PM, Sage Weil wrote:
>>> The current design is based on two simple ideas:
>>>
>>>    1) a key/value interface is better way to manage all of our internal
>>> metadata (object metadata, attrs, layout, collection membership,
>>> write-ahead logging, overlay data, etc.)
>>>
>>>    2) a file system is well suited for storage object data (as files).
>>>
>>> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
>>> things:
>>>
>>>    - We currently write the data to the file, fsync, then commit the kv
>>> transaction.  That's at least 3 IOs: one for the data, one for the fs
>>> journal, one for the kv txn to commit (at least once my rocksdb changes
>>> land... the kv commit is currently 2-3).  So two people are managing
>>> metadata, here: the fs managing the file metadata (with its own
>>> journal) and the kv backend (with its journal).
>> If all of the fsync()'s fall into the same backing file system, are you sure
>> that each fsync() takes the same time? Depending on the local FS
>> implementation of course, but the order of issuing those fsync()'s can
>> effectively make some of them no-ops.
> Surely, yes, but the fact remains we are maintaining two journals: one
> internal to the fs that manages the allocation metadata, and one layered
> on top that handles the kv store's write stream.  The lower bound on any
> write is 3 IOs (unless we're talking about a COW fs).

The way storage devices work means that if we can batch these in some way, we 
might get 3 IO's that land in the cache (even for spinning drives) and one 1 
that is followed by a cache flush.

The first three IO's are quite quick, you don't need to write through to the 
platter. The cost is mostly in the fsync() call which waits until storage 
destages the cache to the platter.

With SSD's, we have some different considerations.

>
>>>    - On read we have to open files by name, which means traversing the fs
>>> namespace.  Newstore tries to keep it as flat and simple as possible, but
>>> at a minimum it is a couple btree lookups.  We'd love to use open by
>>> handle (which would reduce this to 1 btree traversal), but running
>>> the daemon as ceph and not root makes that hard...
>> This seems like a a pretty low hurdle to overcome.
> I wish you luck convincing upstream to allow unprivileged access to
> open_by_handle or the XFS ioctl.  :)  But even if we had that, any object
> access requires multiple metadata lookups: one in our kv db, and a second
> to get the inode for the backing file.  Again, there's an unnecessary
> lower bound on the number of IOs needed to access a cold object.

We should dig into what this actually means when you can do open by handle. If 
you cache the inode (i.e., skip the directory traversal), you still need to 
figure out the mapping back to an actual block on the storage device. Not clear 
to me that you need more IO's with the file system doing this or by having a 
btree on disk - both will require IO.

>
>>>    - ...and file systems insist on updating mtime on writes, even when it is
>>> a overwrite with no allocation changes.  (We don't care about mtime.)
>>> O_NOCMTIME patches exist but it is hard to get these past the kernel
>>> brainfreeze.
>> Are you using O_DIRECT? Seems like there should be some enterprisey database
>> tricks that we can use here.
> It's not about about the data path, but avoiding the useless bookkeeping
> the file system is doing that we don't want or need.  See the recent
> recent reception of Zach's O_NOCMTIME patches on linux-fsdevel:
>
> 	http://marc.info/?t=143094969800001&r=1&w=2
>
> I'm generally an optimist when it comes to introducing new APIs upstream,
> but I still found this to be an unbelievingly frustrating exchange.

We should talk more about this with the local FS people. Might be other ways to 
solve this.

>
>>>    - XFS is (probably) never going going to give us data checksums, which we
>>> want desperately.
>> What is the goal of having the file system do the checksums? How strong do
>> they need to be and what size are the chunks?
>>
>> If you update this on each IO, this will certainly generate more IO (each
>> write will possibly generate at least one other write to update that new
>> checksum).
> Not if we keep the checksums with the allocation metadata, in the
> onode/inode, which we're also doing and IO to persist.  But whther that is
> practial depends on the granularity (4KB or 16K or 128K or ...), which may
> in turn depend on the object (RBD block that'll service random 4K reads
> and writes?  or RGW fragment that is always written sequentially?).  I'm
> highly skeptical we'd ever get anything from a general-purpose file system
> that would work well here (if anything at all).

XFS (or device mapper) could also store checksums per block. I think that the 
T10 DIF/DIX bits work for enterprise databases (again, bypassing the file 
system). Might be interesting to see if we could put the checksums into dm-thin.

>
>>> But what's the alternative?  My thought is to just bite the bullet and
>>> consume a raw block device directly.  Write an allocator, hopefully keep
>>> it pretty simple, and manage it in kv store along with all of our other
>>> metadata.
>> The big problem with consuming block devices directly is that you ultimately
>> end up recreating most of the features that you had in the file system. Even
>> enterprise databases like Oracle and DB2 have been migrating away from running
>> on raw block devices in favor of file systems over time.  In effect, you are
>> looking at making a simple on disk file system which is always easier to start
>> than it is to get back to a stable, production ready state.
> This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had
> everything we were implementing and more: mainly, copy on write and data
> checksums.  But in practice the fact that its general purpose means it
> targets a very different workloads and APIs than what we need.
>
> Now that I've realized the POSIX file namespace is a bad fit for what we
> need and opted to manage that directly, things are vastly simpler: we no
> longer have the horrific directory hashing tricks to allow PG splits (not
> because we are scared of big directories but because we need ordered
> enumeration of objects) and the transactions have exactly the granularity
> we want.  In fact, it turns out that pretty much the *only* thing the file
> system provides that we need is block allocation; everything else is
> overhead we have to play tricks to work around (batched fsync, O_NOCMTIME,
> open by handle), or something that we want but the fs will likely never
> provide (like checksums).

Database people have figured this all out on top of file systems a long time 
ago, I think that we are looking at solving a solved problem here.

>
>> I think that it might be quicker and more maintainable to spend some time
>> working with the local file system people (XFS or other) to see if we can
>> jointly address the concerns you have.
> I have been, in cases where what we want is something that makes sense for
> other file system users.  But mostly I think that the problem is more
> that what we want isn't a file system, but an allocator + block device.

(Broken record) the local fs community deal with enterprise database needs 
already and they are special cases.

>
> And the end result is that slotting a file system into the stack puts an
> upper bound on our performance.  On its face this isn't surprising, but
> I'm running up against it in gory detail in my efforts to make the Ceph
> OSD faster, and the question becomes whether we want to be fast or
> layered.  (I don't think 'simple' is really an option given the effort to
> work around the POSIX vs ObjectStore impedence mismatch.)

The goal of file systems is to make the underlying storage device the bound on 
performance for IO operations. True, you pay something for metadata updates, but 
you would end up doing that in any case.

That should not be a big deal for ceph I think.

>
>> I really hate the idea of making a new file system type (even if we call it a
>> raw block store!).
> Just to be clear, this isn't a new kernel file system--it's userland
> consuming a block device (ala oracle).  (But yeah, I hate it too.)

Once you need a new file system check like utility, you *are* a file system :)  
(dm-thin has one, it is in effect a file system as well).

>
>> In addition to the technical hurdles, there are also production worries like
>> how long will it take for distros to pick up formal support?  How do we test
>> it properly?
> This actually means less for the distros to support: we'll consume
> /dev/sdb instead of an XFS mount.  Testing will be the same as before...
> the usual forced-kill and power cycle testing under the stress and
> correctness testing workloads.
>
> What we (Ceph) will support in its place will be a combination of a kv
> store (which we already need) and a block allocator.
>
>

You need to convince each distro to enable any kernel module that you need if 
you are a kernel driver. If it stays in user space, you need to get access from 
a non-root process to a block device.

Ric


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20 20:36     ` Gregory Farnum
@ 2015-10-20 21:47       ` Sage Weil
  2015-10-20 22:23         ` Ric Wheeler
  0 siblings, 1 reply; 71+ messages in thread
From: Sage Weil @ 2015-10-20 21:47 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: John Spray, Ceph Development

On Tue, 20 Oct 2015, Gregory Farnum wrote:
> On Tue, Oct 20, 2015 at 12:44 PM, Sage Weil <sweil@redhat.com> wrote:
> > On Tue, 20 Oct 2015, Ric Wheeler wrote:
> >> The big problem with consuming block devices directly is that you ultimately
> >> end up recreating most of the features that you had in the file system. Even
> >> enterprise databases like Oracle and DB2 have been migrating away from running
> >> on raw block devices in favor of file systems over time.  In effect, you are
> >> looking at making a simple on disk file system which is always easier to start
> >> than it is to get back to a stable, production ready state.
> >
> > This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had
> > everything we were implementing and more: mainly, copy on write and data
> > checksums.  But in practice the fact that its general purpose means it
> > targets a very different workloads and APIs than what we need.
> 
> Try 7 years since ebofs...

Sigh...

> That's one of my concerns, though. You ditched ebofs once already
> because it had metastasized into an entire FS, and had reached its
> limits of maintainability. What makes you think a second time through
> would work better? :/

A fair point, and I've given this some thought:

1) We know a *lot* more about our workload than I did in 2005.  The things 
I was worrying about then (fragmentation, mainly) are much easier to 
address now, where we have hints from rados and understand what the write 
patterns look like in practice (randomish 4k-128k ios for rbd, sequential 
writes for rgw, and the cephfs wildcard).

2) Most of the ebofs effort was around doing copy-on-write btrees (with 
checksums) and orchestrating commits.  Here our job is *vastly* simplified 
by assuming the existence of a transactional key/value store.  If you look 
at newstore today, we're already half-way through dealing with the 
complexity of doing allocations... we're essentially "allocating" blocks 
that are 1 MB files on XFS, managing that metadata, and overwriting or 
replacing those blocks on write/truncate/clone.  By the time we add in an 
allocator (get_blocks(len), free_block(offset, len)) and rip out all the 
file handling fiddling (like fsync workqueues, file id allocator, 
file truncation fiddling, etc.) we'll probably have something working 
with about the same amount of code we have now.  (Of course, that'll 
grow as we get more sophisticated, but that'll happen either way.)

> On Mon, Oct 19, 2015 at 12:49 PM, Sage Weil <sweil@redhat.com> wrote:
> >  - 2 IOs for most: one to write the data to unused space in the block
> > device, one to commit our transaction (vs 4+ before).  For overwrites,
> > we'd have one io to do our write-ahead log (kv journal), then do
> > the overwrite async (vs 4+ before).
> 
> I can't work this one out. If you're doing one write for the data and
> one for the kv journal (which is on another filesystem), how does the
> commit sequence work that it's only 2 IOs instead of the same 3 we
> already have? Or are you planning to ditch the LevelDB/RocksDB store
> for our journaling and just use something within the block layer?

Now:
    1 io  to write a new file
  1-2 ios to sync the fs journal (commit the inode, alloc change) 
          (I see 2 journal IOs on XFS and only 1 on ext4...)
    1 io  to commit the rocksdb journal (currently 3, but will drop to 
          1 with xfs fix and my rocksdb change)

With block:
    1 io to write to block device
    1 io to commit to rocksdb journal

> If we do want to go down this road, we shouldn't need to write an
> allocator from scratch. I don't remember exactly which ones it is but
> we've read/seen at least a few storage papers where people have reused
> existing allocators  ? I think the one from ext2? And somebody managed
> to get it running in userspace.

Maybe, but the real win is when we combine the allocator state update with 
our kv transaction.  Even if we adopt an existing algorithm we'll need to 
do some significant rejiggering to persist it in the kv store.

My thought is start with something simple that works (e.g., linear sweep 
over free space, simple interval_set<>-style freelist) and once it works 
look at existing state of the art for a clever v2.

BTW, I suspect a modest win here would be to simply use the collection/pg 
as a hint for storing related objects.  That's the best indicator we have 
for aligned lifecycle (think PG migrations/deletions vs flash erase 
blocks).  Good luck plumbing that through XFS...

> Of course, then we also need to figure out how to get checksums on the
> block data, since if we're going to put in the effort to reimplement
> this much of the stack we'd better get our full data integrity
> guarantees along with it!

YES!

Here I think we should make judicious use of the rados hints.  For 
example, rgw always writes complete objects, so we can have coarse 
granularity crcs and only pay for very small reads (that have to make 
slightly larger reads for crc verification).  On RBD... we might opt to be 
opportunistic with the write pattern (if the write was 4k, store the crc 
at small granularity), otherwise use a larger one.  Maybe.  In any case, 
we have a lot more flexibility than we would if trying to plumb this 
through the VFS and a file system.

> > I see two basic options:
> >
> > 1) Wire into the Env abstraction in rocksdb to provide something just
> > smart enough to let rocksdb work.  It isn't much: named files (not that
> > many--we could easily keep the file table in ram), always written
> > sequentially, to be read later with random access. All of the code is
> > written around abstractions of SequentialFileWriter so that everything
> > posix is neatly hidden in env_posix (and there are various other env
> > implementations for in-memory mock tests etc.).
> 
> This seems like the obviously correct move to me? Except we might want
> to include the rocksdb store on flash instead of hard drives, which
> means maybe we do want some unified storage system which can handle
> multiple physical storage devices as a single piece of storage space.
> (Not that any of those exist in "almost done" hell, or that we're
> going through requirements expansion or anything!)

Yeah, I mostly agree.  It's just more work.  And rocks, for example, 
already has some provisions for managing different storage pools: one for 
wal, one for main ssts, one for cold ssts.  And the same Env is used for 
all three, which means we'd run our toy fs backend even for the flash 
portion.  (Which, if it works, is probably good anyway for performance and 
operational simplicity.  One less thing in the stack to break.)

It also ties us to rocksdb, and/or whatever other backends we specifically 
support.  Right now you can trivially swap in leveldb and everything works 
the same.  OTOH there is an alternative btree-based kv store I'm 
considering about that does much better on flash and consumes block 
directly.  Making it share a device with newstore will be interesting.  
So regardless we'll probably have a pretty short list of kv backends that 
we care about...

sage

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20 21:47       ` Sage Weil
@ 2015-10-20 22:23         ` Ric Wheeler
  2015-10-21 13:32           ` Sage Weil
  0 siblings, 1 reply; 71+ messages in thread
From: Ric Wheeler @ 2015-10-20 22:23 UTC (permalink / raw)
  To: Sage Weil, Gregory Farnum; +Cc: John Spray, Ceph Development

On 10/20/2015 05:47 PM, Sage Weil wrote:
> On Tue, 20 Oct 2015, Gregory Farnum wrote:
>> On Tue, Oct 20, 2015 at 12:44 PM, Sage Weil <sweil@redhat.com> wrote:
>>> On Tue, 20 Oct 2015, Ric Wheeler wrote:
>>>> The big problem with consuming block devices directly is that you ultimately
>>>> end up recreating most of the features that you had in the file system. Even
>>>> enterprise databases like Oracle and DB2 have been migrating away from running
>>>> on raw block devices in favor of file systems over time.  In effect, you are
>>>> looking at making a simple on disk file system which is always easier to start
>>>> than it is to get back to a stable, production ready state.
>>> This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had
>>> everything we were implementing and more: mainly, copy on write and data
>>> checksums.  But in practice the fact that its general purpose means it
>>> targets a very different workloads and APIs than what we need.
>> Try 7 years since ebofs...
> Sigh...
>
>> That's one of my concerns, though. You ditched ebofs once already
>> because it had metastasized into an entire FS, and had reached its
>> limits of maintainability. What makes you think a second time through
>> would work better? :/
> A fair point, and I've given this some thought:
>
> 1) We know a *lot* more about our workload than I did in 2005.  The things
> I was worrying about then (fragmentation, mainly) are much easier to
> address now, where we have hints from rados and understand what the write
> patterns look like in practice (randomish 4k-128k ios for rbd, sequential
> writes for rgw, and the cephfs wildcard).
>
> 2) Most of the ebofs effort was around doing copy-on-write btrees (with
> checksums) and orchestrating commits.  Here our job is *vastly* simplified
> by assuming the existence of a transactional key/value store.  If you look
> at newstore today, we're already half-way through dealing with the
> complexity of doing allocations... we're essentially "allocating" blocks
> that are 1 MB files on XFS, managing that metadata, and overwriting or
> replacing those blocks on write/truncate/clone.  By the time we add in an
> allocator (get_blocks(len), free_block(offset, len)) and rip out all the
> file handling fiddling (like fsync workqueues, file id allocator,
> file truncation fiddling, etc.) we'll probably have something working
> with about the same amount of code we have now.  (Of course, that'll
> grow as we get more sophisticated, but that'll happen either way.)
>
>> On Mon, Oct 19, 2015 at 12:49 PM, Sage Weil <sweil@redhat.com> wrote:
>>>   - 2 IOs for most: one to write the data to unused space in the block
>>> device, one to commit our transaction (vs 4+ before).  For overwrites,
>>> we'd have one io to do our write-ahead log (kv journal), then do
>>> the overwrite async (vs 4+ before).
>> I can't work this one out. If you're doing one write for the data and
>> one for the kv journal (which is on another filesystem), how does the
>> commit sequence work that it's only 2 IOs instead of the same 3 we
>> already have? Or are you planning to ditch the LevelDB/RocksDB store
>> for our journaling and just use something within the block layer?
> Now:
>      1 io  to write a new file
>    1-2 ios to sync the fs journal (commit the inode, alloc change)
>            (I see 2 journal IOs on XFS and only 1 on ext4...)
>      1 io  to commit the rocksdb journal (currently 3, but will drop to
>            1 with xfs fix and my rocksdb change)

I think that might be too pessimistic - the number of discrete IO's sent down to 
a spinning disk make much less impact on performance than the number of 
fsync()'s since they IO's all land in the write cache.  Some newer spinning 
drives have a non-volatile write cache, so even an fsync() might not end up 
doing the expensive data transfer to the platter.

It would be interesting to get the timings on the IO's you see to measure the 
actual impact.


>
> With block:
>      1 io to write to block device
>      1 io to commit to rocksdb journal
>
>> If we do want to go down this road, we shouldn't need to write an
>> allocator from scratch. I don't remember exactly which ones it is but
>> we've read/seen at least a few storage papers where people have reused
>> existing allocators  ? I think the one from ext2? And somebody managed
>> to get it running in userspace.
> Maybe, but the real win is when we combine the allocator state update with
> our kv transaction.  Even if we adopt an existing algorithm we'll need to
> do some significant rejiggering to persist it in the kv store.
>
> My thought is start with something simple that works (e.g., linear sweep
> over free space, simple interval_set<>-style freelist) and once it works
> look at existing state of the art for a clever v2.
>
> BTW, I suspect a modest win here would be to simply use the collection/pg
> as a hint for storing related objects.  That's the best indicator we have
> for aligned lifecycle (think PG migrations/deletions vs flash erase
> blocks).  Good luck plumbing that through XFS...
>
>> Of course, then we also need to figure out how to get checksums on the
>> block data, since if we're going to put in the effort to reimplement
>> this much of the stack we'd better get our full data integrity
>> guarantees along with it!
> YES!
>
> Here I think we should make judicious use of the rados hints.  For
> example, rgw always writes complete objects, so we can have coarse
> granularity crcs and only pay for very small reads (that have to make
> slightly larger reads for crc verification).  On RBD... we might opt to be
> opportunistic with the write pattern (if the write was 4k, store the crc
> at small granularity), otherwise use a larger one.  Maybe.  In any case,
> we have a lot more flexibility than we would if trying to plumb this
> through the VFS and a file system.

Plumbing for T10 DIF/DIX already exist, what is missing is the normal block 
device that handles them (not enterprise SAS/disk array class)

ric

>
>>> I see two basic options:
>>>
>>> 1) Wire into the Env abstraction in rocksdb to provide something just
>>> smart enough to let rocksdb work.  It isn't much: named files (not that
>>> many--we could easily keep the file table in ram), always written
>>> sequentially, to be read later with random access. All of the code is
>>> written around abstractions of SequentialFileWriter so that everything
>>> posix is neatly hidden in env_posix (and there are various other env
>>> implementations for in-memory mock tests etc.).
>> This seems like the obviously correct move to me? Except we might want
>> to include the rocksdb store on flash instead of hard drives, which
>> means maybe we do want some unified storage system which can handle
>> multiple physical storage devices as a single piece of storage space.
>> (Not that any of those exist in "almost done" hell, or that we're
>> going through requirements expansion or anything!)
> Yeah, I mostly agree.  It's just more work.  And rocks, for example,
> already has some provisions for managing different storage pools: one for
> wal, one for main ssts, one for cold ssts.  And the same Env is used for
> all three, which means we'd run our toy fs backend even for the flash
> portion.  (Which, if it works, is probably good anyway for performance and
> operational simplicity.  One less thing in the stack to break.)
>
> It also ties us to rocksdb, and/or whatever other backends we specifically
> support.  Right now you can trivially swap in leveldb and everything works
> the same.  OTOH there is an alternative btree-based kv store I'm
> considering about that does much better on flash and consumes block
> directly.  Making it share a device with newstore will be interesting.
> So regardless we'll probably have a pretty short list of kv backends that
> we care about...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20 18:31 ` Ric Wheeler
  2015-10-20 19:44   ` Sage Weil
  2015-10-20 19:44   ` Yehuda Sadeh-Weinraub
@ 2015-10-21  8:22   ` Orit Wasserman
  2015-10-21 11:18     ` Ric Wheeler
  2015-10-21 10:06   ` Allen Samuels
  3 siblings, 1 reply; 71+ messages in thread
From: Orit Wasserman @ 2015-10-21  8:22 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Sage Weil, ceph-devel

On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote:
> On 10/19/2015 03:49 PM, Sage Weil wrote:
> > The current design is based on two simple ideas:
> >
> >   1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> >
> >   2) a file system is well suited for storage object data (as files).
> >
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> > things:
> >
> >   - We currently write the data to the file, fsync, then commit the kv
> > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb changes
> > land... the kv commit is currently 2-3).  So two people are managing
> > metadata, here: the fs managing the file metadata (with its own
> > journal) and the kv backend (with its journal).
> 
> If all of the fsync()'s fall into the same backing file system, are you sure 
> that each fsync() takes the same time? Depending on the local FS implementation 
> of course, but the order of issuing those fsync()'s can effectively make some of 
> them no-ops.
> 
> >
> >   - On read we have to open files by name, which means traversing the fs
> > namespace.  Newstore tries to keep it as flat and simple as possible, but
> > at a minimum it is a couple btree lookups.  We'd love to use open by
> > handle (which would reduce this to 1 btree traversal), but running
> > the daemon as ceph and not root makes that hard...
> 
> This seems like a a pretty low hurdle to overcome.
> 
> >
> >   - ...and file systems insist on updating mtime on writes, even when it is
> > a overwrite with no allocation changes.  (We don't care about mtime.)
> > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > brainfreeze.
> 
> Are you using O_DIRECT? Seems like there should be some enterprisey database 
> tricks that we can use here.
> 
> >
> >   - XFS is (probably) never going going to give us data checksums, which we
> > want desperately.
> 
> What is the goal of having the file system do the checksums? How strong do they 
> need to be and what size are the chunks?
> 
> If you update this on each IO, this will certainly generate more IO (each write 
> will possibly generate at least one other write to update that new checksum).
> 
> >
> > But what's the alternative?  My thought is to just bite the bullet and
> > consume a raw block device directly.  Write an allocator, hopefully keep
> > it pretty simple, and manage it in kv store along with all of our other
> > metadata.
> 
> The big problem with consuming block devices directly is that you ultimately end 
> up recreating most of the features that you had in the file system. Even 
> enterprise databases like Oracle and DB2 have been migrating away from running 
> on raw block devices in favor of file systems over time.  In effect, you are 
> looking at making a simple on disk file system which is always easier to start 
> than it is to get back to a stable, production ready state.

The best performance is still on block device (SAN).
File system simplify the operation tasks which worth the performance
penalty for a database. I think in a storage system this is not the
case.
In many cases they can use their own file system that is tailored for
the database.

> I think that it might be quicker and more maintainable to spend some time 
> working with the local file system people (XFS or other) to see if we can 
> jointly address the concerns you have.
> >
> > Wins:
> >
> >   - 2 IOs for most: one to write the data to unused space in the block
> > device, one to commit our transaction (vs 4+ before).  For overwrites,
> > we'd have one io to do our write-ahead log (kv journal), then do
> > the overwrite async (vs 4+ before).
> >
> >   - No concern about mtime getting in the way
> >
> >   - Faster reads (no fs lookup)
> >
> >   - Similarly sized metadata for most objects.  If we assume most objects
> > are not fragmented, then the metadata to store the block offsets is about
> > the same size as the metadata to store the filenames we have now.
> >
> > Problems:
> >
> >   - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage.  Maybe we do this anyway (put metadata on
> > SSD!) so it won't matter.  But what happens when we are storing gobs of
> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> > a different pool and those aren't currently fungible.
> >
> >   - We have to write and maintain an allocator.  I'm still optimistic this
> > can be reasonbly simple, especially for the flash case (where
> > fragmentation isn't such an issue as long as our blocks are reasonbly
> > sized).  For disk we may beed to be moderately clever.
> >
> >   - We'll need a fsck to ensure our internal metadata is consistent.  The
> > good news is it'll just need to validate what we have stored in the kv
> > store.
> >
> > Other thoughts:
> >
> >   - We might want to consider whether dm-thin or bcache or other block
> > layers might help us with elasticity of file vs block areas.
> >
> >   - Rocksdb can push colder data to a second directory, so we could have a
> > fast ssd primary area (for wal and most metadata) and a second hdd
> > directory for stuff it has to push off.  Then have a conservative amount
> > of file space on the hdd.  If our block fills up, use the existing file
> > mechanism to put data there too.  (But then we have to maintain both the
> > current kv + file approach and not go all-in on kv + block.)
> >
> > Thoughts?
> > sage
> > --
> 
> I really hate the idea of making a new file system type (even if we call it a 
> raw block store!).
> 

This won't be a file system but just an allocator which is a very small
part of a file system.

The benefits are not just in reducing the number of IO operations we
preform, we are also removing the file system stack overhead that will
reduce our latency and make it more predictable.
Removing this layer will give use more control and allow us other
optimization we cannot do today.

I think this is more acute when taking SSD (and even faster
technologies) into account.

> In addition to the technical hurdles, there are also production worries like how 
> long will it take for distros to pick up formal support?  How do we test it 
> properly?
> 

This should be userspace only, I don't think we need it in the kernel
(will need root access for opening the device).
For users that don't have root access we can use one big file and use
the same allocator in it. It can be good for testing too.

As someone that already been part of such a
move more than once (for example in Exanet) I can say that the
performance gain is very impressive and after the change we could
remove many workarounds which simplified the code.

As the API should be small the testing effort is reasonable, we do need
to test it well as a bug in the allocator has really bad consequences.

We won't be able to match (or exceed) our competitors performance
without making this effort ...

Orit

> Regards,
> 
> Ric
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-20 13:19           ` Mark Nelson
  2015-10-20 17:04             ` kernel neophyte
@ 2015-10-21 10:06             ` Allen Samuels
  2015-10-21 13:35               ` Mark Nelson
  1 sibling, 1 reply; 71+ messages in thread
From: Allen Samuels @ 2015-10-21 10:06 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil, Chen, Xiaoxi
  Cc: James (Fei) Liu-SSI, Somnath Roy, ceph-devel

I doubt that NVMKV will be useful for two reasons:

(1) It relies on the unique sparse-mapping addressing capabilities of the FusionIO VSL interface, it won't run on standard SSDs
(2) NVMKV doesn't provide any form of in-order enumeration (i.e., no range operations on keys). This is pretty much required for deep scrubbing.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, October 20, 2015 6:20 AM
To: Sage Weil <sweil@redhat.com>; Chen, Xiaoxi <xiaoxi.chen@intel.com>
Cc: James (Fei) Liu-SSI <james.liu@ssi.samsung.com>; Somnath Roy <Somnath.Roy@sandisk.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/20/2015 07:30 AM, Sage Weil wrote:
> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
>> +1, nowadays K-V DB care more about very small key-value pairs, say
>> several bytes to a few KB, but in SSD case we only care about 4KB or
>> 8KB. In this way, NVMKV is a good design and seems some of the SSD
>> vendor are also trying to build this kind of interface, we had a
>> NVM-L library but still under development.
>
> Do you have an NVMKV link?  I see a paper and a stale github repo..
> not sure if I'm looking at the right thing.
>
> My concern with using a key/value interface for the object data is
> that you end up with lots of key/value pairs (e.g., $inode_$offset =
> $4kb_of_data) that is pretty inefficient to store and (depending on
> the
> implementation) tends to break alignment.  I don't think these
> interfaces are targetted toward block-sized/aligned payloads.  Storing
> just the metadata (block allocation map) w/ the kv api and storing the
> data directly on a block/page interface makes more sense to me.
>
> sage

I get the feeling that some of the folks that were involved with nvmkv at Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for instance.  http://pmem.io might be a better bet, though I haven't looked closely at it.

Mark

>
>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
>>> Sent: Tuesday, October 20, 2015 6:21 AM
>>> To: Sage Weil; Somnath Roy
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: RE: newstore direction
>>>
>>> Hi Sage and Somnath,
>>>    In my humble opinion, There is another more aggressive  solution
>>> than raw block device base keyvalue store as backend for
>>> objectstore. The new key value  SSD device with transaction support would be  ideal to solve the issues.
>>> First of all, it is raw SSD device. Secondly , It provides key value
>>> interface directly from SSD. Thirdly, it can provide transaction
>>> support, consistency will be guaranteed by hardware device. It
>>> pretty much satisfied all of objectstore needs without any extra
>>> overhead since there is not any extra layer in between device and objectstore.
>>>     Either way, I strongly support to have CEPH own data format
>>> instead of relying on filesystem.
>>>
>>>    Regards,
>>>    James
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Sage Weil
>>> Sent: Monday, October 19, 2015 1:55 PM
>>> To: Somnath Roy
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: RE: newstore direction
>>>
>>> On Mon, 19 Oct 2015, Somnath Roy wrote:
>>>> Sage,
>>>> I fully support that.  If we want to saturate SSDs , we need to get
>>>> rid of this filesystem overhead (which I am in process of measuring).
>>>> Also, it will be good if we can eliminate the dependency on the k/v
>>>> dbs (for storing allocators and all). The reason is the unknown
>>>> write amps they causes.
>>>
>>> My hope is to keep behing the KeyValueDB interface (and/more change
>>> it as
>>> appropriate) so that other backends can be easily swapped in (e.g. a
>>> btree- based one for high-end flash).
>>>
>>> sage
>>>
>>>
>>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
>>>> Sent: Monday, October 19, 2015 12:49 PM
>>>> To: ceph-devel@vger.kernel.org
>>>> Subject: newstore direction
>>>>
>>>> The current design is based on two simple ideas:
>>>>
>>>>   1) a key/value interface is better way to manage all of our
>>>> internal metadata (object metadata, attrs, layout, collection
>>>> membership, write-ahead logging, overlay data, etc.)
>>>>
>>>>   2) a file system is well suited for storage object data (as files).
>>>>
>>>> So far 1 is working out well, but I'm questioning the wisdom of #2.
>>>> A few
>>>> things:
>>>>
>>>>   - We currently write the data to the file, fsync, then commit the
>>>> kv transaction.  That's at least 3 IOs: one for the data, one for
>>>> the fs journal, one for the kv txn to commit (at least once my
>>>> rocksdb changes land... the kv commit is currently 2-3).  So two
>>>> people are managing metadata, here: the fs managing the file
>>>> metadata (with its own
>>>> journal) and the kv backend (with its journal).
>>>>
>>>>   - On read we have to open files by name, which means traversing
>>>> the fs
>>> namespace.  Newstore tries to keep it as flat and simple as
>>> possible, but at a minimum it is a couple btree lookups.  We'd love
>>> to use open by handle (which would reduce this to 1 btree
>>> traversal), but running the daemon as ceph and not root makes that hard...
>>>>
>>>>   - ...and file systems insist on updating mtime on writes, even
>>>> when it is a
>>> overwrite with no allocation changes.  (We don't care about mtime.)
>>> O_NOCMTIME patches exist but it is hard to get these past the kernel
>>> brainfreeze.
>>>>
>>>>   - XFS is (probably) never going going to give us data checksums,
>>>> which we
>>> want desperately.
>>>>
>>>> But what's the alternative?  My thought is to just bite the bullet
>>>> and
>>> consume a raw block device directly.  Write an allocator, hopefully
>>> keep it pretty simple, and manage it in kv store along with all of our other metadata.
>>>>
>>>> Wins:
>>>>
>>>>   - 2 IOs for most: one to write the data to unused space in the
>>>> block device,
>>> one to commit our transaction (vs 4+ before).  For overwrites, we'd
>>> have one io to do our write-ahead log (kv journal), then do the
>>> overwrite async (vs 4+ before).
>>>>
>>>>   - No concern about mtime getting in the way
>>>>
>>>>   - Faster reads (no fs lookup)
>>>>
>>>>   - Similarly sized metadata for most objects.  If we assume most
>>>> objects are
>>> not fragmented, then the metadata to store the block offsets is
>>> about the same size as the metadata to store the filenames we have now.
>>>>
>>>> Problems:
>>>>
>>>>   - We have to size the kv backend storage (probably still an XFS
>>>> partition) vs the block storage.  Maybe we do this anyway (put
>>>> metadata on
>>>> SSD!) so it won't matter.  But what happens when we are storing
>>>> gobs of
>>> rgw index data or cephfs metadata?  Suddenly we are pulling storage
>>> out of a different pool and those aren't currently fungible.
>>>>
>>>>   - We have to write and maintain an allocator.  I'm still
>>>> optimistic this can be
>>> reasonbly simple, especially for the flash case (where fragmentation
>>> isn't such an issue as long as our blocks are reasonbly sized).  For
>>> disk we may beed to be moderately clever.
>>>>
>>>>   - We'll need a fsck to ensure our internal metadata is
>>>> consistent.  The good
>>> news is it'll just need to validate what we have stored in the kv store.
>>>>
>>>> Other thoughts:
>>>>
>>>>   - We might want to consider whether dm-thin or bcache or other
>>>> block
>>> layers might help us with elasticity of file vs block areas.
>>>>
>>>>   - Rocksdb can push colder data to a second directory, so we could
>>>> have a fast ssd primary area (for wal and most metadata) and a
>>>> second hdd directory for stuff it has to push off.  Then have a
>>>> conservative amount of file space on the hdd.  If our block fills
>>>> up, use the existing file mechanism to put data there too.  (But
>>>> then we have to maintain both the current kv + file approach and
>>>> not go all-in on kv +
>>>> block.)
>>>>
>>>> Thoughts?
>>>> sage
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>>> majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>> ________________________________
>>>>
>>>> PLEASE NOTE: The information contained in this electronic mail
>>>> message is
>>> intended only for the use of the designated recipient(s) named
>>> above. If the reader of this message is not the intended recipient,
>>> you are hereby notified that you have received this message in error
>>> and that any review, dissemination, distribution, or copying of this
>>> message is strictly prohibited. If you have received this
>>> communication in error, please notify the sender by telephone or
>>> e-mail (as shown above) immediately and destroy any and all copies
>>> of this message in your possession (whether hard copies or electronically stored copies).
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>>> majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-20 18:31 ` Ric Wheeler
                     ` (2 preceding siblings ...)
  2015-10-21  8:22   ` Orit Wasserman
@ 2015-10-21 10:06   ` Allen Samuels
  2015-10-21 11:24     ` Ric Wheeler
  2015-10-21 13:44     ` Mark Nelson
  3 siblings, 2 replies; 71+ messages in thread
From: Allen Samuels @ 2015-10-21 10:06 UTC (permalink / raw)
  To: Ric Wheeler, Sage Weil, ceph-devel

I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one.

You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place.

While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments).

Another example: Sage has just had to substantially rework the journaling code of rocksDB.

In short, as you can tell, I'm full throated in favor of going down the optimal route.

Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things:

(1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others.
(2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases.

You can't avoid (2) as long as you're using a file system.

Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ric Wheeler
Sent: Tuesday, October 20, 2015 11:32 AM
To: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/19/2015 03:49 PM, Sage Weil wrote:
> The current design is based on two simple ideas:
>
>   1) a key/value interface is better way to manage all of our internal
> metadata (object metadata, attrs, layout, collection membership,
> write-ahead logging, overlay data, etc.)
>
>   2) a file system is well suited for storage object data (as files).
>
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> few
> things:
>
>   - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs
> journal, one for the kv txn to commit (at least once my rocksdb
> changes land... the kv commit is currently 2-3).  So two people are
> managing metadata, here: the fs managing the file metadata (with its
> own
> journal) and the kv backend (with its journal).

If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops.

>
>   - On read we have to open files by name, which means traversing the
> fs namespace.  Newstore tries to keep it as flat and simple as
> possible, but at a minimum it is a couple btree lookups.  We'd love to
> use open by handle (which would reduce this to 1 btree traversal), but
> running the daemon as ceph and not root makes that hard...

This seems like a a pretty low hurdle to overcome.

>
>   - ...and file systems insist on updating mtime on writes, even when
> it is a overwrite with no allocation changes.  (We don't care about
> mtime.) O_NOCMTIME patches exist but it is hard to get these past the
> kernel brainfreeze.

Are you using O_DIRECT? Seems like there should be some enterprisey database tricks that we can use here.

>
>   - XFS is (probably) never going going to give us data checksums,
> which we want desperately.

What is the goal of having the file system do the checksums? How strong do they need to be and what size are the chunks?

If you update this on each IO, this will certainly generate more IO (each write will possibly generate at least one other write to update that new checksum).

>
> But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully
> keep it pretty simple, and manage it in kv store along with all of our
> other metadata.

The big problem with consuming block devices directly is that you ultimately end up recreating most of the features that you had in the file system. Even enterprise databases like Oracle and DB2 have been migrating away from running on raw block devices in favor of file systems over time.  In effect, you are looking at making a simple on disk file system which is always easier to start than it is to get back to a stable, production ready state.

I think that it might be quicker and more maintainable to spend some time working with the local file system people (XFS or other) to see if we can jointly address the concerns you have.
>
> Wins:
>
>   - 2 IOs for most: one to write the data to unused space in the block
> device, one to commit our transaction (vs 4+ before).  For overwrites,
> we'd have one io to do our write-ahead log (kv journal), then do the
> overwrite async (vs 4+ before).
>
>   - No concern about mtime getting in the way
>
>   - Faster reads (no fs lookup)
>
>   - Similarly sized metadata for most objects.  If we assume most
> objects are not fragmented, then the metadata to store the block
> offsets is about the same size as the metadata to store the filenames we have now.
>
> Problems:
>
>   - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put
> metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs
> of rgw index data or cephfs metadata?  Suddenly we are pulling storage
> out of a different pool and those aren't currently fungible.
>
>   - We have to write and maintain an allocator.  I'm still optimistic
> this can be reasonbly simple, especially for the flash case (where
> fragmentation isn't such an issue as long as our blocks are reasonbly
> sized).  For disk we may beed to be moderately clever.
>
>   - We'll need a fsck to ensure our internal metadata is consistent.
> The good news is it'll just need to validate what we have stored in
> the kv store.
>
> Other thoughts:
>
>   - We might want to consider whether dm-thin or bcache or other block
> layers might help us with elasticity of file vs block areas.
>
>   - Rocksdb can push colder data to a second directory, so we could
> have a fast ssd primary area (for wal and most metadata) and a second
> hdd directory for stuff it has to push off.  Then have a conservative
> amount of file space on the hdd.  If our block fills up, use the
> existing file mechanism to put data there too.  (But then we have to
> maintain both the current kv + file approach and not go all-in on kv +
> block.)
>
> Thoughts?
> sage
> --

I really hate the idea of making a new file system type (even if we call it a raw block store!).

In addition to the technical hurdles, there are also production worries like how long will it take for distros to pick up formal support?  How do we test it properly?

Regards,

Ric


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-21  8:22   ` Orit Wasserman
@ 2015-10-21 11:18     ` Ric Wheeler
  2015-10-21 17:30       ` Sage Weil
  2015-10-22 12:50       ` Sage Weil
  0 siblings, 2 replies; 71+ messages in thread
From: Ric Wheeler @ 2015-10-21 11:18 UTC (permalink / raw)
  To: Orit Wasserman; +Cc: Sage Weil, ceph-devel

On 10/21/2015 04:22 AM, Orit Wasserman wrote:
> On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote:
>> On 10/19/2015 03:49 PM, Sage Weil wrote:
>>> The current design is based on two simple ideas:
>>>
>>>    1) a key/value interface is better way to manage all of our internal
>>> metadata (object metadata, attrs, layout, collection membership,
>>> write-ahead logging, overlay data, etc.)
>>>
>>>    2) a file system is well suited for storage object data (as files).
>>>
>>> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
>>> things:
>>>
>>>    - We currently write the data to the file, fsync, then commit the kv
>>> transaction.  That's at least 3 IOs: one for the data, one for the fs
>>> journal, one for the kv txn to commit (at least once my rocksdb changes
>>> land... the kv commit is currently 2-3).  So two people are managing
>>> metadata, here: the fs managing the file metadata (with its own
>>> journal) and the kv backend (with its journal).
>> If all of the fsync()'s fall into the same backing file system, are you sure
>> that each fsync() takes the same time? Depending on the local FS implementation
>> of course, but the order of issuing those fsync()'s can effectively make some of
>> them no-ops.
>>
>>>    - On read we have to open files by name, which means traversing the fs
>>> namespace.  Newstore tries to keep it as flat and simple as possible, but
>>> at a minimum it is a couple btree lookups.  We'd love to use open by
>>> handle (which would reduce this to 1 btree traversal), but running
>>> the daemon as ceph and not root makes that hard...
>> This seems like a a pretty low hurdle to overcome.
>>
>>>    - ...and file systems insist on updating mtime on writes, even when it is
>>> a overwrite with no allocation changes.  (We don't care about mtime.)
>>> O_NOCMTIME patches exist but it is hard to get these past the kernel
>>> brainfreeze.
>> Are you using O_DIRECT? Seems like there should be some enterprisey database
>> tricks that we can use here.
>>
>>>    - XFS is (probably) never going going to give us data checksums, which we
>>> want desperately.
>> What is the goal of having the file system do the checksums? How strong do they
>> need to be and what size are the chunks?
>>
>> If you update this on each IO, this will certainly generate more IO (each write
>> will possibly generate at least one other write to update that new checksum).
>>
>>> But what's the alternative?  My thought is to just bite the bullet and
>>> consume a raw block device directly.  Write an allocator, hopefully keep
>>> it pretty simple, and manage it in kv store along with all of our other
>>> metadata.
>> The big problem with consuming block devices directly is that you ultimately end
>> up recreating most of the features that you had in the file system. Even
>> enterprise databases like Oracle and DB2 have been migrating away from running
>> on raw block devices in favor of file systems over time.  In effect, you are
>> looking at making a simple on disk file system which is always easier to start
>> than it is to get back to a stable, production ready state.
> The best performance is still on block device (SAN).
> File system simplify the operation tasks which worth the performance
> penalty for a database. I think in a storage system this is not the
> case.
> In many cases they can use their own file system that is tailored for
> the database.

You will have to trust me on this as the Red Hat person who spoke to pretty much 
all of our key customers about local file systems and storage - customers all 
have migrated over to using normal file systems under Oracle/DB2. Typically, 
they use XFS or ext4.  I don't know of any non-standard file systems and only 
have seen one account running on a raw block store in 8 years :)

If you have a pre-allocated file and write using O_DIRECT, your IO path is 
identical in terms of IO's sent to the device.

If we are causing additional IO's, then we really need to spend some time 
talking to the local file system gurus about this in detail.  I can help with 
that conversation.

>
>> I think that it might be quicker and more maintainable to spend some time
>> working with the local file system people (XFS or other) to see if we can
>> jointly address the concerns you have.
>>> Wins:
>>>
>>>    - 2 IOs for most: one to write the data to unused space in the block
>>> device, one to commit our transaction (vs 4+ before).  For overwrites,
>>> we'd have one io to do our write-ahead log (kv journal), then do
>>> the overwrite async (vs 4+ before).
>>>
>>>    - No concern about mtime getting in the way
>>>
>>>    - Faster reads (no fs lookup)
>>>
>>>    - Similarly sized metadata for most objects.  If we assume most objects
>>> are not fragmented, then the metadata to store the block offsets is about
>>> the same size as the metadata to store the filenames we have now.
>>>
>>> Problems:
>>>
>>>    - We have to size the kv backend storage (probably still an XFS
>>> partition) vs the block storage.  Maybe we do this anyway (put metadata on
>>> SSD!) so it won't matter.  But what happens when we are storing gobs of
>>> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
>>> a different pool and those aren't currently fungible.
>>>
>>>    - We have to write and maintain an allocator.  I'm still optimistic this
>>> can be reasonbly simple, especially for the flash case (where
>>> fragmentation isn't such an issue as long as our blocks are reasonbly
>>> sized).  For disk we may beed to be moderately clever.
>>>
>>>    - We'll need a fsck to ensure our internal metadata is consistent.  The
>>> good news is it'll just need to validate what we have stored in the kv
>>> store.
>>>
>>> Other thoughts:
>>>
>>>    - We might want to consider whether dm-thin or bcache or other block
>>> layers might help us with elasticity of file vs block areas.
>>>
>>>    - Rocksdb can push colder data to a second directory, so we could have a
>>> fast ssd primary area (for wal and most metadata) and a second hdd
>>> directory for stuff it has to push off.  Then have a conservative amount
>>> of file space on the hdd.  If our block fills up, use the existing file
>>> mechanism to put data there too.  (But then we have to maintain both the
>>> current kv + file approach and not go all-in on kv + block.)
>>>
>>> Thoughts?
>>> sage
>>> --
>> I really hate the idea of making a new file system type (even if we call it a
>> raw block store!).
>>
> This won't be a file system but just an allocator which is a very small
> part of a file system.

That is always the intention and then we wake up a few years into the project 
with something that looks and smells like a file system as we slowly bring in 
just one more small thing at a time.

>
> The benefits are not just in reducing the number of IO operations we
> preform, we are also removing the file system stack overhead that will
> reduce our latency and make it more predictable.
> Removing this layer will give use more control and allow us other
> optimization we cannot do today.

I strongly disagree here - we can get that optimal number of IO's if we use the 
file system API's developed over the years to support enterprise databases.  And 
we can have that today without having to re-write allocation routines and checkers.

>
> I think this is more acute when taking SSD (and even faster
> technologies) into account.

XFS and ext4 both support DAX, so we can effectively do direct writes to 
persistent memory (no block IO required). Most of the work over the past few 
years in the IO stack has been around driving IOPs at insanely high rates on top 
of the whole stack (file system layer included) and we have really good results.

>
>> In addition to the technical hurdles, there are also production worries like how
>> long will it take for distros to pick up formal support?  How do we test it
>> properly?
>>
> This should be userspace only, I don't think we need it in the kernel
> (will need root access for opening the device).
> For users that don't have root access we can use one big file and use
> the same allocator in it. It can be good for testing too.
>
> As someone that already been part of such a
> move more than once (for example in Exanet) I can say that the
> performance gain is very impressive and after the change we could
> remove many workarounds which simplified the code.
>
> As the API should be small the testing effort is reasonable, we do need
> to test it well as a bug in the allocator has really bad consequences.
>
> We won't be able to match (or exceed) our competitors performance
> without making this effort ...
>
> Orit
>

I don't agree that we will see a performance win if we use the file system 
properly.  Certainly, you can measure a slow path through a file system and then 
show an improvement with a new, user space block access, but that is not a long 
term path to success.  As far as I know, exanet never published their code or 
performance numbers when measured against local file systems, but it would be 
easy to show how well we can drive XFS or ext4.

Regardless of the address space that the code lives in, we will need to test it 
over things that file systems already know how to do.

Regards,

Ric



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-21 10:06   ` Allen Samuels
@ 2015-10-21 11:24     ` Ric Wheeler
  2015-10-21 14:14       ` Mark Nelson
  2015-10-22  0:53       ` Allen Samuels
  2015-10-21 13:44     ` Mark Nelson
  1 sibling, 2 replies; 71+ messages in thread
From: Ric Wheeler @ 2015-10-21 11:24 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil, ceph-devel



On 10/21/2015 06:06 AM, Allen Samuels wrote:
> I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one.
>
> You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place.

I think that we need to work with the existing stack - measure and do some 
collaborative analysis - before we throw out decades of work.  Very hard to 
understand why the local file system is a barrier for performance in this case 
when it is not an issue in existing enterprise applications.

We need some deep analysis with some local file system experts thrown in to 
validate the concerns.

>
> While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments).

Not clear what bugs you are thinking of or why you think fixing bugs will take a 
long time to hit the field in XFS. Red Hat has most of the XFS developers on 
staff and we actively backport fixes and ship them, other distros do as well.

Never seen a "bug" take a couple of years to hit users.

Regards,

Ric

>
> Another example: Sage has just had to substantially rework the journaling code of rocksDB.
>
> In short, as you can tell, I'm full throated in favor of going down the optimal route.
>
> Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things:
>
> (1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others.
> (2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases.
>
> You can't avoid (2) as long as you're using a file system.
>
> Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Tuesday, October 20, 2015 11:32 AM
> To: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On 10/19/2015 03:49 PM, Sage Weil wrote:
>> The current design is based on two simple ideas:
>>
>>    1) a key/value interface is better way to manage all of our internal
>> metadata (object metadata, attrs, layout, collection membership,
>> write-ahead logging, overlay data, etc.)
>>
>>    2) a file system is well suited for storage object data (as files).
>>
>> So far 1 is working out well, but I'm questioning the wisdom of #2.  A
>> few
>> things:
>>
>>    - We currently write the data to the file, fsync, then commit the kv
>> transaction.  That's at least 3 IOs: one for the data, one for the fs
>> journal, one for the kv txn to commit (at least once my rocksdb
>> changes land... the kv commit is currently 2-3).  So two people are
>> managing metadata, here: the fs managing the file metadata (with its
>> own
>> journal) and the kv backend (with its journal).
> If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops.
>
>>    - On read we have to open files by name, which means traversing the
>> fs namespace.  Newstore tries to keep it as flat and simple as
>> possible, but at a minimum it is a couple btree lookups.  We'd love to
>> use open by handle (which would reduce this to 1 btree traversal), but
>> running the daemon as ceph and not root makes that hard...
> This seems like a a pretty low hurdle to overcome.
>
>>    - ...and file systems insist on updating mtime on writes, even when
>> it is a overwrite with no allocation changes.  (We don't care about
>> mtime.) O_NOCMTIME patches exist but it is hard to get these past the
>> kernel brainfreeze.
> Are you using O_DIRECT? Seems like there should be some enterprisey database tricks that we can use here.
>
>>    - XFS is (probably) never going going to give us data checksums,
>> which we want desperately.
> What is the goal of having the file system do the checksums? How strong do they need to be and what size are the chunks?
>
> If you update this on each IO, this will certainly generate more IO (each write will possibly generate at least one other write to update that new checksum).
>
>> But what's the alternative?  My thought is to just bite the bullet and
>> consume a raw block device directly.  Write an allocator, hopefully
>> keep it pretty simple, and manage it in kv store along with all of our
>> other metadata.
> The big problem with consuming block devices directly is that you ultimately end up recreating most of the features that you had in the file system. Even enterprise databases like Oracle and DB2 have been migrating away from running on raw block devices in favor of file systems over time.  In effect, you are looking at making a simple on disk file system which is always easier to start than it is to get back to a stable, production ready state.
>
> I think that it might be quicker and more maintainable to spend some time working with the local file system people (XFS or other) to see if we can jointly address the concerns you have.
>> Wins:
>>
>>    - 2 IOs for most: one to write the data to unused space in the block
>> device, one to commit our transaction (vs 4+ before).  For overwrites,
>> we'd have one io to do our write-ahead log (kv journal), then do the
>> overwrite async (vs 4+ before).
>>
>>    - No concern about mtime getting in the way
>>
>>    - Faster reads (no fs lookup)
>>
>>    - Similarly sized metadata for most objects.  If we assume most
>> objects are not fragmented, then the metadata to store the block
>> offsets is about the same size as the metadata to store the filenames we have now.
>>
>> Problems:
>>
>>    - We have to size the kv backend storage (probably still an XFS
>> partition) vs the block storage.  Maybe we do this anyway (put
>> metadata on
>> SSD!) so it won't matter.  But what happens when we are storing gobs
>> of rgw index data or cephfs metadata?  Suddenly we are pulling storage
>> out of a different pool and those aren't currently fungible.
>>
>>    - We have to write and maintain an allocator.  I'm still optimistic
>> this can be reasonbly simple, especially for the flash case (where
>> fragmentation isn't such an issue as long as our blocks are reasonbly
>> sized).  For disk we may beed to be moderately clever.
>>
>>    - We'll need a fsck to ensure our internal metadata is consistent.
>> The good news is it'll just need to validate what we have stored in
>> the kv store.
>>
>> Other thoughts:
>>
>>    - We might want to consider whether dm-thin or bcache or other block
>> layers might help us with elasticity of file vs block areas.
>>
>>    - Rocksdb can push colder data to a second directory, so we could
>> have a fast ssd primary area (for wal and most metadata) and a second
>> hdd directory for stuff it has to push off.  Then have a conservative
>> amount of file space on the hdd.  If our block fills up, use the
>> existing file mechanism to put data there too.  (But then we have to
>> maintain both the current kv + file approach and not go all-in on kv +
>> block.)
>>
>> Thoughts?
>> sage
>> --
> I really hate the idea of making a new file system type (even if we call it a raw block store!).
>
> In addition to the technical hurdles, there are also production worries like how long will it take for distros to pick up formal support?  How do we test it properly?
>
> Regards,
>
> Ric
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20 22:23         ` Ric Wheeler
@ 2015-10-21 13:32           ` Sage Weil
  2015-10-21 13:50             ` Ric Wheeler
  0 siblings, 1 reply; 71+ messages in thread
From: Sage Weil @ 2015-10-21 13:32 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Gregory Farnum, John Spray, Ceph Development

On Tue, 20 Oct 2015, Ric Wheeler wrote:
> > Now:
> >      1 io  to write a new file
> >    1-2 ios to sync the fs journal (commit the inode, alloc change)
> >            (I see 2 journal IOs on XFS and only 1 on ext4...)
> >      1 io  to commit the rocksdb journal (currently 3, but will drop to
> >            1 with xfs fix and my rocksdb change)
> 
> I think that might be too pessimistic - the number of discrete IO's sent down
> to a spinning disk make much less impact on performance than the number of
> fsync()'s since they IO's all land in the write cache.  Some newer spinning
> drives have a non-volatile write cache, so even an fsync() might not end up
> doing the expensive data transfer to the platter.

True, but in XFS's case at least the file data and journal are not 
colocated, so its 2 seeks for the new file write+fdatasync and another for 
the rocksdb journal commit.  Of course, with a deep queue, we're doing 
lots of these so there's be fewer journal commits on both counts, but the 
lower bound on latency of a single write is still 3 seeks, and that bound 
is pretty critical when you also have network round trips and replication 
(worst out of 2) on top.

> It would be interesting to get the timings on the IO's you see to measure the
> actual impact.

I observed this with the journaling workload for rocksdb, but I assume the 
journaling behavior is the same regardless of what is being journaled.  
For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and 
blktrace showed an IO to the file, and 2 IOs to the journal.  I believe 
the first one is the record for the inode update, and the second is the 
journal 'commit' record (though I forget how I decided that).  My guess is 
that XFS is being extremely careful about journal integrity here and not 
writing the commit record until it knows that the preceding records landed 
on stable storage.  For ext4, the latency was about ~20ms, and blktrace 
showed the IO to the file and then a single journal IO.  When I made the 
rocksdb change to overwrite an existing, prewritten file, the latency 
dropped to ~10ms on ext4, and blktrace showed a single IO as expected.  
(XFS still showed the 2 journal commit IOs, but Dave just posted the fix 
for that on the XFS list today.)

> Plumbing for T10 DIF/DIX already exist, what is missing is the normal block
> device that handles them (not enterprise SAS/disk array class)

Yeah... which unfortunately means that unless the cheap drives 
suddenly start shipping if DIF/DIX support we'll need to do the 
checksums ourselves.  This is probably a good thing anyway as it doesn't 
constrain our choice of checksum or checksum granularity, and will 
still work with other storage devices (ssds, nvme, etc.).

sage

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-21 10:06             ` Allen Samuels
@ 2015-10-21 13:35               ` Mark Nelson
  2015-10-21 16:10                 ` Chen, Xiaoxi
  0 siblings, 1 reply; 71+ messages in thread
From: Mark Nelson @ 2015-10-21 13:35 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil, Chen, Xiaoxi
  Cc: James (Fei) Liu-SSI, Somnath Roy, ceph-devel

Thanks Allen!  The devil is always in the details.  Know of anything 
else that looks promising?

Mark

On 10/21/2015 05:06 AM, Allen Samuels wrote:
> I doubt that NVMKV will be useful for two reasons:
>
> (1) It relies on the unique sparse-mapping addressing capabilities of the FusionIO VSL interface, it won't run on standard SSDs
> (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no range operations on keys). This is pretty much required for deep scrubbing.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, October 20, 2015 6:20 AM
> To: Sage Weil <sweil@redhat.com>; Chen, Xiaoxi <xiaoxi.chen@intel.com>
> Cc: James (Fei) Liu-SSI <james.liu@ssi.samsung.com>; Somnath Roy <Somnath.Roy@sandisk.com>; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On 10/20/2015 07:30 AM, Sage Weil wrote:
>> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
>>> +1, nowadays K-V DB care more about very small key-value pairs, say
>>> several bytes to a few KB, but in SSD case we only care about 4KB or
>>> 8KB. In this way, NVMKV is a good design and seems some of the SSD
>>> vendor are also trying to build this kind of interface, we had a
>>> NVM-L library but still under development.
>>
>> Do you have an NVMKV link?  I see a paper and a stale github repo..
>> not sure if I'm looking at the right thing.
>>
>> My concern with using a key/value interface for the object data is
>> that you end up with lots of key/value pairs (e.g., $inode_$offset =
>> $4kb_of_data) that is pretty inefficient to store and (depending on
>> the
>> implementation) tends to break alignment.  I don't think these
>> interfaces are targetted toward block-sized/aligned payloads.  Storing
>> just the metadata (block allocation map) w/ the kv api and storing the
>> data directly on a block/page interface makes more sense to me.
>>
>> sage
>
> I get the feeling that some of the folks that were involved with nvmkv at Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for instance.  http://pmem.io might be a better bet, though I haven't looked closely at it.
>
> Mark
>
>>
>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>> owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
>>>> Sent: Tuesday, October 20, 2015 6:21 AM
>>>> To: Sage Weil; Somnath Roy
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: RE: newstore direction
>>>>
>>>> Hi Sage and Somnath,
>>>>     In my humble opinion, There is another more aggressive  solution
>>>> than raw block device base keyvalue store as backend for
>>>> objectstore. The new key value  SSD device with transaction support would be  ideal to solve the issues.
>>>> First of all, it is raw SSD device. Secondly , It provides key value
>>>> interface directly from SSD. Thirdly, it can provide transaction
>>>> support, consistency will be guaranteed by hardware device. It
>>>> pretty much satisfied all of objectstore needs without any extra
>>>> overhead since there is not any extra layer in between device and objectstore.
>>>>      Either way, I strongly support to have CEPH own data format
>>>> instead of relying on filesystem.
>>>>
>>>>     Regards,
>>>>     James
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
>>>> Sent: Monday, October 19, 2015 1:55 PM
>>>> To: Somnath Roy
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: RE: newstore direction
>>>>
>>>> On Mon, 19 Oct 2015, Somnath Roy wrote:
>>>>> Sage,
>>>>> I fully support that.  If we want to saturate SSDs , we need to get
>>>>> rid of this filesystem overhead (which I am in process of measuring).
>>>>> Also, it will be good if we can eliminate the dependency on the k/v
>>>>> dbs (for storing allocators and all). The reason is the unknown
>>>>> write amps they causes.
>>>>
>>>> My hope is to keep behing the KeyValueDB interface (and/more change
>>>> it as
>>>> appropriate) so that other backends can be easily swapped in (e.g. a
>>>> btree- based one for high-end flash).
>>>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>> Thanks & Regards
>>>>> Somnath
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
>>>>> Sent: Monday, October 19, 2015 12:49 PM
>>>>> To: ceph-devel@vger.kernel.org
>>>>> Subject: newstore direction
>>>>>
>>>>> The current design is based on two simple ideas:
>>>>>
>>>>>    1) a key/value interface is better way to manage all of our
>>>>> internal metadata (object metadata, attrs, layout, collection
>>>>> membership, write-ahead logging, overlay data, etc.)
>>>>>
>>>>>    2) a file system is well suited for storage object data (as files).
>>>>>
>>>>> So far 1 is working out well, but I'm questioning the wisdom of #2.
>>>>> A few
>>>>> things:
>>>>>
>>>>>    - We currently write the data to the file, fsync, then commit the
>>>>> kv transaction.  That's at least 3 IOs: one for the data, one for
>>>>> the fs journal, one for the kv txn to commit (at least once my
>>>>> rocksdb changes land... the kv commit is currently 2-3).  So two
>>>>> people are managing metadata, here: the fs managing the file
>>>>> metadata (with its own
>>>>> journal) and the kv backend (with its journal).
>>>>>
>>>>>    - On read we have to open files by name, which means traversing
>>>>> the fs
>>>> namespace.  Newstore tries to keep it as flat and simple as
>>>> possible, but at a minimum it is a couple btree lookups.  We'd love
>>>> to use open by handle (which would reduce this to 1 btree
>>>> traversal), but running the daemon as ceph and not root makes that hard...
>>>>>
>>>>>    - ...and file systems insist on updating mtime on writes, even
>>>>> when it is a
>>>> overwrite with no allocation changes.  (We don't care about mtime.)
>>>> O_NOCMTIME patches exist but it is hard to get these past the kernel
>>>> brainfreeze.
>>>>>
>>>>>    - XFS is (probably) never going going to give us data checksums,
>>>>> which we
>>>> want desperately.
>>>>>
>>>>> But what's the alternative?  My thought is to just bite the bullet
>>>>> and
>>>> consume a raw block device directly.  Write an allocator, hopefully
>>>> keep it pretty simple, and manage it in kv store along with all of our other metadata.
>>>>>
>>>>> Wins:
>>>>>
>>>>>    - 2 IOs for most: one to write the data to unused space in the
>>>>> block device,
>>>> one to commit our transaction (vs 4+ before).  For overwrites, we'd
>>>> have one io to do our write-ahead log (kv journal), then do the
>>>> overwrite async (vs 4+ before).
>>>>>
>>>>>    - No concern about mtime getting in the way
>>>>>
>>>>>    - Faster reads (no fs lookup)
>>>>>
>>>>>    - Similarly sized metadata for most objects.  If we assume most
>>>>> objects are
>>>> not fragmented, then the metadata to store the block offsets is
>>>> about the same size as the metadata to store the filenames we have now.
>>>>>
>>>>> Problems:
>>>>>
>>>>>    - We have to size the kv backend storage (probably still an XFS
>>>>> partition) vs the block storage.  Maybe we do this anyway (put
>>>>> metadata on
>>>>> SSD!) so it won't matter.  But what happens when we are storing
>>>>> gobs of
>>>> rgw index data or cephfs metadata?  Suddenly we are pulling storage
>>>> out of a different pool and those aren't currently fungible.
>>>>>
>>>>>    - We have to write and maintain an allocator.  I'm still
>>>>> optimistic this can be
>>>> reasonbly simple, especially for the flash case (where fragmentation
>>>> isn't such an issue as long as our blocks are reasonbly sized).  For
>>>> disk we may beed to be moderately clever.
>>>>>
>>>>>    - We'll need a fsck to ensure our internal metadata is
>>>>> consistent.  The good
>>>> news is it'll just need to validate what we have stored in the kv store.
>>>>>
>>>>> Other thoughts:
>>>>>
>>>>>    - We might want to consider whether dm-thin or bcache or other
>>>>> block
>>>> layers might help us with elasticity of file vs block areas.
>>>>>
>>>>>    - Rocksdb can push colder data to a second directory, so we could
>>>>> have a fast ssd primary area (for wal and most metadata) and a
>>>>> second hdd directory for stuff it has to push off.  Then have a
>>>>> conservative amount of file space on the hdd.  If our block fills
>>>>> up, use the existing file mechanism to put data there too.  (But
>>>>> then we have to maintain both the current kv + file approach and
>>>>> not go all-in on kv +
>>>>> block.)
>>>>>
>>>>> Thoughts?
>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>> majordomo
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> ________________________________
>>>>>
>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>> message is
>>>> intended only for the use of the designated recipient(s) named
>>>> above. If the reader of this message is not the intended recipient,
>>>> you are hereby notified that you have received this message in error
>>>> and that any review, dissemination, distribution, or copying of this
>>>> message is strictly prohibited. If you have received this
>>>> communication in error, please notify the sender by telephone or
>>>> e-mail (as shown above) immediately and destroy any and all copies
>>>> of this message in your possession (whether hard copies or electronically stored copies).
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>> majordomo
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-21 10:06   ` Allen Samuels
  2015-10-21 11:24     ` Ric Wheeler
@ 2015-10-21 13:44     ` Mark Nelson
  2015-10-22  1:39       ` Allen Samuels
  1 sibling, 1 reply; 71+ messages in thread
From: Mark Nelson @ 2015-10-21 13:44 UTC (permalink / raw)
  To: Allen Samuels, Ric Wheeler, Sage Weil, ceph-devel

On 10/21/2015 05:06 AM, Allen Samuels wrote:
> I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one.
>
> You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place.
>
> While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments).
>
> Another example: Sage has just had to substantially rework the journaling code of rocksDB.
>
> In short, as you can tell, I'm full throated in favor of going down the optimal route.
>
> Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things:

Has there been any discussion regarding opensourcing zetascale?

>
> (1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others.
> (2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases.
>
> You can't avoid (2) as long as you're using a file system.
>
> Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Tuesday, October 20, 2015 11:32 AM
> To: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On 10/19/2015 03:49 PM, Sage Weil wrote:
>> The current design is based on two simple ideas:
>>
>>    1) a key/value interface is better way to manage all of our internal
>> metadata (object metadata, attrs, layout, collection membership,
>> write-ahead logging, overlay data, etc.)
>>
>>    2) a file system is well suited for storage object data (as files).
>>
>> So far 1 is working out well, but I'm questioning the wisdom of #2.  A
>> few
>> things:
>>
>>    - We currently write the data to the file, fsync, then commit the kv
>> transaction.  That's at least 3 IOs: one for the data, one for the fs
>> journal, one for the kv txn to commit (at least once my rocksdb
>> changes land... the kv commit is currently 2-3).  So two people are
>> managing metadata, here: the fs managing the file metadata (with its
>> own
>> journal) and the kv backend (with its journal).
>
> If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops.
>
>>
>>    - On read we have to open files by name, which means traversing the
>> fs namespace.  Newstore tries to keep it as flat and simple as
>> possible, but at a minimum it is a couple btree lookups.  We'd love to
>> use open by handle (which would reduce this to 1 btree traversal), but
>> running the daemon as ceph and not root makes that hard...
>
> This seems like a a pretty low hurdle to overcome.
>
>>
>>    - ...and file systems insist on updating mtime on writes, even when
>> it is a overwrite with no allocation changes.  (We don't care about
>> mtime.) O_NOCMTIME patches exist but it is hard to get these past the
>> kernel brainfreeze.
>
> Are you using O_DIRECT? Seems like there should be some enterprisey database tricks that we can use here.
>
>>
>>    - XFS is (probably) never going going to give us data checksums,
>> which we want desperately.
>
> What is the goal of having the file system do the checksums? How strong do they need to be and what size are the chunks?
>
> If you update this on each IO, this will certainly generate more IO (each write will possibly generate at least one other write to update that new checksum).
>
>>
>> But what's the alternative?  My thought is to just bite the bullet and
>> consume a raw block device directly.  Write an allocator, hopefully
>> keep it pretty simple, and manage it in kv store along with all of our
>> other metadata.
>
> The big problem with consuming block devices directly is that you ultimately end up recreating most of the features that you had in the file system. Even enterprise databases like Oracle and DB2 have been migrating away from running on raw block devices in favor of file systems over time.  In effect, you are looking at making a simple on disk file system which is always easier to start than it is to get back to a stable, production ready state.
>
> I think that it might be quicker and more maintainable to spend some time working with the local file system people (XFS or other) to see if we can jointly address the concerns you have.
>>
>> Wins:
>>
>>    - 2 IOs for most: one to write the data to unused space in the block
>> device, one to commit our transaction (vs 4+ before).  For overwrites,
>> we'd have one io to do our write-ahead log (kv journal), then do the
>> overwrite async (vs 4+ before).
>>
>>    - No concern about mtime getting in the way
>>
>>    - Faster reads (no fs lookup)
>>
>>    - Similarly sized metadata for most objects.  If we assume most
>> objects are not fragmented, then the metadata to store the block
>> offsets is about the same size as the metadata to store the filenames we have now.
>>
>> Problems:
>>
>>    - We have to size the kv backend storage (probably still an XFS
>> partition) vs the block storage.  Maybe we do this anyway (put
>> metadata on
>> SSD!) so it won't matter.  But what happens when we are storing gobs
>> of rgw index data or cephfs metadata?  Suddenly we are pulling storage
>> out of a different pool and those aren't currently fungible.
>>
>>    - We have to write and maintain an allocator.  I'm still optimistic
>> this can be reasonbly simple, especially for the flash case (where
>> fragmentation isn't such an issue as long as our blocks are reasonbly
>> sized).  For disk we may beed to be moderately clever.
>>
>>    - We'll need a fsck to ensure our internal metadata is consistent.
>> The good news is it'll just need to validate what we have stored in
>> the kv store.
>>
>> Other thoughts:
>>
>>    - We might want to consider whether dm-thin or bcache or other block
>> layers might help us with elasticity of file vs block areas.
>>
>>    - Rocksdb can push colder data to a second directory, so we could
>> have a fast ssd primary area (for wal and most metadata) and a second
>> hdd directory for stuff it has to push off.  Then have a conservative
>> amount of file space on the hdd.  If our block fills up, use the
>> existing file mechanism to put data there too.  (But then we have to
>> maintain both the current kv + file approach and not go all-in on kv +
>> block.)
>>
>> Thoughts?
>> sage
>> --
>
> I really hate the idea of making a new file system type (even if we call it a raw block store!).
>
> In addition to the technical hurdles, there are also production worries like how long will it take for distros to pick up formal support?  How do we test it properly?
>
> Regards,
>
> Ric
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-21 13:32           ` Sage Weil
@ 2015-10-21 13:50             ` Ric Wheeler
  2015-10-23  6:21               ` Howard Chu
  0 siblings, 1 reply; 71+ messages in thread
From: Ric Wheeler @ 2015-10-21 13:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, John Spray, Ceph Development

On 10/21/2015 09:32 AM, Sage Weil wrote:
> On Tue, 20 Oct 2015, Ric Wheeler wrote:
>>> Now:
>>>       1 io  to write a new file
>>>     1-2 ios to sync the fs journal (commit the inode, alloc change)
>>>             (I see 2 journal IOs on XFS and only 1 on ext4...)
>>>       1 io  to commit the rocksdb journal (currently 3, but will drop to
>>>             1 with xfs fix and my rocksdb change)
>> I think that might be too pessimistic - the number of discrete IO's sent down
>> to a spinning disk make much less impact on performance than the number of
>> fsync()'s since they IO's all land in the write cache.  Some newer spinning
>> drives have a non-volatile write cache, so even an fsync() might not end up
>> doing the expensive data transfer to the platter.
> True, but in XFS's case at least the file data and journal are not
> colocated, so its 2 seeks for the new file write+fdatasync and another for
> the rocksdb journal commit.  Of course, with a deep queue, we're doing
> lots of these so there's be fewer journal commits on both counts, but the
> lower bound on latency of a single write is still 3 seeks, and that bound
> is pretty critical when you also have network round trips and replication
> (worst out of 2) on top.

What are the performance goals we are looking for?

Small, synchronous writes/second?

File creates/second?

I suspect that looking at things like seeks/write is probably looking at the 
wrong level of performance challenges.  Again, when you write to a modern drive, 
you write to its write cache and it decides internally when/how to destage to 
the platter.

If you look at the performance of XFS with streaming workloads, it will tend to 
max out the bandwidth of the underlaying storage.

If we need IOP's/file writes, etc, we should be clear on what we are aiming at.

>
>> It would be interesting to get the timings on the IO's you see to measure the
>> actual impact.
> I observed this with the journaling workload for rocksdb, but I assume the
> journaling behavior is the same regardless of what is being journaled.
> For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and
> blktrace showed an IO to the file, and 2 IOs to the journal.  I believe
> the first one is the record for the inode update, and the second is the
> journal 'commit' record (though I forget how I decided that).  My guess is
> that XFS is being extremely careful about journal integrity here and not
> writing the commit record until it knows that the preceding records landed
> on stable storage.  For ext4, the latency was about ~20ms, and blktrace
> showed the IO to the file and then a single journal IO.  When I made the
> rocksdb change to overwrite an existing, prewritten file, the latency
> dropped to ~10ms on ext4, and blktrace showed a single IO as expected.
> (XFS still showed the 2 journal commit IOs, but Dave just posted the fix
> for that on the XFS list today.)

Right, if we want to avoid metadata related IO's, we can preallocate a file and 
use O_DIRECT. Effectively, there should be no updates outside of the data write 
itself.  Also won't be performance optimizations, but we could avoid redoing 
allocation and defragmentation again.

Normally, best practice is to use batching to avoid paying worst case latency 
when you do a synchronous IO. Write a batch of files or appends without fsync, 
then go back and fsync and you will pay that latency once (not per file/op).

>
>> Plumbing for T10 DIF/DIX already exist, what is missing is the normal block
>> device that handles them (not enterprise SAS/disk array class)
> Yeah... which unfortunately means that unless the cheap drives
> suddenly start shipping if DIF/DIX support we'll need to do the
> checksums ourselves.  This is probably a good thing anyway as it doesn't
> constrain our choice of checksum or checksum granularity, and will
> still work with other storage devices (ssds, nvme, etc.).
>
> sage

Might be interesting to see if a device mapper target could be written to 
support DIF/DIX.  For what it's worth, XFS developers have talked loosely about 
looking at data block checksums (could do something like btrfs does, store the 
checksums in another btree)

ric


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-21 11:24     ` Ric Wheeler
@ 2015-10-21 14:14       ` Mark Nelson
  2015-10-21 15:51         ` Ric Wheeler
  2015-10-22  0:53       ` Allen Samuels
  1 sibling, 1 reply; 71+ messages in thread
From: Mark Nelson @ 2015-10-21 14:14 UTC (permalink / raw)
  To: Ric Wheeler, Allen Samuels, Sage Weil, ceph-devel



On 10/21/2015 06:24 AM, Ric Wheeler wrote:
>
>
> On 10/21/2015 06:06 AM, Allen Samuels wrote:
>> I agree that moving newStore to raw block is going to be a significant
>> development effort. But the current scheme of using a KV store
>> combined with a normal file system is always going to be problematic
>> (FileStore or NewStore). This is caused by the transactional
>> requirements of the ObjectStore interface, essentially you need to
>> make transactionally consistent updates to two indexes, one of which
>> doesn't understand transactions (File Systems) and can never be
>> tightly-connected to the other one.
>>
>> You'll always be able to make this "loosely coupled" approach work,
>> but it will never be optimal. The real question is whether the
>> performance difference of a suboptimal implementation is something
>> that you can live with compared to the longer gestation period of the
>> more optimal implementation. Clearly, Sage believes that the
>> performance difference is significant or he wouldn't have kicked off
>> this discussion in the first place.
>
> I think that we need to work with the existing stack - measure and do
> some collaborative analysis - before we throw out decades of work.  Very
> hard to understand why the local file system is a barrier for
> performance in this case when it is not an issue in existing enterprise
> applications.
>
> We need some deep analysis with some local file system experts thrown in
> to validate the concerns.

I think Sage has been working pretty closely with the XFS guys to 
uncover these kinds of issues.  I know if I encounter something fairly 
FS specific I try to drag Eric or Dave in.  I think the core of the 
problem is that we often find ourselves exercising filesystems in pretty 
unusual ways.  While it's probably good that we add this kind of 
coverage and help work out somewhat esoteric bugs, I think it does make 
our job of making Ceph perform well harder.  One example:  I had been 
telling folks for several years to favor dentry and inode cache due to 
the way our PG directory splitting works (backed by test results), but 
then Sage discovered:

http://www.spinics.net/lists/ceph-devel/msg25644.html

This is just one example of how very nuanced our performance story is. 
I can keep many users at least semi-engaged when talking about objects 
being laid out in a nested directory structure, how dentry/inode cache 
affects that in a general sense, etc.  But combine the kind of subtlety 
in the link above with the vastness of things in the data path that can 
hurt performance, and people generally just can't wrap their heads 
around all of it (With the exception of some of the very smart folks on 
this mailing list!)

One of my biggest concerns going forward is reducing the user-facing 
complexity of our performance story.  The question I ask myself is: 
Does keeping Ceph on a FS help us or hurt us in that regard?

>
>>
>> While I think we can all agree that writing a full-up KV and raw-block
>> ObjectStore is a significant amount of work. I will offer the case
>> that the "loosely couple" scheme may not have as much time-to-market
>> advantage as it appears to have. One example: NewStore performance is
>> limited due to bugs in XFS that won't be fixed in the field for quite
>> some time (it'll take at least a couple of years before a patched
>> version of XFS will be widely deployed at customer environments).
>
> Not clear what bugs you are thinking of or why you think fixing bugs
> will take a long time to hit the field in XFS. Red Hat has most of the
> XFS developers on staff and we actively backport fixes and ship them,
> other distros do as well.
>
> Never seen a "bug" take a couple of years to hit users.

Maybe a good way to start out would be to see how quickly we can get the 
patch dchinner posted here:

http://oss.sgi.com/archives/xfs/2015-10/msg00545.html

rolled out into RHEL/CentOS/Ubuntu.  I have no idea how long these 
things typically take, but this might be a good test case.

>
> Regards,
>
> Ric
>
>>
>> Another example: Sage has just had to substantially rework the
>> journaling code of rocksDB.
>>
>> In short, as you can tell, I'm full throated in favor of going down
>> the optimal route.
>>
>> Internally at Sandisk, we have a KV store that is optimized for flash
>> (it's called ZetaScale). We have extended it with a raw block
>> allocator just as Sage is now proposing to do. Our internal
>> performance measurements show a significant advantage over the current
>> NewStore. That performance advantage stems primarily from two things:
>>
>> (1) ZetaScale uses a B+-tree internally rather than an LSM tree
>> (levelDB/RocksDB). LSM trees experience exponential increase in write
>> amplification (cost of an insert) as the amount of data under
>> management increases. B+tree write-amplification is nearly constant
>> independent of the size of data under management. As the KV database
>> gets larger (Since newStore is effectively moving the per-file inode
>> into the kv data base. Don't forget checksums that Sage want's to add
>> :)) this performance delta swamps all others.
>> (2) Having a KV and a file-system causes a double lookup. This costs
>> CPU time and disk accesses to page in data structure indexes, metadata
>> efficiency decreases.
>>
>> You can't avoid (2) as long as you're using a file system.
>>
>> Yes an LSM tree performs better on HDD than does a B-tree, which is a
>> good argument for keeping the KV module pluggable.
>>
>>
>> Allen Samuels
>> Software Architect, Fellow, Systems and Software Solutions
>>
>> 2880 Junction Avenue, San Jose, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416
>> allen.samuels@SanDisk.com
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ric Wheeler
>> Sent: Tuesday, October 20, 2015 11:32 AM
>> To: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
>> Subject: Re: newstore direction
>>
>> On 10/19/2015 03:49 PM, Sage Weil wrote:
>>> The current design is based on two simple ideas:
>>>
>>>    1) a key/value interface is better way to manage all of our internal
>>> metadata (object metadata, attrs, layout, collection membership,
>>> write-ahead logging, overlay data, etc.)
>>>
>>>    2) a file system is well suited for storage object data (as files).
>>>
>>> So far 1 is working out well, but I'm questioning the wisdom of #2.  A
>>> few
>>> things:
>>>
>>>    - We currently write the data to the file, fsync, then commit the kv
>>> transaction.  That's at least 3 IOs: one for the data, one for the fs
>>> journal, one for the kv txn to commit (at least once my rocksdb
>>> changes land... the kv commit is currently 2-3).  So two people are
>>> managing metadata, here: the fs managing the file metadata (with its
>>> own
>>> journal) and the kv backend (with its journal).
>> If all of the fsync()'s fall into the same backing file system, are
>> you sure that each fsync() takes the same time? Depending on the local
>> FS implementation of course, but the order of issuing those fsync()'s
>> can effectively make some of them no-ops.
>>
>>>    - On read we have to open files by name, which means traversing the
>>> fs namespace.  Newstore tries to keep it as flat and simple as
>>> possible, but at a minimum it is a couple btree lookups.  We'd love to
>>> use open by handle (which would reduce this to 1 btree traversal), but
>>> running the daemon as ceph and not root makes that hard...
>> This seems like a a pretty low hurdle to overcome.
>>
>>>    - ...and file systems insist on updating mtime on writes, even when
>>> it is a overwrite with no allocation changes.  (We don't care about
>>> mtime.) O_NOCMTIME patches exist but it is hard to get these past the
>>> kernel brainfreeze.
>> Are you using O_DIRECT? Seems like there should be some enterprisey
>> database tricks that we can use here.
>>
>>>    - XFS is (probably) never going going to give us data checksums,
>>> which we want desperately.
>> What is the goal of having the file system do the checksums? How
>> strong do they need to be and what size are the chunks?
>>
>> If you update this on each IO, this will certainly generate more IO
>> (each write will possibly generate at least one other write to update
>> that new checksum).
>>
>>> But what's the alternative?  My thought is to just bite the bullet and
>>> consume a raw block device directly.  Write an allocator, hopefully
>>> keep it pretty simple, and manage it in kv store along with all of our
>>> other metadata.
>> The big problem with consuming block devices directly is that you
>> ultimately end up recreating most of the features that you had in the
>> file system. Even enterprise databases like Oracle and DB2 have been
>> migrating away from running on raw block devices in favor of file
>> systems over time.  In effect, you are looking at making a simple on
>> disk file system which is always easier to start than it is to get
>> back to a stable, production ready state.
>>
>> I think that it might be quicker and more maintainable to spend some
>> time working with the local file system people (XFS or other) to see
>> if we can jointly address the concerns you have.
>>> Wins:
>>>
>>>    - 2 IOs for most: one to write the data to unused space in the block
>>> device, one to commit our transaction (vs 4+ before).  For overwrites,
>>> we'd have one io to do our write-ahead log (kv journal), then do the
>>> overwrite async (vs 4+ before).
>>>
>>>    - No concern about mtime getting in the way
>>>
>>>    - Faster reads (no fs lookup)
>>>
>>>    - Similarly sized metadata for most objects.  If we assume most
>>> objects are not fragmented, then the metadata to store the block
>>> offsets is about the same size as the metadata to store the filenames
>>> we have now.
>>>
>>> Problems:
>>>
>>>    - We have to size the kv backend storage (probably still an XFS
>>> partition) vs the block storage.  Maybe we do this anyway (put
>>> metadata on
>>> SSD!) so it won't matter.  But what happens when we are storing gobs
>>> of rgw index data or cephfs metadata?  Suddenly we are pulling storage
>>> out of a different pool and those aren't currently fungible.
>>>
>>>    - We have to write and maintain an allocator.  I'm still optimistic
>>> this can be reasonbly simple, especially for the flash case (where
>>> fragmentation isn't such an issue as long as our blocks are reasonbly
>>> sized).  For disk we may beed to be moderately clever.
>>>
>>>    - We'll need a fsck to ensure our internal metadata is consistent.
>>> The good news is it'll just need to validate what we have stored in
>>> the kv store.
>>>
>>> Other thoughts:
>>>
>>>    - We might want to consider whether dm-thin or bcache or other block
>>> layers might help us with elasticity of file vs block areas.
>>>
>>>    - Rocksdb can push colder data to a second directory, so we could
>>> have a fast ssd primary area (for wal and most metadata) and a second
>>> hdd directory for stuff it has to push off.  Then have a conservative
>>> amount of file space on the hdd.  If our block fills up, use the
>>> existing file mechanism to put data there too.  (But then we have to
>>> maintain both the current kv + file approach and not go all-in on kv +
>>> block.)
>>>
>>> Thoughts?
>>> sage
>>> --
>> I really hate the idea of making a new file system type (even if we
>> call it a raw block store!).
>>
>> In addition to the technical hurdles, there are also production
>> worries like how long will it take for distros to pick up formal
>> support?  How do we test it properly?
>>
>> Regards,
>>
>> Ric
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message
>> is intended only for the use of the designated recipient(s) named
>> above. If the reader of this message is not the intended recipient,
>> you are hereby notified that you have received this message in error
>> and that any review, dissemination, distribution, or copying of this
>> message is strictly prohibited. If you have received this
>> communication in error, please notify the sender by telephone or
>> e-mail (as shown above) immediately and destroy any and all copies of
>> this message in your possession (whether hard copies or electronically
>> stored copies).
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-21 14:14       ` Mark Nelson
@ 2015-10-21 15:51         ` Ric Wheeler
  2015-10-21 19:37           ` Mark Nelson
  0 siblings, 1 reply; 71+ messages in thread
From: Ric Wheeler @ 2015-10-21 15:51 UTC (permalink / raw)
  To: Mark Nelson, Allen Samuels, Sage Weil, ceph-devel

On 10/21/2015 10:14 AM, Mark Nelson wrote:
>
>
> On 10/21/2015 06:24 AM, Ric Wheeler wrote:
>>
>>
>> On 10/21/2015 06:06 AM, Allen Samuels wrote:
>>> I agree that moving newStore to raw block is going to be a significant
>>> development effort. But the current scheme of using a KV store
>>> combined with a normal file system is always going to be problematic
>>> (FileStore or NewStore). This is caused by the transactional
>>> requirements of the ObjectStore interface, essentially you need to
>>> make transactionally consistent updates to two indexes, one of which
>>> doesn't understand transactions (File Systems) and can never be
>>> tightly-connected to the other one.
>>>
>>> You'll always be able to make this "loosely coupled" approach work,
>>> but it will never be optimal. The real question is whether the
>>> performance difference of a suboptimal implementation is something
>>> that you can live with compared to the longer gestation period of the
>>> more optimal implementation. Clearly, Sage believes that the
>>> performance difference is significant or he wouldn't have kicked off
>>> this discussion in the first place.
>>
>> I think that we need to work with the existing stack - measure and do
>> some collaborative analysis - before we throw out decades of work.  Very
>> hard to understand why the local file system is a barrier for
>> performance in this case when it is not an issue in existing enterprise
>> applications.
>>
>> We need some deep analysis with some local file system experts thrown in
>> to validate the concerns.
>
> I think Sage has been working pretty closely with the XFS guys to uncover 
> these kinds of issues.  I know if I encounter something fairly FS specific I 
> try to drag Eric or Dave in.  I think the core of the problem is that we often 
> find ourselves exercising filesystems in pretty unusual ways.  While it's 
> probably good that we add this kind of coverage and help work out somewhat 
> esoteric bugs, I think it does make our job of making Ceph perform well 
> harder.  One example:  I had been telling folks for several years to favor 
> dentry and inode cache due to the way our PG directory splitting works (backed 
> by test results), but then Sage discovered:
>
> http://www.spinics.net/lists/ceph-devel/msg25644.html
>
> This is just one example of how very nuanced our performance story is. I can 
> keep many users at least semi-engaged when talking about objects being laid 
> out in a nested directory structure, how dentry/inode cache affects that in a 
> general sense, etc.  But combine the kind of subtlety in the link above with 
> the vastness of things in the data path that can hurt performance, and people 
> generally just can't wrap their heads around all of it (With the exception of 
> some of the very smart folks on this mailing list!)
>
> One of my biggest concerns going forward is reducing the user-facing 
> complexity of our performance story.  The question I ask myself is: Does 
> keeping Ceph on a FS help us or hurt us in that regard?

The upshot of that is that the kind of micro-optimization is already handled by 
the file system, so the application job should be easier. Better to fsync() each 
file from an application that you care about rather than to worry about using 
more obscure calls.

>
>>
>>>
>>> While I think we can all agree that writing a full-up KV and raw-block
>>> ObjectStore is a significant amount of work. I will offer the case
>>> that the "loosely couple" scheme may not have as much time-to-market
>>> advantage as it appears to have. One example: NewStore performance is
>>> limited due to bugs in XFS that won't be fixed in the field for quite
>>> some time (it'll take at least a couple of years before a patched
>>> version of XFS will be widely deployed at customer environments).
>>
>> Not clear what bugs you are thinking of or why you think fixing bugs
>> will take a long time to hit the field in XFS. Red Hat has most of the
>> XFS developers on staff and we actively backport fixes and ship them,
>> other distros do as well.
>>
>> Never seen a "bug" take a couple of years to hit users.
>
> Maybe a good way to start out would be to see how quickly we can get the patch 
> dchinner posted here:
>
> http://oss.sgi.com/archives/xfs/2015-10/msg00545.html
>
> rolled out into RHEL/CentOS/Ubuntu.  I have no idea how long these things 
> typically take, but this might be a good test case.

How quickly things land in a distro is up to the interested parties making the 
case for it.

Ric

>
>>
>> Regards,
>>
>> Ric
>>
>>>
>>> Another example: Sage has just had to substantially rework the
>>> journaling code of rocksDB.
>>>
>>> In short, as you can tell, I'm full throated in favor of going down
>>> the optimal route.
>>>
>>> Internally at Sandisk, we have a KV store that is optimized for flash
>>> (it's called ZetaScale). We have extended it with a raw block
>>> allocator just as Sage is now proposing to do. Our internal
>>> performance measurements show a significant advantage over the current
>>> NewStore. That performance advantage stems primarily from two things:
>>>
>>> (1) ZetaScale uses a B+-tree internally rather than an LSM tree
>>> (levelDB/RocksDB). LSM trees experience exponential increase in write
>>> amplification (cost of an insert) as the amount of data under
>>> management increases. B+tree write-amplification is nearly constant
>>> independent of the size of data under management. As the KV database
>>> gets larger (Since newStore is effectively moving the per-file inode
>>> into the kv data base. Don't forget checksums that Sage want's to add
>>> :)) this performance delta swamps all others.
>>> (2) Having a KV and a file-system causes a double lookup. This costs
>>> CPU time and disk accesses to page in data structure indexes, metadata
>>> efficiency decreases.
>>>
>>> You can't avoid (2) as long as you're using a file system.
>>>
>>> Yes an LSM tree performs better on HDD than does a B-tree, which is a
>>> good argument for keeping the KV module pluggable.
>>>
>>>
>>> Allen Samuels
>>> Software Architect, Fellow, Systems and Software Solutions
>>>
>>> 2880 Junction Avenue, San Jose, CA 95134
>>> T: +1 408 801 7030| M: +1 408 780 6416
>>> allen.samuels@SanDisk.com
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ric Wheeler
>>> Sent: Tuesday, October 20, 2015 11:32 AM
>>> To: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
>>> Subject: Re: newstore direction
>>>
>>> On 10/19/2015 03:49 PM, Sage Weil wrote:
>>>> The current design is based on two simple ideas:
>>>>
>>>>    1) a key/value interface is better way to manage all of our internal
>>>> metadata (object metadata, attrs, layout, collection membership,
>>>> write-ahead logging, overlay data, etc.)
>>>>
>>>>    2) a file system is well suited for storage object data (as files).
>>>>
>>>> So far 1 is working out well, but I'm questioning the wisdom of #2.  A
>>>> few
>>>> things:
>>>>
>>>>    - We currently write the data to the file, fsync, then commit the kv
>>>> transaction.  That's at least 3 IOs: one for the data, one for the fs
>>>> journal, one for the kv txn to commit (at least once my rocksdb
>>>> changes land... the kv commit is currently 2-3).  So two people are
>>>> managing metadata, here: the fs managing the file metadata (with its
>>>> own
>>>> journal) and the kv backend (with its journal).
>>> If all of the fsync()'s fall into the same backing file system, are
>>> you sure that each fsync() takes the same time? Depending on the local
>>> FS implementation of course, but the order of issuing those fsync()'s
>>> can effectively make some of them no-ops.
>>>
>>>>    - On read we have to open files by name, which means traversing the
>>>> fs namespace.  Newstore tries to keep it as flat and simple as
>>>> possible, but at a minimum it is a couple btree lookups. We'd love to
>>>> use open by handle (which would reduce this to 1 btree traversal), but
>>>> running the daemon as ceph and not root makes that hard...
>>> This seems like a a pretty low hurdle to overcome.
>>>
>>>>    - ...and file systems insist on updating mtime on writes, even when
>>>> it is a overwrite with no allocation changes.  (We don't care about
>>>> mtime.) O_NOCMTIME patches exist but it is hard to get these past the
>>>> kernel brainfreeze.
>>> Are you using O_DIRECT? Seems like there should be some enterprisey
>>> database tricks that we can use here.
>>>
>>>>    - XFS is (probably) never going going to give us data checksums,
>>>> which we want desperately.
>>> What is the goal of having the file system do the checksums? How
>>> strong do they need to be and what size are the chunks?
>>>
>>> If you update this on each IO, this will certainly generate more IO
>>> (each write will possibly generate at least one other write to update
>>> that new checksum).
>>>
>>>> But what's the alternative?  My thought is to just bite the bullet and
>>>> consume a raw block device directly.  Write an allocator, hopefully
>>>> keep it pretty simple, and manage it in kv store along with all of our
>>>> other metadata.
>>> The big problem with consuming block devices directly is that you
>>> ultimately end up recreating most of the features that you had in the
>>> file system. Even enterprise databases like Oracle and DB2 have been
>>> migrating away from running on raw block devices in favor of file
>>> systems over time.  In effect, you are looking at making a simple on
>>> disk file system which is always easier to start than it is to get
>>> back to a stable, production ready state.
>>>
>>> I think that it might be quicker and more maintainable to spend some
>>> time working with the local file system people (XFS or other) to see
>>> if we can jointly address the concerns you have.
>>>> Wins:
>>>>
>>>>    - 2 IOs for most: one to write the data to unused space in the block
>>>> device, one to commit our transaction (vs 4+ before).  For overwrites,
>>>> we'd have one io to do our write-ahead log (kv journal), then do the
>>>> overwrite async (vs 4+ before).
>>>>
>>>>    - No concern about mtime getting in the way
>>>>
>>>>    - Faster reads (no fs lookup)
>>>>
>>>>    - Similarly sized metadata for most objects.  If we assume most
>>>> objects are not fragmented, then the metadata to store the block
>>>> offsets is about the same size as the metadata to store the filenames
>>>> we have now.
>>>>
>>>> Problems:
>>>>
>>>>    - We have to size the kv backend storage (probably still an XFS
>>>> partition) vs the block storage.  Maybe we do this anyway (put
>>>> metadata on
>>>> SSD!) so it won't matter.  But what happens when we are storing gobs
>>>> of rgw index data or cephfs metadata?  Suddenly we are pulling storage
>>>> out of a different pool and those aren't currently fungible.
>>>>
>>>>    - We have to write and maintain an allocator.  I'm still optimistic
>>>> this can be reasonbly simple, especially for the flash case (where
>>>> fragmentation isn't such an issue as long as our blocks are reasonbly
>>>> sized).  For disk we may beed to be moderately clever.
>>>>
>>>>    - We'll need a fsck to ensure our internal metadata is consistent.
>>>> The good news is it'll just need to validate what we have stored in
>>>> the kv store.
>>>>
>>>> Other thoughts:
>>>>
>>>>    - We might want to consider whether dm-thin or bcache or other block
>>>> layers might help us with elasticity of file vs block areas.
>>>>
>>>>    - Rocksdb can push colder data to a second directory, so we could
>>>> have a fast ssd primary area (for wal and most metadata) and a second
>>>> hdd directory for stuff it has to push off.  Then have a conservative
>>>> amount of file space on the hdd.  If our block fills up, use the
>>>> existing file mechanism to put data there too.  (But then we have to
>>>> maintain both the current kv + file approach and not go all-in on kv +
>>>> block.)
>>>>
>>>> Thoughts?
>>>> sage
>>>> -- 
>>> I really hate the idea of making a new file system type (even if we
>>> call it a raw block store!).
>>>
>>> In addition to the technical hurdles, there are also production
>>> worries like how long will it take for distros to pick up formal
>>> support?  How do we test it properly?
>>>
>>> Regards,
>>>
>>> Ric
>>>
>>>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-21 13:35               ` Mark Nelson
@ 2015-10-21 16:10                 ` Chen, Xiaoxi
  2015-10-22  1:09                   ` Allen Samuels
  0 siblings, 1 reply; 71+ messages in thread
From: Chen, Xiaoxi @ 2015-10-21 16:10 UTC (permalink / raw)
  To: Mark Nelson, Allen Samuels, Sage Weil
  Cc: James (Fei) Liu-SSI, Somnath Roy, ceph-devel

We did evaluate whether NVMKV could be implemented by non-fusionIO ssds, i.e re-invent an NVMKV, the final conclusion sounds like it's not hard with persistent memory(which will be available soon).  But yeah, NVMKV will not work if no PM is present---persist the hashing table to SSD is not practicable.   

Range query seems not a very big issue as the random read performance of nowadays SSD is more than enough, I mean, even we break all sequential to random (typically 70-80K IOPS which is ~300MB/s), the performance still good enough.

Anyway,  I think for the high IOPS case, it's hard for the consumer to play well on SSDs from different vendors, would be better to leave it to SSD vendor, something like Openstack Cinder's structure.  a vendor has the responsibility to maintain their drivers to ceph and take care the performance.

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Wednesday, October 21, 2015 9:36 PM
> To: Allen Samuels; Sage Weil; Chen, Xiaoxi
> Cc: James (Fei) Liu-SSI; Somnath Roy; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
> 
> Thanks Allen!  The devil is always in the details.  Know of anything else that
> looks promising?
> 
> Mark
> 
> On 10/21/2015 05:06 AM, Allen Samuels wrote:
> > I doubt that NVMKV will be useful for two reasons:
> >
> > (1) It relies on the unique sparse-mapping addressing capabilities of
> > the FusionIO VSL interface, it won't run on standard SSDs
> > (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no
> range operations on keys). This is pretty much required for deep scrubbing.
> >
> >
> > Allen Samuels
> > Software Architect, Fellow, Systems and Software Solutions
> >
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> > Sent: Tuesday, October 20, 2015 6:20 AM
> > To: Sage Weil <sweil@redhat.com>; Chen, Xiaoxi <xiaoxi.chen@intel.com>
> > Cc: James (Fei) Liu-SSI <james.liu@ssi.samsung.com>; Somnath Roy
> > <Somnath.Roy@sandisk.com>; ceph-devel@vger.kernel.org
> > Subject: Re: newstore direction
> >
> > On 10/20/2015 07:30 AM, Sage Weil wrote:
> >> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
> >>> +1, nowadays K-V DB care more about very small key-value pairs, say
> >>> several bytes to a few KB, but in SSD case we only care about 4KB or
> >>> 8KB. In this way, NVMKV is a good design and seems some of the SSD
> >>> vendor are also trying to build this kind of interface, we had a
> >>> NVM-L library but still under development.
> >>
> >> Do you have an NVMKV link?  I see a paper and a stale github repo..
> >> not sure if I'm looking at the right thing.
> >>
> >> My concern with using a key/value interface for the object data is
> >> that you end up with lots of key/value pairs (e.g., $inode_$offset =
> >> $4kb_of_data) that is pretty inefficient to store and (depending on
> >> the
> >> implementation) tends to break alignment.  I don't think these
> >> interfaces are targetted toward block-sized/aligned payloads.
> >> Storing just the metadata (block allocation map) w/ the kv api and
> >> storing the data directly on a block/page interface makes more sense to
> me.
> >>
> >> sage
> >
> > I get the feeling that some of the folks that were involved with nvmkv at
> Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for instance.
> http://pmem.io might be a better bet, though I haven't looked closely at it.
> >
> > Mark
> >
> >>
> >>
> >>>> -----Original Message-----
> >>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >>>> owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> >>>> Sent: Tuesday, October 20, 2015 6:21 AM
> >>>> To: Sage Weil; Somnath Roy
> >>>> Cc: ceph-devel@vger.kernel.org
> >>>> Subject: RE: newstore direction
> >>>>
> >>>> Hi Sage and Somnath,
> >>>>     In my humble opinion, There is another more aggressive
> >>>> solution than raw block device base keyvalue store as backend for
> >>>> objectstore. The new key value  SSD device with transaction support
> would be  ideal to solve the issues.
> >>>> First of all, it is raw SSD device. Secondly , It provides key
> >>>> value interface directly from SSD. Thirdly, it can provide
> >>>> transaction support, consistency will be guaranteed by hardware
> >>>> device. It pretty much satisfied all of objectstore needs without
> >>>> any extra overhead since there is not any extra layer in between device
> and objectstore.
> >>>>      Either way, I strongly support to have CEPH own data format
> >>>> instead of relying on filesystem.
> >>>>
> >>>>     Regards,
> >>>>     James
> >>>>
> >>>> -----Original Message-----
> >>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> >>>> Sent: Monday, October 19, 2015 1:55 PM
> >>>> To: Somnath Roy
> >>>> Cc: ceph-devel@vger.kernel.org
> >>>> Subject: RE: newstore direction
> >>>>
> >>>> On Mon, 19 Oct 2015, Somnath Roy wrote:
> >>>>> Sage,
> >>>>> I fully support that.  If we want to saturate SSDs , we need to
> >>>>> get rid of this filesystem overhead (which I am in process of
> measuring).
> >>>>> Also, it will be good if we can eliminate the dependency on the
> >>>>> k/v dbs (for storing allocators and all). The reason is the
> >>>>> unknown write amps they causes.
> >>>>
> >>>> My hope is to keep behing the KeyValueDB interface (and/more
> change
> >>>> it as
> >>>> appropriate) so that other backends can be easily swapped in (e.g.
> >>>> a
> >>>> btree- based one for high-end flash).
> >>>>
> >>>> sage
> >>>>
> >>>>
> >>>>>
> >>>>> Thanks & Regards
> >>>>> Somnath
> >>>>>
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: ceph-devel-owner@vger.kernel.org
> >>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> >>>>> Sent: Monday, October 19, 2015 12:49 PM
> >>>>> To: ceph-devel@vger.kernel.org
> >>>>> Subject: newstore direction
> >>>>>
> >>>>> The current design is based on two simple ideas:
> >>>>>
> >>>>>    1) a key/value interface is better way to manage all of our
> >>>>> internal metadata (object metadata, attrs, layout, collection
> >>>>> membership, write-ahead logging, overlay data, etc.)
> >>>>>
> >>>>>    2) a file system is well suited for storage object data (as files).
> >>>>>
> >>>>> So far 1 is working out well, but I'm questioning the wisdom of #2.
> >>>>> A few
> >>>>> things:
> >>>>>
> >>>>>    - We currently write the data to the file, fsync, then commit
> >>>>> the kv transaction.  That's at least 3 IOs: one for the data, one
> >>>>> for the fs journal, one for the kv txn to commit (at least once my
> >>>>> rocksdb changes land... the kv commit is currently 2-3).  So two
> >>>>> people are managing metadata, here: the fs managing the file
> >>>>> metadata (with its own
> >>>>> journal) and the kv backend (with its journal).
> >>>>>
> >>>>>    - On read we have to open files by name, which means traversing
> >>>>> the fs
> >>>> namespace.  Newstore tries to keep it as flat and simple as
> >>>> possible, but at a minimum it is a couple btree lookups.  We'd love
> >>>> to use open by handle (which would reduce this to 1 btree
> >>>> traversal), but running the daemon as ceph and not root makes that
> hard...
> >>>>>
> >>>>>    - ...and file systems insist on updating mtime on writes, even
> >>>>> when it is a
> >>>> overwrite with no allocation changes.  (We don't care about mtime.)
> >>>> O_NOCMTIME patches exist but it is hard to get these past the
> >>>> kernel brainfreeze.
> >>>>>
> >>>>>    - XFS is (probably) never going going to give us data
> >>>>> checksums, which we
> >>>> want desperately.
> >>>>>
> >>>>> But what's the alternative?  My thought is to just bite the bullet
> >>>>> and
> >>>> consume a raw block device directly.  Write an allocator, hopefully
> >>>> keep it pretty simple, and manage it in kv store along with all of our
> other metadata.
> >>>>>
> >>>>> Wins:
> >>>>>
> >>>>>    - 2 IOs for most: one to write the data to unused space in the
> >>>>> block device,
> >>>> one to commit our transaction (vs 4+ before).  For overwrites, we'd
> >>>> have one io to do our write-ahead log (kv journal), then do the
> >>>> overwrite async (vs 4+ before).
> >>>>>
> >>>>>    - No concern about mtime getting in the way
> >>>>>
> >>>>>    - Faster reads (no fs lookup)
> >>>>>
> >>>>>    - Similarly sized metadata for most objects.  If we assume most
> >>>>> objects are
> >>>> not fragmented, then the metadata to store the block offsets is
> >>>> about the same size as the metadata to store the filenames we have
> now.
> >>>>>
> >>>>> Problems:
> >>>>>
> >>>>>    - We have to size the kv backend storage (probably still an XFS
> >>>>> partition) vs the block storage.  Maybe we do this anyway (put
> >>>>> metadata on
> >>>>> SSD!) so it won't matter.  But what happens when we are storing
> >>>>> gobs of
> >>>> rgw index data or cephfs metadata?  Suddenly we are pulling storage
> >>>> out of a different pool and those aren't currently fungible.
> >>>>>
> >>>>>    - We have to write and maintain an allocator.  I'm still
> >>>>> optimistic this can be
> >>>> reasonbly simple, especially for the flash case (where
> >>>> fragmentation isn't such an issue as long as our blocks are
> >>>> reasonbly sized).  For disk we may beed to be moderately clever.
> >>>>>
> >>>>>    - We'll need a fsck to ensure our internal metadata is
> >>>>> consistent.  The good
> >>>> news is it'll just need to validate what we have stored in the kv store.
> >>>>>
> >>>>> Other thoughts:
> >>>>>
> >>>>>    - We might want to consider whether dm-thin or bcache or other
> >>>>> block
> >>>> layers might help us with elasticity of file vs block areas.
> >>>>>
> >>>>>    - Rocksdb can push colder data to a second directory, so we
> >>>>> could have a fast ssd primary area (for wal and most metadata) and
> >>>>> a second hdd directory for stuff it has to push off.  Then have a
> >>>>> conservative amount of file space on the hdd.  If our block fills
> >>>>> up, use the existing file mechanism to put data there too.  (But
> >>>>> then we have to maintain both the current kv + file approach and
> >>>>> not go all-in on kv +
> >>>>> block.)
> >>>>>
> >>>>> Thoughts?
> >>>>> sage
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>> majordomo
> >>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>> ________________________________
> >>>>>
> >>>>> PLEASE NOTE: The information contained in this electronic mail
> >>>>> message is
> >>>> intended only for the use of the designated recipient(s) named
> >>>> above. If the reader of this message is not the intended recipient,
> >>>> you are hereby notified that you have received this message in
> >>>> error and that any review, dissemination, distribution, or copying
> >>>> of this message is strictly prohibited. If you have received this
> >>>> communication in error, please notify the sender by telephone or
> >>>> e-mail (as shown above) immediately and destroy any and all copies
> >>>> of this message in your possession (whether hard copies or
> electronically stored copies).
> >>>>>
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>> majordomo
> >>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe
> >>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe
> >>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >>> info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> > ________________________________
> >
> > PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-21 11:18     ` Ric Wheeler
@ 2015-10-21 17:30       ` Sage Weil
  2015-10-22  8:31         ` Christoph Hellwig
  2015-10-22 12:50       ` Sage Weil
  1 sibling, 1 reply; 71+ messages in thread
From: Sage Weil @ 2015-10-21 17:30 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Orit Wasserman, ceph-devel

On Wed, 21 Oct 2015, Ric Wheeler wrote:
> On 10/21/2015 04:22 AM, Orit Wasserman wrote:
> > On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote:
> > > On 10/19/2015 03:49 PM, Sage Weil wrote:
> > > > The current design is based on two simple ideas:
> > > > 
> > > >    1) a key/value interface is better way to manage all of our internal
> > > > metadata (object metadata, attrs, layout, collection membership,
> > > > write-ahead logging, overlay data, etc.)
> > > > 
> > > >    2) a file system is well suited for storage object data (as files).
> > > > 
> > > > So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> > > > few
> > > > things:
> > > > 
> > > >    - We currently write the data to the file, fsync, then commit the kv
> > > > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > > > journal, one for the kv txn to commit (at least once my rocksdb changes
> > > > land... the kv commit is currently 2-3).  So two people are managing
> > > > metadata, here: the fs managing the file metadata (with its own
> > > > journal) and the kv backend (with its journal).
> > > If all of the fsync()'s fall into the same backing file system, are you
> > > sure
> > > that each fsync() takes the same time? Depending on the local FS
> > > implementation
> > > of course, but the order of issuing those fsync()'s can effectively make
> > > some of
> > > them no-ops.
> > > 
> > > >    - On read we have to open files by name, which means traversing the
> > > > fs
> > > > namespace.  Newstore tries to keep it as flat and simple as possible,
> > > > but
> > > > at a minimum it is a couple btree lookups.  We'd love to use open by
> > > > handle (which would reduce this to 1 btree traversal), but running
> > > > the daemon as ceph and not root makes that hard...
> > > This seems like a a pretty low hurdle to overcome.
> > > 
> > > >    - ...and file systems insist on updating mtime on writes, even when
> > > > it is
> > > > a overwrite with no allocation changes.  (We don't care about mtime.)
> > > > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > > > brainfreeze.
> > > Are you using O_DIRECT? Seems like there should be some enterprisey
> > > database
> > > tricks that we can use here.
> > > 
> > > >    - XFS is (probably) never going going to give us data checksums,
> > > > which we
> > > > want desperately.
> > > What is the goal of having the file system do the checksums? How strong do
> > > they
> > > need to be and what size are the chunks?
> > > 
> > > If you update this on each IO, this will certainly generate more IO (each
> > > write
> > > will possibly generate at least one other write to update that new
> > > checksum).
> > > 
> > > > But what's the alternative?  My thought is to just bite the bullet and
> > > > consume a raw block device directly.  Write an allocator, hopefully keep
> > > > it pretty simple, and manage it in kv store along with all of our other
> > > > metadata.
> > > The big problem with consuming block devices directly is that you
> > > ultimately end
> > > up recreating most of the features that you had in the file system. Even
> > > enterprise databases like Oracle and DB2 have been migrating away from
> > > running
> > > on raw block devices in favor of file systems over time.  In effect, you
> > > are
> > > looking at making a simple on disk file system which is always easier to
> > > start
> > > than it is to get back to a stable, production ready state.
> > The best performance is still on block device (SAN).
> > File system simplify the operation tasks which worth the performance
> > penalty for a database. I think in a storage system this is not the
> > case.
> > In many cases they can use their own file system that is tailored for
> > the database.
> 
> You will have to trust me on this as the Red Hat person who spoke to pretty
> much all of our key customers about local file systems and storage - customers
> all have migrated over to using normal file systems under Oracle/DB2.
> Typically, they use XFS or ext4.  I don't know of any non-standard file
> systems and only have seen one account running on a raw block store in 8 years
> :)
> 
> If you have a pre-allocated file and write using O_DIRECT, your IO path is
> identical in terms of IO's sent to the device.

...except it's not.  Preallocating the file gives you contiguous space, 
but you still have to mark the extent written (not zero/prealloc).  The 
only way to get an identical IO pattern is to *pre-write* zeros (or 
whatever) to the file... which is hours on modern HDDs.

Ted asked for a way to force prealloc to expose preexisting disk bits a 
couple hears back at LSF and it was shot down for security reasons (and 
rightly so, IMO).

If you're going down this path, you already have a "file system" in user 
space sitting on top of the preallocated file, and you could just as 
easily use the block device directly.

If you're not, then you're writing smaller files (e.g., megabytes), and 
will be paying the price to write to the {xfs,ext4} journal to update 
allocation and inode metadata.  And that's what we're trying to avoid...

> If we are causing additional IO's, then we really need to spend some time
> talking to the local file system gurus about this in detail.  I can help with
> that conversation.

Happy to sync up with Eric or Dave, but I really don't think the fs is 
doing anything wrong here.  It's just not the right fit.

> > This won't be a file system but just an allocator which is a very small
> > part of a file system.
> 
> That is always the intention and then we wake up a few years into the project
> with something that looks and smells like a file system as we slowly bring in
> just one more small thing at a time.

Probably, yes.  But it will be exactly the small things that *we* need.

> > The benefits are not just in reducing the number of IO operations we
> > preform, we are also removing the file system stack overhead that will
> > reduce our latency and make it more predictable.
> > Removing this layer will give use more control and allow us other
> > optimization we cannot do today.
> 
> I strongly disagree here - we can get that optimal number of IO's if we use
> the file system API's developed over the years to support enterprise
> databases.  And we can have that today without having to re-write allocation
> routines and checkers.

It will take years and years to get data crcs and the types of IO hints 
that we want in XFS (if we ever get them--my guess is we won't as it's not 
worth the rearchitecting that is required).  We can be much more agile 
this way.  Yes it's an additional burden, but it's also necessary to get 
the performance we need to be competitive: POSIX does not provide the 
atomicity/consistency that we require, and there is no way to unify our 
transaction commit IOs with the underlying FS journals, or get around the 
fact that the fs is maintaining an independent data structure (inode) for 
our per-object metadata record with yet another intervening data structure 
(directories and dentries) that we have 0 use for.  It's not that the fs 
isn't doing what it does really well, it's that it's doing the wrong 
things for our use case.

> > I think this is more acute when taking SSD (and even faster
> > technologies) into account.
> 
> XFS and ext4 both support DAX, so we can effectively do direct writes to
> persistent memory (no block IO required). Most of the work over the past few
> years in the IO stack has been around driving IOPs at insanely high rates on
> top of the whole stack (file system layer included) and we have really good
> results.

Yes.  But ironically much of that hard work is around maintaining the 
existing functionality of the stack while reducing its overhead.  If you 
avoid a layer of the stack entirely it's a moot issue.  Obviously the 
block layer work will still be important for us, but the fs bits won't 
matter.  And in order to capture any of these benefits the code that is 
driving the IO from userspace also has to be equally efficient anyway, so 
it's not like using a file system here gets you anything for free.

> > > In addition to the technical hurdles, there are also production worries
> > > like how
> > > long will it take for distros to pick up formal support?  How do we test
> > > it
> > > properly?
> > > 
> > This should be userspace only, I don't think we need it in the kernel
> > (will need root access for opening the device).
> > For users that don't have root access we can use one big file and use
> > the same allocator in it. It can be good for testing too.
> > 
> > As someone that already been part of such a
> > move more than once (for example in Exanet) I can say that the
> > performance gain is very impressive and after the change we could
> > remove many workarounds which simplified the code.
> > 
> > As the API should be small the testing effort is reasonable, we do need
> > to test it well as a bug in the allocator has really bad consequences.
> > 
> > We won't be able to match (or exceed) our competitors performance
> > without making this effort ...
> > 
> > Orit
> > 
> 
> I don't agree that we will see a performance win if we use the file system
> properly.  Certainly, you can measure a slow path through a file system and
> then show an improvement with a new, user space block access, but that is not
> a long term path to success.

I've been doing this long enough that I'm pretting confident I'm not 
measuring the slow path.  And yes, there are some things we could do to 
improve the situation, but the complexity required is similar to avoiding 
the fs altogether, and the end result will still be far from optimal.

For example: we need to do an overwrite of an existing object that is 
atomic with respect to a larger ceph transaction (we're updating a bunch 
of other metadata at the same time, possibly overwriting or appending to 
multiple files, etc.).  XFS and ext4 aren't cow file systems, so plugging 
into the transaction infrastructure isn't really an option (and even after 
several years of trying to do it with btrfs it proved to be impractical).  
So: we do write-ahead journaling.  That's okay (even great) for small io 
(the database we're tracking our metadata is log-structured anyway), but 
if the overwrite it large it's pretty inefficient.  Assuming I have a 4MB 
XFS file, how to I do an atomic 1MB overwrite?  Maybe we write to a new 
file, fsync that, and use the defrag ioctl to swap extents.  But then 
we're creating extraneous inodes, forcing additional fsyncs, and relying 
on weakly tested functionality that is much more likely to lead to nasty 
surprises for users (for example, see our use of the xfs extsize ioctl in 
firefly and the resulting data corruption that causes on 3.2 kernels).  
It would be an extremely delicate solution that relies on very 
careful ordering of fs ioctls and syscalls to ensure both data 
safety and performance... and even then it wouldn't be optimal.

If we manage allocation ourselves this problem is trivial: write to an 
unallocated extent, fua/flush, commit transaction.

The allocators in general purpose file systems have to cope with a huge 
spectrum of workloads, and they to admirably well given the challenge.  
Ours will need to cope with a vastly simpler set of constraints.  And most 
importantly will be tied into the same transaction commit mechanism as 
everything else, which means it will not require additional IOs to 
maintain its metadata.  And the metadata we do manage will be exactly the 
metadata we need, and nothing more and nothing less.

sage

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-21 15:51         ` Ric Wheeler
@ 2015-10-21 19:37           ` Mark Nelson
  2015-10-21 21:20             ` Martin Millnert
  0 siblings, 1 reply; 71+ messages in thread
From: Mark Nelson @ 2015-10-21 19:37 UTC (permalink / raw)
  To: Ric Wheeler, Allen Samuels, Sage Weil, ceph-devel

On 10/21/2015 10:51 AM, Ric Wheeler wrote:
> On 10/21/2015 10:14 AM, Mark Nelson wrote:
>>
>>
>> On 10/21/2015 06:24 AM, Ric Wheeler wrote:
>>>
>>>
>>> On 10/21/2015 06:06 AM, Allen Samuels wrote:
>>>> I agree that moving newStore to raw block is going to be a significant
>>>> development effort. But the current scheme of using a KV store
>>>> combined with a normal file system is always going to be problematic
>>>> (FileStore or NewStore). This is caused by the transactional
>>>> requirements of the ObjectStore interface, essentially you need to
>>>> make transactionally consistent updates to two indexes, one of which
>>>> doesn't understand transactions (File Systems) and can never be
>>>> tightly-connected to the other one.
>>>>
>>>> You'll always be able to make this "loosely coupled" approach work,
>>>> but it will never be optimal. The real question is whether the
>>>> performance difference of a suboptimal implementation is something
>>>> that you can live with compared to the longer gestation period of the
>>>> more optimal implementation. Clearly, Sage believes that the
>>>> performance difference is significant or he wouldn't have kicked off
>>>> this discussion in the first place.
>>>
>>> I think that we need to work with the existing stack - measure and do
>>> some collaborative analysis - before we throw out decades of work.  Very
>>> hard to understand why the local file system is a barrier for
>>> performance in this case when it is not an issue in existing enterprise
>>> applications.
>>>
>>> We need some deep analysis with some local file system experts thrown in
>>> to validate the concerns.
>>
>> I think Sage has been working pretty closely with the XFS guys to
>> uncover these kinds of issues.  I know if I encounter something fairly
>> FS specific I try to drag Eric or Dave in.  I think the core of the
>> problem is that we often find ourselves exercising filesystems in
>> pretty unusual ways.  While it's probably good that we add this kind
>> of coverage and help work out somewhat esoteric bugs, I think it does
>> make our job of making Ceph perform well harder.  One example:  I had
>> been telling folks for several years to favor dentry and inode cache
>> due to the way our PG directory splitting works (backed by test
>> results), but then Sage discovered:
>>
>> http://www.spinics.net/lists/ceph-devel/msg25644.html
>>
>> This is just one example of how very nuanced our performance story is.
>> I can keep many users at least semi-engaged when talking about objects
>> being laid out in a nested directory structure, how dentry/inode cache
>> affects that in a general sense, etc.  But combine the kind of
>> subtlety in the link above with the vastness of things in the data
>> path that can hurt performance, and people generally just can't wrap
>> their heads around all of it (With the exception of some of the very
>> smart folks on this mailing list!)
>>
>> One of my biggest concerns going forward is reducing the user-facing
>> complexity of our performance story.  The question I ask myself is:
>> Does keeping Ceph on a FS help us or hurt us in that regard?
>
> The upshot of that is that the kind of micro-optimization is already
> handled by the file system, so the application job should be easier.
> Better to fsync() each file from an application that you care about
> rather than to worry about using more obscure calls.

I hear you, and I don't want to discount the massive amount of work and 
experience that has gone into making XFS and the other filesystems as 
amazing as they are.  I think Sage's argument that the fit isn't right 
has merit though.  There's a lot of things that we end up working 
around.  Take last winter when we ended up pushing past the 254byte 
inline xattr boundary.  We absolutely want to keep xattrs inlined so the 
idea now is we break large ones down into smaller chunks to try to work 
around the limitation while continuing to employ a 2K inode size. (which 
from my conversations with Ben sounds like it's a little controversial 
in it's own right)  All of this by itself is fairly inconsequential, but 
you add enough of this kind of thing up and it's tough not to feel like 
we're trying to pound a square peg into a round hole.

>
>>
>>>
>>>>
>>>> While I think we can all agree that writing a full-up KV and raw-block
>>>> ObjectStore is a significant amount of work. I will offer the case
>>>> that the "loosely couple" scheme may not have as much time-to-market
>>>> advantage as it appears to have. One example: NewStore performance is
>>>> limited due to bugs in XFS that won't be fixed in the field for quite
>>>> some time (it'll take at least a couple of years before a patched
>>>> version of XFS will be widely deployed at customer environments).
>>>
>>> Not clear what bugs you are thinking of or why you think fixing bugs
>>> will take a long time to hit the field in XFS. Red Hat has most of the
>>> XFS developers on staff and we actively backport fixes and ship them,
>>> other distros do as well.
>>>
>>> Never seen a "bug" take a couple of years to hit users.
>>
>> Maybe a good way to start out would be to see how quickly we can get
>> the patch dchinner posted here:
>>
>> http://oss.sgi.com/archives/xfs/2015-10/msg00545.html
>>
>> rolled out into RHEL/CentOS/Ubuntu.  I have no idea how long these
>> things typically take, but this might be a good test case.
>
> How quickly things land in a distro is up to the interested parties
> making the case for it.

My thought is that there is some inflection point where the userland 
kvstore/block approach is going to be less work, for everyone I think, 
than trying to quickly discover, understand, fix, and push upstream 
patches that sometimes only really benefit us.  I don't know if we've 
truly hit that that point, but it's tough for me to find flaws with 
Sage's argument.

>
> Ric
>
>>
>>>
>>> Regards,
>>>
>>> Ric
>>>
>>>>
>>>> Another example: Sage has just had to substantially rework the
>>>> journaling code of rocksDB.
>>>>
>>>> In short, as you can tell, I'm full throated in favor of going down
>>>> the optimal route.
>>>>
>>>> Internally at Sandisk, we have a KV store that is optimized for flash
>>>> (it's called ZetaScale). We have extended it with a raw block
>>>> allocator just as Sage is now proposing to do. Our internal
>>>> performance measurements show a significant advantage over the current
>>>> NewStore. That performance advantage stems primarily from two things:
>>>>
>>>> (1) ZetaScale uses a B+-tree internally rather than an LSM tree
>>>> (levelDB/RocksDB). LSM trees experience exponential increase in write
>>>> amplification (cost of an insert) as the amount of data under
>>>> management increases. B+tree write-amplification is nearly constant
>>>> independent of the size of data under management. As the KV database
>>>> gets larger (Since newStore is effectively moving the per-file inode
>>>> into the kv data base. Don't forget checksums that Sage want's to add
>>>> :)) this performance delta swamps all others.
>>>> (2) Having a KV and a file-system causes a double lookup. This costs
>>>> CPU time and disk accesses to page in data structure indexes, metadata
>>>> efficiency decreases.
>>>>
>>>> You can't avoid (2) as long as you're using a file system.
>>>>
>>>> Yes an LSM tree performs better on HDD than does a B-tree, which is a
>>>> good argument for keeping the KV module pluggable.
>>>>
>>>>
>>>> Allen Samuels
>>>> Software Architect, Fellow, Systems and Software Solutions
>>>>
>>>> 2880 Junction Avenue, San Jose, CA 95134
>>>> T: +1 408 801 7030| M: +1 408 780 6416
>>>> allen.samuels@SanDisk.com
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ric Wheeler
>>>> Sent: Tuesday, October 20, 2015 11:32 AM
>>>> To: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
>>>> Subject: Re: newstore direction
>>>>
>>>> On 10/19/2015 03:49 PM, Sage Weil wrote:
>>>>> The current design is based on two simple ideas:
>>>>>
>>>>>    1) a key/value interface is better way to manage all of our
>>>>> internal
>>>>> metadata (object metadata, attrs, layout, collection membership,
>>>>> write-ahead logging, overlay data, etc.)
>>>>>
>>>>>    2) a file system is well suited for storage object data (as files).
>>>>>
>>>>> So far 1 is working out well, but I'm questioning the wisdom of #2.  A
>>>>> few
>>>>> things:
>>>>>
>>>>>    - We currently write the data to the file, fsync, then commit
>>>>> the kv
>>>>> transaction.  That's at least 3 IOs: one for the data, one for the fs
>>>>> journal, one for the kv txn to commit (at least once my rocksdb
>>>>> changes land... the kv commit is currently 2-3).  So two people are
>>>>> managing metadata, here: the fs managing the file metadata (with its
>>>>> own
>>>>> journal) and the kv backend (with its journal).
>>>> If all of the fsync()'s fall into the same backing file system, are
>>>> you sure that each fsync() takes the same time? Depending on the local
>>>> FS implementation of course, but the order of issuing those fsync()'s
>>>> can effectively make some of them no-ops.
>>>>
>>>>>    - On read we have to open files by name, which means traversing the
>>>>> fs namespace.  Newstore tries to keep it as flat and simple as
>>>>> possible, but at a minimum it is a couple btree lookups. We'd love to
>>>>> use open by handle (which would reduce this to 1 btree traversal), but
>>>>> running the daemon as ceph and not root makes that hard...
>>>> This seems like a a pretty low hurdle to overcome.
>>>>
>>>>>    - ...and file systems insist on updating mtime on writes, even when
>>>>> it is a overwrite with no allocation changes.  (We don't care about
>>>>> mtime.) O_NOCMTIME patches exist but it is hard to get these past the
>>>>> kernel brainfreeze.
>>>> Are you using O_DIRECT? Seems like there should be some enterprisey
>>>> database tricks that we can use here.
>>>>
>>>>>    - XFS is (probably) never going going to give us data checksums,
>>>>> which we want desperately.
>>>> What is the goal of having the file system do the checksums? How
>>>> strong do they need to be and what size are the chunks?
>>>>
>>>> If you update this on each IO, this will certainly generate more IO
>>>> (each write will possibly generate at least one other write to update
>>>> that new checksum).
>>>>
>>>>> But what's the alternative?  My thought is to just bite the bullet and
>>>>> consume a raw block device directly.  Write an allocator, hopefully
>>>>> keep it pretty simple, and manage it in kv store along with all of our
>>>>> other metadata.
>>>> The big problem with consuming block devices directly is that you
>>>> ultimately end up recreating most of the features that you had in the
>>>> file system. Even enterprise databases like Oracle and DB2 have been
>>>> migrating away from running on raw block devices in favor of file
>>>> systems over time.  In effect, you are looking at making a simple on
>>>> disk file system which is always easier to start than it is to get
>>>> back to a stable, production ready state.
>>>>
>>>> I think that it might be quicker and more maintainable to spend some
>>>> time working with the local file system people (XFS or other) to see
>>>> if we can jointly address the concerns you have.
>>>>> Wins:
>>>>>
>>>>>    - 2 IOs for most: one to write the data to unused space in the
>>>>> block
>>>>> device, one to commit our transaction (vs 4+ before).  For overwrites,
>>>>> we'd have one io to do our write-ahead log (kv journal), then do the
>>>>> overwrite async (vs 4+ before).
>>>>>
>>>>>    - No concern about mtime getting in the way
>>>>>
>>>>>    - Faster reads (no fs lookup)
>>>>>
>>>>>    - Similarly sized metadata for most objects.  If we assume most
>>>>> objects are not fragmented, then the metadata to store the block
>>>>> offsets is about the same size as the metadata to store the filenames
>>>>> we have now.
>>>>>
>>>>> Problems:
>>>>>
>>>>>    - We have to size the kv backend storage (probably still an XFS
>>>>> partition) vs the block storage.  Maybe we do this anyway (put
>>>>> metadata on
>>>>> SSD!) so it won't matter.  But what happens when we are storing gobs
>>>>> of rgw index data or cephfs metadata?  Suddenly we are pulling storage
>>>>> out of a different pool and those aren't currently fungible.
>>>>>
>>>>>    - We have to write and maintain an allocator.  I'm still optimistic
>>>>> this can be reasonbly simple, especially for the flash case (where
>>>>> fragmentation isn't such an issue as long as our blocks are reasonbly
>>>>> sized).  For disk we may beed to be moderately clever.
>>>>>
>>>>>    - We'll need a fsck to ensure our internal metadata is consistent.
>>>>> The good news is it'll just need to validate what we have stored in
>>>>> the kv store.
>>>>>
>>>>> Other thoughts:
>>>>>
>>>>>    - We might want to consider whether dm-thin or bcache or other
>>>>> block
>>>>> layers might help us with elasticity of file vs block areas.
>>>>>
>>>>>    - Rocksdb can push colder data to a second directory, so we could
>>>>> have a fast ssd primary area (for wal and most metadata) and a second
>>>>> hdd directory for stuff it has to push off.  Then have a conservative
>>>>> amount of file space on the hdd.  If our block fills up, use the
>>>>> existing file mechanism to put data there too.  (But then we have to
>>>>> maintain both the current kv + file approach and not go all-in on kv +
>>>>> block.)
>>>>>
>>>>> Thoughts?
>>>>> sage
>>>>> --
>>>> I really hate the idea of making a new file system type (even if we
>>>> call it a raw block store!).
>>>>
>>>> In addition to the technical hurdles, there are also production
>>>> worries like how long will it take for distros to pick up formal
>>>> support?  How do we test it properly?
>>>>
>>>> Regards,
>>>>
>>>> Ric
>>>>
>>>>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-21 19:37           ` Mark Nelson
@ 2015-10-21 21:20             ` Martin Millnert
  2015-10-22  2:12               ` Allen Samuels
  0 siblings, 1 reply; 71+ messages in thread
From: Martin Millnert @ 2015-10-21 21:20 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Ric Wheeler, Allen Samuels, Sage Weil, ceph-devel

Adding 2c

On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote:
> My thought is that there is some inflection point where the userland 
> kvstore/block approach is going to be less work, for everyone I think, 
> than trying to quickly discover, understand, fix, and push upstream 
> patches that sometimes only really benefit us.  I don't know if we've 
> truly hit that that point, but it's tough for me to find flaws with 
> Sage's argument.

Regarding the userland / kernel land aspect of the topic, there are
further aspects AFAIK not yet addressed in the thread:
In the networking world, there's been development on memory mapped
(multiple approaches exist) userland networking, which for packet
management has the benefit of - for very, very specific applications of
networking code - avoiding e.g. per-packet context switches etc, and
streamlining processor cache management performance. People have gone as
far as removing CPU cores from CPU scheduler to completely dedicate them
to the networking task at hand (cache optimizations). There are various
latency/throughput (bulking) optimizations applicable, but at the end of
the day, it's about keeping the CPU bus busy with "revenue" bus traffic.

Granted, storage IO operations may be much heavier in cycle counts for
context switches to ever appear as a problem in themselves, certainly
for slower SSDs and HDDs. However, when going for truly high performance
IO, *every* hurdle in the data path counts toward the total latency.
(And really, high performance random IO characteristics approaches the
networking, per-packet handling characteristics).  Now, I'm not really
suggesting memory-mapping a storage device to user space, not at all,
but having better control over the data path for a very specific use
case, reduces dependency on the code that works as best as possible for
the general case, and allows for very purpose-built code, to address a
narrow set of requirements. ("Ceph storage cluster backend" isn't a
typical FS use case.) It also decouples dependencies on users i.e.
waiting for the next distro release before being able to take up the
benefits of improvements to the storage code.

A random google came up with related data on where "doing something way
different" /can/ have significant benefits:
http://phunq.net/pipermail/tux3/2015-April/002147.html 

I (FWIW) certainly agree there is merit to the idea.
The scientific approach here could perhaps be to simply enumerate all
corner cases of "generic FS" that actually are cause for the experienced
issues, and assess probability of them being solved (and if so when).
That *could* improve chances of approaching consensus which wouldn't
hurt I suppose?

BR,
Martin


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-21 11:24     ` Ric Wheeler
  2015-10-21 14:14       ` Mark Nelson
@ 2015-10-22  0:53       ` Allen Samuels
  2015-10-22  1:16         ` Ric Wheeler
  1 sibling, 1 reply; 71+ messages in thread
From: Allen Samuels @ 2015-10-22  0:53 UTC (permalink / raw)
  To: Ric Wheeler, Sage Weil, ceph-devel

Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many companies standardize on a particular release of a particular distro. Getting them to switch to a new release -- even a "bug fix" point release -- is a major undertaking that often is a complete roadblock. Just my experience. YMMV. 


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: Ric Wheeler [mailto:rwheeler@redhat.com] 
Sent: Wednesday, October 21, 2015 8:24 PM
To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction



On 10/21/2015 06:06 AM, Allen Samuels wrote:
> I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one.
>
> You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place.

I think that we need to work with the existing stack - measure and do some collaborative analysis - before we throw out decades of work.  Very hard to understand why the local file system is a barrier for performance in this case when it is not an issue in existing enterprise applications.

We need some deep analysis with some local file system experts thrown in to validate the concerns.

>
> While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments).

Not clear what bugs you are thinking of or why you think fixing bugs will take a long time to hit the field in XFS. Red Hat has most of the XFS developers on staff and we actively backport fixes and ship them, other distros do as well.

Never seen a "bug" take a couple of years to hit users.

Regards,

Ric

>
> Another example: Sage has just had to substantially rework the journaling code of rocksDB.
>
> In short, as you can tell, I'm full throated in favor of going down the optimal route.
>
> Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things:
>
> (1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others.
> (2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases.
>
> You can't avoid (2) as long as you're using a file system.
>
> Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Tuesday, October 20, 2015 11:32 AM
> To: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On 10/19/2015 03:49 PM, Sage Weil wrote:
>> The current design is based on two simple ideas:
>>
>>    1) a key/value interface is better way to manage all of our 
>> internal metadata (object metadata, attrs, layout, collection 
>> membership, write-ahead logging, overlay data, etc.)
>>
>>    2) a file system is well suited for storage object data (as files).
>>
>> So far 1 is working out well, but I'm questioning the wisdom of #2.  
>> A few
>> things:
>>
>>    - We currently write the data to the file, fsync, then commit the 
>> kv transaction.  That's at least 3 IOs: one for the data, one for the 
>> fs journal, one for the kv txn to commit (at least once my rocksdb 
>> changes land... the kv commit is currently 2-3).  So two people are 
>> managing metadata, here: the fs managing the file metadata (with its 
>> own
>> journal) and the kv backend (with its journal).
> If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops.
>
>>    - On read we have to open files by name, which means traversing 
>> the fs namespace.  Newstore tries to keep it as flat and simple as 
>> possible, but at a minimum it is a couple btree lookups.  We'd love 
>> to use open by handle (which would reduce this to 1 btree traversal), 
>> but running the daemon as ceph and not root makes that hard...
> This seems like a a pretty low hurdle to overcome.
>
>>    - ...and file systems insist on updating mtime on writes, even 
>> when it is a overwrite with no allocation changes.  (We don't care 
>> about
>> mtime.) O_NOCMTIME patches exist but it is hard to get these past the 
>> kernel brainfreeze.
> Are you using O_DIRECT? Seems like there should be some enterprisey database tricks that we can use here.
>
>>    - XFS is (probably) never going going to give us data checksums, 
>> which we want desperately.
> What is the goal of having the file system do the checksums? How strong do they need to be and what size are the chunks?
>
> If you update this on each IO, this will certainly generate more IO (each write will possibly generate at least one other write to update that new checksum).
>
>> But what's the alternative?  My thought is to just bite the bullet 
>> and consume a raw block device directly.  Write an allocator, 
>> hopefully keep it pretty simple, and manage it in kv store along with 
>> all of our other metadata.
> The big problem with consuming block devices directly is that you ultimately end up recreating most of the features that you had in the file system. Even enterprise databases like Oracle and DB2 have been migrating away from running on raw block devices in favor of file systems over time.  In effect, you are looking at making a simple on disk file system which is always easier to start than it is to get back to a stable, production ready state.
>
> I think that it might be quicker and more maintainable to spend some time working with the local file system people (XFS or other) to see if we can jointly address the concerns you have.
>> Wins:
>>
>>    - 2 IOs for most: one to write the data to unused space in the 
>> block device, one to commit our transaction (vs 4+ before).  For 
>> overwrites, we'd have one io to do our write-ahead log (kv journal), 
>> then do the overwrite async (vs 4+ before).
>>
>>    - No concern about mtime getting in the way
>>
>>    - Faster reads (no fs lookup)
>>
>>    - Similarly sized metadata for most objects.  If we assume most 
>> objects are not fragmented, then the metadata to store the block 
>> offsets is about the same size as the metadata to store the filenames we have now.
>>
>> Problems:
>>
>>    - We have to size the kv backend storage (probably still an XFS
>> partition) vs the block storage.  Maybe we do this anyway (put 
>> metadata on
>> SSD!) so it won't matter.  But what happens when we are storing gobs 
>> of rgw index data or cephfs metadata?  Suddenly we are pulling 
>> storage out of a different pool and those aren't currently fungible.
>>
>>    - We have to write and maintain an allocator.  I'm still 
>> optimistic this can be reasonbly simple, especially for the flash 
>> case (where fragmentation isn't such an issue as long as our blocks 
>> are reasonbly sized).  For disk we may beed to be moderately clever.
>>
>>    - We'll need a fsck to ensure our internal metadata is consistent.
>> The good news is it'll just need to validate what we have stored in 
>> the kv store.
>>
>> Other thoughts:
>>
>>    - We might want to consider whether dm-thin or bcache or other 
>> block layers might help us with elasticity of file vs block areas.
>>
>>    - Rocksdb can push colder data to a second directory, so we could 
>> have a fast ssd primary area (for wal and most metadata) and a second 
>> hdd directory for stuff it has to push off.  Then have a conservative 
>> amount of file space on the hdd.  If our block fills up, use the 
>> existing file mechanism to put data there too.  (But then we have to 
>> maintain both the current kv + file approach and not go all-in on kv 
>> +
>> block.)
>>
>> Thoughts?
>> sage
>> --
> I really hate the idea of making a new file system type (even if we call it a raw block store!).
>
> In addition to the technical hurdles, there are also production worries like how long will it take for distros to pick up formal support?  How do we test it properly?
>
> Regards,
>
> Ric
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-21 16:10                 ` Chen, Xiaoxi
@ 2015-10-22  1:09                   ` Allen Samuels
  0 siblings, 0 replies; 71+ messages in thread
From: Allen Samuels @ 2015-10-22  1:09 UTC (permalink / raw)
  To: Chen, Xiaoxi, Mark Nelson, Sage Weil
  Cc: James (Fei) Liu-SSI, Somnath Roy, ceph-devel

Actually Range queries are an important part of the performance story and random read speed doesn't really solve the problem.

When you're doing a scrub, you need to enumerate the objects in a specific order on multiple nodes -- so that they can compare the contents of their stores in order to determine if data cleaning needs to take place.

If you don't have in-order enumeration in your basic data structure (which NVMKV doesn't have) then you're forced to sort the directory before you can respond to an enumeration. That sort will either consume huge amounts of IOPS OR huge amounts of DRAM. Regardless of the choice, you'll see a significant degradation of performance while the scrub is ongoing -- which is one of the biggest problems with clustered systems (expensive and extensive maintenance operations).


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: Chen, Xiaoxi [mailto:xiaoxi.chen@intel.com]
Sent: Thursday, October 22, 2015 1:10 AM
To: Mark Nelson <mnelson@redhat.com>; Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil <sweil@redhat.com>
Cc: James (Fei) Liu-SSI <james.liu@ssi.samsung.com>; Somnath Roy <Somnath.Roy@sandisk.com>; ceph-devel@vger.kernel.org
Subject: RE: newstore direction

We did evaluate whether NVMKV could be implemented by non-fusionIO ssds, i.e re-invent an NVMKV, the final conclusion sounds like it's not hard with persistent memory(which will be available soon).  But yeah, NVMKV will not work if no PM is present---persist the hashing table to SSD is not practicable.

Range query seems not a very big issue as the random read performance of nowadays SSD is more than enough, I mean, even we break all sequential to random (typically 70-80K IOPS which is ~300MB/s), the performance still good enough.

Anyway,  I think for the high IOPS case, it's hard for the consumer to play well on SSDs from different vendors, would be better to leave it to SSD vendor, something like Openstack Cinder's structure.  a vendor has the responsibility to maintain their drivers to ceph and take care the performance.

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Wednesday, October 21, 2015 9:36 PM
> To: Allen Samuels; Sage Weil; Chen, Xiaoxi
> Cc: James (Fei) Liu-SSI; Somnath Roy; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> Thanks Allen!  The devil is always in the details.  Know of anything
> else that looks promising?
>
> Mark
>
> On 10/21/2015 05:06 AM, Allen Samuels wrote:
> > I doubt that NVMKV will be useful for two reasons:
> >
> > (1) It relies on the unique sparse-mapping addressing capabilities
> > of the FusionIO VSL interface, it won't run on standard SSDs
> > (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no
> range operations on keys). This is pretty much required for deep scrubbing.
> >
> >
> > Allen Samuels
> > Software Architect, Fellow, Systems and Software Solutions
> >
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> > Sent: Tuesday, October 20, 2015 6:20 AM
> > To: Sage Weil <sweil@redhat.com>; Chen, Xiaoxi
> > <xiaoxi.chen@intel.com>
> > Cc: James (Fei) Liu-SSI <james.liu@ssi.samsung.com>; Somnath Roy
> > <Somnath.Roy@sandisk.com>; ceph-devel@vger.kernel.org
> > Subject: Re: newstore direction
> >
> > On 10/20/2015 07:30 AM, Sage Weil wrote:
> >> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
> >>> +1, nowadays K-V DB care more about very small key-value pairs,
> >>> +say
> >>> several bytes to a few KB, but in SSD case we only care about 4KB
> >>> or 8KB. In this way, NVMKV is a good design and seems some of the
> >>> SSD vendor are also trying to build this kind of interface, we had
> >>> a NVM-L library but still under development.
> >>
> >> Do you have an NVMKV link?  I see a paper and a stale github repo..
> >> not sure if I'm looking at the right thing.
> >>
> >> My concern with using a key/value interface for the object data is
> >> that you end up with lots of key/value pairs (e.g., $inode_$offset
> >> =
> >> $4kb_of_data) that is pretty inefficient to store and (depending on
> >> the
> >> implementation) tends to break alignment.  I don't think these
> >> interfaces are targetted toward block-sized/aligned payloads.
> >> Storing just the metadata (block allocation map) w/ the kv api and
> >> storing the data directly on a block/page interface makes more
> >> sense to
> me.
> >>
> >> sage
> >
> > I get the feeling that some of the folks that were involved with
> > nvmkv at
> Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for instance.
> http://pmem.io might be a better bet, though I haven't looked closely at it.
> >
> > Mark
> >
> >>
> >>
> >>>> -----Original Message-----
> >>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >>>> owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> >>>> Sent: Tuesday, October 20, 2015 6:21 AM
> >>>> To: Sage Weil; Somnath Roy
> >>>> Cc: ceph-devel@vger.kernel.org
> >>>> Subject: RE: newstore direction
> >>>>
> >>>> Hi Sage and Somnath,
> >>>>     In my humble opinion, There is another more aggressive
> >>>> solution than raw block device base keyvalue store as backend for
> >>>> objectstore. The new key value  SSD device with transaction
> >>>> support
> would be  ideal to solve the issues.
> >>>> First of all, it is raw SSD device. Secondly , It provides key
> >>>> value interface directly from SSD. Thirdly, it can provide
> >>>> transaction support, consistency will be guaranteed by hardware
> >>>> device. It pretty much satisfied all of objectstore needs without
> >>>> any extra overhead since there is not any extra layer in between
> >>>> device
> and objectstore.
> >>>>      Either way, I strongly support to have CEPH own data format
> >>>> instead of relying on filesystem.
> >>>>
> >>>>     Regards,
> >>>>     James
> >>>>
> >>>> -----Original Message-----
> >>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> >>>> Sent: Monday, October 19, 2015 1:55 PM
> >>>> To: Somnath Roy
> >>>> Cc: ceph-devel@vger.kernel.org
> >>>> Subject: RE: newstore direction
> >>>>
> >>>> On Mon, 19 Oct 2015, Somnath Roy wrote:
> >>>>> Sage,
> >>>>> I fully support that.  If we want to saturate SSDs , we need to
> >>>>> get rid of this filesystem overhead (which I am in process of
> measuring).
> >>>>> Also, it will be good if we can eliminate the dependency on the
> >>>>> k/v dbs (for storing allocators and all). The reason is the
> >>>>> unknown write amps they causes.
> >>>>
> >>>> My hope is to keep behing the KeyValueDB interface (and/more
> change
> >>>> it as
> >>>> appropriate) so that other backends can be easily swapped in (e.g.
> >>>> a
> >>>> btree- based one for high-end flash).
> >>>>
> >>>> sage
> >>>>
> >>>>
> >>>>>
> >>>>> Thanks & Regards
> >>>>> Somnath
> >>>>>
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: ceph-devel-owner@vger.kernel.org
> >>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> >>>>> Sent: Monday, October 19, 2015 12:49 PM
> >>>>> To: ceph-devel@vger.kernel.org
> >>>>> Subject: newstore direction
> >>>>>
> >>>>> The current design is based on two simple ideas:
> >>>>>
> >>>>>    1) a key/value interface is better way to manage all of our
> >>>>> internal metadata (object metadata, attrs, layout, collection
> >>>>> membership, write-ahead logging, overlay data, etc.)
> >>>>>
> >>>>>    2) a file system is well suited for storage object data (as files).
> >>>>>
> >>>>> So far 1 is working out well, but I'm questioning the wisdom of #2.
> >>>>> A few
> >>>>> things:
> >>>>>
> >>>>>    - We currently write the data to the file, fsync, then commit
> >>>>> the kv transaction.  That's at least 3 IOs: one for the data,
> >>>>> one for the fs journal, one for the kv txn to commit (at least
> >>>>> once my rocksdb changes land... the kv commit is currently 2-3).
> >>>>> So two people are managing metadata, here: the fs managing the
> >>>>> file metadata (with its own
> >>>>> journal) and the kv backend (with its journal).
> >>>>>
> >>>>>    - On read we have to open files by name, which means
> >>>>> traversing the fs
> >>>> namespace.  Newstore tries to keep it as flat and simple as
> >>>> possible, but at a minimum it is a couple btree lookups.  We'd
> >>>> love to use open by handle (which would reduce this to 1 btree
> >>>> traversal), but running the daemon as ceph and not root makes
> >>>> that
> hard...
> >>>>>
> >>>>>    - ...and file systems insist on updating mtime on writes,
> >>>>> even when it is a
> >>>> overwrite with no allocation changes.  (We don't care about
> >>>> mtime.) O_NOCMTIME patches exist but it is hard to get these past
> >>>> the kernel brainfreeze.
> >>>>>
> >>>>>    - XFS is (probably) never going going to give us data
> >>>>> checksums, which we
> >>>> want desperately.
> >>>>>
> >>>>> But what's the alternative?  My thought is to just bite the
> >>>>> bullet and
> >>>> consume a raw block device directly.  Write an allocator,
> >>>> hopefully keep it pretty simple, and manage it in kv store along
> >>>> with all of our
> other metadata.
> >>>>>
> >>>>> Wins:
> >>>>>
> >>>>>    - 2 IOs for most: one to write the data to unused space in
> >>>>> the block device,
> >>>> one to commit our transaction (vs 4+ before).  For overwrites,
> >>>> we'd have one io to do our write-ahead log (kv journal), then do
> >>>> the overwrite async (vs 4+ before).
> >>>>>
> >>>>>    - No concern about mtime getting in the way
> >>>>>
> >>>>>    - Faster reads (no fs lookup)
> >>>>>
> >>>>>    - Similarly sized metadata for most objects.  If we assume
> >>>>> most objects are
> >>>> not fragmented, then the metadata to store the block offsets is
> >>>> about the same size as the metadata to store the filenames we
> >>>> have
> now.
> >>>>>
> >>>>> Problems:
> >>>>>
> >>>>>    - We have to size the kv backend storage (probably still an
> >>>>> XFS
> >>>>> partition) vs the block storage.  Maybe we do this anyway (put
> >>>>> metadata on
> >>>>> SSD!) so it won't matter.  But what happens when we are storing
> >>>>> gobs of
> >>>> rgw index data or cephfs metadata?  Suddenly we are pulling
> >>>> storage out of a different pool and those aren't currently fungible.
> >>>>>
> >>>>>    - We have to write and maintain an allocator.  I'm still
> >>>>> optimistic this can be
> >>>> reasonbly simple, especially for the flash case (where
> >>>> fragmentation isn't such an issue as long as our blocks are
> >>>> reasonbly sized).  For disk we may beed to be moderately clever.
> >>>>>
> >>>>>    - We'll need a fsck to ensure our internal metadata is
> >>>>> consistent.  The good
> >>>> news is it'll just need to validate what we have stored in the kv store.
> >>>>>
> >>>>> Other thoughts:
> >>>>>
> >>>>>    - We might want to consider whether dm-thin or bcache or
> >>>>> other block
> >>>> layers might help us with elasticity of file vs block areas.
> >>>>>
> >>>>>    - Rocksdb can push colder data to a second directory, so we
> >>>>> could have a fast ssd primary area (for wal and most metadata)
> >>>>> and a second hdd directory for stuff it has to push off.  Then
> >>>>> have a conservative amount of file space on the hdd.  If our
> >>>>> block fills up, use the existing file mechanism to put data
> >>>>> there too.  (But then we have to maintain both the current kv +
> >>>>> file approach and not go all-in on kv +
> >>>>> block.)
> >>>>>
> >>>>> Thoughts?
> >>>>> sage
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>> majordomo
> >>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>> ________________________________
> >>>>>
> >>>>> PLEASE NOTE: The information contained in this electronic mail
> >>>>> message is
> >>>> intended only for the use of the designated recipient(s) named
> >>>> above. If the reader of this message is not the intended
> >>>> recipient, you are hereby notified that you have received this
> >>>> message in error and that any review, dissemination,
> >>>> distribution, or copying of this message is strictly prohibited.
> >>>> If you have received this communication in error, please notify
> >>>> the sender by telephone or e-mail (as shown above) immediately
> >>>> and destroy any and all copies of this message in your possession
> >>>> (whether hard copies or
> electronically stored copies).
> >>>>>
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>> majordomo
> >>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe
> >>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe
> >>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >>> info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> > ________________________________
> >
> > PLEASE NOTE: The information contained in this electronic mail
> > message is
> intended only for the use of the designated recipient(s) named above.
> If the reader of this message is not the intended recipient, you are
> hereby notified that you have received this message in error and that
> any review, dissemination, distribution, or copying of this message is
> strictly prohibited. If you have received this communication in error,
> please notify the sender by telephone or e-mail (as shown above)
> immediately and destroy any and all copies of this message in your
> possession (whether hard copies or electronically stored copies).
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-22  0:53       ` Allen Samuels
@ 2015-10-22  1:16         ` Ric Wheeler
  2015-10-22  1:22           ` Allen Samuels
  0 siblings, 1 reply; 71+ messages in thread
From: Ric Wheeler @ 2015-10-22  1:16 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil, ceph-devel

On 10/21/2015 08:53 PM, Allen Samuels wrote:
> Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many companies standardize on a particular release of a particular distro. Getting them to switch to a new release -- even a "bug fix" point release -- is a major undertaking that often is a complete roadblock. Just my experience. YMMV.
>

Customers do control the pace that they upgrade their machines, but we put out 
fixes on a very regular pace.  A lot of customers will get fixes without having 
to qualify a full new release (i.e., fixes come out between major and minor 
releases are easy).

If someone is deploying a critical server for storage, then it falls back on the 
storage software team to help guide them and encourage them to update when 
needed (and no promises of success, but people move if the win is big. If it is 
not, they can wait).

ric


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-22  1:16         ` Ric Wheeler
@ 2015-10-22  1:22           ` Allen Samuels
  2015-10-23  2:10             ` Ric Wheeler
  0 siblings, 1 reply; 71+ messages in thread
From: Allen Samuels @ 2015-10-22  1:22 UTC (permalink / raw)
  To: Ric Wheeler, Sage Weil, ceph-devel

I agree. My only point was that you still have to factor this time into the argument that by continuing to put NewStore on top of a file system you'll get to a stable system much sooner than the longer development path of doing your own raw storage allocator. IMO, once you factor that into the equation the "on top of an FS" path doesn't look like such a clear winner.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: Ric Wheeler [mailto:rwheeler@redhat.com]
Sent: Thursday, October 22, 2015 10:17 AM
To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/21/2015 08:53 PM, Allen Samuels wrote:
> Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many companies standardize on a particular release of a particular distro. Getting them to switch to a new release -- even a "bug fix" point release -- is a major undertaking that often is a complete roadblock. Just my experience. YMMV.
>

Customers do control the pace that they upgrade their machines, but we put out fixes on a very regular pace.  A lot of customers will get fixes without having to qualify a full new release (i.e., fixes come out between major and minor releases are easy).

If someone is deploying a critical server for storage, then it falls back on the storage software team to help guide them and encourage them to update when needed (and no promises of success, but people move if the win is big. If it is not, they can wait).

ric


________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-21 13:44     ` Mark Nelson
@ 2015-10-22  1:39       ` Allen Samuels
  0 siblings, 0 replies; 71+ messages in thread
From: Allen Samuels @ 2015-10-22  1:39 UTC (permalink / raw)
  To: Mark Nelson, Ric Wheeler, Sage Weil, ceph-devel

I am pushing internally to open-source ZetaScale. Recent events may or may not affect that trajectory -- stay tuned.

Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: Mark Nelson [mailto:mnelson@redhat.com] 
Sent: Wednesday, October 21, 2015 10:45 PM
To: Allen Samuels <Allen.Samuels@sandisk.com>; Ric Wheeler <rwheeler@redhat.com>; Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/21/2015 05:06 AM, Allen Samuels wrote:
> I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one.
>
> You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place.
>
> While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments).
>
> Another example: Sage has just had to substantially rework the journaling code of rocksDB.
>
> In short, as you can tell, I'm full throated in favor of going down the optimal route.
>
> Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things:

Has there been any discussion regarding opensourcing zetascale?

>
> (1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others.
> (2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases.
>
> You can't avoid (2) as long as you're using a file system.
>
> Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Tuesday, October 20, 2015 11:32 AM
> To: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On 10/19/2015 03:49 PM, Sage Weil wrote:
>> The current design is based on two simple ideas:
>>
>>    1) a key/value interface is better way to manage all of our 
>> internal metadata (object metadata, attrs, layout, collection 
>> membership, write-ahead logging, overlay data, etc.)
>>
>>    2) a file system is well suited for storage object data (as files).
>>
>> So far 1 is working out well, but I'm questioning the wisdom of #2.  
>> A few
>> things:
>>
>>    - We currently write the data to the file, fsync, then commit the 
>> kv transaction.  That's at least 3 IOs: one for the data, one for the 
>> fs journal, one for the kv txn to commit (at least once my rocksdb 
>> changes land... the kv commit is currently 2-3).  So two people are 
>> managing metadata, here: the fs managing the file metadata (with its 
>> own
>> journal) and the kv backend (with its journal).
>
> If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops.
>
>>
>>    - On read we have to open files by name, which means traversing 
>> the fs namespace.  Newstore tries to keep it as flat and simple as 
>> possible, but at a minimum it is a couple btree lookups.  We'd love 
>> to use open by handle (which would reduce this to 1 btree traversal), 
>> but running the daemon as ceph and not root makes that hard...
>
> This seems like a a pretty low hurdle to overcome.
>
>>
>>    - ...and file systems insist on updating mtime on writes, even 
>> when it is a overwrite with no allocation changes.  (We don't care 
>> about
>> mtime.) O_NOCMTIME patches exist but it is hard to get these past the 
>> kernel brainfreeze.
>
> Are you using O_DIRECT? Seems like there should be some enterprisey database tricks that we can use here.
>
>>
>>    - XFS is (probably) never going going to give us data checksums, 
>> which we want desperately.
>
> What is the goal of having the file system do the checksums? How strong do they need to be and what size are the chunks?
>
> If you update this on each IO, this will certainly generate more IO (each write will possibly generate at least one other write to update that new checksum).
>
>>
>> But what's the alternative?  My thought is to just bite the bullet 
>> and consume a raw block device directly.  Write an allocator, 
>> hopefully keep it pretty simple, and manage it in kv store along with 
>> all of our other metadata.
>
> The big problem with consuming block devices directly is that you ultimately end up recreating most of the features that you had in the file system. Even enterprise databases like Oracle and DB2 have been migrating away from running on raw block devices in favor of file systems over time.  In effect, you are looking at making a simple on disk file system which is always easier to start than it is to get back to a stable, production ready state.
>
> I think that it might be quicker and more maintainable to spend some time working with the local file system people (XFS or other) to see if we can jointly address the concerns you have.
>>
>> Wins:
>>
>>    - 2 IOs for most: one to write the data to unused space in the 
>> block device, one to commit our transaction (vs 4+ before).  For 
>> overwrites, we'd have one io to do our write-ahead log (kv journal), 
>> then do the overwrite async (vs 4+ before).
>>
>>    - No concern about mtime getting in the way
>>
>>    - Faster reads (no fs lookup)
>>
>>    - Similarly sized metadata for most objects.  If we assume most 
>> objects are not fragmented, then the metadata to store the block 
>> offsets is about the same size as the metadata to store the filenames we have now.
>>
>> Problems:
>>
>>    - We have to size the kv backend storage (probably still an XFS
>> partition) vs the block storage.  Maybe we do this anyway (put 
>> metadata on
>> SSD!) so it won't matter.  But what happens when we are storing gobs 
>> of rgw index data or cephfs metadata?  Suddenly we are pulling 
>> storage out of a different pool and those aren't currently fungible.
>>
>>    - We have to write and maintain an allocator.  I'm still 
>> optimistic this can be reasonbly simple, especially for the flash 
>> case (where fragmentation isn't such an issue as long as our blocks 
>> are reasonbly sized).  For disk we may beed to be moderately clever.
>>
>>    - We'll need a fsck to ensure our internal metadata is consistent.
>> The good news is it'll just need to validate what we have stored in 
>> the kv store.
>>
>> Other thoughts:
>>
>>    - We might want to consider whether dm-thin or bcache or other 
>> block layers might help us with elasticity of file vs block areas.
>>
>>    - Rocksdb can push colder data to a second directory, so we could 
>> have a fast ssd primary area (for wal and most metadata) and a second 
>> hdd directory for stuff it has to push off.  Then have a conservative 
>> amount of file space on the hdd.  If our block fills up, use the 
>> existing file mechanism to put data there too.  (But then we have to 
>> maintain both the current kv + file approach and not go all-in on kv 
>> +
>> block.)
>>
>> Thoughts?
>> sage
>> --
>
> I really hate the idea of making a new file system type (even if we call it a raw block store!).
>
> In addition to the technical hurdles, there are also production worries like how long will it take for distros to pick up formal support?  How do we test it properly?
>
> Regards,
>
> Ric
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-21 21:20             ` Martin Millnert
@ 2015-10-22  2:12               ` Allen Samuels
  2015-10-22  8:51                 ` Orit Wasserman
  0 siblings, 1 reply; 71+ messages in thread
From: Allen Samuels @ 2015-10-22  2:12 UTC (permalink / raw)
  To: Martin Millnert, Mark Nelson; +Cc: Ric Wheeler, Sage Weil, ceph-devel

One of the biggest changes that flash is making in the storage world is that the way basic trade-offs in storage management software architecture are being affected. In the HDD world CPU time per IOP was relatively inconsequential, i.e., it had little effect on overall performance which was limited by the physics of the hard drive. Flash is now inverting that situation. When you look at the performance levels being delivered in the latest generation of NVMe SSDs you rapidly see that that storage itself is generally no longer the bottleneck (speaking about BW, not latency of course) but rather it's the system sitting in front of the storage that is the bottleneck. Generally it's the CPU cost of an IOP.

When Sandisk first starting working with Ceph (Dumpling) the design of librados and the OSD lead to the situation that the CPU cost of an IOP was dominated by context switches and network socket handling. Over time, much of that has been addressed. The socket handling code has been re-written (more than once!) some of the internal queueing in the OSD (and the associated context switches) have been eliminated. As the CPU costs have dropped, performance on flash has improved accordingly.

Because we didn't want to completely re-write the OSD (time-to-market and stability drove that decision), we didn't move it from the current "thread per IOP" model into a truly asynchronous "thread per CPU core" model that essentially eliminates context switches in the IO path. But a fully optimized OSD would go down that path (at least part-way). I believe it's been proposed in the past. Perhaps a hybrid "fast-path" style could get most of the benefits while preserving much of the legacy code.

I believe this trend toward thread-per-core software development will also tend to support the "do it in user-space" trend. That's because most of the kernel and file-system interface is architected around the blocking "thread-per-IOP" model and is unlikely to change in the future.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: Martin Millnert [mailto:martin@millnert.se]
Sent: Thursday, October 22, 2015 6:20 AM
To: Mark Nelson <mnelson@redhat.com>
Cc: Ric Wheeler <rwheeler@redhat.com>; Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

Adding 2c

On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote:
> My thought is that there is some inflection point where the userland
> kvstore/block approach is going to be less work, for everyone I think,
> than trying to quickly discover, understand, fix, and push upstream
> patches that sometimes only really benefit us.  I don't know if we've
> truly hit that that point, but it's tough for me to find flaws with
> Sage's argument.

Regarding the userland / kernel land aspect of the topic, there are further aspects AFAIK not yet addressed in the thread:
In the networking world, there's been development on memory mapped (multiple approaches exist) userland networking, which for packet management has the benefit of - for very, very specific applications of networking code - avoiding e.g. per-packet context switches etc, and streamlining processor cache management performance. People have gone as far as removing CPU cores from CPU scheduler to completely dedicate them to the networking task at hand (cache optimizations). There are various latency/throughput (bulking) optimizations applicable, but at the end of the day, it's about keeping the CPU bus busy with "revenue" bus traffic.

Granted, storage IO operations may be much heavier in cycle counts for context switches to ever appear as a problem in themselves, certainly for slower SSDs and HDDs. However, when going for truly high performance IO, *every* hurdle in the data path counts toward the total latency.
(And really, high performance random IO characteristics approaches the networking, per-packet handling characteristics).  Now, I'm not really suggesting memory-mapping a storage device to user space, not at all, but having better control over the data path for a very specific use case, reduces dependency on the code that works as best as possible for the general case, and allows for very purpose-built code, to address a narrow set of requirements. ("Ceph storage cluster backend" isn't a typical FS use case.) It also decouples dependencies on users i.e.
waiting for the next distro release before being able to take up the benefits of improvements to the storage code.

A random google came up with related data on where "doing something way different" /can/ have significant benefits:
http://phunq.net/pipermail/tux3/2015-April/002147.html

I (FWIW) certainly agree there is merit to the idea.
The scientific approach here could perhaps be to simply enumerate all corner cases of "generic FS" that actually are cause for the experienced issues, and assess probability of them being solved (and if so when).
That *could* improve chances of approaching consensus which wouldn't hurt I suppose?

BR,
Martin


________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-21 17:30       ` Sage Weil
@ 2015-10-22  8:31         ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2015-10-22  8:31 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ric Wheeler, Orit Wasserman, ceph-devel

On Wed, Oct 21, 2015 at 10:30:28AM -0700, Sage Weil wrote:
> For example: we need to do an overwrite of an existing object that is 
> atomic with respect to a larger ceph transaction (we're updating a bunch 
> of other metadata at the same time, possibly overwriting or appending to 
> multiple files, etc.).  XFS and ext4 aren't cow file systems, so plugging 
> into the transaction infrastructure isn't really an option (and even after 
> several years of trying to do it with btrfs it proved to be impractical).  

Not that I'm disagreeing with most of your points, but we can do things
like that with swapext-like hacks.  Below is my half year old prototype
of an O_ATOMIC implementation for XFS that gives you atomic out of place
writes.

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ee85cd4..001dd49 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -740,7 +740,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
 		O_RDONLY	| O_WRONLY	| O_RDWR	|
 		O_CREAT		| O_EXCL	| O_NOCTTY	|
 		O_TRUNC		| O_APPEND	| /* O_NONBLOCK	| */
@@ -748,6 +748,7 @@ static int __init fcntl_init(void)
 		O_DIRECT	| O_LARGEFILE	| O_DIRECTORY	|
 		O_NOFOLLOW	| O_NOATIME	| O_CLOEXEC	|
 		__FMODE_EXEC	| O_PATH	| __O_TMPFILE	|
+		O_ATOMIC	|
 		__FMODE_NONOTIFY
 		));
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index aeffeaa..8eafca6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4681,14 +4681,14 @@ xfs_bmap_del_extent(
 	xfs_btree_cur_t		*cur,	/* if null, not a btree */
 	xfs_bmbt_irec_t		*del,	/* data to remove from extents */
 	int			*logflagsp, /* inode logging flags */
-	int			whichfork) /* data or attr fork */
+	int			whichfork, /* data or attr fork */
+	bool			free_blocks) /* free extent at end of routine */
 {
 	xfs_filblks_t		da_new;	/* new delay-alloc indirect blocks */
 	xfs_filblks_t		da_old;	/* old delay-alloc indirect blocks */
 	xfs_fsblock_t		del_endblock=0;	/* first block past del */
 	xfs_fileoff_t		del_endoff;	/* first offset past del */
 	int			delay;	/* current block is delayed allocated */
-	int			do_fx;	/* free extent at end of routine */
 	xfs_bmbt_rec_host_t	*ep;	/* current extent entry pointer */
 	int			error;	/* error return value */
 	int			flags;	/* inode logging flags */
@@ -4712,8 +4712,8 @@ xfs_bmap_del_extent(
 
 	mp = ip->i_mount;
 	ifp = XFS_IFORK_PTR(ip, whichfork);
-	ASSERT((*idx >= 0) && (*idx < ifp->if_bytes /
-		(uint)sizeof(xfs_bmbt_rec_t)));
+	ASSERT(*idx >= 0);
+	ASSERT(*idx < ifp->if_bytes / sizeof(xfs_bmbt_rec_t));
 	ASSERT(del->br_blockcount > 0);
 	ep = xfs_iext_get_ext(ifp, *idx);
 	xfs_bmbt_get_all(ep, &got);
@@ -4746,10 +4746,13 @@ xfs_bmap_del_extent(
 			len = del->br_blockcount;
 			do_div(bno, mp->m_sb.sb_rextsize);
 			do_div(len, mp->m_sb.sb_rextsize);
-			error = xfs_rtfree_extent(tp, bno, (xfs_extlen_t)len);
-			if (error)
-				goto done;
-			do_fx = 0;
+			if (free_blocks) {
+				error = xfs_rtfree_extent(tp, bno,
+						(xfs_extlen_t)len);
+				if (error)
+					goto done;
+				free_blocks = 0;
+			}
 			nblks = len * mp->m_sb.sb_rextsize;
 			qfield = XFS_TRANS_DQ_RTBCOUNT;
 		}
@@ -4757,7 +4760,6 @@ xfs_bmap_del_extent(
 		 * Ordinary allocation.
 		 */
 		else {
-			do_fx = 1;
 			nblks = del->br_blockcount;
 			qfield = XFS_TRANS_DQ_BCOUNT;
 		}
@@ -4777,7 +4779,7 @@ xfs_bmap_del_extent(
 		da_old = startblockval(got.br_startblock);
 		da_new = 0;
 		nblks = 0;
-		do_fx = 0;
+		free_blocks = 0;
 	}
 	/*
 	 * Set flag value to use in switch statement.
@@ -4963,7 +4965,7 @@ xfs_bmap_del_extent(
 	/*
 	 * If we need to, add to list of extents to delete.
 	 */
-	if (do_fx)
+	if (free_blocks)
 		xfs_bmap_add_free(del->br_startblock, del->br_blockcount, flist,
 			mp);
 	/*
@@ -5291,7 +5293,7 @@ xfs_bunmapi(
 			goto error0;
 		}
 		error = xfs_bmap_del_extent(ip, tp, &lastx, flist, cur, &del,
-				&tmp_logflags, whichfork);
+				&tmp_logflags, whichfork, true);
 		logflags |= tmp_logflags;
 		if (error)
 			goto error0;
@@ -5936,3 +5938,291 @@ out:
 	xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES | XFS_TRANS_ABORT);
 	return error;
 }
+
+/*
+ * Create an extent tree pointing to an existing allocation.
+ * This is a small subset of the functionality in xfs_bmap_add_extent_hole_real.
+ *
+ * Note: we don't bother merging with neighbours.
+ */
+STATIC int
+xfs_bmap_insert_extent_real(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*new,
+	struct xfs_btree_cur	*cur,
+	xfs_extnum_t		idx,
+	xfs_fsblock_t		*firstblock,
+	struct xfs_bmap_free	*flist,
+	int			*logflags)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	int			error = 0, rval = 0, i;
+
+	ASSERT(idx >= 0);
+	ASSERT(idx <= ip->i_df.if_bytes / sizeof(struct xfs_bmbt_rec));
+	ASSERT(!isnullstartblock(new->br_startblock));
+	ASSERT(!cur || !(cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL));
+
+	XFS_STATS_INC(xs_add_exlist);
+
+	xfs_iext_insert(ip, idx, 1, new, 0);
+	ip->i_d.di_nextents++;
+	ip->i_d.di_nblocks += new->br_blockcount;
+
+	if (cur == NULL) {
+		rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
+	} else {
+		rval = XFS_ILOG_CORE;
+		error = xfs_bmbt_lookup_eq(cur,
+				new->br_startoff,
+				new->br_startblock,
+				new->br_blockcount, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 0, done);
+		cur->bc_rec.b.br_state = new->br_state;
+		error = xfs_btree_insert(cur, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+	}
+
+	/* convert to a btree if necessary */
+	if (xfs_bmap_needs_btree(ip, XFS_DATA_FORK)) {
+		int	tmp_logflags;	/* partial log flag return val */
+
+		ASSERT(cur == NULL);
+		error = xfs_bmap_extents_to_btree(tp, ip, firstblock, flist,
+				&cur, 0, &tmp_logflags, XFS_DATA_FORK);
+		*logflags |= tmp_logflags;
+		if (error)
+			goto done;
+	}
+
+	/* clear out the allocated field, done with it now in any case. */
+	if (cur)
+		cur->bc_private.b.allocated = 0;
+
+	xfs_bmap_check_leaf_extents(cur, ip, XFS_DATA_FORK);
+done:
+	*logflags |= rval;
+	return error;
+}
+
+int
+xfs_bmapi_insert(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*new,
+	xfs_fsblock_t		*firstblock,
+	struct xfs_bmap_free	*flist)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	int			whichfork = XFS_DATA_FORK;
+	int			eof;
+	int			error;
+	char			inhole;	
+	char			wasdelay;
+	struct xfs_bmbt_irec	got;
+	struct xfs_bmbt_irec	prev;
+	struct xfs_btree_cur	*cur = NULL;
+	xfs_extnum_t		idx;
+	int			logflags = 0;
+
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+
+	if (unlikely(XFS_TEST_ERROR(
+	    (XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
+	     XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_BTREE),
+	     mp, XFS_ERRTAG_BMAPIFORMAT, XFS_RANDOM_BMAPIFORMAT))) {
+		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, mp);
+		return -EFSCORRUPTED;
+	}
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
+	XFS_STATS_INC(xs_blk_mapw);
+
+	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
+		error = xfs_iread_extents(tp, ip, whichfork);
+		if (error)
+			goto error0;
+	}
+
+	xfs_bmap_search_extents(ip, new->br_startoff, whichfork,
+			&eof, &idx, &got, &prev);
+
+	inhole = eof || got.br_startoff > new->br_startoff;
+	wasdelay = !inhole && isnullstartblock(got.br_startblock);
+	ASSERT(!wasdelay);
+	ASSERT(inhole);
+
+	if (ifp->if_flags & XFS_IFBROOT) {
+		cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
+		cur->bc_private.b.flist = flist;
+		cur->bc_private.b.firstblock = *firstblock;
+		cur->bc_private.b.flags = 0;
+	}
+
+	error = xfs_bmap_insert_extent_real(tp, ip, new, cur, idx, firstblock,
+			flist, &logflags);
+	if (error)
+		return error;
+
+	/*
+	 * Transform from btree to extents, give it cur.
+	 */
+	if (xfs_bmap_wants_extents(ip, whichfork)) {
+		int		tmp_logflags = 0;
+
+		ASSERT(cur);
+		error = xfs_bmap_btree_to_extents(tp, ip, cur,
+			&tmp_logflags, whichfork);
+		logflags |= tmp_logflags;
+		if (error)
+			goto error0;
+	}
+
+	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_BTREE ||
+	       XFS_IFORK_NEXTENTS(ip, whichfork) >
+		XFS_IFORK_MAXEXT(ip, whichfork));
+	error = 0;
+error0:
+	/*
+	 * Log everything.  Do this after conversion, there's no point in
+	 * logging the extent records if we've converted to btree format.
+	 */
+	if ((logflags & xfs_ilog_fext(whichfork)) &&
+	    XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS)
+		logflags &= ~xfs_ilog_fext(whichfork);
+	else if ((logflags & xfs_ilog_fbroot(whichfork)) &&
+		 XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_BTREE)
+		logflags &= ~xfs_ilog_fbroot(whichfork);
+	/*
+	 * Log whatever the flags say, even if error.  Otherwise we might miss
+	 * detecting a case where the data is changed, there's an error,
+	 * and it's not logged so we don't shutdown when we should.
+	 */
+	if (logflags)
+		xfs_trans_log_inode(tp, ip, logflags);
+
+	if (cur) {
+		if (!error) {
+			ASSERT(*firstblock == NULLFSBLOCK ||
+			       XFS_FSB_TO_AGNO(mp, *firstblock) ==
+			       XFS_FSB_TO_AGNO(mp,
+				       cur->bc_private.b.firstblock) ||
+			       (flist->xbf_low &&
+				XFS_FSB_TO_AGNO(mp, *firstblock) <
+				XFS_FSB_TO_AGNO(mp,
+					cur->bc_private.b.firstblock)));
+			*firstblock = cur->bc_private.b.firstblock;
+		}
+		xfs_btree_del_cursor(cur,
+			error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	}
+	return error;
+}
+
+/*
+ * Remove the extent pointed to by del from the extent map, but do not free
+ * the blocks for it.
+ */
+int
+xfs_bmapi_unmap(
+	struct xfs_trans	*tp,		/* transaction pointer */
+	struct xfs_inode	*ip,		/* incore inode */
+	xfs_extnum_t		idx,		/* extent number to update/delete */
+	struct xfs_bmbt_irec	*del,		/* extent being deleted */
+	xfs_fsblock_t		*firstblock,	/* first allocated block
+						   controls a.g. for allocs */
+	struct xfs_bmap_free	*flist)		/* i/o: list extents to free */
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*ifp = &ip->i_df;
+	int			whichfork = XFS_DATA_FORK;
+	struct xfs_btree_cur	*cur;
+	int			error;
+	int			logflags = 0;
+
+	if (unlikely(
+	    XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
+	    XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_BTREE)) {
+		XFS_ERROR_REPORT("xfs_bunmapi", XFS_ERRLEVEL_LOW,
+				 ip->i_mount);
+		return -EFSCORRUPTED;
+	}
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+
+	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
+		error = xfs_iread_extents(tp, ip, whichfork);
+		if (error)
+			return error;
+	}
+
+	XFS_STATS_INC(xs_blk_unmap);
+
+	if (ifp->if_flags & XFS_IFBROOT) {
+		ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE);
+		cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
+		cur->bc_private.b.firstblock = *firstblock;
+		cur->bc_private.b.flist = flist;
+		cur->bc_private.b.flags = 0;
+	} else
+		cur = NULL;
+
+	ASSERT(!isnullstartblock(del->br_startblock));
+	error = xfs_bmap_del_extent(ip, tp, &idx, flist, cur, del,
+			&logflags, whichfork, false);
+	if (error)
+		goto error0;
+
+	/*
+	 * transform from btree to extents, give it cur
+	 */
+	if (xfs_bmap_wants_extents(ip, whichfork)) {
+		int tmp_logflags = 0;
+
+		ASSERT(cur != NULL);
+		error = xfs_bmap_btree_to_extents(tp, ip, cur, &tmp_logflags,
+			whichfork);
+		logflags |= tmp_logflags;
+		if (error)
+			goto error0;
+	}
+
+error0:
+	/*
+	 * Log everything.  Do this after conversion, there's no point in
+	 * logging the extent records if we've converted to btree format.
+	 */
+	if ((logflags & xfs_ilog_fext(whichfork)) &&
+	    XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS)
+		logflags &= ~xfs_ilog_fext(whichfork);
+	else if ((logflags & xfs_ilog_fbroot(whichfork)) &&
+		 XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_BTREE)
+		logflags &= ~xfs_ilog_fbroot(whichfork);
+	/*
+	 * Log inode even in the error case, if the transaction
+	 * is dirty we'll need to shut down the filesystem.
+	 */
+	if (logflags)
+		xfs_trans_log_inode(tp, ip, logflags);
+	if (cur) {
+		if (!error) {
+			*firstblock = cur->bc_private.b.firstblock;
+			cur->bc_private.b.allocated = 0;
+		}
+		xfs_btree_del_cursor(cur,
+			error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	}
+	return error;
+}
+		
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 6aaa0c1..394843f 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -221,5 +221,11 @@ int	xfs_bmap_shift_extents(struct xfs_trans *tp, struct xfs_inode *ip,
 		struct xfs_bmap_free *flist, enum shift_direction direction,
 		int num_exts);
 int	xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset);
+int	xfs_bmapi_insert(struct xfs_trans *tp, struct xfs_inode *ip,
+		struct xfs_bmbt_irec *new, xfs_fsblock_t *firstblock,
+		struct xfs_bmap_free *flist);
+int	xfs_bmapi_unmap(struct xfs_trans *tp, struct xfs_inode *ip,
+		xfs_extnum_t idx, struct xfs_bmbt_irec *del,
+		xfs_fsblock_t *firstblock, struct xfs_bmap_free *flist);
 
 #endif	/* __XFS_BMAP_H__ */
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a56960d..e64ffd80 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1365,6 +1365,9 @@ __xfs_get_blocks(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
+	if (ip->i_cow && !ip->i_df.if_bytes && !create)
+		ip = ip->i_cow;
+
 	offset = (xfs_off_t)iblock << inode->i_blkbits;
 	ASSERT(bh_result->b_size >= (1 << inode->i_blkbits));
 	size = bh_result->b_size;
@@ -1372,6 +1375,7 @@ __xfs_get_blocks(
 	if (!create && direct && offset >= i_size_read(inode))
 		return 0;
 
+retry:
 	/*
 	 * Direct I/O is usually done on preallocated files, so try getting
 	 * a block mapping without an exclusive lock first.  For buffered
@@ -1397,6 +1401,13 @@ __xfs_get_blocks(
 	if (error)
 		goto out_unlock;
 
+	if (!create && ip->i_cow &&
+	    (!nimaps || imap.br_startblock == HOLESTARTBLOCK)) {
+		xfs_iunlock(ip, lockmode);
+		ip = ip->i_cow;
+		goto retry;
+	}
+
 	if (create &&
 	    (!nimaps ||
 	     (imap.br_startblock == HOLESTARTBLOCK ||
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index a52bbd3..c45f15e 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1918,3 +1918,262 @@ out_trans_cancel:
 	xfs_trans_cancel(tp, 0);
 	goto out;
 }
+
+static int
+xfs_remove_extent(
+	struct xfs_trans	**tpp,
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*del,
+	bool			*done)
+{
+	struct xfs_trans	*tp = *tpp, *ntp;
+	struct xfs_ifork	*ifp = &ip->i_df;
+	struct xfs_bmap_free	free_list;
+	xfs_fsblock_t		firstblock;
+	int			error, committed;
+	xfs_extnum_t		nextents, idx;
+
+	xfs_trans_ijoin(tp, ip, 0);
+
+	/*
+	 * Always delete the first last extents, this avoids shifting around
+	 * the extent list every time.
+	 *
+	 * XXX: find a way to avoid the transaction allocation without extents?
+	 */
+	nextents = ifp->if_bytes / sizeof(struct xfs_bmbt_rec);
+	if (!nextents) {
+		*done = true;
+		return 0;
+	}
+	idx = nextents - 1;
+	xfs_bmbt_get_all(xfs_iext_get_ext(ifp, idx), del);
+
+	xfs_bmap_init(&free_list, &firstblock);
+	error = xfs_bmapi_unmap(tp, ip, idx, del, &firstblock, &free_list);
+	if (error)
+		goto out_bmap_cancel;
+
+	error = xfs_bmap_finish(&tp, &free_list, &committed);
+	if (error)
+		goto out_bmap_cancel;
+
+	if (committed) {
+		xfs_trans_ijoin(tp, ip, 0);
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	}
+
+	ntp = xfs_trans_dup(tp);
+	error = xfs_trans_commit(tp, 0);
+	tp = ntp;
+	xfs_trans_ijoin(tp, ip, 0);
+
+	if (error) {
+		xfs_trans_cancel(tp, 0);
+		goto out_error;
+	}
+
+	xfs_log_ticket_put(tp->t_ticket);
+	error = xfs_trans_reserve(tp, &M_RES(ip->i_mount)->tr_write, 0, 0);
+	if (error) {
+		xfs_trans_cancel(tp, 0);
+		goto out_error;
+	}
+
+	*tpp = tp;
+	return 0;
+
+out_bmap_cancel:
+	xfs_bmap_cancel(&free_list);
+	xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES | XFS_TRANS_ABORT);
+out_error:
+	*tpp = NULL;
+	return error;
+}
+
+static int
+xfs_free_range(
+	struct xfs_trans	**tpp,
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*del)
+{
+	struct xfs_trans	*tp = *tpp, *ntp;
+	struct xfs_bmap_free	free_list;
+	int			committed;
+	int			done;
+	int			error = 0;
+	xfs_fsblock_t		firstfsb;
+
+	while (!error && !done) {
+		xfs_trans_ijoin(tp, ip, 0);
+
+		xfs_bmap_init(&free_list, &firstfsb);
+		error = xfs_bunmapi(tp, ip, del->br_startoff,
+				del->br_blockcount, 0, 2,
+				&firstfsb, &free_list, &done);
+		if (error)
+			goto out_bmap_cancel;
+
+		error = xfs_bmap_finish(&tp, &free_list, &committed);
+		if (error)
+			goto out_bmap_cancel;
+
+		if (committed) {
+			xfs_trans_ijoin(tp, ip, 0);
+			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+		}
+
+		ntp = xfs_trans_dup(tp);
+		error = xfs_trans_commit(tp, 0);
+		tp = ntp;
+		xfs_trans_ijoin(tp, ip, 0);
+
+		if (error) 
+			goto out_error;
+
+		xfs_log_ticket_put(tp->t_ticket);
+		error = xfs_trans_reserve(tp, &M_RES(ip->i_mount)->tr_write, 0, 0);
+		if (error)
+			goto out_error;
+	}
+
+	*tpp = tp;
+	return 0;
+
+out_bmap_cancel:
+	xfs_bmap_cancel(&free_list);
+out_error:
+	xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES | XFS_TRANS_ABORT);
+	*tpp = NULL;
+	return error;
+}
+
+static int
+xfs_insert_extent(
+	struct xfs_trans	**tpp,
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*r)
+{
+	struct xfs_trans	*tp = *tpp, *ntp;
+	struct xfs_bmap_free	free_list;
+	xfs_fsblock_t		firstblock;
+	int			error, committed;
+
+	xfs_trans_ijoin(tp, ip, 0);
+	xfs_bmap_init(&free_list, &firstblock);
+	error = xfs_bmapi_insert(tp, ip, r, &firstblock, &free_list);
+	if (error)
+		goto out_bmap_cancel;
+
+	error = xfs_bmap_finish(&tp, &free_list, &committed);
+	if (error)
+		goto out_bmap_cancel;
+
+	ntp = xfs_trans_dup(tp);
+	error = xfs_trans_commit(tp, 0);
+	tp = ntp;
+	xfs_trans_ijoin(tp, ip, 0);
+
+	if (error)
+		goto out_error;
+
+	xfs_log_ticket_put(tp->t_ticket);
+	error = xfs_trans_reserve(tp, &M_RES(ip->i_mount)->tr_write, 0, 0);
+	if (error)
+		goto out_error;
+
+	*tpp = tp;
+	return 0;
+
+out_bmap_cancel:
+	xfs_bmap_cancel(&free_list);
+out_error:
+	xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES | XFS_TRANS_ABORT);
+	*tpp = NULL;
+	return error;
+}
+
+int
+xfs_commit_clone(
+	struct file		*file,
+	loff_t			start,
+	loff_t			end)
+{
+	struct xfs_inode	*dest = XFS_I(file_inode(file));
+	struct xfs_inode	*clone = XFS_I(file->f_mapping->host);
+	struct xfs_mount	*mp = clone->i_mount;
+	struct xfs_trans	*tp;
+	uint			lock_flags;
+	bool			done = false;
+	int			error = 0;
+
+	error = xfs_qm_dqattach(clone, 0);
+	if (error)
+		return error;
+
+	error = xfs_qm_dqattach(dest, 0);
+	if (error)
+		return error;
+
+	/*
+	 * Lock the inodes against other IO, page faults and truncate to
+	 * begin with.
+	 */
+	lock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+	xfs_lock_two_inodes(dest, clone, XFS_IOLOCK_EXCL);
+	xfs_lock_two_inodes(dest, clone, XFS_MMAPLOCK_EXCL);
+
+	inode_dio_wait(VFS_I(clone));
+	error = filemap_write_and_wait(VFS_I(clone)->i_mapping);
+	if (error)
+		goto out_unlock;
+
+	inode_dio_wait(VFS_I(dest));
+	error = filemap_write_and_wait(VFS_I(dest)->i_mapping);
+	if (error)
+		goto out_unlock;
+	truncate_pagecache_range(VFS_I(dest), 0, -1);
+	WARN_ON(VFS_I(dest)->i_mapping->nrpages);
+
+	tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
+	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_write, 0, 0);
+	if (error) {
+		xfs_trans_cancel(tp, 0);
+		return error;
+	}
+
+	xfs_lock_two_inodes(dest, clone, XFS_ILOCK_EXCL);
+	lock_flags |= XFS_ILOCK_EXCL;
+
+	for (;;) {
+		struct xfs_bmbt_irec	del;
+
+		error = xfs_remove_extent(&tp, clone, &del, &done);
+		if (error)
+			goto out_unlock;
+		if (done)
+			break;
+
+		error = xfs_free_range(&tp, dest, &del);
+		if (error)
+			goto out_unlock;
+
+		error = xfs_insert_extent(&tp, dest, &del);
+		if (error)
+			goto out_unlock;
+	}
+
+	xfs_trans_ijoin(tp, dest, 0);
+	xfs_trans_log_inode(tp, dest, XFS_ILOG_CORE);
+
+	i_size_write(VFS_I(dest), VFS_I(clone)->i_size);
+	dest->i_d.di_size = VFS_I(clone)->i_size;
+	xfs_trans_ichgtime(tp, dest, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
+
+	error = xfs_trans_commit(tp, XFS_TRANS_RELEASE_LOG_RES);
+
+out_unlock:
+	xfs_iunlock(dest, lock_flags);
+	xfs_iunlock(clone, lock_flags);
+	return error;
+}
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index af97d9a..1f4de38 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -65,6 +65,7 @@ int	xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
 				xfs_off_t len);
 int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 				xfs_off_t len);
+int	xfs_commit_clone(struct file *file, loff_t start, loff_t end);
 
 /* EOF block manipulation functions */
 bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8121e75..11f60ca 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -199,7 +199,7 @@ xfs_file_fsync(
 	loff_t			end,
 	int			datasync)
 {
-	struct inode		*inode = file->f_mapping->host;
+	struct inode		*inode = file_inode(file);
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	int			error = 0;
@@ -208,13 +208,20 @@ xfs_file_fsync(
 
 	trace_xfs_file_fsync(ip);
 
-	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
-	if (error)
-		return error;
-
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
+	if (file->f_mapping->host != inode) {
+		error = xfs_commit_clone(file, start, end);
+		if (error)
+			return error;
+	} else {
+		error = filemap_write_and_wait_range(inode->i_mapping,
+				start, end);
+		if (error)
+			return error;
+	}
+
 	xfs_iflags_clear(ip, XFS_ITRUNCATED);
 
 	if (mp->m_flags & XFS_MOUNT_BARRIER) {
@@ -1002,6 +1009,36 @@ xfs_file_open(
 		return -EFBIG;
 	if (XFS_FORCED_SHUTDOWN(XFS_M(inode->i_sb)))
 		return -EIO;
+
+	if (file->f_flags & O_ATOMIC) {
+		struct dentry *parent;
+		struct xfs_inode *clone;
+		int error;
+	
+		if (XFS_IS_REALTIME_INODE(XFS_I(inode)))
+			return -EINVAL;
+
+		// XXX: also need to prevent setting O_DIRECT using fcntl.
+		if (file->f_flags & O_DIRECT)
+			return -EINVAL;
+
+		error = filemap_write_and_wait(inode->i_mapping);
+		if (error)
+			return error;
+
+		parent = dget_parent(file->f_path.dentry);
+		error = xfs_create_tmpfile(XFS_I(parent->d_inode), NULL,
+				file->f_mode, &clone);
+		dput(parent);
+
+		if (error)
+			return error;
+
+		VFS_I(clone)->i_size = inode->i_size;
+		clone->i_cow = XFS_I(inode);
+		file->f_mapping = VFS_I(clone)->i_mapping;
+		xfs_finish_inode_setup(clone);
+	}
 	return 0;
 }
 
@@ -1032,8 +1069,14 @@ xfs_dir_open(
 STATIC int
 xfs_file_release(
 	struct inode	*inode,
-	struct file	*filp)
+	struct file	*file)
 {
+	if (file->f_mapping->host != inode) {
+		XFS_I(file->f_mapping->host)->i_cow = NULL;
+		IRELE(XFS_I(file->f_mapping->host));
+		return 0;
+	}
+	
 	return xfs_release(XFS_I(inode));
 }
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 76a9f27..a43e83a 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -80,6 +80,7 @@ xfs_inode_alloc(
 	ip->i_flags = 0;
 	ip->i_delayed_blks = 0;
 	memset(&ip->i_d, 0, sizeof(xfs_icdinode_t));
+	ip->i_cow = NULL;
 
 	return ip;
 }
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 8f22d20..a7c3f78 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -52,6 +52,8 @@ typedef struct xfs_inode {
 	/* operations vectors */
 	const struct xfs_dir_ops *d_ops;		/* directory ops vector */
 
+	struct xfs_inode	*i_cow;
+
 	/* Transaction and locking information. */
 	struct xfs_inode_log_item *i_itemp;	/* logging information */
 	mrlock_t		i_lock;		/* inode lock */
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 38e633b..d9e177c 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -268,6 +268,13 @@ xfs_iomap_eof_want_preallocate(
 		return 0;
 
 	/*
+	 * Don't preallocate if this a clone for an O_ATOMIC open, as we'd
+	 * overwrite space in the original file with garbage on a commit.
+	 */
+	if (ip->i_cow)
+		return 0;
+
+	/*
 	 * If the file is smaller than the minimum prealloc and we are using
 	 * dynamic preallocation, don't do any preallocation at all as it is
 	 * likely this is the only write to the file that is going to be done.
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index e063eff..26ab762 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -92,6 +92,8 @@
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
 
+#define O_ATOMIC	040000000
+
 #ifndef O_NDELAY
 #define O_NDELAY	O_NONBLOCK
 #endif

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-22  2:12               ` Allen Samuels
@ 2015-10-22  8:51                 ` Orit Wasserman
  0 siblings, 0 replies; 71+ messages in thread
From: Orit Wasserman @ 2015-10-22  8:51 UTC (permalink / raw)
  To: Allen Samuels
  Cc: Martin Millnert, Mark Nelson, Ric Wheeler, Sage Weil, ceph-devel

On Thu, 2015-10-22 at 02:12 +0000, Allen Samuels wrote:
> One of the biggest changes that flash is making in the storage world is that the way basic trade-offs in storage management software architecture are being affected. In the HDD world CPU time per IOP was relatively inconsequential, i.e., it had little effect on overall performance which was limited by the physics of the hard drive. Flash is now inverting that situation. When you look at the performance levels being delivered in the latest generation of NVMe SSDs you rapidly see that that storage itself is generally no longer the bottleneck (speaking about BW, not latency of course) but rather it's the system sitting in front of the storage that is the bottleneck. Generally it's the CPU cost of an IOP.
> 
> When Sandisk first starting working with Ceph (Dumpling) the design of librados and the OSD lead to the situation that the CPU cost of an IOP was dominated by context switches and network socket handling. Over time, much of that has been addressed. The socket handling code has been re-written (more than once!) some of the internal queueing in the OSD (and the associated context switches) have been eliminated. As the CPU costs have dropped, performance on flash has improved accordingly.
> 
> Because we didn't want to completely re-write the OSD (time-to-market and stability drove that decision), we didn't move it from the current "thread per IOP" model into a truly asynchronous "thread per CPU core" model that essentially eliminates context switches in the IO path. But a fully optimized OSD would go down that path (at least part-way). I believe it's been proposed in the past. Perhaps a hybrid "fast-path" style could get most of the benefits while preserving much of the legacy code.
> 

+1
It not just reducing context switches but also about removing contention
and data copies and getting better cache utilization.

Scylladb just did this to cassandra (using seastar library):
http://www.zdnet.com/article/kvm-creators-open-source-fast-cassandra-drop-in-replacement-scylla/

Orit

> I believe this trend toward thread-per-core software development will also tend to support the "do it in user-space" trend. That's because most of the kernel and file-system interface is architected around the blocking "thread-per-IOP" model and is unlikely to change in the future.
> 
> 
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> -----Original Message-----
> From: Martin Millnert [mailto:martin@millnert.se]
> Sent: Thursday, October 22, 2015 6:20 AM
> To: Mark Nelson <mnelson@redhat.com>
> Cc: Ric Wheeler <rwheeler@redhat.com>; Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
> 
> Adding 2c
> 
> On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote:
> > My thought is that there is some inflection point where the userland
> > kvstore/block approach is going to be less work, for everyone I think,
> > than trying to quickly discover, understand, fix, and push upstream
> > patches that sometimes only really benefit us.  I don't know if we've
> > truly hit that that point, but it's tough for me to find flaws with
> > Sage's argument.
> 
> Regarding the userland / kernel land aspect of the topic, there are further aspects AFAIK not yet addressed in the thread:
> In the networking world, there's been development on memory mapped (multiple approaches exist) userland networking, which for packet management has the benefit of - for very, very specific applications of networking code - avoiding e.g. per-packet context switches etc, and streamlining processor cache management performance. People have gone as far as removing CPU cores from CPU scheduler to completely dedicate them to the networking task at hand (cache optimizations). There are various latency/throughput (bulking) optimizations applicable, but at the end of the day, it's about keeping the CPU bus busy with "revenue" bus traffic.
> 
> Granted, storage IO operations may be much heavier in cycle counts for context switches to ever appear as a problem in themselves, certainly for slower SSDs and HDDs. However, when going for truly high performance IO, *every* hurdle in the data path counts toward the total latency.
> (And really, high performance random IO characteristics approaches the networking, per-packet handling characteristics).  Now, I'm not really suggesting memory-mapping a storage device to user space, not at all, but having better control over the data path for a very specific use case, reduces dependency on the code that works as best as possible for the general case, and allows for very purpose-built code, to address a narrow set of requirements. ("Ceph storage cluster backend" isn't a typical FS use case.) It also decouples dependencies on users i.e.
> waiting for the next distro release before being able to take up the benefits of improvements to the storage code.
> 
> A random google came up with related data on where "doing something way different" /can/ have significant benefits:
> http://phunq.net/pipermail/tux3/2015-April/002147.html
> 
> I (FWIW) certainly agree there is merit to the idea.
> The scientific approach here could perhaps be to simply enumerate all corner cases of "generic FS" that actually are cause for the experienced issues, and assess probability of them being solved (and if so when).
> That *could* improve chances of approaching consensus which wouldn't hurt I suppose?
> 
> BR,
> Martin
> 
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> NrybXǧv^)޺{.n+z]z{ay\x1dʇڙ,j\afhz\x1ew\fj:+vwjm\azZ+ݢj"!


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-20 20:00   ` Sage Weil
  2015-10-20 20:36     ` Gregory Farnum
  2015-10-20 20:42     ` Matt Benjamin
@ 2015-10-22 12:32     ` Milosz Tanski
  2015-10-23  3:16       ` Howard Chu
  2 siblings, 1 reply; 71+ messages in thread
From: Milosz Tanski @ 2015-10-22 12:32 UTC (permalink / raw)
  To: Sage Weil; +Cc: John Spray, Ceph Development

On Tue, Oct 20, 2015 at 4:00 PM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 20 Oct 2015, John Spray wrote:
>> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sweil@redhat.com> wrote:
>> >  - We have to size the kv backend storage (probably still an XFS
>> > partition) vs the block storage.  Maybe we do this anyway (put metadata on
>> > SSD!) so it won't matter.  But what happens when we are storing gobs of
>> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
>> > a different pool and those aren't currently fungible.
>>
>> This is the concerning bit for me -- the other parts one "just" has to
>> get the code right, but this problem could linger and be something we
>> have to keep explaining to users indefinitely.  It reminds me of cases
>> in other systems where users had to make an educated guess about inode
>> size up front, depending on whether you're expecting to efficiently
>> store a lot of xattrs.
>>
>> In practice it's rare for users to make these kinds of decisions well
>> up-front: it really needs to be adjustable later, ideally
>> automatically.  That could be pretty straightforward if the KV part
>> was stored directly on block storage, instead of having XFS in the
>> mix.  I'm not quite up with the state of the art in this area: are
>> there any reasonable alternatives for the KV part that would consume
>> some defined range of a block device from userspace, instead of
>> sitting on top of a filesystem?
>
> I agree: this is my primary concern with the raw block approach.
>
> There are some KV alternatives that could consume block, but the problem
> would be similar: we need to dynamically size up or down the kv portion of
> the device.
>
> I see two basic options:
>
> 1) Wire into the Env abstraction in rocksdb to provide something just
> smart enough to let rocksdb work.  It isn't much: named files (not that
> many--we could easily keep the file table in ram), always written
> sequentially, to be read later with random access. All of the code is
> written around abstractions of SequentialFileWriter so that everything
> posix is neatly hidden in env_posix (and there are various other env
> implementations for in-memory mock tests etc.).
>
> 2) Use something like dm-thin to sit between the raw block device and XFS
> (for rocksdb) and the block device consumed by newstore.  As long as XFS
> doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb
> files in their entirety) we can fstrim and size down the fs portion.  If
> we similarly make newstores allocator stick to large blocks only we would
> be able to size down the block portion as well.  Typical dm-thin block
> sizes seem to range from 64KB to 512KB, which seems reasonable enough to
> me.  In fact, we could likely just size the fs volume at something
> conservatively large (like 90%) and rely on -o discard or periodic fstrim
> to keep its actual utilization in check.
>

I think you could prototype a raw block device OSD store using LMDB as
a starting point. I know there's been some experiments using LMDB as
KV store before with positive read numbers and not great write
numbers.

1. It mmaps, just mmap the raw disk device / partition. I've done this
as an experiment before, I can dig up a patch for LMDB.
2. It already has a free space management strategy. I'm prob it's not
right for the OSDs in the long term but there's something to start
there with.
3. It's already supports transactions / COW.
4. LMDB isn't a huge code base so it might be a good place to start /
evolve code from.
5. You're not starting a multi-year effort at the 0 point.

As to the not great write performance, that could be addressed by
write transaction merging (what mysql implemented a few years ago).
Here you have an opportunity to do it two days. One, you can do it in
the application layer while waiting for the fsync from transaction to
complete. This is probably the easier route. Two, you can do it in the
DB layer (the LMDB transaction handling / locking) where you're
already started processing the following transactions using the
currently committing transaction (COW) as a starting point. This is
harder mostly because of the synchronization needed or involved.

I've actually spend some time thinking about doing LMDB write
transaction merging outside the OSD context. This was for another
project.

My 2 cents.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-21 11:18     ` Ric Wheeler
  2015-10-21 17:30       ` Sage Weil
@ 2015-10-22 12:50       ` Sage Weil
  2015-10-22 17:42         ` James (Fei) Liu-SSI
  2015-10-23  2:06         ` Ric Wheeler
  1 sibling, 2 replies; 71+ messages in thread
From: Sage Weil @ 2015-10-22 12:50 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Orit Wasserman, ceph-devel

On Wed, 21 Oct 2015, Ric Wheeler wrote:
> You will have to trust me on this as the Red Hat person who spoke to pretty
> much all of our key customers about local file systems and storage - customers
> all have migrated over to using normal file systems under Oracle/DB2.
> Typically, they use XFS or ext4.  I don't know of any non-standard file
> systems and only have seen one account running on a raw block store in 8 years
> :)
> 
> If you have a pre-allocated file and write using O_DIRECT, your IO path is
> identical in terms of IO's sent to the device.
> 
> If we are causing additional IO's, then we really need to spend some time
> talking to the local file system gurus about this in detail.  I can help with
> that conversation.

If the file is truly preallocated (that is, prewritten with zeros... 
fallocate doesn't help here because the extents is marked unwritten), then 
sure: there is very little change in the data path.

But at that point, what is the point?  This only works if you have one (or 
a few) huge files and the user space app already has all the complexity of 
a filesystem-like thing (with its own internal journal, allocators, 
garbage collection, etc.).  Do they just do this to ease administrative 
tasks like backup?


This is the fundamental tradeoff:

1) We have a file per object.  We fsync like crazy and the fact that 
there are two independent layers journaling and managing different types 
of consistency penalizes us.

1b) We get clever and start using obscure and/or custom ioctls in the file 
system to work around what it is used to: we swap extents to avoid 
write-ahead (see Christoph's patch), O_NOMTIME, unprivileged 
open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc.

2) We preallocate huge files and write a user-space object system that 
lives within it (pretending the file is a block device).  The file system 
rarely gets in the way (assuming the file is prewritten and we don't do 
anything stupid).  But it doesn't give us anything a block device 
wouldn't, and it doesn't save us any complexity in our code.

At the end of the day, 1 and 1b are always going to be slower than 2.  
And although 1b performs a bit better than 1, it has similar (user-space) 
complexity to 2.  On the other hand, if you step back and view teh 
entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex 
than 2... and yet still slower.  Given we ultimately have to support both 
(both as an upstream and as a distro), that's not very attractive.

Also note that every time we have strayed off the reservation from the 
beaten path (1) to anything mildly exotic (1b) we have been bitten by 
obscure file systems bugs.  And that's assume we get everything we need 
upstream... which is probably a year's endeavour.

Don't get me wrong: I'm all for making changes to file systems to better 
support systems like Ceph.  Things like O_NOCMTIME and O_ATOMIC make a 
huge amount of sense of a ton of different systems.  But our situations is 
a bit different: we always own the entire device (and often the server), 
so there is no need to share with other users or apps (and when you do, 
you just use the existing FileStore backend).  And as you know performance 
is a huge pain point.  We are already handicapped by virtue of being 
distributed and strongly consistent; we can't afford to give away more to 
a storage layer that isn't providing us much (or the right) value.

And I'm tired of half measures.  I want the OSD to be as fast as we can 
make it given the architectural constraints (RADOS consistency and 
ordering semantics).  This is truly low-hanging fruit: it's modular, 
self-contained, pluggable, and this will be my third time around this 
particular block.

sage

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-22 12:50       ` Sage Weil
@ 2015-10-22 17:42         ` James (Fei) Liu-SSI
  2015-10-22 23:42           ` Samuel Just
  2015-10-23  2:06         ` Ric Wheeler
  1 sibling, 1 reply; 71+ messages in thread
From: James (Fei) Liu-SSI @ 2015-10-22 17:42 UTC (permalink / raw)
  To: Sage Weil, Ric Wheeler; +Cc: Orit Wasserman, ceph-devel

Hi Sage and other fellow cephers,
  I truly share the pains with you  all about filesystem while I am working on  objectstore to improve the performance. As mentioned , there is nothing wrong with filesystem. Just the Ceph as one of  use case need more supports but not provided in near future by filesystem no matter what reasons.

   There are so many techniques  pop out which can help to improve performance of OSD.  User space driver(DPDK from Intel) is one of them. It not only gives you the storage allocator,  also gives you the thread scheduling support,  CPU affinity , NUMA friendly, polling  which  might fundamentally change the performance of objectstore.  It should not be hard to improve CPU utilization 3x~5x times, higher IOPS etc.
    I totally agreed that goal of filestore is to gives enough support for filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new design goal of objectstore should focus on giving the best  performance for OSD with new techniques. These two goals are not going to conflict with each other.  They are just for different purposes to make Ceph not only more stable but also better.  

  Scylla mentioned by Orit is a good example .

  Thanks all.

  Regards,
  James   

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Thursday, October 22, 2015 5:50 AM
To: Ric Wheeler
Cc: Orit Wasserman; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On Wed, 21 Oct 2015, Ric Wheeler wrote:
> You will have to trust me on this as the Red Hat person who spoke to 
> pretty much all of our key customers about local file systems and 
> storage - customers all have migrated over to using normal file systems under Oracle/DB2.
> Typically, they use XFS or ext4.  I don't know of any non-standard 
> file systems and only have seen one account running on a raw block 
> store in 8 years
> :)
> 
> If you have a pre-allocated file and write using O_DIRECT, your IO 
> path is identical in terms of IO's sent to the device.
> 
> If we are causing additional IO's, then we really need to spend some 
> time talking to the local file system gurus about this in detail.  I 
> can help with that conversation.

If the file is truly preallocated (that is, prewritten with zeros... 
fallocate doesn't help here because the extents is marked unwritten), then
sure: there is very little change in the data path.

But at that point, what is the point?  This only works if you have one (or a few) huge files and the user space app already has all the complexity of a filesystem-like thing (with its own internal journal, allocators, garbage collection, etc.).  Do they just do this to ease administrative tasks like backup?


This is the fundamental tradeoff:

1) We have a file per object.  We fsync like crazy and the fact that there are two independent layers journaling and managing different types of consistency penalizes us.

1b) We get clever and start using obscure and/or custom ioctls in the file system to work around what it is used to: we swap extents to avoid write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc.

2) We preallocate huge files and write a user-space object system that lives within it (pretending the file is a block device).  The file system rarely gets in the way (assuming the file is prewritten and we don't do anything stupid).  But it doesn't give us anything a block device wouldn't, and it doesn't save us any complexity in our code.

At the end of the day, 1 and 1b are always going to be slower than 2.  
And although 1b performs a bit better than 1, it has similar (user-space) complexity to 2.  On the other hand, if you step back and view teh entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... and yet still slower.  Given we ultimately have to support both (both as an upstream and as a distro), that's not very attractive.

Also note that every time we have strayed off the reservation from the beaten path (1) to anything mildly exotic (1b) we have been bitten by obscure file systems bugs.  And that's assume we get everything we need upstream... which is probably a year's endeavour.

Don't get me wrong: I'm all for making changes to file systems to better support systems like Ceph.  Things like O_NOCMTIME and O_ATOMIC make a huge amount of sense of a ton of different systems.  But our situations is a bit different: we always own the entire device (and often the server), so there is no need to share with other users or apps (and when you do, you just use the existing FileStore backend).  And as you know performance is a huge pain point.  We are already handicapped by virtue of being distributed and strongly consistent; we can't afford to give away more to a storage layer that isn't providing us much (or the right) value.

And I'm tired of half measures.  I want the OSD to be as fast as we can make it given the architectural constraints (RADOS consistency and ordering semantics).  This is truly low-hanging fruit: it's modular, self-contained, pluggable, and this will be my third time around this particular block.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-22 17:42         ` James (Fei) Liu-SSI
@ 2015-10-22 23:42           ` Samuel Just
  2015-10-23  0:10             ` Samuel Just
  2015-10-23  1:26             ` Allen Samuels
  0 siblings, 2 replies; 71+ messages in thread
From: Samuel Just @ 2015-10-22 23:42 UTC (permalink / raw)
  To: James (Fei) Liu-SSI; +Cc: Sage Weil, Ric Wheeler, Orit Wasserman, ceph-devel

Since the changes which moved the pg log and the pg info into the pg
object space, I think it's now the case that any transaction submitted
to the objectstore updates a disjoint range of objects determined by
the sequencer.  It might be easier to exploit that parallelism if we
control allocation and allocation related metadata.  We could split
the store into N pieces which partition the pg space (one additional
one for the meta sequencer?) with one rocksdb instance for each.
Space could then be parcelled out in large pieces (small frequency of
global allocation decisions) and managed more finely within each
partition.  The main challenge would be avoiding internal
fragmentation of those, but at least defragmentation can be managed on
a per-partition basis.  Such parallelism is probably necessary to
exploit the full throughput of some ssds.
-Sam

On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI
<james.liu@ssi.samsung.com> wrote:
> Hi Sage and other fellow cephers,
>   I truly share the pains with you  all about filesystem while I am working on  objectstore to improve the performance. As mentioned , there is nothing wrong with filesystem. Just the Ceph as one of  use case need more supports but not provided in near future by filesystem no matter what reasons.
>
>    There are so many techniques  pop out which can help to improve performance of OSD.  User space driver(DPDK from Intel) is one of them. It not only gives you the storage allocator,  also gives you the thread scheduling support,  CPU affinity , NUMA friendly, polling  which  might fundamentally change the performance of objectstore.  It should not be hard to improve CPU utilization 3x~5x times, higher IOPS etc.
>     I totally agreed that goal of filestore is to gives enough support for filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new design goal of objectstore should focus on giving the best  performance for OSD with new techniques. These two goals are not going to conflict with each other.  They are just for different purposes to make Ceph not only more stable but also better.
>
>   Scylla mentioned by Orit is a good example .
>
>   Thanks all.
>
>   Regards,
>   James
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Thursday, October 22, 2015 5:50 AM
> To: Ric Wheeler
> Cc: Orit Wasserman; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On Wed, 21 Oct 2015, Ric Wheeler wrote:
>> You will have to trust me on this as the Red Hat person who spoke to
>> pretty much all of our key customers about local file systems and
>> storage - customers all have migrated over to using normal file systems under Oracle/DB2.
>> Typically, they use XFS or ext4.  I don't know of any non-standard
>> file systems and only have seen one account running on a raw block
>> store in 8 years
>> :)
>>
>> If you have a pre-allocated file and write using O_DIRECT, your IO
>> path is identical in terms of IO's sent to the device.
>>
>> If we are causing additional IO's, then we really need to spend some
>> time talking to the local file system gurus about this in detail.  I
>> can help with that conversation.
>
> If the file is truly preallocated (that is, prewritten with zeros...
> fallocate doesn't help here because the extents is marked unwritten), then
> sure: there is very little change in the data path.
>
> But at that point, what is the point?  This only works if you have one (or a few) huge files and the user space app already has all the complexity of a filesystem-like thing (with its own internal journal, allocators, garbage collection, etc.).  Do they just do this to ease administrative tasks like backup?
>
>
> This is the fundamental tradeoff:
>
> 1) We have a file per object.  We fsync like crazy and the fact that there are two independent layers journaling and managing different types of consistency penalizes us.
>
> 1b) We get clever and start using obscure and/or custom ioctls in the file system to work around what it is used to: we swap extents to avoid write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc.
>
> 2) We preallocate huge files and write a user-space object system that lives within it (pretending the file is a block device).  The file system rarely gets in the way (assuming the file is prewritten and we don't do anything stupid).  But it doesn't give us anything a block device wouldn't, and it doesn't save us any complexity in our code.
>
> At the end of the day, 1 and 1b are always going to be slower than 2.
> And although 1b performs a bit better than 1, it has similar (user-space) complexity to 2.  On the other hand, if you step back and view teh entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... and yet still slower.  Given we ultimately have to support both (both as an upstream and as a distro), that's not very attractive.
>
> Also note that every time we have strayed off the reservation from the beaten path (1) to anything mildly exotic (1b) we have been bitten by obscure file systems bugs.  And that's assume we get everything we need upstream... which is probably a year's endeavour.
>
> Don't get me wrong: I'm all for making changes to file systems to better support systems like Ceph.  Things like O_NOCMTIME and O_ATOMIC make a huge amount of sense of a ton of different systems.  But our situations is a bit different: we always own the entire device (and often the server), so there is no need to share with other users or apps (and when you do, you just use the existing FileStore backend).  And as you know performance is a huge pain point.  We are already handicapped by virtue of being distributed and strongly consistent; we can't afford to give away more to a storage layer that isn't providing us much (or the right) value.
>
> And I'm tired of half measures.  I want the OSD to be as fast as we can make it given the architectural constraints (RADOS consistency and ordering semantics).  This is truly low-hanging fruit: it's modular, self-contained, pluggable, and this will be my third time around this particular block.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-22 23:42           ` Samuel Just
@ 2015-10-23  0:10             ` Samuel Just
  2015-10-23  1:26             ` Allen Samuels
  1 sibling, 0 replies; 71+ messages in thread
From: Samuel Just @ 2015-10-23  0:10 UTC (permalink / raw)
  To: James (Fei) Liu-SSI; +Cc: Sage Weil, Ric Wheeler, Orit Wasserman, ceph-devel

Ah, except for the snapmapper.  We can split the snapmapper in the
same way, though, as long as we are careful with the name.
-Sam

On Thu, Oct 22, 2015 at 4:42 PM, Samuel Just <sjust@redhat.com> wrote:
> Since the changes which moved the pg log and the pg info into the pg
> object space, I think it's now the case that any transaction submitted
> to the objectstore updates a disjoint range of objects determined by
> the sequencer.  It might be easier to exploit that parallelism if we
> control allocation and allocation related metadata.  We could split
> the store into N pieces which partition the pg space (one additional
> one for the meta sequencer?) with one rocksdb instance for each.
> Space could then be parcelled out in large pieces (small frequency of
> global allocation decisions) and managed more finely within each
> partition.  The main challenge would be avoiding internal
> fragmentation of those, but at least defragmentation can be managed on
> a per-partition basis.  Such parallelism is probably necessary to
> exploit the full throughput of some ssds.
> -Sam
>
> On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI
> <james.liu@ssi.samsung.com> wrote:
>> Hi Sage and other fellow cephers,
>>   I truly share the pains with you  all about filesystem while I am working on  objectstore to improve the performance. As mentioned , there is nothing wrong with filesystem. Just the Ceph as one of  use case need more supports but not provided in near future by filesystem no matter what reasons.
>>
>>    There are so many techniques  pop out which can help to improve performance of OSD.  User space driver(DPDK from Intel) is one of them. It not only gives you the storage allocator,  also gives you the thread scheduling support,  CPU affinity , NUMA friendly, polling  which  might fundamentally change the performance of objectstore.  It should not be hard to improve CPU utilization 3x~5x times, higher IOPS etc.
>>     I totally agreed that goal of filestore is to gives enough support for filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new design goal of objectstore should focus on giving the best  performance for OSD with new techniques. These two goals are not going to conflict with each other.  They are just for different purposes to make Ceph not only more stable but also better.
>>
>>   Scylla mentioned by Orit is a good example .
>>
>>   Thanks all.
>>
>>   Regards,
>>   James
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
>> Sent: Thursday, October 22, 2015 5:50 AM
>> To: Ric Wheeler
>> Cc: Orit Wasserman; ceph-devel@vger.kernel.org
>> Subject: Re: newstore direction
>>
>> On Wed, 21 Oct 2015, Ric Wheeler wrote:
>>> You will have to trust me on this as the Red Hat person who spoke to
>>> pretty much all of our key customers about local file systems and
>>> storage - customers all have migrated over to using normal file systems under Oracle/DB2.
>>> Typically, they use XFS or ext4.  I don't know of any non-standard
>>> file systems and only have seen one account running on a raw block
>>> store in 8 years
>>> :)
>>>
>>> If you have a pre-allocated file and write using O_DIRECT, your IO
>>> path is identical in terms of IO's sent to the device.
>>>
>>> If we are causing additional IO's, then we really need to spend some
>>> time talking to the local file system gurus about this in detail.  I
>>> can help with that conversation.
>>
>> If the file is truly preallocated (that is, prewritten with zeros...
>> fallocate doesn't help here because the extents is marked unwritten), then
>> sure: there is very little change in the data path.
>>
>> But at that point, what is the point?  This only works if you have one (or a few) huge files and the user space app already has all the complexity of a filesystem-like thing (with its own internal journal, allocators, garbage collection, etc.).  Do they just do this to ease administrative tasks like backup?
>>
>>
>> This is the fundamental tradeoff:
>>
>> 1) We have a file per object.  We fsync like crazy and the fact that there are two independent layers journaling and managing different types of consistency penalizes us.
>>
>> 1b) We get clever and start using obscure and/or custom ioctls in the file system to work around what it is used to: we swap extents to avoid write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc.
>>
>> 2) We preallocate huge files and write a user-space object system that lives within it (pretending the file is a block device).  The file system rarely gets in the way (assuming the file is prewritten and we don't do anything stupid).  But it doesn't give us anything a block device wouldn't, and it doesn't save us any complexity in our code.
>>
>> At the end of the day, 1 and 1b are always going to be slower than 2.
>> And although 1b performs a bit better than 1, it has similar (user-space) complexity to 2.  On the other hand, if you step back and view teh entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... and yet still slower.  Given we ultimately have to support both (both as an upstream and as a distro), that's not very attractive.
>>
>> Also note that every time we have strayed off the reservation from the beaten path (1) to anything mildly exotic (1b) we have been bitten by obscure file systems bugs.  And that's assume we get everything we need upstream... which is probably a year's endeavour.
>>
>> Don't get me wrong: I'm all for making changes to file systems to better support systems like Ceph.  Things like O_NOCMTIME and O_ATOMIC make a huge amount of sense of a ton of different systems.  But our situations is a bit different: we always own the entire device (and often the server), so there is no need to share with other users or apps (and when you do, you just use the existing FileStore backend).  And as you know performance is a huge pain point.  We are already handicapped by virtue of being distributed and strongly consistent; we can't afford to give away more to a storage layer that isn't providing us much (or the right) value.
>>
>> And I'm tired of half measures.  I want the OSD to be as fast as we can make it given the architectural constraints (RADOS consistency and ordering semantics).  This is truly low-hanging fruit: it's modular, self-contained, pluggable, and this will be my third time around this particular block.
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: newstore direction
  2015-10-22 23:42           ` Samuel Just
  2015-10-23  0:10             ` Samuel Just
@ 2015-10-23  1:26             ` Allen Samuels
  1 sibling, 0 replies; 71+ messages in thread
From: Allen Samuels @ 2015-10-23  1:26 UTC (permalink / raw)
  To: Samuel Just, James (Fei) Liu-SSI
  Cc: Sage Weil, Ric Wheeler, Orit Wasserman, ceph-devel

How would this kind of split affect small transactions? Will each split be separately transactionally consistent or is there some kind of meta-transaction that synchronizes each of the splits?


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Samuel Just
Sent: Friday, October 23, 2015 8:42 AM
To: James (Fei) Liu-SSI <james.liu@ssi.samsung.com>
Cc: Sage Weil <sweil@redhat.com>; Ric Wheeler <rwheeler@redhat.com>; Orit Wasserman <owasserm@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

Since the changes which moved the pg log and the pg info into the pg object space, I think it's now the case that any transaction submitted to the objectstore updates a disjoint range of objects determined by the sequencer.  It might be easier to exploit that parallelism if we control allocation and allocation related metadata.  We could split the store into N pieces which partition the pg space (one additional one for the meta sequencer?) with one rocksdb instance for each.
Space could then be parcelled out in large pieces (small frequency of global allocation decisions) and managed more finely within each partition.  The main challenge would be avoiding internal fragmentation of those, but at least defragmentation can be managed on a per-partition basis.  Such parallelism is probably necessary to exploit the full throughput of some ssds.
-Sam

On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
> Hi Sage and other fellow cephers,
>   I truly share the pains with you  all about filesystem while I am working on  objectstore to improve the performance. As mentioned , there is nothing wrong with filesystem. Just the Ceph as one of  use case need more supports but not provided in near future by filesystem no matter what reasons.
>
>    There are so many techniques  pop out which can help to improve performance of OSD.  User space driver(DPDK from Intel) is one of them. It not only gives you the storage allocator,  also gives you the thread scheduling support,  CPU affinity , NUMA friendly, polling  which  might fundamentally change the performance of objectstore.  It should not be hard to improve CPU utilization 3x~5x times, higher IOPS etc.
>     I totally agreed that goal of filestore is to gives enough support for filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new design goal of objectstore should focus on giving the best  performance for OSD with new techniques. These two goals are not going to conflict with each other.  They are just for different purposes to make Ceph not only more stable but also better.
>
>   Scylla mentioned by Orit is a good example .
>
>   Thanks all.
>
>   Regards,
>   James
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Thursday, October 22, 2015 5:50 AM
> To: Ric Wheeler
> Cc: Orit Wasserman; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On Wed, 21 Oct 2015, Ric Wheeler wrote:
>> You will have to trust me on this as the Red Hat person who spoke to
>> pretty much all of our key customers about local file systems and
>> storage - customers all have migrated over to using normal file systems under Oracle/DB2.
>> Typically, they use XFS or ext4.  I don't know of any non-standard
>> file systems and only have seen one account running on a raw block
>> store in 8 years
>> :)
>>
>> If you have a pre-allocated file and write using O_DIRECT, your IO
>> path is identical in terms of IO's sent to the device.
>>
>> If we are causing additional IO's, then we really need to spend some
>> time talking to the local file system gurus about this in detail.  I
>> can help with that conversation.
>
> If the file is truly preallocated (that is, prewritten with zeros...
> fallocate doesn't help here because the extents is marked unwritten),
> then
> sure: there is very little change in the data path.
>
> But at that point, what is the point?  This only works if you have one (or a few) huge files and the user space app already has all the complexity of a filesystem-like thing (with its own internal journal, allocators, garbage collection, etc.).  Do they just do this to ease administrative tasks like backup?
>
>
> This is the fundamental tradeoff:
>
> 1) We have a file per object.  We fsync like crazy and the fact that there are two independent layers journaling and managing different types of consistency penalizes us.
>
> 1b) We get clever and start using obscure and/or custom ioctls in the file system to work around what it is used to: we swap extents to avoid write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc.
>
> 2) We preallocate huge files and write a user-space object system that lives within it (pretending the file is a block device).  The file system rarely gets in the way (assuming the file is prewritten and we don't do anything stupid).  But it doesn't give us anything a block device wouldn't, and it doesn't save us any complexity in our code.
>
> At the end of the day, 1 and 1b are always going to be slower than 2.
> And although 1b performs a bit better than 1, it has similar (user-space) complexity to 2.  On the other hand, if you step back and view teh entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... and yet still slower.  Given we ultimately have to support both (both as an upstream and as a distro), that's not very attractive.
>
> Also note that every time we have strayed off the reservation from the beaten path (1) to anything mildly exotic (1b) we have been bitten by obscure file systems bugs.  And that's assume we get everything we need upstream... which is probably a year's endeavour.
>
> Don't get me wrong: I'm all for making changes to file systems to better support systems like Ceph.  Things like O_NOCMTIME and O_ATOMIC make a huge amount of sense of a ton of different systems.  But our situations is a bit different: we always own the entire device (and often the server), so there is no need to share with other users or apps (and when you do, you just use the existing FileStore backend).  And as you know performance is a huge pain point.  We are already handicapped by virtue of being distributed and strongly consistent; we can't afford to give away more to a storage layer that isn't providing us much (or the right) value.
>
> And I'm tired of half measures.  I want the OSD to be as fast as we can make it given the architectural constraints (RADOS consistency and ordering semantics).  This is truly low-hanging fruit: it's modular, self-contained, pluggable, and this will be my third time around this particular block.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-22 12:50       ` Sage Weil
  2015-10-22 17:42         ` James (Fei) Liu-SSI
@ 2015-10-23  2:06         ` Ric Wheeler
  1 sibling, 0 replies; 71+ messages in thread
From: Ric Wheeler @ 2015-10-23  2:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: Orit Wasserman, ceph-devel

On 10/22/2015 08:50 AM, Sage Weil wrote:
> On Wed, 21 Oct 2015, Ric Wheeler wrote:
>> You will have to trust me on this as the Red Hat person who spoke to pretty
>> much all of our key customers about local file systems and storage - customers
>> all have migrated over to using normal file systems under Oracle/DB2.
>> Typically, they use XFS or ext4.  I don't know of any non-standard file
>> systems and only have seen one account running on a raw block store in 8 years
>> :)
>>
>> If you have a pre-allocated file and write using O_DIRECT, your IO path is
>> identical in terms of IO's sent to the device.
>>
>> If we are causing additional IO's, then we really need to spend some time
>> talking to the local file system gurus about this in detail.  I can help with
>> that conversation.
> If the file is truly preallocated (that is, prewritten with zeros...
> fallocate doesn't help here because the extents is marked unwritten), then
> sure: there is very little change in the data path.
>
> But at that point, what is the point?  This only works if you have one (or
> a few) huge files and the user space app already has all the complexity of
> a filesystem-like thing (with its own internal journal, allocators,
> garbage collection, etc.).  Do they just do this to ease administrative
> tasks like backup?

I think that the key here is that if we fsync() like crazy - regardless of 
writing to a file system or to some new, yet to be define block device primitive 
store - we are limited to the IOP's of that particular block device.

Ignoring exotic hardware configs for people who can ignore all SSD devices, we 
will have rotating, high capacity, slow spinning drives for *a long time* as the 
eventual tier.  Given that assumption, we need to do better then to be limited 
to synchronous IOP's for a slow drive.  When we have commodity pricing for 
things like persistent DRAM, then I agree that writing directly to that medium 
makes sense (but you can do that with DAX by effectively mapping that into the 
process address space).

Specifically, moving from a file system with some inefficiencies will only boost 
performance from say 20-30 IOP's to roughly 40-50 IOP's.

The way this has been handled traditionally for things like databases, etc is:

* batch up the transactions that need to be destaged
* issue an O_DIRECT async IO for all of the elements that need to be written 
(bypassed the page cache, direct to the backing store)
* wait for completion

We should probably add to that sequence an fsync() of the directory (or a file 
in the file system) to insure that any volatile write cache is invalidated, but 
there is *no* reason to fsync() each file.

I think that we need to look at why the write pattern is so heavily synchronous 
and single threaded if we are hoping to extract from any given storage tier its 
maximum performance.

Doing this can raise your file creations per second (or allocations per second) 
from a few dozen to a few hundred or more per second.

The complexity that writing a new block level allocation strategy that you save is:

* if you lay out a lot of small objects on the block store that can grow, we 
will quickly end up doing very complicated techniques that file systems solved a 
long time ago (pre-allocation, etc)
* multi-stream aware allocation if you have multiple processes writing to the 
same store
* tracking things like allocated but unwritten (can happen if some process 
"pokes" a hole in an object, common with things like virtual machine images)

One we end up handling all of that in new, untested code, I think that we end up 
with a lot of pain and only minimal gain in terms of performance.

ric

>
>
> This is the fundamental tradeoff:
>
> 1) We have a file per object.  We fsync like crazy and the fact that
> there are two independent layers journaling and managing different types
> of consistency penalizes us.
>
> 1b) We get clever and start using obscure and/or custom ioctls in the file
> system to work around what it is used to: we swap extents to avoid
> write-ahead (see Christoph's patch), O_NOMTIME, unprivileged
> open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc.
>
> 2) We preallocate huge files and write a user-space object system that
> lives within it (pretending the file is a block device).  The file system
> rarely gets in the way (assuming the file is prewritten and we don't do
> anything stupid).  But it doesn't give us anything a block device
> wouldn't, and it doesn't save us any complexity in our code.
>
> At the end of the day, 1 and 1b are always going to be slower than 2.
> And although 1b performs a bit better than 1, it has similar (user-space)
> complexity to 2.  On the other hand, if you step back and view teh
> entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex
> than 2... and yet still slower.  Given we ultimately have to support both
> (both as an upstream and as a distro), that's not very attractive.
>
> Also note that every time we have strayed off the reservation from the
> beaten path (1) to anything mildly exotic (1b) we have been bitten by
> obscure file systems bugs.  And that's assume we get everything we need
> upstream... which is probably a year's endeavour.
>
> Don't get me wrong: I'm all for making changes to file systems to better
> support systems like Ceph.  Things like O_NOCMTIME and O_ATOMIC make a
> huge amount of sense of a ton of different systems.  But our situations is
> a bit different: we always own the entire device (and often the server),
> so there is no need to share with other users or apps (and when you do,
> you just use the existing FileStore backend).  And as you know performance
> is a huge pain point.  We are already handicapped by virtue of being
> distributed and strongly consistent; we can't afford to give away more to
> a storage layer that isn't providing us much (or the right) value.
>
> And I'm tired of half measures.  I want the OSD to be as fast as we can
> make it given the architectural constraints (RADOS consistency and
> ordering semantics).  This is truly low-hanging fruit: it's modular,
> self-contained, pluggable, and this will be my third time around this
> particular block.
>
> sage


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-22  1:22           ` Allen Samuels
@ 2015-10-23  2:10             ` Ric Wheeler
  0 siblings, 0 replies; 71+ messages in thread
From: Ric Wheeler @ 2015-10-23  2:10 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil, ceph-devel

I disagree with your point still - your argument was that customers don't like 
to update their code so we cannot rely on them moving to better file system 
code.  Those same customers would be *just* as reluctant to upgrade OSD code.  
Been there, done that in pure block storage, pure object storage and in file 
system code (customers just don't care about the protocol, the conservative 
nature is consistent).

Not a casual observation, I have been building storage systems since the mid-80's.

Regards,

Ric

On 10/21/2015 09:22 PM, Allen Samuels wrote:
> I agree. My only point was that you still have to factor this time into the argument that by continuing to put NewStore on top of a file system you'll get to a stable system much sooner than the longer development path of doing your own raw storage allocator. IMO, once you factor that into the equation the "on top of an FS" path doesn't look like such a clear winner.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
> -----Original Message-----
> From: Ric Wheeler [mailto:rwheeler@redhat.com]
> Sent: Thursday, October 22, 2015 10:17 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On 10/21/2015 08:53 PM, Allen Samuels wrote:
>> Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many companies standardize on a particular release of a particular distro. Getting them to switch to a new release -- even a "bug fix" point release -- is a major undertaking that often is a complete roadblock. Just my experience. YMMV.
>>
> Customers do control the pace that they upgrade their machines, but we put out fixes on a very regular pace.  A lot of customers will get fixes without having to qualify a full new release (i.e., fixes come out between major and minor releases are easy).
>
> If someone is deploying a critical server for storage, then it falls back on the storage software team to help guide them and encourage them to update when needed (and no promises of success, but people move if the win is big. If it is not, they can wait).
>
> ric
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-22 12:32     ` Milosz Tanski
@ 2015-10-23  3:16       ` Howard Chu
  2015-10-23 13:27         ` Milosz Tanski
  0 siblings, 1 reply; 71+ messages in thread
From: Howard Chu @ 2015-10-23  3:16 UTC (permalink / raw)
  To: ceph-devel

Milosz Tanski <milosz <at> adfin.com> writes:

> 
> On Tue, Oct 20, 2015 at 4:00 PM, Sage Weil <sweil <at> redhat.com> wrote:
> > On Tue, 20 Oct 2015, John Spray wrote:
> >> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sweil <at> redhat.com> wrote:
> >> >  - We have to size the kv backend storage (probably still an XFS
> >> > partition) vs the block storage.  Maybe we do this anyway (put
metadata on
> >> > SSD!) so it won't matter.  But what happens when we are storing gobs of
> >> > rgw index data or cephfs metadata?  Suddenly we are pulling storage
out of
> >> > a different pool and those aren't currently fungible.
> >>
> >> This is the concerning bit for me -- the other parts one "just" has to
> >> get the code right, but this problem could linger and be something we
> >> have to keep explaining to users indefinitely.  It reminds me of cases
> >> in other systems where users had to make an educated guess about inode
> >> size up front, depending on whether you're expecting to efficiently
> >> store a lot of xattrs.
> >>
> >> In practice it's rare for users to make these kinds of decisions well
> >> up-front: it really needs to be adjustable later, ideally
> >> automatically.  That could be pretty straightforward if the KV part
> >> was stored directly on block storage, instead of having XFS in the
> >> mix.  I'm not quite up with the state of the art in this area: are
> >> there any reasonable alternatives for the KV part that would consume
> >> some defined range of a block device from userspace, instead of
> >> sitting on top of a filesystem?
> >
> > I agree: this is my primary concern with the raw block approach.
> >
> > There are some KV alternatives that could consume block, but the problem
> > would be similar: we need to dynamically size up or down the kv portion of
> > the device.
> >
> > I see two basic options:
> >
> > 1) Wire into the Env abstraction in rocksdb to provide something just
> > smart enough to let rocksdb work.  It isn't much: named files (not that
> > many--we could easily keep the file table in ram), always written
> > sequentially, to be read later with random access. All of the code is
> > written around abstractions of SequentialFileWriter so that everything
> > posix is neatly hidden in env_posix (and there are various other env
> > implementations for in-memory mock tests etc.).
> >
> > 2) Use something like dm-thin to sit between the raw block device and XFS
> > (for rocksdb) and the block device consumed by newstore.  As long as XFS
> > doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb
> > files in their entirety) we can fstrim and size down the fs portion.  If
> > we similarly make newstores allocator stick to large blocks only we would
> > be able to size down the block portion as well.  Typical dm-thin block
> > sizes seem to range from 64KB to 512KB, which seems reasonable enough to
> > me.  In fact, we could likely just size the fs volume at something
> > conservatively large (like 90%) and rely on -o discard or periodic fstrim
> > to keep its actual utilization in check.
> >
> 
> I think you could prototype a raw block device OSD store using LMDB as
> a starting point. I know there's been some experiments using LMDB as
> KV store before with positive read numbers and not great write
> numbers.
> 
> 1. It mmaps, just mmap the raw disk device / partition. I've done this
> as an experiment before, I can dig up a patch for LMDB.
> 2. It already has a free space management strategy. I'm prob it's not
> right for the OSDs in the long term but there's something to start
> there with.
> 3. It's already supports transactions / COW.
> 4. LMDB isn't a huge code base so it might be a good place to start /
> evolve code from.
> 5. You're not starting a multi-year effort at the 0 point.
> 
> As to the not great write performance, that could be addressed by
> write transaction merging (what mysql implemented a few years ago).

We have a heavily hacked version of LMDB contributed by VMware that
implements a WAL. In my preliminary testing it performs synchronous writes
30x faster (on average) than current LMDB. Their version unfortunately
slashed'n'burned a lot of LMDB features that other folks actually need, so
we can't use it as-is. Currently working on rationalizing the approach and
merging it into mdb.master.

The reasons for the WAL approach:
  1) obviously sequential writes are cheaper than random writes.
  2) fsync() of a small log file will always be faster than fsync() of a
large DB. I.e., fsync() latency is proportional to the total number of pages
in the file, not just the number of dirty pages.

LMDB on a raw block device is a simpler proposition, and one we intend to
integrate soon as well. (Milosz, did you ever submit your changes?)

> Here you have an opportunity to do it two days. One, you can do it in
> the application layer while waiting for the fsync from transaction to
> complete. This is probably the easier route. Two, you can do it in the
> DB layer (the LMDB transaction handling / locking) where you're
> already started processing the following transactions using the
> currently committing transaction (COW) as a starting point. This is
> harder mostly because of the synchronization needed or involved.
> 
> I've actually spend some time thinking about doing LMDB write
> transaction merging outside the OSD context. This was for another
> project.
> 
> My 2 cents.

For my 2 cents, a number of approaches have been mentioned on this thread
that I think are worth touching on:

First of all LevelDB-style LSMs are an inherently poor design choice -
requiring multiple files to be opened/closed during routine operation is
inherently fragile. Inside a service that is also opening/closing many
network sockets, if you hit your filedescriptor limit in the middle of a DB
op you lose the DB. If you get a system crash in the middle of a sequence of
open/close/rename/delete ops you lose the DB. Etc. etc. (LevelDB
unreliability is already well researched and well proven, I'm not saying
anything new here
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pillai
)

User-level pagecache management - also an inherently poor design choice.
  1) The kernel has hardware-assist - it will always be more efficient than
any user-level code.
  2) The kernel knows about the entire system state - user level can only
easily know about a single process' resource usage. If your process is
sharing with any other services on the machine your performance will be
sub-optimal.
  3) In this day of virtual machines/cloud processing with
hardware-accelerated VMs, kernel-managed paging passes thru straight to the
hypervisor, so it is always efficient. User-level paging might know about
the current guest machine image's resource consumption, but won't know about
the actual state of the world in the hypervisor or host machine. It will be
prone to (and exacerbate) thrashing in ways that kernel-managed paging won't.

User-level pagecache management only works when your application is the only
thing running on the box. (In that case, it can certainly work very well.)
That's not the reality for most of today's computing landscape, nor the
foreseeable future.

-- 
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/ 


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-21 13:50             ` Ric Wheeler
@ 2015-10-23  6:21               ` Howard Chu
  2015-10-23 11:06                 ` Ric Wheeler
  0 siblings, 1 reply; 71+ messages in thread
From: Howard Chu @ 2015-10-23  6:21 UTC (permalink / raw)
  To: ceph-devel

Ric Wheeler <rwheeler <at> redhat.com> writes:

> 
> On 10/21/2015 09:32 AM, Sage Weil wrote:
> > On Tue, 20 Oct 2015, Ric Wheeler wrote:
> >>> Now:
> >>>       1 io  to write a new file
> >>>     1-2 ios to sync the fs journal (commit the inode, alloc change)
> >>>             (I see 2 journal IOs on XFS and only 1 on ext4...)
> >>>       1 io  to commit the rocksdb journal (currently 3, but will drop to
> >>>             1 with xfs fix and my rocksdb change)
> >> I think that might be too pessimistic - the number of discrete IO's
sent down
> >> to a spinning disk make much less impact on performance than the number of
> >> fsync()'s since they IO's all land in the write cache.  Some newer spinning
> >> drives have a non-volatile write cache, so even an fsync() might not end up
> >> doing the expensive data transfer to the platter.
> > True, but in XFS's case at least the file data and journal are not
> > colocated, so its 2 seeks for the new file write+fdatasync and another for
> > the rocksdb journal commit.  Of course, with a deep queue, we're doing
> > lots of these so there's be fewer journal commits on both counts, but the
> > lower bound on latency of a single write is still 3 seeks, and that bound
> > is pretty critical when you also have network round trips and replication
> > (worst out of 2) on top.
> 
> What are the performance goals we are looking for?
> 
> Small, synchronous writes/second?
> 
> File creates/second?
> 
> I suspect that looking at things like seeks/write is probably looking at the 
> wrong level of performance challenges.  Again, when you write to a modern
drive, 
> you write to its write cache and it decides internally when/how to destage to 
> the platter.
> 
> If you look at the performance of XFS with streaming workloads, it will
tend to 
> max out the bandwidth of the underlaying storage.
> 
> If we need IOP's/file writes, etc, we should be clear on what we are
aiming at.
> 
> >
> >> It would be interesting to get the timings on the IO's you see to
measure the
> >> actual impact.
> > I observed this with the journaling workload for rocksdb, but I assume the
> > journaling behavior is the same regardless of what is being journaled.
> > For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and
> > blktrace showed an IO to the file, and 2 IOs to the journal.  I believe
> > the first one is the record for the inode update, and the second is the
> > journal 'commit' record (though I forget how I decided that).  My guess is
> > that XFS is being extremely careful about journal integrity here and not
> > writing the commit record until it knows that the preceding records landed
> > on stable storage.  For ext4, the latency was about ~20ms, and blktrace
> > showed the IO to the file and then a single journal IO.  When I made the
> > rocksdb change to overwrite an existing, prewritten file, the latency
> > dropped to ~10ms on ext4, and blktrace showed a single IO as expected.
> > (XFS still showed the 2 journal commit IOs, but Dave just posted the fix
> > for that on the XFS list today.)

> Normally, best practice is to use batching to avoid paying worst case latency 
> when you do a synchronous IO. Write a batch of files or appends without
fsync, 
> then go back and fsync and you will pay that latency once (not per file/op).

If filesystems would support ordered writes you wouldn't need to fsync at
all. Just spit out a stream of writes and declare that batch N must be
written before batch N+1. (Note that this is not identical to "write
barriers", which imposed the same latencies as fsync by blocking all I/Os at
a barrier boundary. Ordered writes may be freely interleaved with un-ordered
writes, so normal I/O traffic can proceed unhindered. Their ordering is only
enforced wrt other ordered writes.)

A bit of a shame that Linux's SCSI drivers support Ordering attributes but
nothing above that layer makes use of it.
-- 
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/ 


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-23  6:21               ` Howard Chu
@ 2015-10-23 11:06                 ` Ric Wheeler
  2015-10-23 11:47                   ` Ric Wheeler
  0 siblings, 1 reply; 71+ messages in thread
From: Ric Wheeler @ 2015-10-23 11:06 UTC (permalink / raw)
  To: Howard Chu, ceph-devel

On 10/23/2015 02:21 AM, Howard Chu wrote:
>> Normally, best practice is to use batching to avoid paying worst case latency
>> >when you do a synchronous IO. Write a batch of files or appends without
> fsync,
>> >then go back and fsync and you will pay that latency once (not per file/op).
> If filesystems would support ordered writes you wouldn't need to fsync at
> all. Just spit out a stream of writes and declare that batch N must be
> written before batch N+1. (Note that this is not identical to "write
> barriers", which imposed the same latencies as fsync by blocking all I/Os at
> a barrier boundary. Ordered writes may be freely interleaved with un-ordered
> writes, so normal I/O traffic can proceed unhindered. Their ordering is only
> enforced wrt other ordered writes.)
>
> A bit of a shame that Linux's SCSI drivers support Ordering attributes but
> nothing above that layer makes use of it.

I think that if the stream on either side of the barrier is large enough, using 
ordered tags (SCSI speak) versus doing stream1, fsync(), stream2, should have 
the same performance.

Not clear to me if we could do away with an fsync to trigger a cache flush here 
either - do SCSI ordered tags require that the writes be acknowledged only when 
durable, or can the device ack them once the target has them (including in a 
volatile write cache)?

Ric


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-23 11:06                 ` Ric Wheeler
@ 2015-10-23 11:47                   ` Ric Wheeler
  2015-10-23 14:59                     ` Howard Chu
  0 siblings, 1 reply; 71+ messages in thread
From: Ric Wheeler @ 2015-10-23 11:47 UTC (permalink / raw)
  To: Howard Chu, ceph-devel

On 10/23/2015 07:06 AM, Ric Wheeler wrote:
> On 10/23/2015 02:21 AM, Howard Chu wrote:
>>> Normally, best practice is to use batching to avoid paying worst case latency
>>> >when you do a synchronous IO. Write a batch of files or appends without
>> fsync,
>>> >then go back and fsync and you will pay that latency once (not per file/op).
>> If filesystems would support ordered writes you wouldn't need to fsync at
>> all. Just spit out a stream of writes and declare that batch N must be
>> written before batch N+1. (Note that this is not identical to "write
>> barriers", which imposed the same latencies as fsync by blocking all I/Os at
>> a barrier boundary. Ordered writes may be freely interleaved with un-ordered
>> writes, so normal I/O traffic can proceed unhindered. Their ordering is only
>> enforced wrt other ordered writes.)
>>
>> A bit of a shame that Linux's SCSI drivers support Ordering attributes but
>> nothing above that layer makes use of it.
>
> I think that if the stream on either side of the barrier is large enough, 
> using ordered tags (SCSI speak) versus doing stream1, fsync(), stream2, should 
> have the same performance.
>
> Not clear to me if we could do away with an fsync to trigger a cache flush 
> here either - do SCSI ordered tags require that the writes be acknowledged 
> only when durable, or can the device ack them once the target has them 
> (including in a volatile write cache)?
>
> Ric
>
>

One other note, the file & storage kernel people discussed using ordering years 
ago. One of the issues is that the devices themselves need to support. While 
S-ATA devices are portrayed as SCSI in the kernel, ATA does not (and still does 
not as far as I know?) support ordered tags.

Regards,

Ric



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-23  3:16       ` Howard Chu
@ 2015-10-23 13:27         ` Milosz Tanski
  0 siblings, 0 replies; 71+ messages in thread
From: Milosz Tanski @ 2015-10-23 13:27 UTC (permalink / raw)
  To: Howard Chu; +Cc: ceph-devel

On Thu, Oct 22, 2015 at 11:16 PM, Howard Chu <hyc@symas.com> wrote:
> Milosz Tanski <milosz <at> adfin.com> writes:
>
>>
>> On Tue, Oct 20, 2015 at 4:00 PM, Sage Weil <sweil <at> redhat.com> wrote:
>> > On Tue, 20 Oct 2015, John Spray wrote:
>> >> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sweil <at> redhat.com> wrote:
>> >> >  - We have to size the kv backend storage (probably still an XFS
>> >> > partition) vs the block storage.  Maybe we do this anyway (put
> metadata on
>> >> > SSD!) so it won't matter.  But what happens when we are storing gobs of
>> >> > rgw index data or cephfs metadata?  Suddenly we are pulling storage
> out of
>> >> > a different pool and those aren't currently fungible.
>> >>
>> >> This is the concerning bit for me -- the other parts one "just" has to
>> >> get the code right, but this problem could linger and be something we
>> >> have to keep explaining to users indefinitely.  It reminds me of cases
>> >> in other systems where users had to make an educated guess about inode
>> >> size up front, depending on whether you're expecting to efficiently
>> >> store a lot of xattrs.
>> >>
>> >> In practice it's rare for users to make these kinds of decisions well
>> >> up-front: it really needs to be adjustable later, ideally
>> >> automatically.  That could be pretty straightforward if the KV part
>> >> was stored directly on block storage, instead of having XFS in the
>> >> mix.  I'm not quite up with the state of the art in this area: are
>> >> there any reasonable alternatives for the KV part that would consume
>> >> some defined range of a block device from userspace, instead of
>> >> sitting on top of a filesystem?
>> >
>> > I agree: this is my primary concern with the raw block approach.
>> >
>> > There are some KV alternatives that could consume block, but the problem
>> > would be similar: we need to dynamically size up or down the kv portion of
>> > the device.
>> >
>> > I see two basic options:
>> >
>> > 1) Wire into the Env abstraction in rocksdb to provide something just
>> > smart enough to let rocksdb work.  It isn't much: named files (not that
>> > many--we could easily keep the file table in ram), always written
>> > sequentially, to be read later with random access. All of the code is
>> > written around abstractions of SequentialFileWriter so that everything
>> > posix is neatly hidden in env_posix (and there are various other env
>> > implementations for in-memory mock tests etc.).
>> >
>> > 2) Use something like dm-thin to sit between the raw block device and XFS
>> > (for rocksdb) and the block device consumed by newstore.  As long as XFS
>> > doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb
>> > files in their entirety) we can fstrim and size down the fs portion.  If
>> > we similarly make newstores allocator stick to large blocks only we would
>> > be able to size down the block portion as well.  Typical dm-thin block
>> > sizes seem to range from 64KB to 512KB, which seems reasonable enough to
>> > me.  In fact, we could likely just size the fs volume at something
>> > conservatively large (like 90%) and rely on -o discard or periodic fstrim
>> > to keep its actual utilization in check.
>> >
>>
>> I think you could prototype a raw block device OSD store using LMDB as
>> a starting point. I know there's been some experiments using LMDB as
>> KV store before with positive read numbers and not great write
>> numbers.
>>
>> 1. It mmaps, just mmap the raw disk device / partition. I've done this
>> as an experiment before, I can dig up a patch for LMDB.
>> 2. It already has a free space management strategy. I'm prob it's not
>> right for the OSDs in the long term but there's something to start
>> there with.
>> 3. It's already supports transactions / COW.
>> 4. LMDB isn't a huge code base so it might be a good place to start /
>> evolve code from.
>> 5. You're not starting a multi-year effort at the 0 point.
>>
>> As to the not great write performance, that could be addressed by
>> write transaction merging (what mysql implemented a few years ago).
>
> We have a heavily hacked version of LMDB contributed by VMware that
> implements a WAL. In my preliminary testing it performs synchronous writes
> 30x faster (on average) than current LMDB. Their version unfortunately
> slashed'n'burned a lot of LMDB features that other folks actually need, so
> we can't use it as-is. Currently working on rationalizing the approach and
> merging it into mdb.master.
>
> The reasons for the WAL approach:
>   1) obviously sequential writes are cheaper than random writes.
>   2) fsync() of a small log file will always be faster than fsync() of a
> large DB. I.e., fsync() latency is proportional to the total number of pages
> in the file, not just the number of dirty pages.

This a bit off topic (from new store). More to Howard about LMDB
internals and write serialization.

Howard, there is way to make progress on pending transactions without
WAL. LMDB is already COW so hypothetically further write write
transactions could process one at a time using the previous committed
(but not fsynced) transaction as a starting point. When one fsync is
complete, you can fsync the next group. This breaks ACID because it
violates the Isolation principle since transactions become dependent
on the previous transaction and if that fails to fsync then the next
transactions fail. I'm not sure this is that important for a lot of
apps.

Here's the conceptual model: http://i.imgur.com/wUCplq1.png

The way LMDB code is organized (the data structures) makes it seam
like it be straightforward. Synchronization is where this becomes
painful as there needs to be a lot more coordination between writers
(waiters) then there is today (a simple writer mutex).

>
> LMDB on a raw block device is a simpler proposition, and one we intend to
> integrate soon as well. (Milosz, did you ever submit your changes?)

I'll dig out my changes from my work environment, see if anything
needs to be cleaned up and send it out. I got context switched out to
something else :/

>
>> Here you have an opportunity to do it two days. One, you can do it in
>> the application layer while waiting for the fsync from transaction to
>> complete. This is probably the easier route. Two, you can do it in the
>> DB layer (the LMDB transaction handling / locking) where you're
>> already started processing the following transactions using the
>> currently committing transaction (COW) as a starting point. This is
>> harder mostly because of the synchronization needed or involved.
>>
>> I've actually spend some time thinking about doing LMDB write
>> transaction merging outside the OSD context. This was for another
>> project.
>>
>> My 2 cents.
>
> For my 2 cents, a number of approaches have been mentioned on this thread
> that I think are worth touching on:
>
> First of all LevelDB-style LSMs are an inherently poor design choice -
> requiring multiple files to be opened/closed during routine operation is
> inherently fragile. Inside a service that is also opening/closing many
> network sockets, if you hit your filedescriptor limit in the middle of a DB
> op you lose the DB. If you get a system crash in the middle of a sequence of
> open/close/rename/delete ops you lose the DB. Etc. etc. (LevelDB
> unreliability is already well researched and well proven, I'm not saying
> anything new here
> https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pillai
> )
>
> User-level pagecache management - also an inherently poor design choice.
>   1) The kernel has hardware-assist - it will always be more efficient than
> any user-level code.
>   2) The kernel knows about the entire system state - user level can only
> easily know about a single process' resource usage. If your process is
> sharing with any other services on the machine your performance will be
> sub-optimal.
>   3) In this day of virtual machines/cloud processing with
> hardware-accelerated VMs, kernel-managed paging passes thru straight to the
> hypervisor, so it is always efficient. User-level paging might know about
> the current guest machine image's resource consumption, but won't know about
> the actual state of the world in the hypervisor or host machine. It will be
> prone to (and exacerbate) thrashing in ways that kernel-managed paging won't.
>
> User-level pagecache management only works when your application is the only
> thing running on the box. (In that case, it can certainly work very well.)
> That's not the reality for most of today's computing landscape, nor the
> foreseeable future.
>
> --
>   -- Howard Chu
>   CTO, Symas Corp.           http://www.symas.com
>   Director, Highland Sun     http://highlandsun.com/hyc/
>   Chief Architect, OpenLDAP  http://www.openldap.org/project/
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-23 11:47                   ` Ric Wheeler
@ 2015-10-23 14:59                     ` Howard Chu
  2015-10-23 16:37                       ` Ric Wheeler
  2015-10-23 18:59                       ` Gregory Farnum
  0 siblings, 2 replies; 71+ messages in thread
From: Howard Chu @ 2015-10-23 14:59 UTC (permalink / raw)
  To: Ric Wheeler, ceph-devel

Ric Wheeler wrote:
> On 10/23/2015 07:06 AM, Ric Wheeler wrote:
>> On 10/23/2015 02:21 AM, Howard Chu wrote:
>>>> Normally, best practice is to use batching to avoid paying worst case latency
>>>> >when you do a synchronous IO. Write a batch of files or appends without
>>> fsync,
>>>> >then go back and fsync and you will pay that latency once (not per file/op).
>>> If filesystems would support ordered writes you wouldn't need to fsync at
>>> all. Just spit out a stream of writes and declare that batch N must be
>>> written before batch N+1. (Note that this is not identical to "write
>>> barriers", which imposed the same latencies as fsync by blocking all I/Os at
>>> a barrier boundary. Ordered writes may be freely interleaved with un-ordered
>>> writes, so normal I/O traffic can proceed unhindered. Their ordering is only
>>> enforced wrt other ordered writes.)

> One other note, the file & storage kernel people discussed using ordering
> years ago. One of the issues is that the devices themselves need to support.
> While S-ATA devices are portrayed as SCSI in the kernel, ATA does not (and
> still does not as far as I know?) support ordered tags.

Yes, that's a bigger problem. ATA NCQ/TCQ aren't up to the job.

 >>> A bit of a shame that Linux's SCSI drivers support Ordering attributes but
 >>> nothing above that layer makes use of it.
 >>
 >> I think that if the stream on either side of the barrier is large enough,
 >> using ordered tags (SCSI speak) versus doing stream1, fsync(), stream2,
 >> should have the same performance.

 >> Not clear to me if we could do away with an fsync to trigger a cache flush
 >> here either - do SCSI ordered tags require that the writes be acknowledged
 >> only when durable, or can the device ack them once the target has them
 >> (including in a volatile write cache)?

fsync() is too blunt a tool; its use gives you both C and D of ACID 
(Consistency and Durability). Ordered tags give you Consistency; there are 
lots of applications that can live without perfect Durability but losing 
Consistency is a major headache.

If the stream of writes is large enough, you could omit fsync because 
everything is being forced out of the cache to disk anyway. In that scenario, 
the only thing that matters is that the writes get forced out in the order you 
intended, so that an interruption or crash leaves you in a known (or knowable) 
state vs unknown.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-23 14:59                     ` Howard Chu
@ 2015-10-23 16:37                       ` Ric Wheeler
  2015-10-23 18:59                       ` Gregory Farnum
  1 sibling, 0 replies; 71+ messages in thread
From: Ric Wheeler @ 2015-10-23 16:37 UTC (permalink / raw)
  To: Howard Chu, ceph-devel

On 10/23/2015 10:59 AM, Howard Chu wrote:
> Ric Wheeler wrote:
>> On 10/23/2015 07:06 AM, Ric Wheeler wrote:
>>> On 10/23/2015 02:21 AM, Howard Chu wrote:
>>>>> Normally, best practice is to use batching to avoid paying worst case latency
>>>>> >when you do a synchronous IO. Write a batch of files or appends without
>>>> fsync,
>>>>> >then go back and fsync and you will pay that latency once (not per file/op).
>>>> If filesystems would support ordered writes you wouldn't need to fsync at
>>>> all. Just spit out a stream of writes and declare that batch N must be
>>>> written before batch N+1. (Note that this is not identical to "write
>>>> barriers", which imposed the same latencies as fsync by blocking all I/Os at
>>>> a barrier boundary. Ordered writes may be freely interleaved with un-ordered
>>>> writes, so normal I/O traffic can proceed unhindered. Their ordering is only
>>>> enforced wrt other ordered writes.)
>
>> One other note, the file & storage kernel people discussed using ordering
>> years ago. One of the issues is that the devices themselves need to support.
>> While S-ATA devices are portrayed as SCSI in the kernel, ATA does not (and
>> still does not as far as I know?) support ordered tags.
>
> Yes, that's a bigger problem. ATA NCQ/TCQ aren't up to the job.
>
> >>> A bit of a shame that Linux's SCSI drivers support Ordering attributes but
> >>> nothing above that layer makes use of it.
> >>
> >> I think that if the stream on either side of the barrier is large enough,
> >> using ordered tags (SCSI speak) versus doing stream1, fsync(), stream2,
> >> should have the same performance.
>
> >> Not clear to me if we could do away with an fsync to trigger a cache flush
> >> here either - do SCSI ordered tags require that the writes be acknowledged
> >> only when durable, or can the device ack them once the target has them
> >> (including in a volatile write cache)?
>
> fsync() is too blunt a tool; its use gives you both C and D of ACID 
> (Consistency and Durability). Ordered tags give you Consistency; there are 
> lots of applications that can live without perfect Durability but losing 
> Consistency is a major headache.
>
> If the stream of writes is large enough, you could omit fsync because 
> everything is being forced out of the cache to disk anyway. In that scenario, 
> the only thing that matters is that the writes get forced out in the order you 
> intended, so that an interruption or crash leaves you in a known (or knowable) 
> state vs unknown.
>

I do agree that fsync is quite a blunt tool, but you cannot assume that a stream 
of writes will flush the cache - that is extremely firmware dependent.

Pretty common to leave small IO's in cache and let larger IO's stream directly 
to the backing device (platter, etc) - those small objects can stay live and 
non-durable for days under some heavy workloads :)

ric


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-23 14:59                     ` Howard Chu
  2015-10-23 16:37                       ` Ric Wheeler
@ 2015-10-23 18:59                       ` Gregory Farnum
  2015-10-23 21:23                         ` Howard Chu
  1 sibling, 1 reply; 71+ messages in thread
From: Gregory Farnum @ 2015-10-23 18:59 UTC (permalink / raw)
  To: Howard Chu; +Cc: Ric Wheeler, ceph-devel

On Fri, Oct 23, 2015 at 7:59 AM, Howard Chu <hyc@symas.com> wrote:
> If the stream of writes is large enough, you could omit fsync because
> everything is being forced out of the cache to disk anyway. In that
> scenario, the only thing that matters is that the writes get forced out in
> the order you intended, so that an interruption or crash leaves you in a
> known (or knowable) state vs unknown.

The RADOS storage semantics actually require that we know it's durable
on disk as well, unfortunately. But ordered writes would probably let
us batch up commit points in ways that are a lot friendlier for the
drives!
-Greg

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: newstore direction
  2015-10-23 18:59                       ` Gregory Farnum
@ 2015-10-23 21:23                         ` Howard Chu
  0 siblings, 0 replies; 71+ messages in thread
From: Howard Chu @ 2015-10-23 21:23 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ric Wheeler, ceph-devel

Gregory Farnum wrote:
> On Fri, Oct 23, 2015 at 7:59 AM, Howard Chu <hyc@symas.com> wrote:
>> If the stream of writes is large enough, you could omit fsync because
>> everything is being forced out of the cache to disk anyway. In that
>> scenario, the only thing that matters is that the writes get forced out in
>> the order you intended, so that an interruption or crash leaves you in a
>> known (or knowable) state vs unknown.
>
> The RADOS storage semantics actually require that we know it's durable
> on disk as well, unfortunately. But ordered writes would probably let
> us batch up commit points in ways that are a lot friendlier for the
> drives!

Ah, that's too bad. LMDB does fine with only ordering, but never mind.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2015-10-23 21:23 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-19 19:49 newstore direction Sage Weil
2015-10-19 20:22 ` Robert LeBlanc
2015-10-19 20:30 ` Somnath Roy
2015-10-19 20:54   ` Sage Weil
2015-10-19 22:21     ` James (Fei) Liu-SSI
2015-10-20  2:24       ` Chen, Xiaoxi
2015-10-20 12:30         ` Sage Weil
2015-10-20 13:19           ` Mark Nelson
2015-10-20 17:04             ` kernel neophyte
2015-10-21 10:06             ` Allen Samuels
2015-10-21 13:35               ` Mark Nelson
2015-10-21 16:10                 ` Chen, Xiaoxi
2015-10-22  1:09                   ` Allen Samuels
2015-10-20  2:32       ` Varada Kari
2015-10-20  2:40         ` Chen, Xiaoxi
2015-10-20 12:34       ` Sage Weil
2015-10-20 20:18         ` Martin Millnert
2015-10-20 20:32         ` James (Fei) Liu-SSI
2015-10-20 20:39           ` James (Fei) Liu-SSI
2015-10-20 21:20           ` Sage Weil
2015-10-19 21:18 ` Wido den Hollander
2015-10-19 22:40 ` Varada Kari
2015-10-20  0:48 ` John Spray
2015-10-20 20:00   ` Sage Weil
2015-10-20 20:36     ` Gregory Farnum
2015-10-20 21:47       ` Sage Weil
2015-10-20 22:23         ` Ric Wheeler
2015-10-21 13:32           ` Sage Weil
2015-10-21 13:50             ` Ric Wheeler
2015-10-23  6:21               ` Howard Chu
2015-10-23 11:06                 ` Ric Wheeler
2015-10-23 11:47                   ` Ric Wheeler
2015-10-23 14:59                     ` Howard Chu
2015-10-23 16:37                       ` Ric Wheeler
2015-10-23 18:59                       ` Gregory Farnum
2015-10-23 21:23                         ` Howard Chu
2015-10-20 20:42     ` Matt Benjamin
2015-10-22 12:32     ` Milosz Tanski
2015-10-23  3:16       ` Howard Chu
2015-10-23 13:27         ` Milosz Tanski
2015-10-20  2:08 ` Haomai Wang
2015-10-20 12:25   ` Sage Weil
2015-10-20  7:06 ` Dałek, Piotr
2015-10-20 18:31 ` Ric Wheeler
2015-10-20 19:44   ` Sage Weil
2015-10-20 21:43     ` Ric Wheeler
2015-10-20 19:44   ` Yehuda Sadeh-Weinraub
2015-10-21  8:22   ` Orit Wasserman
2015-10-21 11:18     ` Ric Wheeler
2015-10-21 17:30       ` Sage Weil
2015-10-22  8:31         ` Christoph Hellwig
2015-10-22 12:50       ` Sage Weil
2015-10-22 17:42         ` James (Fei) Liu-SSI
2015-10-22 23:42           ` Samuel Just
2015-10-23  0:10             ` Samuel Just
2015-10-23  1:26             ` Allen Samuels
2015-10-23  2:06         ` Ric Wheeler
2015-10-21 10:06   ` Allen Samuels
2015-10-21 11:24     ` Ric Wheeler
2015-10-21 14:14       ` Mark Nelson
2015-10-21 15:51         ` Ric Wheeler
2015-10-21 19:37           ` Mark Nelson
2015-10-21 21:20             ` Martin Millnert
2015-10-22  2:12               ` Allen Samuels
2015-10-22  8:51                 ` Orit Wasserman
2015-10-22  0:53       ` Allen Samuels
2015-10-22  1:16         ` Ric Wheeler
2015-10-22  1:22           ` Allen Samuels
2015-10-23  2:10             ` Ric Wheeler
2015-10-21 13:44     ` Mark Nelson
2015-10-22  1:39       ` Allen Samuels

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.