All of lore.kernel.org
 help / color / mirror / Atom feed
* KeyFileStore ?
@ 2014-07-31  5:25 Sage Weil
  2014-07-31  5:49 ` Mark Kirkwood
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Sage Weil @ 2014-07-31  5:25 UTC (permalink / raw)
  To: ceph-devel

After the latest set of bug fixes to the FileStore file naming code I am 
newly inspired to replace it with something less complex.  Right now I'm 
mostly thinking about HDDs, although some of this may map well onto hybrid 
SSD/HDD as well.  It may or may not make sense for pure flash.

Anyway, here are the main flaws with the overall approach that FileStore 
uses:

- It tries to maintain a direct mapping of object names to file names.  
This is problematic because of 255 character limits, rados namespaces, pg 
prefixes, and the pg directory hashing we do to allow efficient split, for 
starters.  It is also problematic because we often want to do things like 
rename but can't make it happen atomically in combination with the rest of 
our transaction.

- The PG directory hashing (that we do to allow efficient split) can have 
a big impact on performance, particularly when injesting lots of data.  
(And when benchmarking.)  It's also complex.

- We often overwrite or replace entire objects.  These are "easy" 
operations to do safely without doing complete data journaling, but the 
current design is not conducive to doing anything clever (and it's complex 
enough that I wouldn't want to add any cleverness on top).

- Objects may contain only key/value data, but we still have to create an 
inode for them and look that up first.  This only matters for some 
workloads (rgw indexes, cephfs directory objects).

Instead, I think we should try a hybrid approach that more heavily 
leverages a key/value db in combination with the file system.  The kv db 
might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just 
assume it provides transactional key/value storage and efficient range 
operations.  Here's the basic idea:

- The mapping from names to object lives in the kv db.  The object 
metadata is in a structure we can call an "onode" to avoid confusing it 
with the inodes in the backing file system.  The mapping is simple 
ghobject_t -> onode map; there is no PG collection.  The PG collection 
still exist but really only as ranges of those keys.  We will need to be 
slightly clever with the coll_t to distinguish between "bare" PGs (that 
live in this flat mapping) and the other collections (*_temp and 
metadata), but that should be easy.  This makes PG splitting "free" as far 
as the objects go.

- The onodes are relatively small.  They will contain the xattrs and 
basic metadata like object size.  They will also identify the file name of 
the backing file in the file system (if size > 0).

- The backing file can be a random, short file name.  We can just make a 
one or two level deep set of directories, and let the directories get 
reasonably big... whatever we decide the backing fs can handle 
efficiently.  We can also store a file handle in the onode and use the 
open by handle API; this should let us go directly from onode (in our kv 
db) to the on-disk inode without looking at the directory at all, and fall 
back to using the actual file name only if that fails for some reason 
(say, someone mucked around with the backing files).  The backing file 
need not have any xattrs on it at all (except perhaps some simple id to 
verify it does it fact belong to the referring onode, just as a sanity 
check).

- The name -> onode mapping can live in a disjunct part of the kv 
namespace so that the other kv stuff associated with the file (like omap 
pairs or big xattrs or whatever) don't blow up those parts of the 
db and slow down lookup.

- We can keep a simple LRU of recent onodes in memory and avoid the kv 
lookup for hot objects.

- Previously complicated operations like rename are now trivial: we just 
update the kv db with a transaction.  The backing file never gets renamed, 
ever, and the other object omap data is keyed by a unique (onode) id, not 
the name.

Initially, for simplicity, we can start with the existing data journaling 
behavior.  However, I think there are opportunities to improve the 
situation there.  There is a pending wip-transactions branch in which I 
started to rejigger the ObjectStore::Transaction interface a bit so that 
you identify objects by handle and then operation on them.  Although it 
doesn't change the encoding yet, once it does, we can make the 
implementation take advantage of that, by avoid duplicate name lookups.  
It will also let us do things like clearly identify when an object is 
entirely new; in that case, we might forgo data journaling and instead 
write the data to the (new) file, fsync, and then commit the journal entry 
with the transaction that uses it.  (On remount a simple cleanup process 
can throw out new but unreferenced backing files.)  It would also make it 
easier to track all recently touched files and bulk fsync them instead of 
doing a syncfs (if we decide that is faster).

Anyway, at the end of the day, small writes or overwrites would still be 
journaled, but large writes or large new objects would not, which would (I 
think) be a pretty big improvement.  Overall, I think the design will be 
much simpler to reason about, and there are several potential avenues to 
be clever and make improvements.  I'm not sure we can say the same about 
the FileStore design, which suffers from the fact that it has evolved 
slowly over the last 9 years or so.

sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KeyFileStore ?
  2014-07-31  5:25 KeyFileStore ? Sage Weil
@ 2014-07-31  5:49 ` Mark Kirkwood
  2014-07-31  6:07   ` Haomai Wang
  2014-07-31 13:18 ` Gregory Farnum
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Mark Kirkwood @ 2014-07-31  5:49 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

On 31/07/14 17:25, Sage Weil wrote:
> After the latest set of bug fixes to the FileStore file naming code I am
> newly inspired to replace it with something less complex.  Right now I'm
> mostly thinking about HDDs, although some of this may map well onto hybrid
> SSD/HDD as well.  It may or may not make sense for pure flash.
>
> Anyway, here are the main flaws with the overall approach that FileStore
> uses:
>
> - It tries to maintain a direct mapping of object names to file names.
> This is problematic because of 255 character limits, rados namespaces, pg
> prefixes, and the pg directory hashing we do to allow efficient split, for
> starters.  It is also problematic because we often want to do things like
> rename but can't make it happen atomically in combination with the rest of
> our transaction.
>
> - The PG directory hashing (that we do to allow efficient split) can have
> a big impact on performance, particularly when injesting lots of data.
> (And when benchmarking.)  It's also complex.
>
> - We often overwrite or replace entire objects.  These are "easy"
> operations to do safely without doing complete data journaling, but the
> current design is not conducive to doing anything clever (and it's complex
> enough that I wouldn't want to add any cleverness on top).
>
> - Objects may contain only key/value data, but we still have to create an
> inode for them and look that up first.  This only matters for some
> workloads (rgw indexes, cephfs directory objects).
>
> Instead, I think we should try a hybrid approach that more heavily
> leverages a key/value db in combination with the file system.  The kv db
> might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just
> assume it provides transactional key/value storage and efficient range
> operations.  Here's the basic idea:
>
> - The mapping from names to object lives in the kv db.  The object
> metadata is in a structure we can call an "onode" to avoid confusing it
> with the inodes in the backing file system.  The mapping is simple
> ghobject_t -> onode map; there is no PG collection.  The PG collection
> still exist but really only as ranges of those keys.  We will need to be
> slightly clever with the coll_t to distinguish between "bare" PGs (that
> live in this flat mapping) and the other collections (*_temp and
> metadata), but that should be easy.  This makes PG splitting "free" as far
> as the objects go.
>
> - The onodes are relatively small.  They will contain the xattrs and
> basic metadata like object size.  They will also identify the file name of
> the backing file in the file system (if size > 0).
>
> - The backing file can be a random, short file name.  We can just make a
> one or two level deep set of directories, and let the directories get
> reasonably big... whatever we decide the backing fs can handle
> efficiently.  We can also store a file handle in the onode and use the
> open by handle API; this should let us go directly from onode (in our kv
> db) to the on-disk inode without looking at the directory at all, and fall
> back to using the actual file name only if that fails for some reason
> (say, someone mucked around with the backing files).  The backing file
> need not have any xattrs on it at all (except perhaps some simple id to
> verify it does it fact belong to the referring onode, just as a sanity
> check).
>
> - The name -> onode mapping can live in a disjunct part of the kv
> namespace so that the other kv stuff associated with the file (like omap
> pairs or big xattrs or whatever) don't blow up those parts of the
> db and slow down lookup.
>
> - We can keep a simple LRU of recent onodes in memory and avoid the kv
> lookup for hot objects.
>
> - Previously complicated operations like rename are now trivial: we just
> update the kv db with a transaction.  The backing file never gets renamed,
> ever, and the other object omap data is keyed by a unique (onode) id, not
> the name.
>
> Initially, for simplicity, we can start with the existing data journaling
> behavior.  However, I think there are opportunities to improve the
> situation there.  There is a pending wip-transactions branch in which I
> started to rejigger the ObjectStore::Transaction interface a bit so that
> you identify objects by handle and then operation on them.  Although it
> doesn't change the encoding yet, once it does, we can make the
> implementation take advantage of that, by avoid duplicate name lookups.
> It will also let us do things like clearly identify when an object is
> entirely new; in that case, we might forgo data journaling and instead
> write the data to the (new) file, fsync, and then commit the journal entry
> with the transaction that uses it.  (On remount a simple cleanup process
> can throw out new but unreferenced backing files.)  It would also make it
> easier to track all recently touched files and bulk fsync them instead of
> doing a syncfs (if we decide that is faster).
>
> Anyway, at the end of the day, small writes or overwrites would still be
> journaled, but large writes or large new objects would not, which would (I
> think) be a pretty big improvement.  Overall, I think the design will be
> much simpler to reason about, and there are several potential avenues to
> be clever and make improvements.  I'm not sure we can say the same about
> the FileStore design, which suffers from the fact that it has evolved
> slowly over the last 9 years or so.
>

Certainly makes sense to me - I can recall thinking "that stuff is 
actually a database" when looking at the on disk filestore object data 
layout.

regards

Mark


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KeyFileStore ?
  2014-07-31  5:49 ` Mark Kirkwood
@ 2014-07-31  6:07   ` Haomai Wang
  0 siblings, 0 replies; 10+ messages in thread
From: Haomai Wang @ 2014-07-31  6:07 UTC (permalink / raw)
  To: Mark Kirkwood; +Cc: Sage Weil, ceph-devel

Awesome job!

On Thu, Jul 31, 2014 at 1:49 PM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
> On 31/07/14 17:25, Sage Weil wrote:
>>
>> After the latest set of bug fixes to the FileStore file naming code I am
>> newly inspired to replace it with something less complex.  Right now I'm
>> mostly thinking about HDDs, although some of this may map well onto hybrid
>> SSD/HDD as well.  It may or may not make sense for pure flash.
>>
>> Anyway, here are the main flaws with the overall approach that FileStore
>> uses:
>>
>> - It tries to maintain a direct mapping of object names to file names.
>> This is problematic because of 255 character limits, rados namespaces, pg
>> prefixes, and the pg directory hashing we do to allow efficient split, for
>> starters.  It is also problematic because we often want to do things like
>> rename but can't make it happen atomically in combination with the rest of
>> our transaction.
>>
>> - The PG directory hashing (that we do to allow efficient split) can have
>> a big impact on performance, particularly when injesting lots of data.
>> (And when benchmarking.)  It's also complex.

+1, it's too complexity now!

>>
>> - We often overwrite or replace entire objects.  These are "easy"
>> operations to do safely without doing complete data journaling, but the
>> current design is not conducive to doing anything clever (and it's complex
>> enough that I wouldn't want to add any cleverness on top).
>>
>> - Objects may contain only key/value data, but we still have to create an
>> inode for them and look that up first.  This only matters for some
>> workloads (rgw indexes, cephfs directory objects).
>>
>> Instead, I think we should try a hybrid approach that more heavily
>> leverages a key/value db in combination with the file system.  The kv db
>> might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just
>> assume it provides transactional key/value storage and efficient range
>> operations.  Here's the basic idea:
>>
>> - The mapping from names to object lives in the kv db.  The object
>> metadata is in a structure we can call an "onode" to avoid confusing it
>> with the inodes in the backing file system.  The mapping is simple
>> ghobject_t -> onode map; there is no PG collection.  The PG collection
>> still exist but really only as ranges of those keys.  We will need to be
>> slightly clever with the coll_t to distinguish between "bare" PGs (that
>> live in this flat mapping) and the other collections (*_temp and
>> metadata), but that should be easy.  This makes PG splitting "free" as far
>> as the objects go.
>>
>> - The onodes are relatively small.  They will contain the xattrs and
>> basic metadata like object size.  They will also identify the file name of
>> the backing file in the file system (if size > 0).
>>
>> - The backing file can be a random, short file name.  We can just make a
>> one or two level deep set of directories, and let the directories get
>> reasonably big... whatever we decide the backing fs can handle
>> efficiently.  We can also store a file handle in the onode and use the
>> open by handle API; this should let us go directly from onode (in our kv
>> db) to the on-disk inode without looking at the directory at all, and fall
>> back to using the actual file name only if that fails for some reason
>> (say, someone mucked around with the backing files).  The backing file
>> need not have any xattrs on it at all (except perhaps some simple id to
>> verify it does it fact belong to the referring onode, just as a sanity
>> check).
>>
>> - The name -> onode mapping can live in a disjunct part of the kv
>> namespace so that the other kv stuff associated with the file (like omap
>> pairs or big xattrs or whatever) don't blow up those parts of the
>> db and slow down lookup.
>>
>> - We can keep a simple LRU of recent onodes in memory and avoid the kv
>> lookup for hot objects.
>>
>> - Previously complicated operations like rename are now trivial: we just
>> update the kv db with a transaction.  The backing file never gets renamed,
>> ever, and the other object omap data is keyed by a unique (onode) id, not
>> the name.
>>
>> Initially, for simplicity, we can start with the existing data journaling
>> behavior.  However, I think there are opportunities to improve the
>> situation there.  There is a pending wip-transactions branch in which I
>> started to rejigger the ObjectStore::Transaction interface a bit so that
>> you identify objects by handle and then operation on them.  Although it
>> doesn't change the encoding yet, once it does, we can make the
>> implementation take advantage of that, by avoid duplicate name lookups.
>> It will also let us do things like clearly identify when an object is
>> entirely new; in that case, we might forgo data journaling and instead
>> write the data to the (new) file, fsync, and then commit the journal entry
>> with the transaction that uses it.  (On remount a simple cleanup process
>> can throw out new but unreferenced backing files.)  It would also make it
>> easier to track all recently touched files and bulk fsync them instead of
>> doing a syncfs (if we decide that is faster).
>>
>> Anyway, at the end of the day, small writes or overwrites would still be
>> journaled, but large writes or large new objects would not, which would (I
>> think) be a pretty big improvement.  Overall, I think the design will be
>> much simpler to reason about, and there are several potential avenues to
>> be clever and make improvements.  I'm not sure we can say the same about
>> the FileStore design, which suffers from the fact that it has evolved
>> slowly over the last 9 years or so.

When I do works in KeyValueStore, I already enjoys the convenience of
"onode" map. It's really good for ops like rename, move, truncate.
Especially, "onode" map is collected together and it's effective for
onode search.

Overall, it's really meet my idea. It's combined the advantage of
native filesystem(read, write ops) and the effective of DB(lookup,
rename, move ops). It solve the lookup slowly problem for FileStore
and avoid the slow read/write implementation of KeyValueStore.

I hope we can quicker the step, and maybe I can do some works in it!

>>
>
> Certainly makes sense to me - I can recall thinking "that stuff is actually
> a database" when looking at the on disk filestore object data layout.
>
> regards
>
> Mark
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KeyFileStore ?
  2014-07-31  5:25 KeyFileStore ? Sage Weil
  2014-07-31  5:49 ` Mark Kirkwood
@ 2014-07-31 13:18 ` Gregory Farnum
  2014-07-31 13:59   ` Mark Nelson
  2014-07-31 15:05   ` Yehuda Sadeh
  2014-07-31 13:56 ` Matt W. Benjamin
  2014-08-01 15:08 ` Guang Yang
  3 siblings, 2 replies; 10+ messages in thread
From: Gregory Farnum @ 2014-07-31 13:18 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Thu, Jul 31, 2014 at 1:25 AM, Sage Weil <sweil@redhat.com> wrote:
> After the latest set of bug fixes to the FileStore file naming code I am
> newly inspired to replace it with something less complex.  Right now I'm
> mostly thinking about HDDs, although some of this may map well onto hybrid
> SSD/HDD as well.  It may or may not make sense for pure flash.
>
> Anyway, here are the main flaws with the overall approach that FileStore
> uses:
>
> - It tries to maintain a direct mapping of object names to file names.
> This is problematic because of 255 character limits, rados namespaces, pg
> prefixes, and the pg directory hashing we do to allow efficient split, for
> starters.  It is also problematic because we often want to do things like
> rename but can't make it happen atomically in combination with the rest of
> our transaction.
>
> - The PG directory hashing (that we do to allow efficient split) can have
> a big impact on performance, particularly when injesting lots of data.
> (And when benchmarking.)  It's also complex.
>
> - We often overwrite or replace entire objects.  These are "easy"
> operations to do safely without doing complete data journaling, but the
> current design is not conducive to doing anything clever (and it's complex
> enough that I wouldn't want to add any cleverness on top).
>
> - Objects may contain only key/value data, but we still have to create an
> inode for them and look that up first.  This only matters for some
> workloads (rgw indexes, cephfs directory objects).
>
> Instead, I think we should try a hybrid approach that more heavily
> leverages a key/value db in combination with the file system.  The kv db
> might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just
> assume it provides transactional key/value storage and efficient range
> operations.

This all sounds great in theory, but this is a point I'm a little
worried about. We've already seen cases in the field where leveldb
lookups (for whatever reason) are noticeably slower than inode
accesses. We haven't really characterized the circumstances required
(that I'm aware of, anyway), but if we do a bunch of work to create a
new (not-yet-tested...) ObjectStore implementation, it's going to be
very sad if it's slower in practice than our FileStore is. Before
embarking down this path, we should probably experiment with a few
different things to figure out what performance characteristics we can
rely on. (Heck, maybe an embeddable RDBMS is faster for this workload!
We're talking about an awful lot of overwrites.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

> Here's the basic idea:
>
> - The mapping from names to object lives in the kv db.  The object
> metadata is in a structure we can call an "onode" to avoid confusing it
> with the inodes in the backing file system.  The mapping is simple
> ghobject_t -> onode map; there is no PG collection.  The PG collection
> still exist but really only as ranges of those keys.  We will need to be
> slightly clever with the coll_t to distinguish between "bare" PGs (that
> live in this flat mapping) and the other collections (*_temp and
> metadata), but that should be easy.  This makes PG splitting "free" as far
> as the objects go.
>
> - The onodes are relatively small.  They will contain the xattrs and
> basic metadata like object size.  They will also identify the file name of
> the backing file in the file system (if size > 0).
>
> - The backing file can be a random, short file name.  We can just make a
> one or two level deep set of directories, and let the directories get
> reasonably big... whatever we decide the backing fs can handle
> efficiently.  We can also store a file handle in the onode and use the
> open by handle API; this should let us go directly from onode (in our kv
> db) to the on-disk inode without looking at the directory at all, and fall
> back to using the actual file name only if that fails for some reason
> (say, someone mucked around with the backing files).  The backing file
> need not have any xattrs on it at all (except perhaps some simple id to
> verify it does it fact belong to the referring onode, just as a sanity
> check).
>
> - The name -> onode mapping can live in a disjunct part of the kv
> namespace so that the other kv stuff associated with the file (like omap
> pairs or big xattrs or whatever) don't blow up those parts of the
> db and slow down lookup.
>
> - We can keep a simple LRU of recent onodes in memory and avoid the kv
> lookup for hot objects.
>
> - Previously complicated operations like rename are now trivial: we just
> update the kv db with a transaction.  The backing file never gets renamed,
> ever, and the other object omap data is keyed by a unique (onode) id, not
> the name.
>
> Initially, for simplicity, we can start with the existing data journaling
> behavior.  However, I think there are opportunities to improve the
> situation there.  There is a pending wip-transactions branch in which I
> started to rejigger the ObjectStore::Transaction interface a bit so that
> you identify objects by handle and then operation on them.  Although it
> doesn't change the encoding yet, once it does, we can make the
> implementation take advantage of that, by avoid duplicate name lookups.
> It will also let us do things like clearly identify when an object is
> entirely new; in that case, we might forgo data journaling and instead
> write the data to the (new) file, fsync, and then commit the journal entry
> with the transaction that uses it.  (On remount a simple cleanup process
> can throw out new but unreferenced backing files.)  It would also make it
> easier to track all recently touched files and bulk fsync them instead of
> doing a syncfs (if we decide that is faster).
>
> Anyway, at the end of the day, small writes or overwrites would still be
> journaled, but large writes or large new objects would not, which would (I
> think) be a pretty big improvement.  Overall, I think the design will be
> much simpler to reason about, and there are several potential avenues to
> be clever and make improvements.  I'm not sure we can say the same about
> the FileStore design, which suffers from the fact that it has evolved
> slowly over the last 9 years or so.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KeyFileStore ?
  2014-07-31  5:25 KeyFileStore ? Sage Weil
  2014-07-31  5:49 ` Mark Kirkwood
  2014-07-31 13:18 ` Gregory Farnum
@ 2014-07-31 13:56 ` Matt W. Benjamin
  2014-08-01 15:08 ` Guang Yang
  3 siblings, 0 replies; 10+ messages in thread
From: Matt W. Benjamin @ 2014-07-31 13:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

+1 
----- "Sage Weil" <sweil@redhat.com> wrote:

> 
> Anyway, at the end of the day, small writes or overwrites would still
> be 
> journaled, but large writes or large new objects would not, which
> would (I 
> think) be a pretty big improvement.

Yes.

  Overall, I think the design will
> be 
> much simpler to reason about, and there are several potential avenues
> to 
> be clever and make improvements.  I'm not sure we can say the same
> about 
> the FileStore design, which suffers from the fact that it has evolved
> 
> slowly over the last 9 years or so.
> 

-- 
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KeyFileStore ?
  2014-07-31 13:18 ` Gregory Farnum
@ 2014-07-31 13:59   ` Mark Nelson
  2014-07-31 15:05   ` Yehuda Sadeh
  1 sibling, 0 replies; 10+ messages in thread
From: Mark Nelson @ 2014-07-31 13:59 UTC (permalink / raw)
  To: Gregory Farnum, Sage Weil; +Cc: ceph-devel

On 07/31/2014 08:18 AM, Gregory Farnum wrote:
> On Thu, Jul 31, 2014 at 1:25 AM, Sage Weil <sweil@redhat.com> wrote:
>> After the latest set of bug fixes to the FileStore file naming code I am
>> newly inspired to replace it with something less complex.  Right now I'm
>> mostly thinking about HDDs, although some of this may map well onto hybrid
>> SSD/HDD as well.  It may or may not make sense for pure flash.
>>
>> Anyway, here are the main flaws with the overall approach that FileStore
>> uses:
>>
>> - It tries to maintain a direct mapping of object names to file names.
>> This is problematic because of 255 character limits, rados namespaces, pg
>> prefixes, and the pg directory hashing we do to allow efficient split, for
>> starters.  It is also problematic because we often want to do things like
>> rename but can't make it happen atomically in combination with the rest of
>> our transaction.
>>
>> - The PG directory hashing (that we do to allow efficient split) can have
>> a big impact on performance, particularly when injesting lots of data.
>> (And when benchmarking.)  It's also complex.
>>
>> - We often overwrite or replace entire objects.  These are "easy"
>> operations to do safely without doing complete data journaling, but the
>> current design is not conducive to doing anything clever (and it's complex
>> enough that I wouldn't want to add any cleverness on top).
>>
>> - Objects may contain only key/value data, but we still have to create an
>> inode for them and look that up first.  This only matters for some
>> workloads (rgw indexes, cephfs directory objects).
>>
>> Instead, I think we should try a hybrid approach that more heavily
>> leverages a key/value db in combination with the file system.  The kv db
>> might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just
>> assume it provides transactional key/value storage and efficient range
>> operations.
>
> This all sounds great in theory, but this is a point I'm a little
> worried about. We've already seen cases in the field where leveldb
> lookups (for whatever reason) are noticeably slower than inode
> accesses. We haven't really characterized the circumstances required
> (that I'm aware of, anyway), but if we do a bunch of work to create a
> new (not-yet-tested...) ObjectStore implementation, it's going to be
> very sad if it's slower in practice than our FileStore is. Before
> embarking down this path, we should probably experiment with a few
> different things to figure out what performance characteristics we can
> rely on. (Heck, maybe an embeddable RDBMS is faster for this workload!
> We're talking about an awful lot of overwrites.)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com

I'm both very in favour of trying it for some of the potential benefits 
Sage mentioned, and also rather frightened by some of the latencies we 
see in key/value stores and what kind of effects that could have given 
that we rely on 100% deterministic data placement.  If we go down this 
path I agree we really need to arm ourselves with a lot of data before 
we get too invested.

On a side note, I've wondered if semi-adapative data placement below the 
OSD might be one way to help mitigate high latency spikes.  If the 
average case is good but we suffer from occasional high latency 
<cough>compaction</cough> perhaps this might be a way to help mitigate 
the effects if we can reasonably guarantee that the worst spikes are 
staggered.


>
>> Here's the basic idea:
>>
>> - The mapping from names to object lives in the kv db.  The object
>> metadata is in a structure we can call an "onode" to avoid confusing it
>> with the inodes in the backing file system.  The mapping is simple
>> ghobject_t -> onode map; there is no PG collection.  The PG collection
>> still exist but really only as ranges of those keys.  We will need to be
>> slightly clever with the coll_t to distinguish between "bare" PGs (that
>> live in this flat mapping) and the other collections (*_temp and
>> metadata), but that should be easy.  This makes PG splitting "free" as far
>> as the objects go.
>>
>> - The onodes are relatively small.  They will contain the xattrs and
>> basic metadata like object size.  They will also identify the file name of
>> the backing file in the file system (if size > 0).
>>
>> - The backing file can be a random, short file name.  We can just make a
>> one or two level deep set of directories, and let the directories get
>> reasonably big... whatever we decide the backing fs can handle
>> efficiently.  We can also store a file handle in the onode and use the
>> open by handle API; this should let us go directly from onode (in our kv
>> db) to the on-disk inode without looking at the directory at all, and fall
>> back to using the actual file name only if that fails for some reason
>> (say, someone mucked around with the backing files).  The backing file
>> need not have any xattrs on it at all (except perhaps some simple id to
>> verify it does it fact belong to the referring onode, just as a sanity
>> check).
>>
>> - The name -> onode mapping can live in a disjunct part of the kv
>> namespace so that the other kv stuff associated with the file (like omap
>> pairs or big xattrs or whatever) don't blow up those parts of the
>> db and slow down lookup.
>>
>> - We can keep a simple LRU of recent onodes in memory and avoid the kv
>> lookup for hot objects.
>>
>> - Previously complicated operations like rename are now trivial: we just
>> update the kv db with a transaction.  The backing file never gets renamed,
>> ever, and the other object omap data is keyed by a unique (onode) id, not
>> the name.
>>
>> Initially, for simplicity, we can start with the existing data journaling
>> behavior.  However, I think there are opportunities to improve the
>> situation there.  There is a pending wip-transactions branch in which I
>> started to rejigger the ObjectStore::Transaction interface a bit so that
>> you identify objects by handle and then operation on them.  Although it
>> doesn't change the encoding yet, once it does, we can make the
>> implementation take advantage of that, by avoid duplicate name lookups.
>> It will also let us do things like clearly identify when an object is
>> entirely new; in that case, we might forgo data journaling and instead
>> write the data to the (new) file, fsync, and then commit the journal entry
>> with the transaction that uses it.  (On remount a simple cleanup process
>> can throw out new but unreferenced backing files.)  It would also make it
>> easier to track all recently touched files and bulk fsync them instead of
>> doing a syncfs (if we decide that is faster).
>>
>> Anyway, at the end of the day, small writes or overwrites would still be
>> journaled, but large writes or large new objects would not, which would (I
>> think) be a pretty big improvement.  Overall, I think the design will be
>> much simpler to reason about, and there are several potential avenues to
>> be clever and make improvements.  I'm not sure we can say the same about
>> the FileStore design, which suffers from the fact that it has evolved
>> slowly over the last 9 years or so.
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KeyFileStore ?
  2014-07-31 13:18 ` Gregory Farnum
  2014-07-31 13:59   ` Mark Nelson
@ 2014-07-31 15:05   ` Yehuda Sadeh
  1 sibling, 0 replies; 10+ messages in thread
From: Yehuda Sadeh @ 2014-07-31 15:05 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel

On Thu, Jul 31, 2014 at 6:18 AM, Gregory Farnum <greg@inktank.com> wrote:
> On Thu, Jul 31, 2014 at 1:25 AM, Sage Weil <sweil@redhat.com> wrote:
>> After the latest set of bug fixes to the FileStore file naming code I am
>> newly inspired to replace it with something less complex.  Right now I'm
>> mostly thinking about HDDs, although some of this may map well onto hybrid
>> SSD/HDD as well.  It may or may not make sense for pure flash.
>>
>> Anyway, here are the main flaws with the overall approach that FileStore
>> uses:
>>
>> - It tries to maintain a direct mapping of object names to file names.
>> This is problematic because of 255 character limits, rados namespaces, pg
>> prefixes, and the pg directory hashing we do to allow efficient split, for
>> starters.  It is also problematic because we often want to do things like
>> rename but can't make it happen atomically in combination with the rest of
>> our transaction.
>>
>> - The PG directory hashing (that we do to allow efficient split) can have
>> a big impact on performance, particularly when injesting lots of data.
>> (And when benchmarking.)  It's also complex.
>>
>> - We often overwrite or replace entire objects.  These are "easy"
>> operations to do safely without doing complete data journaling, but the
>> current design is not conducive to doing anything clever (and it's complex
>> enough that I wouldn't want to add any cleverness on top).
>>
>> - Objects may contain only key/value data, but we still have to create an
>> inode for them and look that up first.  This only matters for some
>> workloads (rgw indexes, cephfs directory objects).
>>
>> Instead, I think we should try a hybrid approach that more heavily
>> leverages a key/value db in combination with the file system.  The kv db
>> might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just
>> assume it provides transactional key/value storage and efficient range
>> operations.
>
> This all sounds great in theory, but this is a point I'm a little
> worried about. We've already seen cases in the field where leveldb
> lookups (for whatever reason) are noticeably slower than inode
> accesses. We haven't really characterized the circumstances required
> (that I'm aware of, anyway), but if we do a bunch of work to create a
> new (not-yet-tested...) ObjectStore implementation, it's going to be
> very sad if it's slower in practice than our FileStore is. Before
> embarking down this path, we should probably experiment with a few
> different things to figure out what performance characteristics we can
> rely on. (Heck, maybe an embeddable RDBMS is faster for this workload!
> We're talking about an awful lot of overwrites.)

Definitely. I think the output of such experiment should also be
(while we're at it) a set of tools for benchmarking the different
aspects of the system with various store systems. This will be useful
in determining whether the suggested approach is valid in practice,
and in the future it would be useful for tracking performance issues
in new software. It would also help in testing new dbs (like the work
Mark Nelson did recently on rocksdb).

Yehuda

> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>> Here's the basic idea:
>>
>> - The mapping from names to object lives in the kv db.  The object
>> metadata is in a structure we can call an "onode" to avoid confusing it
>> with the inodes in the backing file system.  The mapping is simple
>> ghobject_t -> onode map; there is no PG collection.  The PG collection
>> still exist but really only as ranges of those keys.  We will need to be
>> slightly clever with the coll_t to distinguish between "bare" PGs (that
>> live in this flat mapping) and the other collections (*_temp and
>> metadata), but that should be easy.  This makes PG splitting "free" as far
>> as the objects go.
>>
>> - The onodes are relatively small.  They will contain the xattrs and
>> basic metadata like object size.  They will also identify the file name of
>> the backing file in the file system (if size > 0).
>>
>> - The backing file can be a random, short file name.  We can just make a
>> one or two level deep set of directories, and let the directories get
>> reasonably big... whatever we decide the backing fs can handle
>> efficiently.  We can also store a file handle in the onode and use the
>> open by handle API; this should let us go directly from onode (in our kv
>> db) to the on-disk inode without looking at the directory at all, and fall
>> back to using the actual file name only if that fails for some reason
>> (say, someone mucked around with the backing files).  The backing file
>> need not have any xattrs on it at all (except perhaps some simple id to
>> verify it does it fact belong to the referring onode, just as a sanity
>> check).
>>
>> - The name -> onode mapping can live in a disjunct part of the kv
>> namespace so that the other kv stuff associated with the file (like omap
>> pairs or big xattrs or whatever) don't blow up those parts of the
>> db and slow down lookup.
>>
>> - We can keep a simple LRU of recent onodes in memory and avoid the kv
>> lookup for hot objects.
>>
>> - Previously complicated operations like rename are now trivial: we just
>> update the kv db with a transaction.  The backing file never gets renamed,
>> ever, and the other object omap data is keyed by a unique (onode) id, not
>> the name.
>>
>> Initially, for simplicity, we can start with the existing data journaling
>> behavior.  However, I think there are opportunities to improve the
>> situation there.  There is a pending wip-transactions branch in which I
>> started to rejigger the ObjectStore::Transaction interface a bit so that
>> you identify objects by handle and then operation on them.  Although it
>> doesn't change the encoding yet, once it does, we can make the
>> implementation take advantage of that, by avoid duplicate name lookups.
>> It will also let us do things like clearly identify when an object is
>> entirely new; in that case, we might forgo data journaling and instead
>> write the data to the (new) file, fsync, and then commit the journal entry
>> with the transaction that uses it.  (On remount a simple cleanup process
>> can throw out new but unreferenced backing files.)  It would also make it
>> easier to track all recently touched files and bulk fsync them instead of
>> doing a syncfs (if we decide that is faster).
>>
>> Anyway, at the end of the day, small writes or overwrites would still be
>> journaled, but large writes or large new objects would not, which would (I
>> think) be a pretty big improvement.  Overall, I think the design will be
>> much simpler to reason about, and there are several potential avenues to
>> be clever and make improvements.  I'm not sure we can say the same about
>> the FileStore design, which suffers from the fact that it has evolved
>> slowly over the last 9 years or so.
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KeyFileStore ?
  2014-07-31  5:25 KeyFileStore ? Sage Weil
                   ` (2 preceding siblings ...)
  2014-07-31 13:56 ` Matt W. Benjamin
@ 2014-08-01 15:08 ` Guang Yang
  2014-08-01 21:34   ` Samuel Just
  3 siblings, 1 reply; 10+ messages in thread
From: Guang Yang @ 2014-08-01 15:08 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

I really like the idea, one scenario keeps bothering us is that there are too many small files which make the file system indexing slow (so that a single read request could take more than 10 disk IOs for path lookup).

If we pursuit this proposal, is there a chance we can take one step further, that instead of storing one physical file for each object, we can allocate a big file (tens of GB) and each object only map to a chunk within that big file. So that all those big file’s description could be cached to avoid disk I/O to open the file. At least we keep it flexible that if someone would like to implement in such way, there is a chance to leverage the existing framework.

Thanks,
Guang

On Jul 31, 2014, at 1:25 PM, Sage Weil <sweil@redhat.com> wrote:

> After the latest set of bug fixes to the FileStore file naming code I am 
> newly inspired to replace it with something less complex.  Right now I'm 
> mostly thinking about HDDs, although some of this may map well onto hybrid 
> SSD/HDD as well.  It may or may not make sense for pure flash.
> 
> Anyway, here are the main flaws with the overall approach that FileStore 
> uses:
> 
> - It tries to maintain a direct mapping of object names to file names.  
> This is problematic because of 255 character limits, rados namespaces, pg 
> prefixes, and the pg directory hashing we do to allow efficient split, for 
> starters.  It is also problematic because we often want to do things like 
> rename but can't make it happen atomically in combination with the rest of 
> our transaction.
> 
> - The PG directory hashing (that we do to allow efficient split) can have 
> a big impact on performance, particularly when injesting lots of data.  
> (And when benchmarking.)  It's also complex.
> 
> - We often overwrite or replace entire objects.  These are "easy" 
> operations to do safely without doing complete data journaling, but the 
> current design is not conducive to doing anything clever (and it's complex 
> enough that I wouldn't want to add any cleverness on top).
> 
> - Objects may contain only key/value data, but we still have to create an 
> inode for them and look that up first.  This only matters for some 
> workloads (rgw indexes, cephfs directory objects).
> 
> Instead, I think we should try a hybrid approach that more heavily 
> leverages a key/value db in combination with the file system.  The kv db 
> might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just 
> assume it provides transactional key/value storage and efficient range 
> operations.  Here's the basic idea:
> 
> - The mapping from names to object lives in the kv db.  The object 
> metadata is in a structure we can call an "onode" to avoid confusing it 
> with the inodes in the backing file system.  The mapping is simple 
> ghobject_t -> onode map; there is no PG collection.  The PG collection 
> still exist but really only as ranges of those keys.  We will need to be 
> slightly clever with the coll_t to distinguish between "bare" PGs (that 
> live in this flat mapping) and the other collections (*_temp and 
> metadata), but that should be easy.  This makes PG splitting "free" as far 
> as the objects go.
> 
> - The onodes are relatively small.  They will contain the xattrs and 
> basic metadata like object size.  They will also identify the file name of 
> the backing file in the file system (if size > 0).
> 
> - The backing file can be a random, short file name.  We can just make a 
> one or two level deep set of directories, and let the directories get 
> reasonably big... whatever we decide the backing fs can handle 
> efficiently.  We can also store a file handle in the onode and use the 
> open by handle API; this should let us go directly from onode (in our kv 
> db) to the on-disk inode without looking at the directory at all, and fall 
> back to using the actual file name only if that fails for some reason 
> (say, someone mucked around with the backing files).  The backing file 
> need not have any xattrs on it at all (except perhaps some simple id to 
> verify it does it fact belong to the referring onode, just as a sanity 
> check).
> 
> - The name -> onode mapping can live in a disjunct part of the kv 
> namespace so that the other kv stuff associated with the file (like omap 
> pairs or big xattrs or whatever) don't blow up those parts of the 
> db and slow down lookup.
> 
> - We can keep a simple LRU of recent onodes in memory and avoid the kv 
> lookup for hot objects.
> 
> - Previously complicated operations like rename are now trivial: we just 
> update the kv db with a transaction.  The backing file never gets renamed, 
> ever, and the other object omap data is keyed by a unique (onode) id, not 
> the name.
> 
> Initially, for simplicity, we can start with the existing data journaling 
> behavior.  However, I think there are opportunities to improve the 
> situation there.  There is a pending wip-transactions branch in which I 
> started to rejigger the ObjectStore::Transaction interface a bit so that 
> you identify objects by handle and then operation on them.  Although it 
> doesn't change the encoding yet, once it does, we can make the 
> implementation take advantage of that, by avoid duplicate name lookups.  
> It will also let us do things like clearly identify when an object is 
> entirely new; in that case, we might forgo data journaling and instead 
> write the data to the (new) file, fsync, and then commit the journal entry 
> with the transaction that uses it.  (On remount a simple cleanup process 
> can throw out new but unreferenced backing files.)  It would also make it 
> easier to track all recently touched files and bulk fsync them instead of 
> doing a syncfs (if we decide that is faster).
> 
> Anyway, at the end of the day, small writes or overwrites would still be 
> journaled, but large writes or large new objects would not, which would (I 
> think) be a pretty big improvement.  Overall, I think the design will be 
> much simpler to reason about, and there are several potential avenues to 
> be clever and make improvements.  I'm not sure we can say the same about 
> the FileStore design, which suffers from the fact that it has evolved 
> slowly over the last 9 years or so.
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KeyFileStore ?
  2014-08-01 15:08 ` Guang Yang
@ 2014-08-01 21:34   ` Samuel Just
  2014-08-04 14:27     ` Guang Yang
  0 siblings, 1 reply; 10+ messages in thread
From: Samuel Just @ 2014-08-01 21:34 UTC (permalink / raw)
  To: Guang Yang; +Cc: Sage Weil, ceph-devel

Sage's basic approach sounds about right to me.  I'm fairly skeptical
about the benefits of packing small objects together within larger
files, though.  It seems like for very small objects, we would be
better off stashing the contents opportunistically within the onode.
For somewhat larger objects, it seems like the complexity of
maintaining information about the larger pack objects would be
equivalent to the what the filesystem would do anyway.
-Sam

On Fri, Aug 1, 2014 at 8:08 AM, Guang Yang <yguang11@outlook.com> wrote:
> I really like the idea, one scenario keeps bothering us is that there are too many small files which make the file system indexing slow (so that a single read request could take more than 10 disk IOs for path lookup).
>
> If we pursuit this proposal, is there a chance we can take one step further, that instead of storing one physical file for each object, we can allocate a big file (tens of GB) and each object only map to a chunk within that big file. So that all those big file’s description could be cached to avoid disk I/O to open the file. At least we keep it flexible that if someone would like to implement in such way, there is a chance to leverage the existing framework.
>
> Thanks,
> Guang
>
> On Jul 31, 2014, at 1:25 PM, Sage Weil <sweil@redhat.com> wrote:
>
>> After the latest set of bug fixes to the FileStore file naming code I am
>> newly inspired to replace it with something less complex.  Right now I'm
>> mostly thinking about HDDs, although some of this may map well onto hybrid
>> SSD/HDD as well.  It may or may not make sense for pure flash.
>>
>> Anyway, here are the main flaws with the overall approach that FileStore
>> uses:
>>
>> - It tries to maintain a direct mapping of object names to file names.
>> This is problematic because of 255 character limits, rados namespaces, pg
>> prefixes, and the pg directory hashing we do to allow efficient split, for
>> starters.  It is also problematic because we often want to do things like
>> rename but can't make it happen atomically in combination with the rest of
>> our transaction.
>>
>> - The PG directory hashing (that we do to allow efficient split) can have
>> a big impact on performance, particularly when injesting lots of data.
>> (And when benchmarking.)  It's also complex.
>>
>> - We often overwrite or replace entire objects.  These are "easy"
>> operations to do safely without doing complete data journaling, but the
>> current design is not conducive to doing anything clever (and it's complex
>> enough that I wouldn't want to add any cleverness on top).
>>
>> - Objects may contain only key/value data, but we still have to create an
>> inode for them and look that up first.  This only matters for some
>> workloads (rgw indexes, cephfs directory objects).
>>
>> Instead, I think we should try a hybrid approach that more heavily
>> leverages a key/value db in combination with the file system.  The kv db
>> might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just
>> assume it provides transactional key/value storage and efficient range
>> operations.  Here's the basic idea:
>>
>> - The mapping from names to object lives in the kv db.  The object
>> metadata is in a structure we can call an "onode" to avoid confusing it
>> with the inodes in the backing file system.  The mapping is simple
>> ghobject_t -> onode map; there is no PG collection.  The PG collection
>> still exist but really only as ranges of those keys.  We will need to be
>> slightly clever with the coll_t to distinguish between "bare" PGs (that
>> live in this flat mapping) and the other collections (*_temp and
>> metadata), but that should be easy.  This makes PG splitting "free" as far
>> as the objects go.
>>
>> - The onodes are relatively small.  They will contain the xattrs and
>> basic metadata like object size.  They will also identify the file name of
>> the backing file in the file system (if size > 0).
>>
>> - The backing file can be a random, short file name.  We can just make a
>> one or two level deep set of directories, and let the directories get
>> reasonably big... whatever we decide the backing fs can handle
>> efficiently.  We can also store a file handle in the onode and use the
>> open by handle API; this should let us go directly from onode (in our kv
>> db) to the on-disk inode without looking at the directory at all, and fall
>> back to using the actual file name only if that fails for some reason
>> (say, someone mucked around with the backing files).  The backing file
>> need not have any xattrs on it at all (except perhaps some simple id to
>> verify it does it fact belong to the referring onode, just as a sanity
>> check).
>>
>> - The name -> onode mapping can live in a disjunct part of the kv
>> namespace so that the other kv stuff associated with the file (like omap
>> pairs or big xattrs or whatever) don't blow up those parts of the
>> db and slow down lookup.
>>
>> - We can keep a simple LRU of recent onodes in memory and avoid the kv
>> lookup for hot objects.
>>
>> - Previously complicated operations like rename are now trivial: we just
>> update the kv db with a transaction.  The backing file never gets renamed,
>> ever, and the other object omap data is keyed by a unique (onode) id, not
>> the name.
>>
>> Initially, for simplicity, we can start with the existing data journaling
>> behavior.  However, I think there are opportunities to improve the
>> situation there.  There is a pending wip-transactions branch in which I
>> started to rejigger the ObjectStore::Transaction interface a bit so that
>> you identify objects by handle and then operation on them.  Although it
>> doesn't change the encoding yet, once it does, we can make the
>> implementation take advantage of that, by avoid duplicate name lookups.
>> It will also let us do things like clearly identify when an object is
>> entirely new; in that case, we might forgo data journaling and instead
>> write the data to the (new) file, fsync, and then commit the journal entry
>> with the transaction that uses it.  (On remount a simple cleanup process
>> can throw out new but unreferenced backing files.)  It would also make it
>> easier to track all recently touched files and bulk fsync them instead of
>> doing a syncfs (if we decide that is faster).
>>
>> Anyway, at the end of the day, small writes or overwrites would still be
>> journaled, but large writes or large new objects would not, which would (I
>> think) be a pretty big improvement.  Overall, I think the design will be
>> much simpler to reason about, and there are several potential avenues to
>> be clever and make improvements.  I'm not sure we can say the same about
>> the FileStore design, which suffers from the fact that it has evolved
>> slowly over the last 9 years or so.
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KeyFileStore ?
  2014-08-01 21:34   ` Samuel Just
@ 2014-08-04 14:27     ` Guang Yang
  0 siblings, 0 replies; 10+ messages in thread
From: Guang Yang @ 2014-08-04 14:27 UTC (permalink / raw)
  To: Samuel Just; +Cc: Sage Weil, ceph-devel

在 2014年8月2日,上午5:34,Samuel Just <sam.just@inktank.com> 写道:

> Sage's basic approach sounds about right to me.  I'm fairly skeptical
> about the benefits of packing small objects together within larger
> files, though.  It seems like for very small objects, we would be
> better off stashing the contents opportunistically within the onode.
I really like this idea, for radosgw + EC use case, there are lots of small physical files generated (multiple Kbs), and when the OSD disk is filled to a certain ratio, each read to one chunk could incur several disk I/Os (path lookup and data), and putting the data as part of onode could boost the read performance and as the same time, decrease the number of physical files.
> For somewhat larger objects, it seems like the complexity of
> maintaining information about the larger pack objects would be
> equivalent to the what the filesystem would do anyway.
> -Sam
> 
> On Fri, Aug 1, 2014 at 8:08 AM, Guang Yang <yguang11@outlook.com> wrote:
>> I really like the idea, one scenario keeps bothering us is that there are too many small files which make the file system indexing slow (so that a single read request could take more than 10 disk IOs for path lookup).
>> 
>> If we pursuit this proposal, is there a chance we can take one step further, that instead of storing one physical file for each object, we can allocate a big file (tens of GB) and each object only map to a chunk within that big file. So that all those big file’s description could be cached to avoid disk I/O to open the file. At least we keep it flexible that if someone would like to implement in such way, there is a chance to leverage the existing framework.
>> 
>> Thanks,
>> Guang
>> 
>> On Jul 31, 2014, at 1:25 PM, Sage Weil <sweil@redhat.com> wrote:
>> 
>>> After the latest set of bug fixes to the FileStore file naming code I am
>>> newly inspired to replace it with something less complex.  Right now I'm
>>> mostly thinking about HDDs, although some of this may map well onto hybrid
>>> SSD/HDD as well.  It may or may not make sense for pure flash.
>>> 
>>> Anyway, here are the main flaws with the overall approach that FileStore
>>> uses:
>>> 
>>> - It tries to maintain a direct mapping of object names to file names.
>>> This is problematic because of 255 character limits, rados namespaces, pg
>>> prefixes, and the pg directory hashing we do to allow efficient split, for
>>> starters.  It is also problematic because we often want to do things like
>>> rename but can't make it happen atomically in combination with the rest of
>>> our transaction.
>>> 
>>> - The PG directory hashing (that we do to allow efficient split) can have
>>> a big impact on performance, particularly when injesting lots of data.
>>> (And when benchmarking.)  It's also complex.
>>> 
>>> - We often overwrite or replace entire objects.  These are "easy"
>>> operations to do safely without doing complete data journaling, but the
>>> current design is not conducive to doing anything clever (and it's complex
>>> enough that I wouldn't want to add any cleverness on top).
>>> 
>>> - Objects may contain only key/value data, but we still have to create an
>>> inode for them and look that up first.  This only matters for some
>>> workloads (rgw indexes, cephfs directory objects).
>>> 
>>> Instead, I think we should try a hybrid approach that more heavily
>>> leverages a key/value db in combination with the file system.  The kv db
>>> might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just
>>> assume it provides transactional key/value storage and efficient range
>>> operations.  Here's the basic idea:
>>> 
>>> - The mapping from names to object lives in the kv db.  The object
>>> metadata is in a structure we can call an "onode" to avoid confusing it
>>> with the inodes in the backing file system.  The mapping is simple
>>> ghobject_t -> onode map; there is no PG collection.  The PG collection
>>> still exist but really only as ranges of those keys.  We will need to be
>>> slightly clever with the coll_t to distinguish between "bare" PGs (that
>>> live in this flat mapping) and the other collections (*_temp and
>>> metadata), but that should be easy.  This makes PG splitting "free" as far
>>> as the objects go.
>>> 
>>> - The onodes are relatively small.  They will contain the xattrs and
>>> basic metadata like object size.  They will also identify the file name of
>>> the backing file in the file system (if size > 0).
>>> 
>>> - The backing file can be a random, short file name.  We can just make a
>>> one or two level deep set of directories, and let the directories get
>>> reasonably big... whatever we decide the backing fs can handle
>>> efficiently.  We can also store a file handle in the onode and use the
>>> open by handle API; this should let us go directly from onode (in our kv
>>> db) to the on-disk inode without looking at the directory at all, and fall
>>> back to using the actual file name only if that fails for some reason
>>> (say, someone mucked around with the backing files).  The backing file
>>> need not have any xattrs on it at all (except perhaps some simple id to
>>> verify it does it fact belong to the referring onode, just as a sanity
>>> check).
>>> 
>>> - The name -> onode mapping can live in a disjunct part of the kv
>>> namespace so that the other kv stuff associated with the file (like omap
>>> pairs or big xattrs or whatever) don't blow up those parts of the
>>> db and slow down lookup.
>>> 
>>> - We can keep a simple LRU of recent onodes in memory and avoid the kv
>>> lookup for hot objects.
>>> 
>>> - Previously complicated operations like rename are now trivial: we just
>>> update the kv db with a transaction.  The backing file never gets renamed,
>>> ever, and the other object omap data is keyed by a unique (onode) id, not
>>> the name.
>>> 
>>> Initially, for simplicity, we can start with the existing data journaling
>>> behavior.  However, I think there are opportunities to improve the
>>> situation there.  There is a pending wip-transactions branch in which I
>>> started to rejigger the ObjectStore::Transaction interface a bit so that
>>> you identify objects by handle and then operation on them.  Although it
>>> doesn't change the encoding yet, once it does, we can make the
>>> implementation take advantage of that, by avoid duplicate name lookups.
>>> It will also let us do things like clearly identify when an object is
>>> entirely new; in that case, we might forgo data journaling and instead
>>> write the data to the (new) file, fsync, and then commit the journal entry
>>> with the transaction that uses it.  (On remount a simple cleanup process
>>> can throw out new but unreferenced backing files.)  It would also make it
>>> easier to track all recently touched files and bulk fsync them instead of
>>> doing a syncfs (if we decide that is faster).
>>> 
>>> Anyway, at the end of the day, small writes or overwrites would still be
>>> journaled, but large writes or large new objects would not, which would (I
>>> think) be a pretty big improvement.  Overall, I think the design will be
>>> much simpler to reason about, and there are several potential avenues to
>>> be clever and make improvements.  I'm not sure we can say the same about
>>> the FileStore design, which suffers from the fact that it has evolved
>>> slowly over the last 9 years or so.
>>> 
>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-08-04 14:27 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-31  5:25 KeyFileStore ? Sage Weil
2014-07-31  5:49 ` Mark Kirkwood
2014-07-31  6:07   ` Haomai Wang
2014-07-31 13:18 ` Gregory Farnum
2014-07-31 13:59   ` Mark Nelson
2014-07-31 15:05   ` Yehuda Sadeh
2014-07-31 13:56 ` Matt W. Benjamin
2014-08-01 15:08 ` Guang Yang
2014-08-01 21:34   ` Samuel Just
2014-08-04 14:27     ` Guang Yang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.