* A way to reduce compression overhead
@ 2016-11-08 17:28 Igor Fedotov
2016-11-08 20:27 ` Sage Weil
0 siblings, 1 reply; 7+ messages in thread
From: Igor Fedotov @ 2016-11-08 17:28 UTC (permalink / raw)
To: Sage Weil, ceph-devel
Hi Sage, et al.
Let me share some ideas about possible compression burden reduction on
the cluster.
As known we perform block compression at BlueStore level for each
replica independently. This triples compression CPU overhead for the
cluster. Looks like significant CPU resource waste IMHO.
We can probably eliminate this overhead by introduction write request
preprocessing performed at ObjectStore level synchronously. This
preprocessing parses transaction, detects write requests and transforms
them into different ones aligned with current store allocation unit. At
the same time resulting extents that span more than single AU are
compressed if needed. I.e. preprocessing do some of the job performed at
BlueStore::_do_write_data that splits write request into
_do_write_small/_do_write_big calls. But after the split and big blob
compression preprocessor simply updates the transaction with new write
requests.
E.g.
with AU = 0x1000
Write Request (1~0xffff) is transformed into the following sequence:
WriteX 1~0xfff (uncompressed)
WriteX 0x1000~E000 (compressed if needed)
WriteX 0xf000~0xfff (uncompressed)
Then updated transaction is passed to all replicas including the master
one using regular apply_/queue_transaction mechanics.
As a bonus one receives automatic payload compression when transporting
request to remote store replicas.
Regular write request path should be preserved for EC pools and other
needs as well.
Please note that almost no latency is introduced for request handling.
Replicas receive modified transaction later but they do not spend time
on doing split/compress stuff.
There is a potential conflict with the current garbage collection stuff
though - we can't perform GC during preprocessing due to possible race
with preceding unfinished transactions and consequently we're unable to
merge and compress merged data. Well, we can do that when applying
transaction but this will produce a sequence like this at each replica:
decompress original request + decompress data to merge -> compress
merged data.
Probably this limitation isn't that bad - IMHO it's better to have
compressed blobs aligned with original write requests.
Moreover I have some ideas how to get rid of blob_depth notion that
makes life a bit easier. Will share shortly.
Any thought/comments?
Thanks,
Igor
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: A way to reduce compression overhead
2016-11-08 17:28 A way to reduce compression overhead Igor Fedotov
@ 2016-11-08 20:27 ` Sage Weil
2016-11-09 13:35 ` Igor Fedotov
0 siblings, 1 reply; 7+ messages in thread
From: Sage Weil @ 2016-11-08 20:27 UTC (permalink / raw)
To: Igor Fedotov; +Cc: ceph-devel
On Tue, 8 Nov 2016, Igor Fedotov wrote:
> Hi Sage, et al.
>
> Let me share some ideas about possible compression burden reduction on the
> cluster.
>
> As known we perform block compression at BlueStore level for each replica
> independently. This triples compression CPU overhead for the cluster. Looks
> like significant CPU resource waste IMHO.
>
> We can probably eliminate this overhead by introduction write request
> preprocessing performed at ObjectStore level synchronously. This preprocessing
> parses transaction, detects write requests and transforms them into different
> ones aligned with current store allocation unit. At the same time resulting
> extents that span more than single AU are compressed if needed. I.e.
> preprocessing do some of the job performed at BlueStore::_do_write_data that
> splits write request into _do_write_small/_do_write_big calls. But after the
> split and big blob compression preprocessor simply updates the transaction
> with new write requests.
>
> E.g.
>
> with AU = 0x1000
>
> Write Request (1~0xffff) is transformed into the following sequence:
>
> WriteX 1~0xfff (uncompressed)
>
> WriteX 0x1000~E000 (compressed if needed)
>
> WriteX 0xf000~0xfff (uncompressed)
>
> Then updated transaction is passed to all replicas including the master one
> using regular apply_/queue_transaction mechanics.
>
>
> As a bonus one receives automatic payload compression when transporting
> request to remote store replicas.
> Regular write request path should be preserved for EC pools and other needs as
> well.
>
> Please note that almost no latency is introduced for request handling.
> Replicas receive modified transaction later but they do not spend time on
> doing split/compress stuff.
I think this is pretty reasonable! We have a couple options... we could
(1) just expose a compression alignment via ObjectStore, (2) take
compression alignment from a pool property, or (3) have an explicit
per-write call into ObjectStore so that it can chunk it up however it
likes.
Whatever we choose, the tricky bit is that there may be different stores
on different replicas. Or we could let the primary just decide locally,
given that this is primarily an optimization; in the worst case we
compress something on the primary but one replica doesn't support
compression and just decompresses it before doing the write (i.e., we get
on-the-wire compression but no on-disk compression).
I lean toward the simplicity of get_compression_alignment() and
get_compression_alg() (or similar) and just make a local (primary)
decision. Then we just have a simple compatibility write_compressed()
implementation (or helper) that decompresses the payload so that we can do
a normal write.
Before getting to carried away, though, we should consider whether we're
going to want to take a further step to allow clients to compress data
before it's sent. That isn't necessarily in conflict with this if we go
with pool properties to inform the alignment and compression alg
decision. If we assume that the ObjectStore on the primary gets to decide
everything it will work less well...
> There is a potential conflict with the current garbage collection stuff though
> - we can't perform GC during preprocessing due to possible race with preceding
> unfinished transactions and consequently we're unable to merge and compress
> merged data. Well, we can do that when applying transaction but this will
> produce a sequence like this at each replica:
>
> decompress original request + decompress data to merge -> compress merged
> data.
>
> Probably this limitation isn't that bad - IMHO it's better to have compressed
> blobs aligned with original write requests.
>
> Moreover I have some ideas how to get rid of blob_depth notion that makes life
> a bit easier. Will share shortly.
I'm curious what you have in mind! The blob_depth as currently
implemented is not terribly reliable...
sage
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: A way to reduce compression overhead
2016-11-08 20:27 ` Sage Weil
@ 2016-11-09 13:35 ` Igor Fedotov
2016-11-11 23:42 ` Sage Weil
0 siblings, 1 reply; 7+ messages in thread
From: Igor Fedotov @ 2016-11-09 13:35 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
Sage
On 11/8/2016 11:27 PM, Sage Weil wrote:
> On Tue, 8 Nov 2016, Igor Fedotov wrote:
>> Hi Sage, et al.
>>
>> Let me share some ideas about possible compression burden reduction on the
>> cluster.
>>
>> As known we perform block compression at BlueStore level for each replica
>> independently. This triples compression CPU overhead for the cluster. Looks
>> like significant CPU resource waste IMHO.
>>
>> We can probably eliminate this overhead by introduction write request
>> preprocessing performed at ObjectStore level synchronously. This preprocessing
>> parses transaction, detects write requests and transforms them into different
>> ones aligned with current store allocation unit. At the same time resulting
>> extents that span more than single AU are compressed if needed. I.e.
>> preprocessing do some of the job performed at BlueStore::_do_write_data that
>> splits write request into _do_write_small/_do_write_big calls. But after the
>> split and big blob compression preprocessor simply updates the transaction
>> with new write requests.
>>
>> E.g.
>>
>> with AU = 0x1000
>>
>> Write Request (1~0xffff) is transformed into the following sequence:
>>
>> WriteX 1~0xfff (uncompressed)
>>
>> WriteX 0x1000~E000 (compressed if needed)
>>
>> WriteX 0xf000~0xfff (uncompressed)
>>
>> Then updated transaction is passed to all replicas including the master one
>> using regular apply_/queue_transaction mechanics.
>>
>>
>> As a bonus one receives automatic payload compression when transporting
>> request to remote store replicas.
>> Regular write request path should be preserved for EC pools and other needs as
>> well.
>>
>> Please note that almost no latency is introduced for request handling.
>> Replicas receive modified transaction later but they do not spend time on
>> doing split/compress stuff.
> I think this is pretty reasonable! We have a couple options... we could
> (1) just expose a compression alignment via ObjectStore, (2) take
> compression alignment from a pool property, or (3) have an explicit
> per-write call into ObjectStore so that it can chunk it up however it
> likes.
>
> Whatever we choose, the tricky bit is that there may be different stores
> on different replicas. Or we could let the primary just decide locally,
> given that this is primarily an optimization; in the worst case we
> compress something on the primary but one replica doesn't support
> compression and just decompresses it before doing the write (i.e., we get
> on-the-wire compression but no on-disk compression).
IMHO different stores on different replicas is rather a corner case and
it's better (or simpler) to disable compression optimization when it
takes place. Doing compression followed by decompression seems ugly a
bit unless we're talking about traffic compression only.
To disable compression preprocessing we can either have a manual switch
in the config or collect remote OSD capabilities at primary and disable
preprocessing automatically. This can be made just once hence it
wouldn't impact request handling performance.
> I lean toward the simplicity of get_compression_alignment() and
> get_compression_alg() (or similar) and just make a local (primary)
> decision. Then we just have a simple compatibility write_compressed()
> implementation (or helper) that decompresses the payload so that we can do
> a normal write.
As for me I always stand for better functionality encapsulation - hence
I'd prefer (3): store do whatever it can and transparently passes
results to replicas. This allows to modify or extend the logic smoothly,
e.g. optimize csum calculation for big chunks etc.
Contrary in (1) we expose most of this functionality to store's client
(i.g. replicated backend stuff, not a real Ceph client). In fact for
(1) we'll have 2 potentially evolving APIs:
- compressed(optimized) write request delivery
- store optimization description provided to client ( i.e. mentioned
algorithm + alignment retrieval initially).
The latter isn't needed for (3)
>
> Before getting to carried away, though, we should consider whether we're
> going to want to take a further step to allow clients to compress data
> before it's sent. That isn't necessarily in conflict with this if we go
> with pool properties to inform the alignment and compression alg
> decision. If we assume that the ObjectStore on the primary gets to decide
> everything it will work less well...
Firstly let's agree on the terminology. Here we're talking about Ceph
cluster clients. While it were store clients (PG backends) above.
Well, this case is a bit different comparing to the above. (3) isn't a
viable option here. Ceph client definitely relies on (1) or (2) if any
(I'm afraid bringing compression to client will be a headache).
But at the same time IMHO it might be an argument against having (1) for
the store client. There appears three entities that a aware of
compression optimization: Ceph client, store client(PG backend) and
store itself. Not good...
In case of (1) + (3) intermediate layer can be probably unburden from
that awareness - it simply has to pass compressed blocks transparently
from client to store and from primary store to replicas.
>> There is a potential conflict with the current garbage collection stuff though
>> - we can't perform GC during preprocessing due to possible race with preceding
>> unfinished transactions and consequently we're unable to merge and compress
>> merged data. Well, we can do that when applying transaction but this will
>> produce a sequence like this at each replica:
>>
>> decompress original request + decompress data to merge -> compress merged
>> data.
>>
>> Probably this limitation isn't that bad - IMHO it's better to have compressed
>> blobs aligned with original write requests.
>>
>> Moreover I have some ideas how to get rid of blob_depth notion that makes life
>> a bit easier. Will share shortly.
> I'm curious what you have in mind! The blob_depth as currently
> implemented is not terribly reliable...
General idea is to estimate allocated vs stored ratio for the blob(s)
under the extent being written.
Where stored and allocated are measured in allocation units. And can be
calculated using blobs ref_map.
If that ratio is greater than 1 (+-some correction) - we need to perform
GC for these blobs. Given the fact we do that after compression
preprocessing it's expensive to merge the compressed extent being
written and old shards. Hence that shards are written as standalone
extents as opposed to current implementation when we try to merge both
new and existing extents into a single entity. Not a big drawback IMHO.
Evidently this is valid for new compressed extents (that are AU aligned)
only. Uncompressed ones can be merged in any fashion.
This is just a draft hence comments are highly appreciated.
>
> sage
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: A way to reduce compression overhead
2016-11-09 13:35 ` Igor Fedotov
@ 2016-11-11 23:42 ` Sage Weil
2016-11-15 13:14 ` Igor Fedotov
0 siblings, 1 reply; 7+ messages in thread
From: Sage Weil @ 2016-11-11 23:42 UTC (permalink / raw)
To: Igor Fedotov; +Cc: ceph-devel
On Wed, 9 Nov 2016, Igor Fedotov wrote:
> > I think this is pretty reasonable! We have a couple options... we could
> > (1) just expose a compression alignment via ObjectStore, (2) take
> > compression alignment from a pool property, or (3) have an explicit
> > per-write call into ObjectStore so that it can chunk it up however it
> > likes.
> >
> > Whatever we choose, the tricky bit is that there may be different stores
> > on different replicas. Or we could let the primary just decide locally,
> > given that this is primarily an optimization; in the worst case we
> > compress something on the primary but one replica doesn't support
> > compression and just decompresses it before doing the write (i.e., we get
> > on-the-wire compression but no on-disk compression).
> IMHO different stores on different replicas is rather a corner case and it's
> better (or simpler) to disable compression optimization when it takes place.
> Doing compression followed by decompression seems ugly a bit unless we're
> talking about traffic compression only.
> To disable compression preprocessing we can either have a manual switch in the
> config or collect remote OSD capabilities at primary and disable preprocessing
> automatically. This can be made just once hence it wouldn't impact request
> handling performance.
> > I lean toward the simplicity of get_compression_alignment() and
> > get_compression_alg() (or similar) and just make a local (primary)
> > decision. Then we just have a simple compatibility write_compressed()
> > implementation (or helper) that decompresses the payload so that we can do
> > a normal write.
> As for me I always stand for better functionality encapsulation - hence I'd
> prefer (3): store do whatever it can and transparently passes results to
> replicas. This allows to modify or extend the logic smoothly, e.g. optimize
> csum calculation for big chunks etc.
> Contrary in (1) we expose most of this functionality to store's client (i.g.
> replicated backend stuff, not a real Ceph client). In fact for (1) we'll have
> 2 potentially evolving APIs:
> - compressed(optimized) write request delivery
> - store optimization description provided to client ( i.e. mentioned algorithm
> + alignment retrieval initially).
> The latter isn't needed for (3)
The concern I have here is that it probably won't map well onto EC. The
primary can't easily have the local ObjectStore chunking things up and
then "pass it to the replica".. there's an intermediate layer between the
replication code and the ObjectStore (and is getting a bit more
sophisticated with the coming EC changes).
I think the simplest approach here would be to keep it simple. For
example, a min_alloc_size and max compressed chunk size specified for the
pool. The intermediate layer can apply the EC striping parameters, and
then chunk/compress accordingly.
I agree that worrying about client-side compression seems like a lot at
this stage, but it's going to be the very next thing we ask about, so we
should consider it to make sure we don't put up any major roadblocks.
Either way, though, we should probably wait for the EC overwrite changes
to land...
As for GC,
> > I'm curious what you have in mind! The blob_depth as currently
> > implemented is not terribly reliable...
> General idea is to estimate allocated vs stored ratio for the blob(s) under
> the extent being written.
> Where stored and allocated are measured in allocation units. And can be
> calculated using blobs ref_map.
> If that ratio is greater than 1 (+-some correction) - we need to perform GC
> for these blobs. Given the fact we do that after compression preprocessing
> it's expensive to merge the compressed extent being written and old shards.
> Hence that shards are written as standalone extents as opposed to current
> implementation when we try to merge both new and existing extents into a
> single entity. Not a big drawback IMHO. Evidently this is valid for new
> compressed extents (that are AU aligned) only. Uncompressed ones can be merged
> in any fashion.
> This is just a draft hence comments are highly appreciated.
Yeah, I think this is a more sensible approach (focusing on allocated vs
referenced). It seems like the most straightforward thing to do is
actually look at the old_extents in the wctx--since those are the ref_maps
that will become less referenced than before--in order to identify which
blobs might need rewriting. Avoiding the merge case vastly simplifies it.
That also isn't any persistent metadata that we have to maintain (that
might become incorrect or inconsistent).
We'd probably do the _do_write_data (which will do the various
punch_hole's), then check for any gc work, then do the final
_do_alloc_write and _wctx_finish?
sage
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: A way to reduce compression overhead
2016-11-11 23:42 ` Sage Weil
@ 2016-11-15 13:14 ` Igor Fedotov
2016-11-15 14:31 ` Sage Weil
0 siblings, 1 reply; 7+ messages in thread
From: Igor Fedotov @ 2016-11-15 13:14 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
Sage,
please find inline.
On 12.11.2016 2:42, Sage Weil wrote:
> On Wed, 9 Nov 2016, Igor Fedotov wrote:
>
> The concern I have here is that it probably won't map well onto EC. The
> primary can't easily have the local ObjectStore chunking things up and
> then "pass it to the replica".. there's an intermediate layer between the
> replication code and the ObjectStore (and is getting a bit more
> sophisticated with the coming EC changes).
>
> I think the simplest approach here would be to keep it simple. For
> example, a min_alloc_size and max compressed chunk size specified for the
> pool. The intermediate layer can apply the EC striping parameters, and
> then chunk/compress accordingly.
>
> I agree that worrying about client-side compression seems like a lot at
> this stage, but it's going to be the very next thing we ask about, so we
> should consider it to make sure we don't put up any major roadblocks.
>
> Either way, though, we should probably wait for the EC overwrite changes
> to land...
Got it, thanks. Will start working on POC meanwhile.
>
> As for GC,
>
>>> I'm curious what you have in mind! The blob_depth as currently
>>> implemented is not terribly reliable...
>> General idea is to estimate allocated vs stored ratio for the blob(s) under
>> the extent being written.
>> Where stored and allocated are measured in allocation units. And can be
>> calculated using blobs ref_map.
>> If that ratio is greater than 1 (+-some correction) - we need to perform GC
>> for these blobs. Given the fact we do that after compression preprocessing
>> it's expensive to merge the compressed extent being written and old shards.
>> Hence that shards are written as standalone extents as opposed to current
>> implementation when we try to merge both new and existing extents into a
>> single entity. Not a big drawback IMHO. Evidently this is valid for new
>> compressed extents (that are AU aligned) only. Uncompressed ones can be merged
>> in any fashion.
>> This is just a draft hence comments are highly appreciated.
> Yeah, I think this is a more sensible approach (focusing on allocated vs
> referenced). It seems like the most straightforward thing to do is
> actually look at the old_extents in the wctx--since those are the ref_maps
> that will become less referenced than before--in order to identify which
> blobs might need rewriting. Avoiding the merge case vastly simplifies it.
> That also isn't any persistent metadata that we have to maintain (that
> might become incorrect or inconsistent).
>
> We'd probably do the _do_write_data (which will do the various
> punch_hole's), then check for any gc work, then do the final
> _do_alloc_write and _wctx_finish?
>
Sounds good. Still need a detailed consistent algorithm though - working
on that.
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: A way to reduce compression overhead
2016-11-15 13:14 ` Igor Fedotov
@ 2016-11-15 14:31 ` Sage Weil
2016-11-15 14:42 ` Igor Fedotov
0 siblings, 1 reply; 7+ messages in thread
From: Sage Weil @ 2016-11-15 14:31 UTC (permalink / raw)
To: Igor Fedotov; +Cc: ceph-devel
On Tue, 15 Nov 2016, Igor Fedotov wrote:
> > > > I'm curious what you have in mind! The blob_depth as currently
> > > > implemented is not terribly reliable...
> > > General idea is to estimate allocated vs stored ratio for the blob(s)
> > > under
> > > the extent being written.
> > > Where stored and allocated are measured in allocation units. And can be
> > > calculated using blobs ref_map.
> > > If that ratio is greater than 1 (+-some correction) - we need to perform
> > > GC
> > > for these blobs. Given the fact we do that after compression preprocessing
> > > it's expensive to merge the compressed extent being written and old
> > > shards.
> > > Hence that shards are written as standalone extents as opposed to current
> > > implementation when we try to merge both new and existing extents into a
> > > single entity. Not a big drawback IMHO. Evidently this is valid for new
> > > compressed extents (that are AU aligned) only. Uncompressed ones can be
> > > merged
> > > in any fashion.
> > > This is just a draft hence comments are highly appreciated.
> > Yeah, I think this is a more sensible approach (focusing on allocated vs
> > referenced). It seems like the most straightforward thing to do is
> > actually look at the old_extents in the wctx--since those are the ref_maps
> > that will become less referenced than before--in order to identify which
> > blobs might need rewriting. Avoiding the merge case vastly simplifies it.
> > That also isn't any persistent metadata that we have to maintain (that
> > might become incorrect or inconsistent).
> >
> > We'd probably do the _do_write_data (which will do the various
> > punch_hole's), then check for any gc work, then do the final
> > _do_alloc_write and _wctx_finish?
> >
> Sounds good. Still need a detailed consistent algorithm though - working on
> that.
In the meantime, perhaps we should remove the blob_depth code for now so
it doesn't end up in the on-disk format.
sage
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: A way to reduce compression overhead
2016-11-15 14:31 ` Sage Weil
@ 2016-11-15 14:42 ` Igor Fedotov
0 siblings, 0 replies; 7+ messages in thread
From: Igor Fedotov @ 2016-11-15 14:42 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
You mean removing GC stuff totally, right?
Will do.
On 15.11.2016 17:31, Sage Weil wrote:
>
> In the meantime, perhaps we should remove the blob_depth code for now so
> it doesn't end up in the on-disk format.
>
> sage
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-11-15 14:43 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-08 17:28 A way to reduce compression overhead Igor Fedotov
2016-11-08 20:27 ` Sage Weil
2016-11-09 13:35 ` Igor Fedotov
2016-11-11 23:42 ` Sage Weil
2016-11-15 13:14 ` Igor Fedotov
2016-11-15 14:31 ` Sage Weil
2016-11-15 14:42 ` Igor Fedotov
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.