All of lore.kernel.org
 help / color / mirror / Atom feed
* A way to reduce compression overhead
@ 2016-11-08 17:28 Igor Fedotov
  2016-11-08 20:27 ` Sage Weil
  0 siblings, 1 reply; 7+ messages in thread
From: Igor Fedotov @ 2016-11-08 17:28 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

Hi Sage, et al.

Let me share some ideas about possible compression burden reduction on 
the cluster.

As known we perform block compression at BlueStore level for each 
replica independently. This triples compression CPU overhead for the 
cluster. Looks like significant CPU resource waste IMHO.

We can probably eliminate this overhead by introduction write request 
preprocessing performed at ObjectStore level synchronously. This 
preprocessing parses transaction, detects write requests and transforms 
them into different ones aligned with current store allocation unit. At 
the same time resulting extents that span more than single AU are 
compressed if needed. I.e. preprocessing do some of the job performed at 
BlueStore::_do_write_data that splits write request into 
_do_write_small/_do_write_big calls. But after the split and big blob 
compression preprocessor simply updates the transaction with new write 
requests.

E.g.

with AU = 0x1000

Write Request (1~0xffff) is transformed into the following sequence:

WriteX 1~0xfff (uncompressed)

WriteX 0x1000~E000 (compressed if needed)

WriteX 0xf000~0xfff (uncompressed)

Then updated transaction is passed to all replicas including the master 
one using regular apply_/queue_transaction mechanics.


As a bonus one receives automatic payload compression when transporting 
request to remote store replicas.
Regular write request path should be preserved for EC pools and other 
needs as well.

Please note that almost no latency is introduced for request handling. 
Replicas receive modified transaction later but they do not spend time 
on doing split/compress stuff.

There is a potential conflict with the current garbage collection stuff 
though - we can't perform GC during preprocessing due to possible race 
with preceding unfinished transactions and consequently we're unable to 
merge and compress merged data. Well, we can do that when applying 
transaction but this will produce a sequence like this at each replica:

decompress original request + decompress data to merge -> compress 
merged  data.

Probably this limitation isn't that bad - IMHO it's better to have 
compressed blobs aligned with original write requests.

Moreover I have some ideas how to get rid of blob_depth notion that 
makes life a bit easier. Will share shortly.

Any thought/comments?

Thanks,
Igor



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A way to reduce compression overhead
  2016-11-08 17:28 A way to reduce compression overhead Igor Fedotov
@ 2016-11-08 20:27 ` Sage Weil
  2016-11-09 13:35   ` Igor Fedotov
  0 siblings, 1 reply; 7+ messages in thread
From: Sage Weil @ 2016-11-08 20:27 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-devel

On Tue, 8 Nov 2016, Igor Fedotov wrote:
> Hi Sage, et al.
> 
> Let me share some ideas about possible compression burden reduction on the
> cluster.
> 
> As known we perform block compression at BlueStore level for each replica
> independently. This triples compression CPU overhead for the cluster. Looks
> like significant CPU resource waste IMHO.
> 
> We can probably eliminate this overhead by introduction write request
> preprocessing performed at ObjectStore level synchronously. This preprocessing
> parses transaction, detects write requests and transforms them into different
> ones aligned with current store allocation unit. At the same time resulting
> extents that span more than single AU are compressed if needed. I.e.
> preprocessing do some of the job performed at BlueStore::_do_write_data that
> splits write request into _do_write_small/_do_write_big calls. But after the
> split and big blob compression preprocessor simply updates the transaction
> with new write requests.
> 
> E.g.
> 
> with AU = 0x1000
> 
> Write Request (1~0xffff) is transformed into the following sequence:
> 
> WriteX 1~0xfff (uncompressed)
> 
> WriteX 0x1000~E000 (compressed if needed)
> 
> WriteX 0xf000~0xfff (uncompressed)
> 
> Then updated transaction is passed to all replicas including the master one
> using regular apply_/queue_transaction mechanics.
> 
> 
> As a bonus one receives automatic payload compression when transporting
> request to remote store replicas.
> Regular write request path should be preserved for EC pools and other needs as
> well.
> 
> Please note that almost no latency is introduced for request handling.
> Replicas receive modified transaction later but they do not spend time on
> doing split/compress stuff.

I think this is pretty reasonable!  We have a couple options... we could 
(1) just expose a compression alignment via ObjectStore, (2) take 
compression alignment from a pool property, or (3) have an explicit 
per-write call into ObjectStore so that it can chunk it up however it 
likes.  

Whatever we choose, the tricky bit is that there may be different stores 
on different replicas.  Or we could let the primary just decide locally, 
given that this is primarily an optimization; in the worst case we 
compress something on the primary but one replica doesn't support 
compression and just decompresses it before doing the write (i.e., we get 
on-the-wire compression but no on-disk compression).

I lean toward the simplicity of get_compression_alignment() and 
get_compression_alg() (or similar) and just make a local (primary) 
decision.  Then we just have a simple compatibility write_compressed() 
implementation (or helper) that decompresses the payload so that we can do 
a normal write.

Before getting to carried away, though, we should consider whether we're 
going to want to take a further step to allow clients to compress data 
before it's sent.  That isn't necessarily in conflict with this if we go 
with pool properties to inform the alignment and compression alg 
decision.  If we assume that the ObjectStore on the primary gets to decide 
everything it will work less well...

> There is a potential conflict with the current garbage collection stuff though
> - we can't perform GC during preprocessing due to possible race with preceding
> unfinished transactions and consequently we're unable to merge and compress
> merged data. Well, we can do that when applying transaction but this will
> produce a sequence like this at each replica:
> 
> decompress original request + decompress data to merge -> compress merged
> data.
> 
> Probably this limitation isn't that bad - IMHO it's better to have compressed
> blobs aligned with original write requests.
> 
> Moreover I have some ideas how to get rid of blob_depth notion that makes life
> a bit easier. Will share shortly.

I'm curious what you have in mind!  The blob_depth as currently 
implemented is not terribly reliable...

sage

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A way to reduce compression overhead
  2016-11-08 20:27 ` Sage Weil
@ 2016-11-09 13:35   ` Igor Fedotov
  2016-11-11 23:42     ` Sage Weil
  0 siblings, 1 reply; 7+ messages in thread
From: Igor Fedotov @ 2016-11-09 13:35 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Sage


On 11/8/2016 11:27 PM, Sage Weil wrote:
> On Tue, 8 Nov 2016, Igor Fedotov wrote:
>> Hi Sage, et al.
>>
>> Let me share some ideas about possible compression burden reduction on the
>> cluster.
>>
>> As known we perform block compression at BlueStore level for each replica
>> independently. This triples compression CPU overhead for the cluster. Looks
>> like significant CPU resource waste IMHO.
>>
>> We can probably eliminate this overhead by introduction write request
>> preprocessing performed at ObjectStore level synchronously. This preprocessing
>> parses transaction, detects write requests and transforms them into different
>> ones aligned with current store allocation unit. At the same time resulting
>> extents that span more than single AU are compressed if needed. I.e.
>> preprocessing do some of the job performed at BlueStore::_do_write_data that
>> splits write request into _do_write_small/_do_write_big calls. But after the
>> split and big blob compression preprocessor simply updates the transaction
>> with new write requests.
>>
>> E.g.
>>
>> with AU = 0x1000
>>
>> Write Request (1~0xffff) is transformed into the following sequence:
>>
>> WriteX 1~0xfff (uncompressed)
>>
>> WriteX 0x1000~E000 (compressed if needed)
>>
>> WriteX 0xf000~0xfff (uncompressed)
>>
>> Then updated transaction is passed to all replicas including the master one
>> using regular apply_/queue_transaction mechanics.
>>
>>
>> As a bonus one receives automatic payload compression when transporting
>> request to remote store replicas.
>> Regular write request path should be preserved for EC pools and other needs as
>> well.
>>
>> Please note that almost no latency is introduced for request handling.
>> Replicas receive modified transaction later but they do not spend time on
>> doing split/compress stuff.
> I think this is pretty reasonable!  We have a couple options... we could
> (1) just expose a compression alignment via ObjectStore, (2) take
> compression alignment from a pool property, or (3) have an explicit
> per-write call into ObjectStore so that it can chunk it up however it
> likes.
>
> Whatever we choose, the tricky bit is that there may be different stores
> on different replicas.  Or we could let the primary just decide locally,
> given that this is primarily an optimization; in the worst case we
> compress something on the primary but one replica doesn't support
> compression and just decompresses it before doing the write (i.e., we get
> on-the-wire compression but no on-disk compression).
IMHO different stores on different replicas is rather a corner case and 
it's better (or simpler) to disable compression optimization when it 
takes place. Doing compression followed by decompression seems ugly a 
bit unless we're talking about traffic compression only.
To disable compression preprocessing we can either have a manual switch 
in the config or collect remote OSD capabilities at primary and disable 
preprocessing automatically. This can be made just once hence it 
wouldn't impact request handling performance.
> I lean toward the simplicity of get_compression_alignment() and
> get_compression_alg() (or similar) and just make a local (primary)
> decision.  Then we just have a simple compatibility write_compressed()
> implementation (or helper) that decompresses the payload so that we can do
> a normal write.
As for me I always stand for better functionality encapsulation - hence 
I'd prefer (3): store do whatever it can and transparently passes 
results to replicas. This allows to modify or extend the logic smoothly, 
e.g. optimize csum calculation for big chunks etc.
Contrary in (1) we expose most of this functionality to store's client 
(i.g. replicated backend stuff,  not a real Ceph client). In fact for 
(1) we'll have  2 potentially evolving APIs:
- compressed(optimized) write request delivery
- store optimization description provided to client ( i.e. mentioned 
algorithm + alignment retrieval initially).
The latter isn't needed for (3)

>
> Before getting to carried away, though, we should consider whether we're
> going to want to take a further step to allow clients to compress data
> before it's sent.  That isn't necessarily in conflict with this if we go
> with pool properties to inform the alignment and compression alg
> decision.  If we assume that the ObjectStore on the primary gets to decide
> everything it will work less well...
Firstly let's agree on the terminology. Here we're talking about Ceph 
cluster clients. While it were store clients (PG backends) above.
Well, this case is a bit different comparing to the above. (3) isn't a 
viable option here. Ceph client definitely relies on (1) or (2) if any 
(I'm afraid bringing compression to client will be a headache).
But at the same time IMHO it might be an argument against having (1) for 
the store client. There appears three entities that a aware of 
compression optimization: Ceph client, store client(PG backend) and 
store itself. Not good...
In case of (1) + (3) intermediate layer can be probably unburden from 
that awareness - it simply has to pass compressed blocks transparently 
from client to store and from primary store to replicas.
>> There is a potential conflict with the current garbage collection stuff though
>> - we can't perform GC during preprocessing due to possible race with preceding
>> unfinished transactions and consequently we're unable to merge and compress
>> merged data. Well, we can do that when applying transaction but this will
>> produce a sequence like this at each replica:
>>
>> decompress original request + decompress data to merge -> compress merged
>> data.
>>
>> Probably this limitation isn't that bad - IMHO it's better to have compressed
>> blobs aligned with original write requests.
>>
>> Moreover I have some ideas how to get rid of blob_depth notion that makes life
>> a bit easier. Will share shortly.
> I'm curious what you have in mind!  The blob_depth as currently
> implemented is not terribly reliable...
General idea is to estimate allocated vs stored ratio for the blob(s) 
under the extent being written.
Where stored and allocated are measured in allocation units. And can be 
calculated using blobs ref_map.
If that ratio is greater than 1 (+-some correction) - we need to perform 
GC for these blobs. Given the fact we do that after compression 
preprocessing it's expensive to merge the compressed extent being 
written and old shards. Hence that shards are written as standalone 
extents as opposed to current implementation when we try to merge both 
new and existing extents into  a single entity. Not a big drawback IMHO. 
Evidently this is valid for new compressed extents (that are AU aligned) 
only. Uncompressed ones can be merged in any fashion.
This is just a draft hence comments are highly appreciated.

>
> sage


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A way to reduce compression overhead
  2016-11-09 13:35   ` Igor Fedotov
@ 2016-11-11 23:42     ` Sage Weil
  2016-11-15 13:14       ` Igor Fedotov
  0 siblings, 1 reply; 7+ messages in thread
From: Sage Weil @ 2016-11-11 23:42 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-devel

On Wed, 9 Nov 2016, Igor Fedotov wrote:
> > I think this is pretty reasonable!  We have a couple options... we could
> > (1) just expose a compression alignment via ObjectStore, (2) take
> > compression alignment from a pool property, or (3) have an explicit
> > per-write call into ObjectStore so that it can chunk it up however it
> > likes.
> > 
> > Whatever we choose, the tricky bit is that there may be different stores
> > on different replicas.  Or we could let the primary just decide locally,
> > given that this is primarily an optimization; in the worst case we
> > compress something on the primary but one replica doesn't support
> > compression and just decompresses it before doing the write (i.e., we get
> > on-the-wire compression but no on-disk compression).
> IMHO different stores on different replicas is rather a corner case and it's
> better (or simpler) to disable compression optimization when it takes place.
> Doing compression followed by decompression seems ugly a bit unless we're
> talking about traffic compression only.
> To disable compression preprocessing we can either have a manual switch in the
> config or collect remote OSD capabilities at primary and disable preprocessing
> automatically. This can be made just once hence it wouldn't impact request
> handling performance.
> > I lean toward the simplicity of get_compression_alignment() and
> > get_compression_alg() (or similar) and just make a local (primary)
> > decision.  Then we just have a simple compatibility write_compressed()
> > implementation (or helper) that decompresses the payload so that we can do
> > a normal write.
> As for me I always stand for better functionality encapsulation - hence I'd
> prefer (3): store do whatever it can and transparently passes results to
> replicas. This allows to modify or extend the logic smoothly, e.g. optimize
> csum calculation for big chunks etc.
> Contrary in (1) we expose most of this functionality to store's client (i.g.
> replicated backend stuff,  not a real Ceph client). In fact for (1) we'll have
> 2 potentially evolving APIs:
> - compressed(optimized) write request delivery
> - store optimization description provided to client ( i.e. mentioned algorithm
> + alignment retrieval initially).
> The latter isn't needed for (3)

The concern I have here is that it probably won't map well onto EC.  The 
primary can't easily have the local ObjectStore chunking things up and 
then "pass it to the replica".. there's an intermediate layer between the 
replication code and the ObjectStore (and is getting a bit more 
sophisticated with the coming EC changes).

I think the simplest approach here would be to keep it simple.  For 
example, a min_alloc_size and max compressed chunk size specified for the 
pool.  The intermediate layer can apply the EC striping parameters, and 
then chunk/compress accordingly.

I agree that worrying about client-side compression seems like a lot at 
this stage, but it's going to be the very next thing we ask about, so we 
should consider it to make sure we don't put up any major roadblocks.

Either way, though, we should probably wait for the EC overwrite changes 
to land...


As for GC,

> > I'm curious what you have in mind!  The blob_depth as currently
> > implemented is not terribly reliable...
> General idea is to estimate allocated vs stored ratio for the blob(s) under
> the extent being written.
> Where stored and allocated are measured in allocation units. And can be
> calculated using blobs ref_map.
> If that ratio is greater than 1 (+-some correction) - we need to perform GC
> for these blobs. Given the fact we do that after compression preprocessing
> it's expensive to merge the compressed extent being written and old shards.
> Hence that shards are written as standalone extents as opposed to current
> implementation when we try to merge both new and existing extents into  a
> single entity. Not a big drawback IMHO. Evidently this is valid for new
> compressed extents (that are AU aligned) only. Uncompressed ones can be merged
> in any fashion.
> This is just a draft hence comments are highly appreciated.

Yeah, I think this is a more sensible approach (focusing on allocated vs 
referenced).  It seems like the most straightforward thing to do is 
actually look at the old_extents in the wctx--since those are the ref_maps 
that will become less referenced than before--in order to identify which 
blobs might need rewriting.  Avoiding the merge case vastly simplifies it.  
That also isn't any persistent metadata that we have to maintain (that 
might become incorrect or inconsistent).

We'd probably do the _do_write_data (which will do the various 
punch_hole's), then check for any gc work, then do the final 
_do_alloc_write and _wctx_finish?

sage


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A way to reduce compression overhead
  2016-11-11 23:42     ` Sage Weil
@ 2016-11-15 13:14       ` Igor Fedotov
  2016-11-15 14:31         ` Sage Weil
  0 siblings, 1 reply; 7+ messages in thread
From: Igor Fedotov @ 2016-11-15 13:14 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Sage,
please find inline.
On 12.11.2016 2:42, Sage Weil wrote:
> On Wed, 9 Nov 2016, Igor Fedotov wrote:
>
> The concern I have here is that it probably won't map well onto EC.  The
> primary can't easily have the local ObjectStore chunking things up and
> then "pass it to the replica".. there's an intermediate layer between the
> replication code and the ObjectStore (and is getting a bit more
> sophisticated with the coming EC changes).
>
> I think the simplest approach here would be to keep it simple.  For
> example, a min_alloc_size and max compressed chunk size specified for the
> pool.  The intermediate layer can apply the EC striping parameters, and
> then chunk/compress accordingly.
>
> I agree that worrying about client-side compression seems like a lot at
> this stage, but it's going to be the very next thing we ask about, so we
> should consider it to make sure we don't put up any major roadblocks.
>
> Either way, though, we should probably wait for the EC overwrite changes
> to land...
Got it, thanks. Will start working on POC meanwhile.
>
> As for GC,
>
>>> I'm curious what you have in mind!  The blob_depth as currently
>>> implemented is not terribly reliable...
>> General idea is to estimate allocated vs stored ratio for the blob(s) under
>> the extent being written.
>> Where stored and allocated are measured in allocation units. And can be
>> calculated using blobs ref_map.
>> If that ratio is greater than 1 (+-some correction) - we need to perform GC
>> for these blobs. Given the fact we do that after compression preprocessing
>> it's expensive to merge the compressed extent being written and old shards.
>> Hence that shards are written as standalone extents as opposed to current
>> implementation when we try to merge both new and existing extents into  a
>> single entity. Not a big drawback IMHO. Evidently this is valid for new
>> compressed extents (that are AU aligned) only. Uncompressed ones can be merged
>> in any fashion.
>> This is just a draft hence comments are highly appreciated.
> Yeah, I think this is a more sensible approach (focusing on allocated vs
> referenced).  It seems like the most straightforward thing to do is
> actually look at the old_extents in the wctx--since those are the ref_maps
> that will become less referenced than before--in order to identify which
> blobs might need rewriting.  Avoiding the merge case vastly simplifies it.
> That also isn't any persistent metadata that we have to maintain (that
> might become incorrect or inconsistent).
>
> We'd probably do the _do_write_data (which will do the various
> punch_hole's), then check for any gc work, then do the final
> _do_alloc_write and _wctx_finish?
>
Sounds good. Still need a detailed consistent algorithm though - working 
on that.
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A way to reduce compression overhead
  2016-11-15 13:14       ` Igor Fedotov
@ 2016-11-15 14:31         ` Sage Weil
  2016-11-15 14:42           ` Igor Fedotov
  0 siblings, 1 reply; 7+ messages in thread
From: Sage Weil @ 2016-11-15 14:31 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-devel

On Tue, 15 Nov 2016, Igor Fedotov wrote:
> > > > I'm curious what you have in mind!  The blob_depth as currently
> > > > implemented is not terribly reliable...
> > > General idea is to estimate allocated vs stored ratio for the blob(s)
> > > under
> > > the extent being written.
> > > Where stored and allocated are measured in allocation units. And can be
> > > calculated using blobs ref_map.
> > > If that ratio is greater than 1 (+-some correction) - we need to perform
> > > GC
> > > for these blobs. Given the fact we do that after compression preprocessing
> > > it's expensive to merge the compressed extent being written and old
> > > shards.
> > > Hence that shards are written as standalone extents as opposed to current
> > > implementation when we try to merge both new and existing extents into  a
> > > single entity. Not a big drawback IMHO. Evidently this is valid for new
> > > compressed extents (that are AU aligned) only. Uncompressed ones can be
> > > merged
> > > in any fashion.
> > > This is just a draft hence comments are highly appreciated.
> > Yeah, I think this is a more sensible approach (focusing on allocated vs
> > referenced).  It seems like the most straightforward thing to do is
> > actually look at the old_extents in the wctx--since those are the ref_maps
> > that will become less referenced than before--in order to identify which
> > blobs might need rewriting.  Avoiding the merge case vastly simplifies it.
> > That also isn't any persistent metadata that we have to maintain (that
> > might become incorrect or inconsistent).
> > 
> > We'd probably do the _do_write_data (which will do the various
> > punch_hole's), then check for any gc work, then do the final
> > _do_alloc_write and _wctx_finish?
> > 
> Sounds good. Still need a detailed consistent algorithm though - working on
> that.

In the meantime, perhaps we should remove the blob_depth code for now so 
it doesn't end up in the on-disk format.

sage

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A way to reduce compression overhead
  2016-11-15 14:31         ` Sage Weil
@ 2016-11-15 14:42           ` Igor Fedotov
  0 siblings, 0 replies; 7+ messages in thread
From: Igor Fedotov @ 2016-11-15 14:42 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

You mean removing GC stuff totally, right?

Will do.


On 15.11.2016 17:31, Sage Weil wrote:
>
> In the meantime, perhaps we should remove the blob_depth code for now so
> it doesn't end up in the on-disk format.
>
> sage


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-11-15 14:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-08 17:28 A way to reduce compression overhead Igor Fedotov
2016-11-08 20:27 ` Sage Weil
2016-11-09 13:35   ` Igor Fedotov
2016-11-11 23:42     ` Sage Weil
2016-11-15 13:14       ` Igor Fedotov
2016-11-15 14:31         ` Sage Weil
2016-11-15 14:42           ` Igor Fedotov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.