All of lore.kernel.org
 help / color / mirror / Atom feed
* Notes from a discussion a design to allow EC overwrites
@ 2015-11-12 22:37 Samuel Just
  2015-11-13 13:35 ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Samuel Just @ 2015-11-12 22:37 UTC (permalink / raw)
  To: ceph-devel, Samuels, Allen, Durgin, Josh, Farnum, Gregory

I was present for a discussion about allowing EC overwrites and thought it
would be good to summarize it for the list:

Commit Protocol:
1) client sends write to primary
2) primary reads in partial stripes needed for partial stripe
overwrites from replicas
3) primary sends prepares to participating replicas and queues its own
prepare locally
4) once all prepares are complete, primary sends a commit to the client
5) primary sends applies to all participating replicas

When we get the prepare, we write out a temp object with the data to be
written.  On apply, we use an objectstore primitive to atomically move those
extents into the actual object.  The log entry contains the name/id for the
temp object so it can be applied on apply or removed on rollback.

Each log entry contains a list of the shard ids modified.  During peering, we
use the same protocol for choosing the authoritative log for the existing EC
pool, except that we first take the longest candidate log and use it to extend
shorter logs until they hit an entry they should have witnessed, but didn't.

Implicit in the above scheme is the fact that if an object is written, but a
particular shard isn't changed, the osd with that shard will have a copy of the
object with the correct data, but an out of date object_into (notably, the
version will be old).  To avoid this messing up the missing set during the log
scan on OSD start, we'll skip log entries we wouldn't have participated in (we
may not even choose to store them, see below).  This does generally pose a
challenge for maintaining prior_version.  It seems like it shouldn't be much of
a problem since rollbacks can only happen on prepared log entries which haven't
been applied, so log merging can never result in a divergent entry causing a
missing object.  I think we can get by without it then?

We can go further with the above and actually never persist a log entry on a
shard which it did not participate in.  As long as peering works correctly, the
union of the logs we got must have all entries.  The primary will need to keep
a complete copy in memory to manage dup op detection and recovery, however.

2) above can also be done much more efficiently.  Some codes will allow the
parity chunks to be reconstructed more efficiently, so some thought will
have to go into restructuring the plugin interface to allow more efficient
-Sam

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Notes from a discussion a design to allow EC overwrites
  2015-11-12 22:37 Notes from a discussion a design to allow EC overwrites Samuel Just
@ 2015-11-13 13:35 ` Sage Weil
  2015-11-13 15:38   ` Samuel Just
  0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2015-11-13 13:35 UTC (permalink / raw)
  To: Samuel Just; +Cc: ceph-devel, Samuels, Allen, Durgin, Josh, Farnum, Gregory

On Thu, 12 Nov 2015, Samuel Just wrote:
> I was present for a discussion about allowing EC overwrites and thought it
> would be good to summarize it for the list:
> 
> Commit Protocol:
> 1) client sends write to primary
> 2) primary reads in partial stripes needed for partial stripe
> overwrites from replicas
> 3) primary sends prepares to participating replicas and queues its own
> prepare locally
> 4) once all prepares are complete, primary sends a commit to the client
> 5) primary sends applies to all participating replicas
> 
> When we get the prepare, we write out a temp object with the data to be
> written.  On apply, we use an objectstore primitive to atomically move those
> extents into the actual object.  The log entry contains the name/id for the
> temp object so it can be applied on apply or removed on rollback.

Currently we assume that temp objects are/can be cleared out on restart.  
This will need to change.  And we'll need to be careful that they get 
cleaned out when peering completes (and the rollforward/rollback decision 
is made.

If the stripes are small, then the objectstore primitive may not actually 
be that efficient.  I'd suggest also hinting that the temp object will be 
swapped later, so that the backend can, if it's small, store it in a cheap 
temporary location in the expectation that it will get rewritten later.  
(In particular, the newstore allocation chunk is currently targetting 
512kb, and this will only be efficient with narrow stripes, so it'll just 
get double-written.  We'll want to keep the temp value in the kv store 
[log, hopefully] and not bother to allocate disk and rewrite it.)

> Each log entry contains a list of the shard ids modified.  During peering, we
> use the same protocol for choosing the authoritative log for the existing EC
> pool, except that we first take the longest candidate log and use it to extend
> shorter logs until they hit an entry they should have witnessed, but didn't.
> 
> Implicit in the above scheme is the fact that if an object is written, but a
> particular shard isn't changed, the osd with that shard will have a copy of the
> object with the correct data, but an out of date object_into (notably, the
> version will be old).  To avoid this messing up the missing set during the log
> scan on OSD start, we'll skip log entries we wouldn't have participated in (we
> may not even choose to store them, see below).  This does generally pose a
> challenge for maintaining prior_version.  It seems like it shouldn't be much of
> a problem since rollbacks can only happen on prepared log entries which haven't
> been applied, so log merging can never result in a divergent entry causing a
> missing object.  I think we can get by without it then?
> 
> We can go further with the above and actually never persist a log entry on a
> shard which it did not participate in.  As long as peering works correctly, the
> union of the logs we got must have all entries.  The primary will need to keep
> a complete copy in memory to manage dup op detection and recovery, however.

That sounds more complex to me.  Maybe instead we could lazily persist the 
entries (on the next pg write) so that it is always a contiguous sequence?

> 2) above can also be done much more efficiently.  Some codes will allow the
> parity chunks to be reconstructed more efficiently, so some thought will
> have to go into restructuring the plugin interface to allow more efficient

Hopefully the stripe size is chosen such that most writes will end up 
being full stripe writes (we should figure out if the EC performance 
degrades significantly in that case?).

An alternative would be to do something like

1) client sends write to primary
2) primary sends prepare to the first M+1 shards, who write it in a 
temporary object/location
3) primary acks write once they ack
4) asynchronously, primary recalculates the affected stripes and 
sends an overwrite.

 - step 4 doesn't need to be 2-phase, since we have the original data 
persisted already on >M shards
 - the client-observed latency is bounded by only M+1 OSDs (not 
acting.size())

I suspect you discussed this option, though, and have other concerns 
around its complexity?

sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Notes from a discussion a design to allow EC overwrites
  2015-11-13 13:35 ` Sage Weil
@ 2015-11-13 15:38   ` Samuel Just
  2015-11-13 16:16     ` Allen Samuels
  0 siblings, 1 reply; 6+ messages in thread
From: Samuel Just @ 2015-11-13 15:38 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Samuels, Allen, Durgin, Josh, Farnum, Gregory

Lazily persisting the intermediate entries would certainly also work,
but there's an argument that it needlessly adds to the write
transaction.

Actually, we probably want to avoid having small writes be full stripe
writes -- with a 8+3 code the difference between modifying a single
stripelet and modifying the full stripe is 4 writes vs 11 writes.

It means that during peering, any log we can find (particularly the
shortest one) from the most recent active interval isn't an upper
bound on writes committed to the client (the (actingset.size() - M -
1)th one is?) -- we'd have to think carefully about the implications
of that.
-Sam

On Fri, Nov 13, 2015 at 5:35 AM, Sage Weil <sweil@redhat.com> wrote:
> On Thu, 12 Nov 2015, Samuel Just wrote:
>> I was present for a discussion about allowing EC overwrites and thought it
>> would be good to summarize it for the list:
>>
>> Commit Protocol:
>> 1) client sends write to primary
>> 2) primary reads in partial stripes needed for partial stripe
>> overwrites from replicas
>> 3) primary sends prepares to participating replicas and queues its own
>> prepare locally
>> 4) once all prepares are complete, primary sends a commit to the client
>> 5) primary sends applies to all participating replicas
>>
>> When we get the prepare, we write out a temp object with the data to be
>> written.  On apply, we use an objectstore primitive to atomically move those
>> extents into the actual object.  The log entry contains the name/id for the
>> temp object so it can be applied on apply or removed on rollback.
>
> Currently we assume that temp objects are/can be cleared out on restart.
> This will need to change.  And we'll need to be careful that they get
> cleaned out when peering completes (and the rollforward/rollback decision
> is made.
>
> If the stripes are small, then the objectstore primitive may not actually
> be that efficient.  I'd suggest also hinting that the temp object will be
> swapped later, so that the backend can, if it's small, store it in a cheap
> temporary location in the expectation that it will get rewritten later.
> (In particular, the newstore allocation chunk is currently targetting
> 512kb, and this will only be efficient with narrow stripes, so it'll just
> get double-written.  We'll want to keep the temp value in the kv store
> [log, hopefully] and not bother to allocate disk and rewrite it.)
>
>> Each log entry contains a list of the shard ids modified.  During peering, we
>> use the same protocol for choosing the authoritative log for the existing EC
>> pool, except that we first take the longest candidate log and use it to extend
>> shorter logs until they hit an entry they should have witnessed, but didn't.
>>
>> Implicit in the above scheme is the fact that if an object is written, but a
>> particular shard isn't changed, the osd with that shard will have a copy of the
>> object with the correct data, but an out of date object_into (notably, the
>> version will be old).  To avoid this messing up the missing set during the log
>> scan on OSD start, we'll skip log entries we wouldn't have participated in (we
>> may not even choose to store them, see below).  This does generally pose a
>> challenge for maintaining prior_version.  It seems like it shouldn't be much of
>> a problem since rollbacks can only happen on prepared log entries which haven't
>> been applied, so log merging can never result in a divergent entry causing a
>> missing object.  I think we can get by without it then?
>>
>> We can go further with the above and actually never persist a log entry on a
>> shard which it did not participate in.  As long as peering works correctly, the
>> union of the logs we got must have all entries.  The primary will need to keep
>> a complete copy in memory to manage dup op detection and recovery, however.
>
> That sounds more complex to me.  Maybe instead we could lazily persist the
> entries (on the next pg write) so that it is always a contiguous sequence?
>
>> 2) above can also be done much more efficiently.  Some codes will allow the
>> parity chunks to be reconstructed more efficiently, so some thought will
>> have to go into restructuring the plugin interface to allow more efficient
>
> Hopefully the stripe size is chosen such that most writes will end up
> being full stripe writes (we should figure out if the EC performance
> degrades significantly in that case?).
>
> An alternative would be to do something like
>
> 1) client sends write to primary
> 2) primary sends prepare to the first M+1 shards, who write it in a
> temporary object/location
> 3) primary acks write once they ack
> 4) asynchronously, primary recalculates the affected stripes and
> sends an overwrite.
>
>  - step 4 doesn't need to be 2-phase, since we have the original data
> persisted already on >M shards
>  - the client-observed latency is bounded by only M+1 OSDs (not
> acting.size())
>
> I suspect you discussed this option, though, and have other concerns
> around its complexity?
>
> sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Notes from a discussion a design to allow EC overwrites
  2015-11-13 15:38   ` Samuel Just
@ 2015-11-13 16:16     ` Allen Samuels
  2015-11-13 18:26       ` Tianshan Qu
  0 siblings, 1 reply; 6+ messages in thread
From: Allen Samuels @ 2015-11-13 16:16 UTC (permalink / raw)
  To: Samuel Just, Sage Weil; +Cc: ceph-devel, Durgin, Josh, Farnum, Gregory

This scheme fundamentally relies on the temporary objects "gracefully" transitioning into being portions of full-up long-term durable objects.

This means that if the allocation size for a temporary object significantly mismatches the size of the mutation (partial stripe write) you're creating a problem that's proportional to the mismatch size. 

So either, NewStore is able to efficiently allocate small chunks
OR
You have some kind of background cleanup process that reclaims that space (i.e., a "straightener")

The right choices depend on having a shared notion of the operational profile that you're trying to optimize for.

The fundamental question becomes, are you going to optimize for small-block random writes?

In my experience this is a key use-case in virtually every customer's evaluation scenario. I believe we MUST make this case reasonably efficient.

It seems to me that the lowest-complexity "fix" for the problem is to teach NewStore to have two different allocation sizes (big and small :)). 
Naturally the allocator becomes more complex. Worst case, you're now left with the garbage collection problem. Which I suspect could be punted to a subsequent release (i.e., I'm out of large blocks, but there's plenty of fragmented available space -- This can happen, but's a pretty pathological case which becomes rare-er and rare-er as you scale-out)

Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


-----Original Message-----
From: Samuel Just [mailto:sjust@redhat.com] 
Sent: Friday, November 13, 2015 7:39 AM
To: Sage Weil <sweil@redhat.com>
Cc: ceph-devel@vger.kernel.org; Allen Samuels <Allen.Samuels@sandisk.com>; Durgin, Josh <jdurgin@redhat.com>; Farnum, Gregory <gfarnum@redhat.com>
Subject: Re: Notes from a discussion a design to allow EC overwrites

Lazily persisting the intermediate entries would certainly also work, but there's an argument that it needlessly adds to the write transaction.

Actually, we probably want to avoid having small writes be full stripe writes -- with a 8+3 code the difference between modifying a single stripelet and modifying the full stripe is 4 writes vs 11 writes.

It means that during peering, any log we can find (particularly the shortest one) from the most recent active interval isn't an upper bound on writes committed to the client (the (actingset.size() - M - 1)th one is?) -- we'd have to think carefully about the implications of that.
-Sam

On Fri, Nov 13, 2015 at 5:35 AM, Sage Weil <sweil@redhat.com> wrote:
> On Thu, 12 Nov 2015, Samuel Just wrote:
>> I was present for a discussion about allowing EC overwrites and 
>> thought it would be good to summarize it for the list:
>>
>> Commit Protocol:
>> 1) client sends write to primary
>> 2) primary reads in partial stripes needed for partial stripe 
>> overwrites from replicas
>> 3) primary sends prepares to participating replicas and queues its 
>> own prepare locally
>> 4) once all prepares are complete, primary sends a commit to the 
>> client
>> 5) primary sends applies to all participating replicas
>>
>> When we get the prepare, we write out a temp object with the data to 
>> be written.  On apply, we use an objectstore primitive to atomically 
>> move those extents into the actual object.  The log entry contains 
>> the name/id for the temp object so it can be applied on apply or removed on rollback.
>
> Currently we assume that temp objects are/can be cleared out on restart.
> This will need to change.  And we'll need to be careful that they get 
> cleaned out when peering completes (and the rollforward/rollback 
> decision is made.
>
> If the stripes are small, then the objectstore primitive may not 
> actually be that efficient.  I'd suggest also hinting that the temp 
> object will be swapped later, so that the backend can, if it's small, 
> store it in a cheap temporary location in the expectation that it will get rewritten later.
> (In particular, the newstore allocation chunk is currently targetting 
> 512kb, and this will only be efficient with narrow stripes, so it'll 
> just get double-written.  We'll want to keep the temp value in the kv 
> store [log, hopefully] and not bother to allocate disk and rewrite 
> it.)
>
>> Each log entry contains a list of the shard ids modified.  During 
>> peering, we use the same protocol for choosing the authoritative log 
>> for the existing EC pool, except that we first take the longest 
>> candidate log and use it to extend shorter logs until they hit an entry they should have witnessed, but didn't.
>>
>> Implicit in the above scheme is the fact that if an object is 
>> written, but a particular shard isn't changed, the osd with that 
>> shard will have a copy of the object with the correct data, but an 
>> out of date object_into (notably, the version will be old).  To avoid 
>> this messing up the missing set during the log scan on OSD start, 
>> we'll skip log entries we wouldn't have participated in (we may not 
>> even choose to store them, see below).  This does generally pose a 
>> challenge for maintaining prior_version.  It seems like it shouldn't 
>> be much of a problem since rollbacks can only happen on prepared log 
>> entries which haven't been applied, so log merging can never result in a divergent entry causing a missing object.  I think we can get by without it then?
>>
>> We can go further with the above and actually never persist a log 
>> entry on a shard which it did not participate in.  As long as peering 
>> works correctly, the union of the logs we got must have all entries.  
>> The primary will need to keep a complete copy in memory to manage dup op detection and recovery, however.
>
> That sounds more complex to me.  Maybe instead we could lazily persist 
> the entries (on the next pg write) so that it is always a contiguous sequence?
>
>> 2) above can also be done much more efficiently.  Some codes will 
>> allow the parity chunks to be reconstructed more efficiently, so some 
>> thought will have to go into restructuring the plugin interface to 
>> allow more efficient
>
> Hopefully the stripe size is chosen such that most writes will end up 
> being full stripe writes (we should figure out if the EC performance 
> degrades significantly in that case?).
>
> An alternative would be to do something like
>
> 1) client sends write to primary
> 2) primary sends prepare to the first M+1 shards, who write it in a 
> temporary object/location
> 3) primary acks write once they ack
> 4) asynchronously, primary recalculates the affected stripes and sends 
> an overwrite.
>
>  - step 4 doesn't need to be 2-phase, since we have the original data 
> persisted already on >M shards
>  - the client-observed latency is bounded by only M+1 OSDs (not
> acting.size())
>
> I suspect you discussed this option, though, and have other concerns 
> around its complexity?
>
> sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Notes from a discussion a design to allow EC overwrites
  2015-11-13 16:16     ` Allen Samuels
@ 2015-11-13 18:26       ` Tianshan Qu
  2015-12-14 21:19         ` Samuel Just
  0 siblings, 1 reply; 6+ messages in thread
From: Tianshan Qu @ 2015-11-13 18:26 UTC (permalink / raw)
  To: Allen Samuels
  Cc: Samuel Just, Sage Weil, ceph-devel, Durgin, Josh, Farnum, Gregory

Hi:

sorry for half send.

I'm working on this recently. I think we should distinguish the scene
of sequence write and small write.

The first, overwrite is enough. The second small write's typical
optimal is a log based write, which do not need ec's reconstruct and
avoid parity chunk's extra read.

Fast14 paper (Parity Logging with Reserved Space Towards Efficient
Updates and Recovery in Erasure-coded Clustered Storage)  propose a
way to further improve the performance, which use pre allocate reserve
space after parity chunk to avoid fragmentation.

I think the challenge to implement this in ceph is how to manager the
reserve space, since we may have no direct support.

also, the kv is another good choice for the log.

For the ec interface, I think we need extra function to apply parity change.



2015-11-14 0:16 GMT+08:00 Allen Samuels <Allen.Samuels@sandisk.com>:
> This scheme fundamentally relies on the temporary objects "gracefully" transitioning into being portions of full-up long-term durable objects.
>
> This means that if the allocation size for a temporary object significantly mismatches the size of the mutation (partial stripe write) you're creating a problem that's proportional to the mismatch size.
>
> So either, NewStore is able to efficiently allocate small chunks
> OR
> You have some kind of background cleanup process that reclaims that space (i.e., a "straightener")
>
> The right choices depend on having a shared notion of the operational profile that you're trying to optimize for.
>
> The fundamental question becomes, are you going to optimize for small-block random writes?
>
> In my experience this is a key use-case in virtually every customer's evaluation scenario. I believe we MUST make this case reasonably efficient.
>
> It seems to me that the lowest-complexity "fix" for the problem is to teach NewStore to have two different allocation sizes (big and small :)).
> Naturally the allocator becomes more complex. Worst case, you're now left with the garbage collection problem. Which I suspect could be punted to a subsequent release (i.e., I'm out of large blocks, but there's plenty of fragmented available space -- This can happen, but's a pretty pathological case which becomes rare-er and rare-er as you scale-out)
>
> Allen Samuels
> Software Architect, Emerging Storage Solutions
>
> 2880 Junction Avenue, Milpitas, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
> -----Original Message-----
> From: Samuel Just [mailto:sjust@redhat.com]
> Sent: Friday, November 13, 2015 7:39 AM
> To: Sage Weil <sweil@redhat.com>
> Cc: ceph-devel@vger.kernel.org; Allen Samuels <Allen.Samuels@sandisk.com>; Durgin, Josh <jdurgin@redhat.com>; Farnum, Gregory <gfarnum@redhat.com>
> Subject: Re: Notes from a discussion a design to allow EC overwrites
>
> Lazily persisting the intermediate entries would certainly also work, but there's an argument that it needlessly adds to the write transaction.
>
> Actually, we probably want to avoid having small writes be full stripe writes -- with a 8+3 code the difference between modifying a single stripelet and modifying the full stripe is 4 writes vs 11 writes.
>
> It means that during peering, any log we can find (particularly the shortest one) from the most recent active interval isn't an upper bound on writes committed to the client (the (actingset.size() - M - 1)th one is?) -- we'd have to think carefully about the implications of that.
> -Sam
>
> On Fri, Nov 13, 2015 at 5:35 AM, Sage Weil <sweil@redhat.com> wrote:
>> On Thu, 12 Nov 2015, Samuel Just wrote:
>>> I was present for a discussion about allowing EC overwrites and
>>> thought it would be good to summarize it for the list:
>>>
>>> Commit Protocol:
>>> 1) client sends write to primary
>>> 2) primary reads in partial stripes needed for partial stripe
>>> overwrites from replicas
>>> 3) primary sends prepares to participating replicas and queues its
>>> own prepare locally
>>> 4) once all prepares are complete, primary sends a commit to the
>>> client
>>> 5) primary sends applies to all participating replicas
>>>
>>> When we get the prepare, we write out a temp object with the data to
>>> be written.  On apply, we use an objectstore primitive to atomically
>>> move those extents into the actual object.  The log entry contains
>>> the name/id for the temp object so it can be applied on apply or removed on rollback.
>>
>> Currently we assume that temp objects are/can be cleared out on restart.
>> This will need to change.  And we'll need to be careful that they get
>> cleaned out when peering completes (and the rollforward/rollback
>> decision is made.
>>
>> If the stripes are small, then the objectstore primitive may not
>> actually be that efficient.  I'd suggest also hinting that the temp
>> object will be swapped later, so that the backend can, if it's small,
>> store it in a cheap temporary location in the expectation that it will get rewritten later.
>> (In particular, the newstore allocation chunk is currently targetting
>> 512kb, and this will only be efficient with narrow stripes, so it'll
>> just get double-written.  We'll want to keep the temp value in the kv
>> store [log, hopefully] and not bother to allocate disk and rewrite
>> it.)
>>
>>> Each log entry contains a list of the shard ids modified.  During
>>> peering, we use the same protocol for choosing the authoritative log
>>> for the existing EC pool, except that we first take the longest
>>> candidate log and use it to extend shorter logs until they hit an entry they should have witnessed, but didn't.
>>>
>>> Implicit in the above scheme is the fact that if an object is
>>> written, but a particular shard isn't changed, the osd with that
>>> shard will have a copy of the object with the correct data, but an
>>> out of date object_into (notably, the version will be old).  To avoid
>>> this messing up the missing set during the log scan on OSD start,
>>> we'll skip log entries we wouldn't have participated in (we may not
>>> even choose to store them, see below).  This does generally pose a
>>> challenge for maintaining prior_version.  It seems like it shouldn't
>>> be much of a problem since rollbacks can only happen on prepared log
>>> entries which haven't been applied, so log merging can never result in a divergent entry causing a missing object.  I think we can get by without it then?
>>>
>>> We can go further with the above and actually never persist a log
>>> entry on a shard which it did not participate in.  As long as peering
>>> works correctly, the union of the logs we got must have all entries.
>>> The primary will need to keep a complete copy in memory to manage dup op detection and recovery, however.
>>
>> That sounds more complex to me.  Maybe instead we could lazily persist
>> the entries (on the next pg write) so that it is always a contiguous sequence?
>>
>>> 2) above can also be done much more efficiently.  Some codes will
>>> allow the parity chunks to be reconstructed more efficiently, so some
>>> thought will have to go into restructuring the plugin interface to
>>> allow more efficient
>>
>> Hopefully the stripe size is chosen such that most writes will end up
>> being full stripe writes (we should figure out if the EC performance
>> degrades significantly in that case?).
>>
>> An alternative would be to do something like
>>
>> 1) client sends write to primary
>> 2) primary sends prepare to the first M+1 shards, who write it in a
>> temporary object/location
>> 3) primary acks write once they ack
>> 4) asynchronously, primary recalculates the affected stripes and sends
>> an overwrite.
>>
>>  - step 4 doesn't need to be 2-phase, since we have the original data
>> persisted already on >M shards
>>  - the client-observed latency is bounded by only M+1 OSDs (not
>> acting.size())
>>
>> I suspect you discussed this option, though, and have other concerns
>> around its complexity?
>>
>> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Notes from a discussion a design to allow EC overwrites
  2015-11-13 18:26       ` Tianshan Qu
@ 2015-12-14 21:19         ` Samuel Just
  0 siblings, 0 replies; 6+ messages in thread
From: Samuel Just @ 2015-12-14 21:19 UTC (permalink / raw)
  To: Tianshan Qu
  Cc: Allen Samuels, Sage Weil, ceph-devel, Durgin, Josh, Farnum, Gregory

I created a set of tickets for what I think needs to happen:
http://tracker.ceph.com/issues/14031.  There's a doc linked from that
ticket with more details on the design.
-Sam

On Fri, Nov 13, 2015 at 10:26 AM, Tianshan Qu <qutianshan@gmail.com> wrote:
> Hi:
>
> sorry for half send.
>
> I'm working on this recently. I think we should distinguish the scene
> of sequence write and small write.
>
> The first, overwrite is enough. The second small write's typical
> optimal is a log based write, which do not need ec's reconstruct and
> avoid parity chunk's extra read.
>
> Fast14 paper (Parity Logging with Reserved Space Towards Efficient
> Updates and Recovery in Erasure-coded Clustered Storage)  propose a
> way to further improve the performance, which use pre allocate reserve
> space after parity chunk to avoid fragmentation.
>
> I think the challenge to implement this in ceph is how to manager the
> reserve space, since we may have no direct support.
>
> also, the kv is another good choice for the log.
>
> For the ec interface, I think we need extra function to apply parity change.
>
>
>
> 2015-11-14 0:16 GMT+08:00 Allen Samuels <Allen.Samuels@sandisk.com>:
>> This scheme fundamentally relies on the temporary objects "gracefully" transitioning into being portions of full-up long-term durable objects.
>>
>> This means that if the allocation size for a temporary object significantly mismatches the size of the mutation (partial stripe write) you're creating a problem that's proportional to the mismatch size.
>>
>> So either, NewStore is able to efficiently allocate small chunks
>> OR
>> You have some kind of background cleanup process that reclaims that space (i.e., a "straightener")
>>
>> The right choices depend on having a shared notion of the operational profile that you're trying to optimize for.
>>
>> The fundamental question becomes, are you going to optimize for small-block random writes?
>>
>> In my experience this is a key use-case in virtually every customer's evaluation scenario. I believe we MUST make this case reasonably efficient.
>>
>> It seems to me that the lowest-complexity "fix" for the problem is to teach NewStore to have two different allocation sizes (big and small :)).
>> Naturally the allocator becomes more complex. Worst case, you're now left with the garbage collection problem. Which I suspect could be punted to a subsequent release (i.e., I'm out of large blocks, but there's plenty of fragmented available space -- This can happen, but's a pretty pathological case which becomes rare-er and rare-er as you scale-out)
>>
>> Allen Samuels
>> Software Architect, Emerging Storage Solutions
>>
>> 2880 Junction Avenue, Milpitas, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416
>> allen.samuels@SanDisk.com
>>
>>
>> -----Original Message-----
>> From: Samuel Just [mailto:sjust@redhat.com]
>> Sent: Friday, November 13, 2015 7:39 AM
>> To: Sage Weil <sweil@redhat.com>
>> Cc: ceph-devel@vger.kernel.org; Allen Samuels <Allen.Samuels@sandisk.com>; Durgin, Josh <jdurgin@redhat.com>; Farnum, Gregory <gfarnum@redhat.com>
>> Subject: Re: Notes from a discussion a design to allow EC overwrites
>>
>> Lazily persisting the intermediate entries would certainly also work, but there's an argument that it needlessly adds to the write transaction.
>>
>> Actually, we probably want to avoid having small writes be full stripe writes -- with a 8+3 code the difference between modifying a single stripelet and modifying the full stripe is 4 writes vs 11 writes.
>>
>> It means that during peering, any log we can find (particularly the shortest one) from the most recent active interval isn't an upper bound on writes committed to the client (the (actingset.size() - M - 1)th one is?) -- we'd have to think carefully about the implications of that.
>> -Sam
>>
>> On Fri, Nov 13, 2015 at 5:35 AM, Sage Weil <sweil@redhat.com> wrote:
>>> On Thu, 12 Nov 2015, Samuel Just wrote:
>>>> I was present for a discussion about allowing EC overwrites and
>>>> thought it would be good to summarize it for the list:
>>>>
>>>> Commit Protocol:
>>>> 1) client sends write to primary
>>>> 2) primary reads in partial stripes needed for partial stripe
>>>> overwrites from replicas
>>>> 3) primary sends prepares to participating replicas and queues its
>>>> own prepare locally
>>>> 4) once all prepares are complete, primary sends a commit to the
>>>> client
>>>> 5) primary sends applies to all participating replicas
>>>>
>>>> When we get the prepare, we write out a temp object with the data to
>>>> be written.  On apply, we use an objectstore primitive to atomically
>>>> move those extents into the actual object.  The log entry contains
>>>> the name/id for the temp object so it can be applied on apply or removed on rollback.
>>>
>>> Currently we assume that temp objects are/can be cleared out on restart.
>>> This will need to change.  And we'll need to be careful that they get
>>> cleaned out when peering completes (and the rollforward/rollback
>>> decision is made.
>>>
>>> If the stripes are small, then the objectstore primitive may not
>>> actually be that efficient.  I'd suggest also hinting that the temp
>>> object will be swapped later, so that the backend can, if it's small,
>>> store it in a cheap temporary location in the expectation that it will get rewritten later.
>>> (In particular, the newstore allocation chunk is currently targetting
>>> 512kb, and this will only be efficient with narrow stripes, so it'll
>>> just get double-written.  We'll want to keep the temp value in the kv
>>> store [log, hopefully] and not bother to allocate disk and rewrite
>>> it.)
>>>
>>>> Each log entry contains a list of the shard ids modified.  During
>>>> peering, we use the same protocol for choosing the authoritative log
>>>> for the existing EC pool, except that we first take the longest
>>>> candidate log and use it to extend shorter logs until they hit an entry they should have witnessed, but didn't.
>>>>
>>>> Implicit in the above scheme is the fact that if an object is
>>>> written, but a particular shard isn't changed, the osd with that
>>>> shard will have a copy of the object with the correct data, but an
>>>> out of date object_into (notably, the version will be old).  To avoid
>>>> this messing up the missing set during the log scan on OSD start,
>>>> we'll skip log entries we wouldn't have participated in (we may not
>>>> even choose to store them, see below).  This does generally pose a
>>>> challenge for maintaining prior_version.  It seems like it shouldn't
>>>> be much of a problem since rollbacks can only happen on prepared log
>>>> entries which haven't been applied, so log merging can never result in a divergent entry causing a missing object.  I think we can get by without it then?
>>>>
>>>> We can go further with the above and actually never persist a log
>>>> entry on a shard which it did not participate in.  As long as peering
>>>> works correctly, the union of the logs we got must have all entries.
>>>> The primary will need to keep a complete copy in memory to manage dup op detection and recovery, however.
>>>
>>> That sounds more complex to me.  Maybe instead we could lazily persist
>>> the entries (on the next pg write) so that it is always a contiguous sequence?
>>>
>>>> 2) above can also be done much more efficiently.  Some codes will
>>>> allow the parity chunks to be reconstructed more efficiently, so some
>>>> thought will have to go into restructuring the plugin interface to
>>>> allow more efficient
>>>
>>> Hopefully the stripe size is chosen such that most writes will end up
>>> being full stripe writes (we should figure out if the EC performance
>>> degrades significantly in that case?).
>>>
>>> An alternative would be to do something like
>>>
>>> 1) client sends write to primary
>>> 2) primary sends prepare to the first M+1 shards, who write it in a
>>> temporary object/location
>>> 3) primary acks write once they ack
>>> 4) asynchronously, primary recalculates the affected stripes and sends
>>> an overwrite.
>>>
>>>  - step 4 doesn't need to be 2-phase, since we have the original data
>>> persisted already on >M shards
>>>  - the client-observed latency is bounded by only M+1 OSDs (not
>>> acting.size())
>>>
>>> I suspect you discussed this option, though, and have other concerns
>>> around its complexity?
>>>
>>> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-12-14 21:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-12 22:37 Notes from a discussion a design to allow EC overwrites Samuel Just
2015-11-13 13:35 ` Sage Weil
2015-11-13 15:38   ` Samuel Just
2015-11-13 16:16     ` Allen Samuels
2015-11-13 18:26       ` Tianshan Qu
2015-12-14 21:19         ` Samuel Just

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.