All of lore.kernel.org
 help / color / mirror / Atom feed
* RBD journal draft design
       [not found] <1574383603.9391063.1433257824183.JavaMail.zimbra@redhat.com>
@ 2015-06-02 15:11 ` Jason Dillaman
  2015-06-03  0:39   ` Gregory Farnum
  2015-06-03 10:47   ` John Spray
  0 siblings, 2 replies; 12+ messages in thread
From: Jason Dillaman @ 2015-06-02 15:11 UTC (permalink / raw)
  To: Ceph Development

I am posting to get wider review/feedback on this draft design.  In support of the RBD mirroring feature [1], a new client-side journaling class will be developed for use by librbd.  The implementation is designed to carry opaque journal entry payloads so it will be possible to be re-used in other applications as well in the future.  It will also use the librados API for all operations.  At a high level, a single journal will be composed of a journal header to store metadata and multiple journal objects to contain the individual journal entries.

Journal objects will be named "<journal object prefix>.<journal id>.<object number>".  An individual journal object will hold one or more journal entries, appended one after another.  Journal objects will have a configurable soft maximum size.  After the size has been exceeded, a new journal object (numbered current object + number of journal objects) will be created for future journal entries and the header active set will be updated so that other clients know that a new journal object was created.

In contrast to the current journal code used by CephFS, the new journal code will use sequence numbers to identify journal entries, instead of offsets within the journal.  Additionally, a given journal entry will not be striped across multiple journal objects.  Journal entries will be mapped to journal objects using the sequence number: <sequence number> mod <splay count> == <object number> mod <splay count> for active journal objects.

The rationale for this difference is to facilitate parallelism for appends as journal entries will be splayed across a configurable number of journal objects.  The journal API for appending a new journal entry will return a future which can be used to retrieve the assigned sequence number for the submitted journal entry payload once committed to disk. The use of a future allows for asynchronous journal entry submissions by default and can be used to simplify integration with the client-side cache writeback handler (and as a potential future enhacement to delay appends to the journal in order to satisfy EC-pool alignment requirements).

Sequence numbers are treated as a monotonically increasing integer for a given value of journal entry tag.  This allows for the possibility for multiple clients to concurrently use the same journal (e.g. all RBD disks within a given VM could use the same journal).  This will provide a loose coupling of operations between different clients using the same journal.

A new journal object class method will be used to submit journal entry append requests.  This will act as a gatekeeper for the concurrent client case.  A successful append will indicate whether or not the journal is now full (larger than the max object size), indicating to the client that a new journal object should be used.  If the journal is too large, an error code responce would alert the client that it needs to write to the current active journal object.  In practice, the only time the journaler should expect to see such a response would be in the case where multiple clients are using the same journal and the active object update notification has yet to be received.  

All the journal objects will be tied together by means of a journal header object, named "<journal header prefix>.<journal id>".  This object will contain the current committed journal entry positions of all registered clients.  In librbd's case, each mirrored copy of an image would be a new registered client.  OSD class methods would be used to create/manipulate the journal header to serialize modifications from multiple clients.

Journal recovery / playback will iterate through each journal entry from the journal, in sequence order.  Journal objects will be prefetched (where possible) to a configurable amount to improve the latency.  Journal entry playback can use an optional client-specified filter to only iterate over entries with a matching journal entry tag.  The API will need to support the use case of an external client periodically testing the journal for new data. 
 
Journal trimming will be accomplished by removing a whole journal object.  Only after all registered users of the journal have indicated that they have committed all journal entries within the journal object (via an update to the journal header metadata) will the journal object be deleted and the header updated to indicate the new starting object number.

Since the journal is designed to be append-only, there needs to be support for cases where journal entry needs to be updated out-of-band (e.g. fixing a corrupt entry similar to CephFS's current journal recovery tools).  The proposed solution is to just append a new journal entry with the same sequence number as the record to be replaced to the end of the journal (i.e. last entry for a given sequence number wins).  This also protects against accidental replays of the original append operation.  An alternative suggestion would be to use a compare-and-swap mechanism to update the full journal object with the updated contents.

Journal Header
~~~~~~~~~~~~~~

omap
* soft max object size
* journal objects splay count
* min object number
* most recent active journal objects (could be out-of-date)
* registered clients
  * client description (i.e. zone)
  * journal entry tag
  * last committed sequence number

Journal Object
~~~~~~~~~~~~~~~

Data
* 1..N: <Journal Entry>

Journal Entries
~~~~~~~~~~~~~~~

Header
* version
* tag
* sequence number
* data size

Data
* raw payload

Footer
* CRC of journal entry header + data

[1] http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/24929

-- 

Jason Dillaman 
Red Hat 
dillaman@redhat.com 
http://www.redhat.com 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RBD journal draft design
  2015-06-02 15:11 ` RBD journal draft design Jason Dillaman
@ 2015-06-03  0:39   ` Gregory Farnum
  2015-06-03 16:13     ` Jason Dillaman
  2015-06-03 10:47   ` John Spray
  1 sibling, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2015-06-03  0:39 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: Ceph Development

On Tue, Jun 2, 2015 at 8:11 AM, Jason Dillaman <dillaman@redhat.com> wrote:
> I am posting to get wider review/feedback on this draft design.  In support of the RBD mirroring feature [1], a new client-side journaling class will be developed for use by librbd.  The implementation is designed to carry opaque journal entry payloads so it will be possible to be re-used in other applications as well in the future.  It will also use the librados API for all operations.  At a high level, a single journal will be composed of a journal header to store metadata and multiple journal objects to contain the individual journal entries.
>
> Journal objects will be named "<journal object prefix>.<journal id>.<object number>".  An individual journal object will hold one or more journal entries, appended one after another.  Journal objects will have a configurable soft maximum size.  After the size has been exceeded, a new journal object (numbered current object + number of journal objects) will be created for future journal entries and the header active set will be updated so that other clients know that a new journal object was created.
>
> In contrast to the current journal code used by CephFS, the new journal code will use sequence numbers to identify journal entries, instead of offsets within the journal.

Am I misremembering what actually got done with our journal v2 format?
I think this is done — or at least we made a move in this direction.

> Additionally, a given journal entry will not be striped across multiple journal objects.  Journal entries will be mapped to journal objects using the sequence number: <sequence number> mod <splay count> == <object number> mod <splay count> for active journal objects.

Okay, that's different.

>
> The rationale for this difference is to facilitate parallelism for appends as journal entries will be splayed across a configurable number of journal objects.  The journal API for appending a new journal entry will return a future which can be used to retrieve the assigned sequence number for the submitted journal entry payload once committed to disk. The use of a future allows for asynchronous journal entry submissions by default and can be used to simplify integration with the client-side cache writeback handler (and as a potential future enhacement to delay appends to the journal in order to satisfy EC-pool alignment requirements).
>
> Sequence numbers are treated as a monotonically increasing integer for a given value of journal entry tag.  This allows for the possibility for multiple clients to concurrently use the same journal (e.g. all RBD disks within a given VM could use the same journal).  This will provide a loose coupling of operations between different clients using the same journal.
>
> A new journal object class method will be used to submit journal entry append requests.  This will act as a gatekeeper for the concurrent client case.

The object class is going to be a big barrier to using EC pools;
unless you want to block the use of EC pools on EC pools supporting
object classes. :(

>A successful append will indicate whether or not the journal is now full (larger than the max object size), indicating to the client that a new journal object should be used.  If the journal is too large, an error code responce would alert the client that it needs to write to the current active journal object.  In practice, the only time the journaler should expect to see such a response would be in the case where multiple clients are using the same journal and the active object update notification has yet to be received.

I'm confused. How does this work with the splay count thing you
mentioned above? Can you define <splay count>?

What happens if users submit sequenced entries substantially out of
order? It sounds like if you have multiple writers (or even just a
misbehaving client) it would not be hard for one of them to grab
sequence value N, for another to fill up one of the journal entry
objects with sequences in the range [N+1]...[N+x] and then for the
user of N to get an error response.

>
> All the journal objects will be tied together by means of a journal header object, named "<journal header prefix>.<journal id>".  This object will contain the current committed journal entry positions of all registered clients.  In librbd's case, each mirrored copy of an image would be a new registered client.  OSD class methods would be used to create/manipulate the journal header to serialize modifications from multiple clients.
>
> Journal recovery / playback will iterate through each journal entry from the journal, in sequence order.  Journal objects will be prefetched (where possible) to a configurable amount to improve the latency.  Journal entry playback can use an optional client-specified filter to only iterate over entries with a matching journal entry tag.  The API will need to support the use case of an external client periodically testing the journal for new data.
>
> Journal trimming will be accomplished by removing a whole journal object.  Only after all registered users of the journal have indicated that they have committed all journal entries within the journal object (via an update to the journal header metadata) will the journal object be deleted and the header updated to indicate the new starting object number.
>
> Since the journal is designed to be append-only, there needs to be support for cases where journal entry needs to be updated out-of-band (e.g. fixing a corrupt entry similar to CephFS's current journal recovery tools).  The proposed solution is to just append a new journal entry with the same sequence number as the record to be replaced to the end of the journal (i.e. last entry for a given sequence number wins).  This also protects against accidental replays of the original append operation.  An alternative suggestion would be to use a compare-and-swap mechanism to update the full journal object with the updated contents.

I'm confused by this bit. It seems to imply that fetching a single
entry requires checking the entire object to make sure there's no
replacement. Certainly if we were doing replay we couldn't just apply
each entry sequentially any more because an overwritten entry might
have its value replaced by a later (by sequence number) entry that
occurs earlier (by offset) in the journal.

I'd also like it if we could organize a single Journal implementation
within the Ceph project, or at least have a blessed one going forward
that we use for new stuff and might plausibly migrate existing users
to. The big things I see different from osdc/Journaler are:

1) (design) class-based
2) (design) uses librados instead of Objecter (hurray)
3) (need) should allow multiple writers
4) (fallout of other choices?) does not stripe entries across multiple objects

Using librados instead of the Objecter might make this tough to use in
the MDS, but we've already got journaling happening in a separate
thread and it's one of the more isolated bits of code so we might be
able to handle it. I'm not sure if we'd want to stripe across objects
or not, but the possibility does appeal to me.

>
> Journal Header
> ~~~~~~~~~~~~~~
>
> omap
> * soft max object size
> * journal objects splay count
> * min object number
> * most recent active journal objects (could be out-of-date)
> * registered clients
>   * client description (i.e. zone)
>   * journal entry tag
>   * last committed sequence number

omap definitely doesn't go in EC pools — I'm not sure how blue-sky you
were thinking when you mentioned those. :)

More generally the naive client implementation would be pretty slow to
commit something (go to header for sequence number, write data out).
Do you expect to always have a queue of sequence numbers available in
case you need to do an immediate commit of data? What makes the single
header sequence assignment be not a bottleneck on its own for multiple
clients? It will need to do a write each time...
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RBD journal draft design
  2015-06-02 15:11 ` RBD journal draft design Jason Dillaman
  2015-06-03  0:39   ` Gregory Farnum
@ 2015-06-03 10:47   ` John Spray
  2015-06-03 16:24     ` Jason Dillaman
  1 sibling, 1 reply; 12+ messages in thread
From: John Spray @ 2015-06-03 10:47 UTC (permalink / raw)
  To: Jason Dillaman, Ceph Development



On 02/06/2015 16:11, Jason Dillaman wrote:
> I am posting to get wider review/feedback on this draft design.  In support of the RBD mirroring feature [1], a new client-side journaling class will be developed for use by librbd.  The implementation is designed to carry opaque journal entry payloads so it will be possible to be re-used in other applications as well in the future.  It will also use the librados API for all operations.  At a high level, a single journal will be composed of a journal header to store metadata and multiple journal objects to contain the individual journal entries.
>
> ...
> A new journal object class method will be used to submit journal entry append requests.  This will act as a gatekeeper for the concurrent client case.  A successful append will indicate whether or not the journal is now full (larger than the max object size), indicating to the client that a new journal object should be used.  If the journal is too large, an error code responce would alert the client that it needs to write to the current active journal object.  In practice, the only time the journaler should expect to see such a response would be in the case where multiple clients are using the same journal and the active object update notification has yet to be received.

Can you clarify the procedure when a client write gets a "I'm full" 
return code from a journal object?  The key part I'm not clear on is 
whether the client will first update the header to add an object to the 
active set (and then write it) or whether it goes ahead and writes 
objects and then lazily updates the header.
* If it's object first, header later, what bounds how far ahead of the 
active set we have to scan when doing recovery?
* If it's header first, object later, thats an uncomfortable bit of 
latency whenever we cross and object bound

Nothing intractable about mitigating either case, just wondering what 
the idea is in this design.


> In contrast to the current journal code used by CephFS, the new journal code will use sequence numbers to identify journal entries, instead of offsets within the journal.  Additionally, a given journal entry will not be striped across multiple journal objects.  Journal entries will be mapped to journal objects using the sequence number: <sequence number> mod <splay count> == <object number> mod <splay count> for active journal objects.
>
> The rationale for this difference is to facilitate parallelism for appends as journal entries will be splayed across a configurable number of journal objects.  The journal API for appending a new journal entry will return a future which can be used to retrieve the assigned sequence number for the submitted journal entry payload once committed to disk. The use of a future allows for asynchronous journal entry submissions by default and can be used to simplify integration with the client-side cache writeback handler (and as a potential future enhacement to delay appends to the journal in order to satisfy EC-pool alignment requirements).

When two clients are both doing splayed writes, and they both send writes in parallel, it seems like the per-object fullness check via the object class could result in the writes getting staggered across different objects.  E.g. if we have two objects that both only have one slot left, then A could end up taking the slot in one (call it 1) and B could end up taking the slot in the other (call it 2).  Then when B's write lands at to object 1, it gets a "I'm full" response and has to send the entry... where?  I guess to some arbitrarily-higher-numbered journal object depending on how many other writes landed in the meantime.

This potentially leads to the stripes (splays?) of a given journal entry being separated arbitrarily far across different journal objects, which would be fine as long as everything was well formed, but will make detecting issues during replay harder (would have to remember partially-read entries when looking for their remaining stripes through rest of journal).

You could apply the object class behaviour only to the object containing the 0th splay, but then you'd have to wait for the write there to complete before writing to the rest of the splays, so the latency benefit would go away.  Or its equally possible that there's a trick in the design that has gone over my head :-)

Cheers,
John


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RBD journal draft design
  2015-06-03  0:39   ` Gregory Farnum
@ 2015-06-03 16:13     ` Jason Dillaman
  2015-06-04  0:01       ` Gregory Farnum
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Dillaman @ 2015-06-03 16:13 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph Development

> > In contrast to the current journal code used by CephFS, the new journal
> > code will use sequence numbers to identify journal entries, instead of
> > offsets within the journal.
> 
> Am I misremembering what actually got done with our journal v2 format?
> I think this is done — or at least we made a move in this direction.

Assuming journal v2 is the code in osdc/Journaler.cc, there is a new "resilient" format that helps in detecting corruption, but it appears to be still largely based upon offsets and using the Filer/Striper for I/O.  This does remind me that I probably want to include a magic preamble value at the start of each journal entry to facilitate recovery.

> > A new journal object class method will be used to submit journal entry
> > append requests.  This will act as a gatekeeper for the concurrent client
> > case.
> 
> The object class is going to be a big barrier to using EC pools;
> unless you want to block the use of EC pools on EC pools supporting
> object classes. :(

Josh mentioned (via Sam) that reads were not currently supported by object classes on EC pools.  Are appends not supported either?

> >A successful append will indicate whether or not the journal is now full
> >(larger than the max object size), indicating to the client that a new
> >journal object should be used.  If the journal is too large, an error code
> >responce would alert the client that it needs to write to the current
> >active journal object.  In practice, the only time the journaler should
> >expect to see such a response would be in the case where multiple clients
> >are using the same journal and the active object update notification has
> >yet to be received.
> 
> I'm confused. How does this work with the splay count thing you
> mentioned above? Can you define <splay count>?

Similar to the stripe width.

> What happens if users submit sequenced entries substantially out of
> order? It sounds like if you have multiple writers (or even just a
> misbehaving client) it would not be hard for one of them to grab
> sequence value N, for another to fill up one of the journal entry
> objects with sequences in the range [N+1]...[N+x] and then for the
> user of N to get an error response.

I was thinking that when a client submits their journal entry payload, the journaler will allocate the next available sequence number, compute which active journal object that sequence should be submitted to, and start an AIO append op to write the journal entry.  The next journal entry to be appended to the same journal object would be <splay count/width> entries later.  This does bring up a good point that if you are generating journal entries fast enough, the delayed response saying the object is full could cause multiple later journal entry ops to need to be resent to the new (non-full) object.  Given that, it might be best to scrap the hard error when the journal object gets full and just let the journaler eventually switch to a new object when it receives a response saying the object is now full.

> >
> > Since the journal is designed to be append-only, there needs to be support
> > for cases where journal entry needs to be updated out-of-band (e.g. fixing
> > a corrupt entry similar to CephFS's current journal recovery tools).  The
> > proposed solution is to just append a new journal entry with the same
> > sequence number as the record to be replaced to the end of the journal
> > (i.e. last entry for a given sequence number wins).  This also protects
> > against accidental replays of the original append operation.  An
> > alternative suggestion would be to use a compare-and-swap mechanism to
> > update the full journal object with the updated contents.
> 
> I'm confused by this bit. It seems to imply that fetching a single
> entry requires checking the entire object to make sure there's no
> replacement. Certainly if we were doing replay we couldn't just apply
> each entry sequentially any more because an overwritten entry might
> have its value replaced by a later (by sequence number) entry that
> occurs earlier (by offset) in the journal.

The goal would be to use prefetching on the replay.  Since the whole object is already in-memory, scanning for duplicates would be fairly trivial.  If there is a way to prevent the OSDs from potentially replaying a duplicate append journal entry message, the CAS update technique could be used.

> I'd also like it if we could organize a single Journal implementation
> within the Ceph project, or at least have a blessed one going forward
> that we use for new stuff and might plausibly migrate existing users
> to. The big things I see different from osdc/Journaler are:

Agreed.  While librbd will be the first user of this, I wasn't planning to locate it within the librbd library.

> 1) (design) class-based
> 2) (design) uses librados instead of Objecter (hurray)
> 3) (need) should allow multiple writers
> 4) (fallout of other choices?) does not stripe entries across multiple
> objects

For striping, I assume this is a function of how large MDS journal entries are expected to be.  The largest RBD journal entries would be block write operations, so in the low kilobytes.  It would be possible to add a higher layer to this design that could break-up large client journal entries into multiple, smaller entries.

> Using librados instead of the Objecter might make this tough to use in
> the MDS, but we've already got journaling happening in a separate
> thread and it's one of the more isolated bits of code so we might be
> able to handle it. I'm not sure if we'd want to stripe across objects
> or not, but the possibility does appeal to me.
> 
> >
> > Journal Header
> > ~~~~~~~~~~~~~~
> >
> > omap
> > * soft max object size
> > * journal objects splay count
> > * min object number
> > * most recent active journal objects (could be out-of-date)
> > * registered clients
> >   * client description (i.e. zone)
> >   * journal entry tag
> >   * last committed sequence number
> 
> omap definitely doesn't go in EC pools — I'm not sure how blue-sky you
> were thinking when you mentioned those. :)

Did not realize that.  Good to know.

> More generally the naive client implementation would be pretty slow to
> commit something (go to header for sequence number, write data out).
> Do you expect to always have a queue of sequence numbers available in
> case you need to do an immediate commit of data? What makes the single
> header sequence assignment be not a bottleneck on its own for multiple
> clients? It will need to do a write each time...

There is no need to go to the header for a sequence number.  Multiple (out-of-process) writers to the same journal would need to use a different tag so that they would have their own sequence number set.

> -Greg
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RBD journal draft design
  2015-06-03 10:47   ` John Spray
@ 2015-06-03 16:24     ` Jason Dillaman
  0 siblings, 0 replies; 12+ messages in thread
From: Jason Dillaman @ 2015-06-03 16:24 UTC (permalink / raw)
  To: John Spray; +Cc: Ceph Development

> > A new journal object class method will be used to submit journal entry
> > append requests.  This will act as a gatekeeper for the concurrent client
> > case.  A successful append will indicate whether or not the journal is now
> > full (larger than the max object size), indicating to the client that a
> > new journal object should be used.  If the journal is too large, an error
> > code responce would alert the client that it needs to write to the current
> > active journal object.  In practice, the only time the journaler should
> > expect to see such a response would be in the case where multiple clients
> > are using the same journal and the active object update notification has
> > yet to be received.
> 
> Can you clarify the procedure when a client write gets a "I'm full"
> return code from a journal object?  The key part I'm not clear on is
> whether the client will first update the header to add an object to the
> active set (and then write it) or whether it goes ahead and writes
> objects and then lazily updates the header.
> * If it's object first, header later, what bounds how far ahead of the
> active set we have to scan when doing recovery?
> * If it's header first, object later, thats an uncomfortable bit of
> latency whenever we cross and object bound
> 
> Nothing intractable about mitigating either case, just wondering what
> the idea is in this design.

I was thinking object first, header later.  As I mentioned in my response to Greg, I now think this "I'm full" should only be used as a guide to kick future (un-submitted) requests over to a new journal object.  For example, if you submitted 16 4K AIO journal entry append requests, it's possible that the first request filled the journal -- so now your soft max size will include an extra 15 4K journal entries before the response to the first request indicates that the journal object is full and future requests should use a new journal object.

> > The rationale for this difference is to facilitate parallelism for appends
> > as journal entries will be splayed across a configurable number of journal
> > objects.  The journal API for appending a new journal entry will return a
> > future which can be used to retrieve the assigned sequence number for the
> > submitted journal entry payload once committed to disk. The use of a
> > future allows for asynchronous journal entry submissions by default and
> > can be used to simplify integration with the client-side cache writeback
> > handler (and as a potential future enhacement to delay appends to the
> > journal in order to satisfy EC-pool alignment requirements).
> 
> When two clients are both doing splayed writes, and they both send writes in
> parallel, it seems like the per-object fullness check via the object class
> could result in the writes getting staggered across different objects.  E.g.
> if we have two objects that both only have one slot left, then A could end
> up taking the slot in one (call it 1) and B could end up taking the slot in
> the other (call it 2).  Then when B's write lands at to object 1, it gets a
> "I'm full" response and has to send the entry... where?  I guess to some
> arbitrarily-higher-numbered journal object depending on how many other
> writes landed in the meantime.

In this case, assuming B sent the request to journal object 0, it would send the re-request to journal object 0 + <splay width> since the request <sequence number> mod <splay width> must equal <object number> mod <splay width>.  However, at this point I think it would be better to eliminate the "I'm full" error code and stick with "extra" soft max object size.

> This potentially leads to the stripes (splays?) of a given journal entry
> being separated arbitrarily far across different journal objects, which
> would be fine as long as everything was well formed, but will make detecting
> issues during replay harder (would have to remember partially-read entries
> when looking for their remaining stripes through rest of journal).
> 
> You could apply the object class behaviour only to the object containing the
> 0th splay, but then you'd have to wait for the write there to complete
> before writing to the rest of the splays, so the latency benefit would go
> away.  Or its equally possible that there's a trick in the design that has
> gone over my head :-)

I'm probably missing something here.  A journal entry won't be partially striped across multiple journal objects.  The journal entry in its entirety would be written to one of the <splay width> active journal objects.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RBD journal draft design
  2015-06-03 16:13     ` Jason Dillaman
@ 2015-06-04  0:01       ` Gregory Farnum
  2015-06-04 15:08         ` Jason Dillaman
  0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2015-06-04  0:01 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: Ceph Development

On Wed, Jun 3, 2015 at 9:13 AM, Jason Dillaman <dillaman@redhat.com> wrote:
>> > In contrast to the current journal code used by CephFS, the new journal
>> > code will use sequence numbers to identify journal entries, instead of
>> > offsets within the journal.
>>
>> Am I misremembering what actually got done with our journal v2 format?
>> I think this is done — or at least we made a move in this direction.
>
> Assuming journal v2 is the code in osdc/Journaler.cc, there is a new "resilient" format that helps in detecting corruption, but it appears to be still largely based upon offsets and using the Filer/Striper for I/O.  This does remind me that I probably want to include a magic preamble value at the start of each journal entry to facilitate recovery.

Ah yeah, I was confusing the changes we did there and in our MDLog
wrapper bits. Ignore me on this bit.

>
>> > A new journal object class method will be used to submit journal entry
>> > append requests.  This will act as a gatekeeper for the concurrent client
>> > case.
>>
>> The object class is going to be a big barrier to using EC pools;
>> unless you want to block the use of EC pools on EC pools supporting
>> object classes. :(
>
> Josh mentioned (via Sam) that reads were not currently supported by object classes on EC pools.  Are appends not supported either?

We discussed this briefly and certain object class functions might
work "by mistake" on EC pools, but you should assume nothing does (is
my recollection of the conclusions). For instance, even if it's
technically possible, the append thing is really hard for this sort of
write; I think I mentioned in Josh's thread about needing to have an
entire stripe at a time (and the smallest you could even think about
doing reasonably is 4KB * N, and really that's not big enough given
metadata overheads).

>
>> >A successful append will indicate whether or not the journal is now full
>> >(larger than the max object size), indicating to the client that a new
>> >journal object should be used.  If the journal is too large, an error code
>> >responce would alert the client that it needs to write to the current
>> >active journal object.  In practice, the only time the journaler should
>> >expect to see such a response would be in the case where multiple clients
>> >are using the same journal and the active object update notification has
>> >yet to be received.
>>
>> I'm confused. How does this work with the splay count thing you
>> mentioned above? Can you define <splay count>?
>
> Similar to the stripe width.

Okay, that sort of makes sense but I don't see how you could legally
be writing to different "sets" so why not just make it an explicit
striping thing and move all journal entries for that "set" at once?

...Actually, doesn't *not* forcing a coordinated move from one object
set to another mean that you don't actually have an ordering guarantee
across tags if you replay the journal objects in order?


>
>> What happens if users submit sequenced entries substantially out of
>> order? It sounds like if you have multiple writers (or even just a
>> misbehaving client) it would not be hard for one of them to grab
>> sequence value N, for another to fill up one of the journal entry
>> objects with sequences in the range [N+1]...[N+x] and then for the
>> user of N to get an error response.
>
> I was thinking that when a client submits their journal entry payload, the journaler will allocate the next available sequence number, compute which active journal object that sequence should be submitted to, and start an AIO append op to write the journal entry.  The next journal entry to be appended to the same journal object would be <splay count/width> entries later.  This does bring up a good point that if you are generating journal entries fast enough, the delayed response saying the object is full could cause multiple later journal entry ops to need to be resent to the new (non-full) object.  Given that, it might be best to scrap the hard error when the journal object gets full and just let the journaler eventually switch to a new object when it receives a response saying the object is now full.

I was misunderstanding where the seqs came from and that they were
associated with the tag, not the journal. So this shouldn't be such a
problem.

>
>> >
>> > Since the journal is designed to be append-only, there needs to be support
>> > for cases where journal entry needs to be updated out-of-band (e.g. fixing
>> > a corrupt entry similar to CephFS's current journal recovery tools).  The
>> > proposed solution is to just append a new journal entry with the same
>> > sequence number as the record to be replaced to the end of the journal
>> > (i.e. last entry for a given sequence number wins).  This also protects
>> > against accidental replays of the original append operation.  An
>> > alternative suggestion would be to use a compare-and-swap mechanism to
>> > update the full journal object with the updated contents.
>>
>> I'm confused by this bit. It seems to imply that fetching a single
>> entry requires checking the entire object to make sure there's no
>> replacement. Certainly if we were doing replay we couldn't just apply
>> each entry sequentially any more because an overwritten entry might
>> have its value replaced by a later (by sequence number) entry that
>> occurs earlier (by offset) in the journal.
>
> The goal would be to use prefetching on the replay.  Since the whole object is already in-memory, scanning for duplicates would be fairly trivial.  If there is a way to prevent the OSDs from potentially replaying a duplicate append journal entry message, the CAS update technique could be used.

Actually don't you need to keep <splay count> objects prefetched in
memory, because the ops round-robin across them?

>
>> I'd also like it if we could organize a single Journal implementation
>> within the Ceph project, or at least have a blessed one going forward
>> that we use for new stuff and might plausibly migrate existing users
>> to. The big things I see different from osdc/Journaler are:
>
> Agreed.  While librbd will be the first user of this, I wasn't planning to locate it within the librbd library.
>
>> 1) (design) class-based
>> 2) (design) uses librados instead of Objecter (hurray)
>> 3) (need) should allow multiple writers
>> 4) (fallout of other choices?) does not stripe entries across multiple
>> objects
>
> For striping, I assume this is a function of how large MDS journal entries are expected to be.  The largest RBD journal entries would be block write operations, so in the low kilobytes.  It would be possible to add a higher layer to this design that could break-up large client journal entries into multiple, smaller entries.

Really we just picked up the striping for free by making use of the
Filer to handle our data layout. ;) We don't ever enable it ourself
and I don't think it matters.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RBD journal draft design
  2015-06-04  0:01       ` Gregory Farnum
@ 2015-06-04 15:08         ` Jason Dillaman
  2015-06-04 20:25           ` Gregory Farnum
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Dillaman @ 2015-06-04 15:08 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph Development

> >> >A successful append will indicate whether or not the journal is now full
> >> >(larger than the max object size), indicating to the client that a new
> >> >journal object should be used.  If the journal is too large, an error
> >> >code
> >> >responce would alert the client that it needs to write to the current
> >> >active journal object.  In practice, the only time the journaler should
> >> >expect to see such a response would be in the case where multiple clients
> >> >are using the same journal and the active object update notification has
> >> >yet to be received.
> >>
> >> I'm confused. How does this work with the splay count thing you
> >> mentioned above? Can you define <splay count>?
> >
> > Similar to the stripe width.
> 
> Okay, that sort of makes sense but I don't see how you could legally
> be writing to different "sets" so why not just make it an explicit
> striping thing and move all journal entries for that "set" at once?
> 
> ...Actually, doesn't *not* forcing a coordinated move from one object
> set to another mean that you don't actually have an ordering guarantee 
> across tags if you replay the journal objects in order?

The ordering between tags was meant to be a soft ordering guarantee (since any number of delays could throw off the actual order as delivered from the OS).  In the case of a VM using multiple RBD images sharing the same journal, this provides an ordering guarantee per device but not between devices.

This is no worse than the case of each RBD image using its own journal instead of sharing a journal and the behavior doesn't seem too different from a non-RBD case when submitting requests to two different physical devices (e.g. a SSD device and a NAS device will commit data at different latencies). Without the forced coordinated move, the potential gap in request orders between two devices would increase by the latency of the notify message roundtrip time, but it prevents the need for potentially resending journal entries to a new journal object.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RBD journal draft design
  2015-06-04 15:08         ` Jason Dillaman
@ 2015-06-04 20:25           ` Gregory Farnum
  2015-06-05  0:36             ` Jason Dillaman
  0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2015-06-04 20:25 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: Ceph Development

On Thu, Jun 4, 2015 at 8:08 AM, Jason Dillaman <dillaman@redhat.com> wrote:
>> >> >A successful append will indicate whether or not the journal is now full
>> >> >(larger than the max object size), indicating to the client that a new
>> >> >journal object should be used.  If the journal is too large, an error
>> >> >code
>> >> >responce would alert the client that it needs to write to the current
>> >> >active journal object.  In practice, the only time the journaler should
>> >> >expect to see such a response would be in the case where multiple clients
>> >> >are using the same journal and the active object update notification has
>> >> >yet to be received.
>> >>
>> >> I'm confused. How does this work with the splay count thing you
>> >> mentioned above? Can you define <splay count>?
>> >
>> > Similar to the stripe width.
>>
>> Okay, that sort of makes sense but I don't see how you could legally
>> be writing to different "sets" so why not just make it an explicit
>> striping thing and move all journal entries for that "set" at once?
>>
>> ...Actually, doesn't *not* forcing a coordinated move from one object
>> set to another mean that you don't actually have an ordering guarantee
>> across tags if you replay the journal objects in order?
>
> The ordering between tags was meant to be a soft ordering guarantee (since any number of delays could throw off the actual order as delivered from the OS).  In the case of a VM using multiple RBD images sharing the same journal, this provides an ordering guarantee per device but not between devices.
>
> This is no worse than the case of each RBD image using its own journal instead of sharing a journal and the behavior doesn't seem too different from a non-RBD case when submitting requests to two different physical devices (e.g. a SSD device and a NAS device will commit data at different latencies).

Yes, it's exactly the same. But I thought the point was that if you
commingle the journals then you actually have the appropriate ordering
across clients/disks (if there's enough ordering and synchronization)
that you can stream the journal off-site and know that if there's any
kind of disaster you are always at least crash-consistent. If there's
arbitrary re-ordering of different volume writes at object boundaries
then I don't see what benefit there is to having a commingled journal
at all.

I think there's a thing called a "consistency group" in various
storage platforms that is sort of similar to this, where you can take
a snapshot of a related group of volumes at once. I presume the
commingled journal is an attempt at basically having an ongoing
snapshot of the whole consistency group.
-Greg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RBD journal draft design
  2015-06-04 20:25           ` Gregory Farnum
@ 2015-06-05  0:36             ` Jason Dillaman
  2015-06-09 18:32               ` Gregory Farnum
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Dillaman @ 2015-06-05  0:36 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph Development

> >> ...Actually, doesn't *not* forcing a coordinated move from one object
> >> set to another mean that you don't actually have an ordering guarantee
> >> across tags if you replay the journal objects in order?
> >
> > The ordering between tags was meant to be a soft ordering guarantee (since
> > any number of delays could throw off the actual order as delivered from
> > the OS).  In the case of a VM using multiple RBD images sharing the same
> > journal, this provides an ordering guarantee per device but not between
> > devices.
> >
> > This is no worse than the case of each RBD image using its own journal
> > instead of sharing a journal and the behavior doesn't seem too different
> > from a non-RBD case when submitting requests to two different physical
> > devices (e.g. a SSD device and a NAS device will commit data at different
> > latencies).
> 
> Yes, it's exactly the same. But I thought the point was that if you
> commingle the journals then you actually have the appropriate ordering
> across clients/disks (if there's enough ordering and synchronization)
> that you can stream the journal off-site and know that if there's any
> kind of disaster you are always at least crash-consistent. If there's
> arbitrary re-ordering of different volume writes at object boundaries
> then I don't see what benefit there is to having a commingled journal
> at all.
> 
> I think there's a thing called a "consistency group" in various
> storage platforms that is sort of similar to this, where you can take
> a snapshot of a related group of volumes at once. I presume the
> commingled journal is an attempt at basically having an ongoing
> snapshot of the whole consistency group.

Seems like even with a SAN-type consistency group, you could still have temporal ordering issues between volume writes unless it synchronized with the client OSes to flush out all volumes at a consistent place so that the snapshot could take place.

I suppose you could provide much tighter QEMU inter-volume ordering guarantees if you modified the RBD block device so that each individual RBD image instance was provided a mechanism to coordinate the allocation of the sequence number between the images.  Right now, each image is opened in its own context w/ no knowledge of one another and no way to coordinate.  The current proposed tag + sequence number approach could be used to provide the soft inter-volume ordering guarantees until QEMU / librbd could be modified to support volume groupings.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RBD journal draft design
  2015-06-05  0:36             ` Jason Dillaman
@ 2015-06-09 18:32               ` Gregory Farnum
  2015-06-09 19:08                 ` Jason Dillaman
  0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2015-06-09 18:32 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: Ceph Development

On Thu, Jun 4, 2015 at 5:36 PM, Jason Dillaman <dillaman@redhat.com> wrote:
>> >> ...Actually, doesn't *not* forcing a coordinated move from one object
>> >> set to another mean that you don't actually have an ordering guarantee
>> >> across tags if you replay the journal objects in order?
>> >
>> > The ordering between tags was meant to be a soft ordering guarantee (since
>> > any number of delays could throw off the actual order as delivered from
>> > the OS).  In the case of a VM using multiple RBD images sharing the same
>> > journal, this provides an ordering guarantee per device but not between
>> > devices.
>> >
>> > This is no worse than the case of each RBD image using its own journal
>> > instead of sharing a journal and the behavior doesn't seem too different
>> > from a non-RBD case when submitting requests to two different physical
>> > devices (e.g. a SSD device and a NAS device will commit data at different
>> > latencies).
>>
>> Yes, it's exactly the same. But I thought the point was that if you
>> commingle the journals then you actually have the appropriate ordering
>> across clients/disks (if there's enough ordering and synchronization)
>> that you can stream the journal off-site and know that if there's any
>> kind of disaster you are always at least crash-consistent. If there's
>> arbitrary re-ordering of different volume writes at object boundaries
>> then I don't see what benefit there is to having a commingled journal
>> at all.
>>
>> I think there's a thing called a "consistency group" in various
>> storage platforms that is sort of similar to this, where you can take
>> a snapshot of a related group of volumes at once. I presume the
>> commingled journal is an attempt at basically having an ongoing
>> snapshot of the whole consistency group.
>
> Seems like even with a SAN-type consistency group, you could still have temporal ordering issues between volume writes unless it synchronized with the client OSes to flush out all volumes at a consistent place so that the snapshot could take place.
>
> I suppose you could provide much tighter QEMU inter-volume ordering guarantees if you modified the RBD block device so that each individual RBD image instance was provided a mechanism to coordinate the allocation of the sequence number between the images.  Right now, each image is opened in its own context w/ no knowledge of one another and no way to coordinate.  The current proposed tag + sequence number approach could be used to provide the soft inter-volume ordering guarantees until QEMU / librbd could be modified to support volume groupings.

I must not be being clear. Tell me if this scenario is possible:

* Client A writes to file foo many times and it is journaled to object set 1.
* Client B writes to file bar many times and it starts journaling to
object set 1, but hits the end and moves on to object set 2.
* Client A hits a synchronization point in its higher-level logic.
* Client A fsyncs file foo to object set 1 and then
* Client B hits the synchronization point, fsyncs file bar to object
set 2, and sends data back to Client A.
* Client A fsyncs the receipt of its data stream to object set 1, and
only then gets sent on to object set 2.
* The journal copier runs and migrates object set 1 to a remote data
center, then the data center explodes.
* In the remote data center they fail over and client A thinks it has
reached a synchronization point and gotten an acknowledgement that
client B has never heard of.

Does that being a problem make sense? I don't think handling it is
overly complicated and it's kind of important.
-Greg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RBD journal draft design
  2015-06-09 18:32               ` Gregory Farnum
@ 2015-06-09 19:08                 ` Jason Dillaman
  2015-06-09 22:30                   ` Gregory Farnum
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Dillaman @ 2015-06-09 19:08 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph Development

> I must not be being clear. Tell me if this scenario is possible:
> 
> * Client A writes to file foo many times and it is journaled to object set 1.
> * Client B writes to file bar many times and it starts journaling to
> object set 1, but hits the end and moves on to object set 2.
> * Client A hits a synchronization point in its higher-level logic.
> * Client A fsyncs file foo to object set 1 and then
> * Client B hits the synchronization point, fsyncs file bar to object
> set 2, and sends data back to Client A.
> * Client A fsyncs the receipt of its data stream to object set 1, and
> only then gets sent on to object set 2.
> * The journal copier runs and migrates object set 1 to a remote data
> center, then the data center explodes.
> * In the remote data center they fail over and client A thinks it has
> reached a synchronization point and gotten an acknowledgement that
> client B has never heard of.
> 
> Does that being a problem make sense? I don't think handling it is
> overly complicated and it's kind of important.
> -Greg

Seems this case is solved if you delay the completion of client B's flush (fsync) until the "active set updated" notification is successfully delivered.  In that case, client A would know that it needs to re-read the active set collection and thus needs to now write to object set 2.  Thoughts?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RBD journal draft design
  2015-06-09 19:08                 ` Jason Dillaman
@ 2015-06-09 22:30                   ` Gregory Farnum
  0 siblings, 0 replies; 12+ messages in thread
From: Gregory Farnum @ 2015-06-09 22:30 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: Ceph Development

On Tue, Jun 9, 2015 at 12:08 PM, Jason Dillaman <dillaman@redhat.com> wrote:
>> I must not be being clear. Tell me if this scenario is possible:
>>
>> * Client A writes to file foo many times and it is journaled to object set 1.
>> * Client B writes to file bar many times and it starts journaling to
>> object set 1, but hits the end and moves on to object set 2.
>> * Client A hits a synchronization point in its higher-level logic.
>> * Client A fsyncs file foo to object set 1 and then
>> * Client B hits the synchronization point, fsyncs file bar to object
>> set 2, and sends data back to Client A.
>> * Client A fsyncs the receipt of its data stream to object set 1, and
>> only then gets sent on to object set 2.
>> * The journal copier runs and migrates object set 1 to a remote data
>> center, then the data center explodes.
>> * In the remote data center they fail over and client A thinks it has
>> reached a synchronization point and gotten an acknowledgement that
>> client B has never heard of.
>>
>> Does that being a problem make sense? I don't think handling it is
>> overly complicated and it's kind of important.
>> -Greg
>
> Seems this case is solved if you delay the completion of client B's flush (fsync) until the "active set updated" notification is successfully delivered.  In that case, client A would know that it needs to re-read the active set collection and thus needs to now write to object set 2.  Thoughts?

Honestly at this point my head's a little wrapped around itself and
I'm not sure. :) I think that however it's set up we want to switch
from one object set to the next coherently (ie, no writing to object
set 2 for write N and object set 1 for write N+1) and that we force
each client to switch at the same point. I guess in general the
penalty for having to re-send ops when we find out late that the
object set is full probably wouldn't be a big deal? But I'm not sure
if if sending notifies on the objects is the best option or if
something else is.
-Greg

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-06-09 22:30 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1574383603.9391063.1433257824183.JavaMail.zimbra@redhat.com>
2015-06-02 15:11 ` RBD journal draft design Jason Dillaman
2015-06-03  0:39   ` Gregory Farnum
2015-06-03 16:13     ` Jason Dillaman
2015-06-04  0:01       ` Gregory Farnum
2015-06-04 15:08         ` Jason Dillaman
2015-06-04 20:25           ` Gregory Farnum
2015-06-05  0:36             ` Jason Dillaman
2015-06-09 18:32               ` Gregory Farnum
2015-06-09 19:08                 ` Jason Dillaman
2015-06-09 22:30                   ` Gregory Farnum
2015-06-03 10:47   ` John Spray
2015-06-03 16:24     ` Jason Dillaman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.