All of lore.kernel.org
 help / color / mirror / Atom feed
* RBD mirroring design draft
@ 2015-05-13  0:42 Josh Durgin
  2015-05-13  7:48 ` Haomai Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Josh Durgin @ 2015-05-13  0:42 UTC (permalink / raw)
  To: ceph-devel

We've talked about this a bit at ceph developer summits, but haven't
gone through several parts of the design thoroughly. I'd like to post
this to a wider audience and get feedback on this draft of a design.

The journal parts are more defined, but the failover/failback workflow
and general configuration need more fleshing out. Enabling/disabling
journaling on existing images isn't described yet, though it's
something that should be supported.

=============
RBD Mirroring
=============

The goal of rbd mirroring is to provide disaster recovery for rbd
images.

This includes:

1) maintaining a crash-consistent view of an image
2) streaming updates from one site to another
3) failover/failback

I'll refer to different (cluster, pool1, [pool2, ... poolN]) combinations
where rbd images are stored as "zones" here, which would be a new
abstraction introduced for easier configuration of mirroring. This is
the same term used by radosgw replication.

Crash consistency
-----------------

This is the basic level of consistency block devices can provide with
no higher-level hooks, like qemu's guest agent. For replaying a stream
of block device writes, higher-level hooks could make sense, but these
could be added later as points in a stream of writes. For crash
consistency, rbd just needs to maintain the order of writes. There are
a few ways to do this:

a) snapshots

   rbd has supported differential snapshots for a while now, and these
   are great for performing backups. They don't work as well for
   providing a stream of consistent updates, since there is overhead in
   space and I/O load to creating and deleting rados snapshots. For
   backend filesystems like xfs and ext4, frequent snapshots would turn
   many small writes into copies of 4MB and a small write, wasting
   space. Deleting snapshots is also expensive if there hundreds or
   thousands happening all the time. Rados snapshots were not designed
   for this kind of load. In addition, diffing snapshots does not tell
   us the order in which writes were done, so a partially applied
   diff would be inconsistent and likely unusable.

b) log-structured rbd

   The simplest way to keep writes in order is to only write them in
   order, by appending to a log of rados objects. This is great for
   mirroring, but vastly complicates everything else. This would
   require all the usual bells and whistles of a log-structured
   filesystem, including garbage collection, reference tracking, a new
   rbd-level snapshot mechanism, and more. Custom fsck-like tools for
   consistency checking and repair would be needed, and the I/O paths
   would be much more complex. This is a good research project, but
   it would take a long time to develop and stabilize.

c) journaling

   Journaling is an intermediate step between snapshots and log
   structured rbd. The idea is that each image has a log of all writes
   (including data) and metadata changes, like resize, snapshot
   create/delete, etc. This journal is stored as a series of rados
   objects, similar to cephfs' journal. A write would first be appended
   to the journal, acked to the librbd user at that point, and later
   written out to the usual rbd data objects. Extending rbd's existing
   client-side cache to track this allows reads of data written to the
   journal but not the data objects to be satisfied from the cache, and
   avoids issues of stale reads. This data needs to be kept in memory
   anyway, so it makes sense to keep it in the cache, where it can be
   useful.

Structure
^^^^^^^^^

The journal could be stored in a separate pool from the image, such as
one backed by ssds to improve write performance. Since it is
append-only, the journal's data could be stored in an EC pool to save
space.

It will need some metadata regarding positions in the journal. These
could be stored as omap values in a 'journal header' object in a
replicated pool, for rbd perhaps the same pool as the image for
simplicity. The header would contain at least:

* pool_id - where journal data is stored
* journal_object_prefix - unique prefix for journal data objects
* positions - (zone, purpose, object num, offset) tuples indexed by zone
* object_size - approximate size of each data object
* object_num_begin - current earliest object in the log
* object_num_end - max potential object in the log

Similar to rbd images, journal data would be stored in objects named
after the journal_object_prefix and their object number. To avoid
issues of padding or splitting journal entries, and to make it simpler
to keep append-only, it's easier to allow the objects to be near
object_size before moving to the next object number instead of
sticking with an exact object size.

Ideally this underlying structure could be used for both rbd and
cephfs. Variable sized objects are different from the existing cephfs
journal, which uses fixed-size objects for striping. The default is
still 4MB chunks though. How important is striping the journal to
cephfs? For rbd it seems unlikely to help much, since updates need to
be batched up by the client cache anyway.

Usage
^^^^^

When an rbd image with journaling enabled is opened, the journal
metadata would be read and the last part of the journal would be
replayed if necessary.

In general, a write would first go to the journal, return to the
client, and then be written to the underlying rbd image. Once a
threshold of bytes of journal entries are flushed, or a time period is
reached and some journal entries were flushed, a position with purpose
"flushed" for the zone the rbd image is in would be updated in the
journal metadata.

Trimming old entries from the journal would be allowed up to the
minimum of all the positions stored in its metadata. This would be an
asynchronous operation executed by the consumers of the journal.

There would be a new feature bit for rbd images to enable
journaling. As a first step it could only be set when an image is
created.

One way to enable it dynamically would be to take a snapshot at the
same time to serve as a base for mirroring further changes.  This
could be added as a journal entry for snapshot creation with a special
'internal' flag, and the snapshot could be deleted by the process that
trims this journal entry.

Deleting an image would delete its journal, despite any mirroring in
progress, since mirroring is not backup.

Streaming Updates
-----------------

This a complex area with many trade-offs. I expect we'll need some
iteration to find good general solutions here. I'll describe a simple
initial step, and some potential optimizations, and issues to address
in future versions.

In general, there will be a new daemon (tentatively called rbd-mirror
here) that reads journal entries from images in one zone and replays
them in different zones. An initial implementation might connect to
ceph clusters in all zones, and replay writes and metadata changes to
images in other zones directly via librbd. To simplify failover, it
would be better to run these in follower zones rather than the leader
zone.

There are a couple of improvements on this we'd probably want to make
early:

* using multiple threads to mirror many images at once
* using multiple processes to scale across machines, so one node is
not a bottleneck

Some other possible optimizations:
* reading a large window of the journal to coalesce overlapping writes
* decoupling reading from the leader zone and writing to follower zones,
to allow optimizations like compression of the journal or other
transforms as data is sent, and relaxing the requirement for one node
to be directly connected to more than one ceph cluster

Noticing updates
^^^^^^^^^^^^^^^^

There are two kinds of changes that rbd-mirror needs to be aware of:

1) journaled image creation/deletion

The features of an image are only stored in the image's header right
now. To get updates of these more easily, we need an index of some
sort. This could take the form of an additional index in the
rbd_directory object, which already contains all images. Creating or
deleting an image with the journal feature bit could send a rados
notify on the rbd_directory object, and rbd-mirror could watch
rbd_directory for these notifications. The notifications could contain
information about the image (at least its features), but if
rbd-mirror's watch times out it could simply re-read the features of
all images in a pool that it cares about (more on this later).

Dynamically enabling/disabling features would work the same way. The
image header would be updated as usual, and the rbd_directory index
would be updated as well. If the journaling feature bit changed, a
notify on the rbd_directory object would be sent.

Since we'd be storing the features in two places, to keep them in sync
we could use an approach like:

a) set a new updated_features field on image header
b) set features on rbd_directory
c) clear updated_features and set features on image header

This is all through the lock holder, so we don't need to worry about
concurrent updates - header operations are prefixed by an assertion
that the lock is still held for extra safety.

2) journal updates for a particular image

Generally rbd-mirror can keep reading the journal until it hits the
end, detected by -ENOENT on an object or less than the journal's
target object size.

Once it reaches the end, it can poll for new content periodically, or
use notifications like watch/notify on the journal header for the max
journal object number to change. I don't think polling in this case is
very expensive, especially if it uses exponential backoff to a
configurable max time it can be behind the journal.

Clones
^^^^^^

Cloning is currently the only way images can be related. Mirroring
should preserve these relationships so mirrored zones behave the same
as the original zone.

In order for clones with non-zero overlap to be useful, their parent
snapshot must be present in the zone already. A simple approach is to
avoid mirroring clones until their parent snapshot is mirrored.

Clones refer to parents by pool id, image id, and snapshot id. These
are all generated automatically when each is created, so they will be
different in different zones. Since pools and images can be renamed,
we'll need a way to make sure we keep the correct mappings in mirrored
zones. A simple way to do this is to record a leader zone ->
follower zone mapping for pool and image ids. When a pool or image
is created in follower zones, their mapping to the ids in the leader
zone would be stored in the destination zone.

Parallelism
^^^^^^^^^^^

Mirroring many images is embarrassingly parallel. A simple unit of
work is an image (more specifically a journal, if e.g. a group of
images shared a journal as part of a consistency group in the future).

Spreading this work across threads within a single process is
relatively simple. For HA, and to avoid a single NIC becoming a
bottleneck, we'll want to spread out the work across multiple
processes (and probably multiple hosts). rbd-mirror should have no
local state, so we just need a mechanism to coordinate the division of
work across multiple processes.

One way to do this would be layering on top of watch/notify. Each
rbd-mirror process in a zone could watch the same object, and shard
the set of images to mirror based on a hash of image ids onto the
current set of rbd-mirror processes sorted by client gid. The set of
rbd-mirror processes could be determined by listing watchers.

Failover
--------

Watch/notify could also be used (via a predetermined object) to
communicate with rbd-mirror processes to get sync status from each,
and for managing failover.

Failing over means preventing changes in the original leader zone, and
making the new leader zone writeable. The state of a zone (read-only vs
writeable) could be stored in a zone's metadata in rados to represent
this, and images with the journal feature bit could check this before
being opened read/write for safety. To make it race-proof, the zone
state can be a tri-state - read-only, read-write, or changing.

In the original leader zone, if it is still running, the zone would be
set to read-only mode and all clients could be blacklisted to avoid
creating too much divergent history to rollback later.

In the new leader zone, the zone's state would be set to 'changing',
and rbd-mirror processes would be told to stop copying from the
original leader and close the images they were mirroring to.  New
rbd-mirror processes should refuse to start mirroring when the zone is
not read-only. Once the mirroring processes have stopped, the zone
could be set to read-write, and begin normal usage.

Failback
^^^^^^^^

In this scenario, after failing over, the original leader zone (A)
starts running again, but needs to catch up to the current leader
(B). At a high level, this involves syncing up the image by rolling
back the updates in A past the point B synced to as noted in an
images's journal in A, and mirroring all the changes since then from
B.

This would need to be an offline operation, since at some point
B would need to go read-only before A goes read-write. Making this
transition online is outside the scope of mirroring for now, since it
would require another level of indirection for rbd users like QEMU.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RBD mirroring design draft
  2015-05-13  0:42 RBD mirroring design draft Josh Durgin
@ 2015-05-13  7:48 ` Haomai Wang
  2015-05-13  8:07 ` Haomai Wang
  2015-05-28  5:37 ` Gregory Farnum
  2 siblings, 0 replies; 9+ messages in thread
From: Haomai Wang @ 2015-05-13  7:48 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On Wed, May 13, 2015 at 8:42 AM, Josh Durgin <jdurgin@redhat.com> wrote:
> We've talked about this a bit at ceph developer summits, but haven't
> gone through several parts of the design thoroughly. I'd like to post
> this to a wider audience and get feedback on this draft of a design.
>
> The journal parts are more defined, but the failover/failback workflow
> and general configuration need more fleshing out. Enabling/disabling
> journaling on existing images isn't described yet, though it's
> something that should be supported.
>
> =============
> RBD Mirroring
> =============
>
> The goal of rbd mirroring is to provide disaster recovery for rbd
> images.
>
> This includes:
>
> 1) maintaining a crash-consistent view of an image
> 2) streaming updates from one site to another
> 3) failover/failback
>
> I'll refer to different (cluster, pool1, [pool2, ... poolN]) combinations
> where rbd images are stored as "zones" here, which would be a new
> abstraction introduced for easier configuration of mirroring. This is
> the same term used by radosgw replication.
>
> Crash consistency
> -----------------
>
> This is the basic level of consistency block devices can provide with
> no higher-level hooks, like qemu's guest agent. For replaying a stream
> of block device writes, higher-level hooks could make sense, but these
> could be added later as points in a stream of writes. For crash
> consistency, rbd just needs to maintain the order of writes. There are
> a few ways to do this:
>
> a) snapshots
>
>   rbd has supported differential snapshots for a while now, and these
>   are great for performing backups. They don't work as well for
>   providing a stream of consistent updates, since there is overhead in
>   space and I/O load to creating and deleting rados snapshots. For
>   backend filesystems like xfs and ext4, frequent snapshots would turn
>   many small writes into copies of 4MB and a small write, wasting
>   space. Deleting snapshots is also expensive if there hundreds or
>   thousands happening all the time. Rados snapshots were not designed
>   for this kind of load. In addition, diffing snapshots does not tell
>   us the order in which writes were done, so a partially applied
>   diff would be inconsistent and likely unusable.
>
> b) log-structured rbd
>
>   The simplest way to keep writes in order is to only write them in
>   order, by appending to a log of rados objects. This is great for
>   mirroring, but vastly complicates everything else. This would
>   require all the usual bells and whistles of a log-structured
>   filesystem, including garbage collection, reference tracking, a new
>   rbd-level snapshot mechanism, and more. Custom fsck-like tools for
>   consistency checking and repair would be needed, and the I/O paths
>   would be much more complex. This is a good research project, but
>   it would take a long time to develop and stabilize.
>
> c) journaling
>
>   Journaling is an intermediate step between snapshots and log
>   structured rbd. The idea is that each image has a log of all writes
>   (including data) and metadata changes, like resize, snapshot
>   create/delete, etc. This journal is stored as a series of rados
>   objects, similar to cephfs' journal. A write would first be appended
>   to the journal, acked to the librbd user at that point, and later
>   written out to the usual rbd data objects. Extending rbd's existing
>   client-side cache to track this allows reads of data written to the
>   journal but not the data objects to be satisfied from the cache, and
>   avoids issues of stale reads. This data needs to be kept in memory
>   anyway, so it makes sense to keep it in the cache, where it can be
>   useful.
>
> Structure
> ^^^^^^^^^
>
> The journal could be stored in a separate pool from the image, such as
> one backed by ssds to improve write performance. Since it is
> append-only, the journal's data could be stored in an EC pool to save
> space.
>
> It will need some metadata regarding positions in the journal. These
> could be stored as omap values in a 'journal header' object in a
> replicated pool, for rbd perhaps the same pool as the image for
> simplicity. The header would contain at least:
>
> * pool_id - where journal data is stored
> * journal_object_prefix - unique prefix for journal data objects
> * positions - (zone, purpose, object num, offset) tuples indexed by zone
> * object_size - approximate size of each data object
> * object_num_begin - current earliest object in the log
> * object_num_end - max potential object in the log
>
> Similar to rbd images, journal data would be stored in objects named
> after the journal_object_prefix and their object number. To avoid
> issues of padding or splitting journal entries, and to make it simpler
> to keep append-only, it's easier to allow the objects to be near
> object_size before moving to the next object number instead of
> sticking with an exact object size.
>
> Ideally this underlying structure could be used for both rbd and
> cephfs. Variable sized objects are different from the existing cephfs
> journal, which uses fixed-size objects for striping. The default is
> still 4MB chunks though. How important is striping the journal to
> cephfs? For rbd it seems unlikely to help much, since updates need to
> be batched up by the client cache anyway.
>
> Usage
> ^^^^^
>
> When an rbd image with journaling enabled is opened, the journal
> metadata would be read and the last part of the journal would be
> replayed if necessary.
>
> In general, a write would first go to the journal, return to the
> client, and then be written to the underlying rbd image. Once a
> threshold of bytes of journal entries are flushed, or a time period is
> reached and some journal entries were flushed, a position with purpose
> "flushed" for the zone the rbd image is in would be updated in the
> journal metadata.
>
> Trimming old entries from the journal would be allowed up to the
> minimum of all the positions stored in its metadata. This would be an
> asynchronous operation executed by the consumers of the journal.
>
> There would be a new feature bit for rbd images to enable
> journaling. As a first step it could only be set when an image is
> created.
>
> One way to enable it dynamically would be to take a snapshot at the
> same time to serve as a base for mirroring further changes.  This
> could be added as a journal entry for snapshot creation with a special
> 'internal' flag, and the snapshot could be deleted by the process that
> trims this journal entry.
>
> Deleting an image would delete its journal, despite any mirroring in
> progress, since mirroring is not backup.
>
> Streaming Updates
> -----------------
>
> This a complex area with many trade-offs. I expect we'll need some
> iteration to find good general solutions here. I'll describe a simple
> initial step, and some potential optimizations, and issues to address
> in future versions.
>
> In general, there will be a new daemon (tentatively called rbd-mirror
> here) that reads journal entries from images in one zone and replays
> them in different zones. An initial implementation might connect to
> ceph clusters in all zones, and replay writes and metadata changes to
> images in other zones directly via librbd. To simplify failover, it
> would be better to run these in follower zones rather than the leader
> zone.
>
> There are a couple of improvements on this we'd probably want to make
> early:
>
> * using multiple threads to mirror many images at once
> * using multiple processes to scale across machines, so one node is
> not a bottleneck
>
> Some other possible optimizations:
> * reading a large window of the journal to coalesce overlapping writes
> * decoupling reading from the leader zone and writing to follower zones,
> to allow optimizations like compression of the journal or other
> transforms as data is sent, and relaxing the requirement for one node
> to be directly connected to more than one ceph cluster

Maybe we could add separate NIC/network support which only used to
write journaling data to journaling pool? From my mind, a multi-site
cluster always need another low-latency fiber.

>
> Noticing updates
> ^^^^^^^^^^^^^^^^
>
> There are two kinds of changes that rbd-mirror needs to be aware of:
>
> 1) journaled image creation/deletion
>
> The features of an image are only stored in the image's header right
> now. To get updates of these more easily, we need an index of some
> sort. This could take the form of an additional index in the
> rbd_directory object, which already contains all images. Creating or
> deleting an image with the journal feature bit could send a rados
> notify on the rbd_directory object, and rbd-mirror could watch
> rbd_directory for these notifications. The notifications could contain
> information about the image (at least its features), but if
> rbd-mirror's watch times out it could simply re-read the features of
> all images in a pool that it cares about (more on this later).
>
> Dynamically enabling/disabling features would work the same way. The
> image header would be updated as usual, and the rbd_directory index
> would be updated as well. If the journaling feature bit changed, a
> notify on the rbd_directory object would be sent.
>
> Since we'd be storing the features in two places, to keep them in sync
> we could use an approach like:
>
> a) set a new updated_features field on image header
> b) set features on rbd_directory
> c) clear updated_features and set features on image header
>
> This is all through the lock holder, so we don't need to worry about
> concurrent updates - header operations are prefixed by an assertion
> that the lock is still held for extra safety.
>
> 2) journal updates for a particular image
>
> Generally rbd-mirror can keep reading the journal until it hits the
> end, detected by -ENOENT on an object or less than the journal's
> target object size.
>
> Once it reaches the end, it can poll for new content periodically, or
> use notifications like watch/notify on the journal header for the max
> journal object number to change. I don't think polling in this case is
> very expensive, especially if it uses exponential backoff to a
> configurable max time it can be behind the journal.
>
> Clones
> ^^^^^^
>
> Cloning is currently the only way images can be related. Mirroring
> should preserve these relationships so mirrored zones behave the same
> as the original zone.
>
> In order for clones with non-zero overlap to be useful, their parent
> snapshot must be present in the zone already. A simple approach is to
> avoid mirroring clones until their parent snapshot is mirrored.
>
> Clones refer to parents by pool id, image id, and snapshot id. These
> are all generated automatically when each is created, so they will be
> different in different zones. Since pools and images can be renamed,
> we'll need a way to make sure we keep the correct mappings in mirrored
> zones. A simple way to do this is to record a leader zone ->
> follower zone mapping for pool and image ids. When a pool or image
> is created in follower zones, their mapping to the ids in the leader
> zone would be stored in the destination zone.
>
> Parallelism
> ^^^^^^^^^^^
>
> Mirroring many images is embarrassingly parallel. A simple unit of
> work is an image (more specifically a journal, if e.g. a group of
> images shared a journal as part of a consistency group in the future).
>
> Spreading this work across threads within a single process is
> relatively simple. For HA, and to avoid a single NIC becoming a
> bottleneck, we'll want to spread out the work across multiple
> processes (and probably multiple hosts). rbd-mirror should have no
> local state, so we just need a mechanism to coordinate the division of
> work across multiple processes.
>
> One way to do this would be layering on top of watch/notify. Each
> rbd-mirror process in a zone could watch the same object, and shard
> the set of images to mirror based on a hash of image ids onto the
> current set of rbd-mirror processes sorted by client gid. The set of
> rbd-mirror processes could be determined by listing watchers.
>
> Failover
> --------
>
> Watch/notify could also be used (via a predetermined object) to
> communicate with rbd-mirror processes to get sync status from each,
> and for managing failover.
>
> Failing over means preventing changes in the original leader zone, and
> making the new leader zone writeable. The state of a zone (read-only vs
> writeable) could be stored in a zone's metadata in rados to represent
> this, and images with the journal feature bit could check this before
> being opened read/write for safety. To make it race-proof, the zone
> state can be a tri-state - read-only, read-write, or changing.
>
> In the original leader zone, if it is still running, the zone would be
> set to read-only mode and all clients could be blacklisted to avoid
> creating too much divergent history to rollback later.
>
> In the new leader zone, the zone's state would be set to 'changing',
> and rbd-mirror processes would be told to stop copying from the
> original leader and close the images they were mirroring to.  New
> rbd-mirror processes should refuse to start mirroring when the zone is
> not read-only. Once the mirroring processes have stopped, the zone
> could be set to read-write, and begin normal usage.
>
> Failback
> ^^^^^^^^
>
> In this scenario, after failing over, the original leader zone (A)
> starts running again, but needs to catch up to the current leader
> (B). At a high level, this involves syncing up the image by rolling
> back the updates in A past the point B synced to as noted in an
> images's journal in A, and mirroring all the changes since then from
> B.
>
> This would need to be an offline operation, since at some point
> B would need to go read-only before A goes read-write. Making this
> transition online is outside the scope of mirroring for now, since it
> would require another level of indirection for rbd users like QEMU.

So do you mean when primary zone failed we need to switch primary zone
offline by hand?

> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RBD mirroring design draft
  2015-05-13  0:42 RBD mirroring design draft Josh Durgin
  2015-05-13  7:48 ` Haomai Wang
@ 2015-05-13  8:07 ` Haomai Wang
  2015-05-14  4:21   ` Josh Durgin
  2015-05-28  5:37 ` Gregory Farnum
  2 siblings, 1 reply; 9+ messages in thread
From: Haomai Wang @ 2015-05-13  8:07 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On Wed, May 13, 2015 at 8:42 AM, Josh Durgin <jdurgin@redhat.com> wrote:
> We've talked about this a bit at ceph developer summits, but haven't
> gone through several parts of the design thoroughly. I'd like to post
> this to a wider audience and get feedback on this draft of a design.
>
> The journal parts are more defined, but the failover/failback workflow
> and general configuration need more fleshing out. Enabling/disabling
> journaling on existing images isn't described yet, though it's
> something that should be supported.
>
> =============
> RBD Mirroring
> =============
>
> The goal of rbd mirroring is to provide disaster recovery for rbd
> images.
>
> This includes:
>
> 1) maintaining a crash-consistent view of an image
> 2) streaming updates from one site to another
> 3) failover/failback
>
> I'll refer to different (cluster, pool1, [pool2, ... poolN]) combinations
> where rbd images are stored as "zones" here, which would be a new
> abstraction introduced for easier configuration of mirroring. This is
> the same term used by radosgw replication.
>
> Crash consistency
> -----------------
>
> This is the basic level of consistency block devices can provide with
> no higher-level hooks, like qemu's guest agent. For replaying a stream
> of block device writes, higher-level hooks could make sense, but these
> could be added later as points in a stream of writes. For crash
> consistency, rbd just needs to maintain the order of writes. There are
> a few ways to do this:
>
> a) snapshots
>
>   rbd has supported differential snapshots for a while now, and these
>   are great for performing backups. They don't work as well for
>   providing a stream of consistent updates, since there is overhead in
>   space and I/O load to creating and deleting rados snapshots. For
>   backend filesystems like xfs and ext4, frequent snapshots would turn
>   many small writes into copies of 4MB and a small write, wasting
>   space. Deleting snapshots is also expensive if there hundreds or
>   thousands happening all the time. Rados snapshots were not designed
>   for this kind of load. In addition, diffing snapshots does not tell
>   us the order in which writes were done, so a partially applied
>   diff would be inconsistent and likely unusable.
>
> b) log-structured rbd
>
>   The simplest way to keep writes in order is to only write them in
>   order, by appending to a log of rados objects. This is great for
>   mirroring, but vastly complicates everything else. This would
>   require all the usual bells and whistles of a log-structured
>   filesystem, including garbage collection, reference tracking, a new
>   rbd-level snapshot mechanism, and more. Custom fsck-like tools for
>   consistency checking and repair would be needed, and the I/O paths
>   would be much more complex. This is a good research project, but
>   it would take a long time to develop and stabilize.
>
> c) journaling
>
>   Journaling is an intermediate step between snapshots and log
>   structured rbd. The idea is that each image has a log of all writes
>   (including data) and metadata changes, like resize, snapshot
>   create/delete, etc. This journal is stored as a series of rados
>   objects, similar to cephfs' journal. A write would first be appended
>   to the journal, acked to the librbd user at that point, and later
>   written out to the usual rbd data objects. Extending rbd's existing
>   client-side cache to track this allows reads of data written to the
>   journal but not the data objects to be satisfied from the cache, and
>   avoids issues of stale reads. This data needs to be kept in memory
>   anyway, so it makes sense to keep it in the cache, where it can be
>   useful.
>
> Structure
> ^^^^^^^^^
>
> The journal could be stored in a separate pool from the image, such as
> one backed by ssds to improve write performance. Since it is
> append-only, the journal's data could be stored in an EC pool to save
> space.
>
> It will need some metadata regarding positions in the journal. These
> could be stored as omap values in a 'journal header' object in a
> replicated pool, for rbd perhaps the same pool as the image for
> simplicity. The header would contain at least:
>
> * pool_id - where journal data is stored
> * journal_object_prefix - unique prefix for journal data objects
> * positions - (zone, purpose, object num, offset) tuples indexed by zone
> * object_size - approximate size of each data object
> * object_num_begin - current earliest object in the log
> * object_num_end - max potential object in the log
>
> Similar to rbd images, journal data would be stored in objects named
> after the journal_object_prefix and their object number. To avoid
> issues of padding or splitting journal entries, and to make it simpler
> to keep append-only, it's easier to allow the objects to be near
> object_size before moving to the next object number instead of
> sticking with an exact object size.
>
> Ideally this underlying structure could be used for both rbd and
> cephfs. Variable sized objects are different from the existing cephfs
> journal, which uses fixed-size objects for striping. The default is
> still 4MB chunks though. How important is striping the journal to
> cephfs? For rbd it seems unlikely to help much, since updates need to
> be batched up by the client cache anyway.
>
> Usage
> ^^^^^
>
> When an rbd image with journaling enabled is opened, the journal
> metadata would be read and the last part of the journal would be
> replayed if necessary.
>
> In general, a write would first go to the journal, return to the
> client, and then be written to the underlying rbd image. Once a
> threshold of bytes of journal entries are flushed, or a time period is
> reached and some journal entries were flushed, a position with purpose
> "flushed" for the zone the rbd image is in would be updated in the
> journal metadata.
>
> Trimming old entries from the journal would be allowed up to the
> minimum of all the positions stored in its metadata. This would be an
> asynchronous operation executed by the consumers of the journal.
>
> There would be a new feature bit for rbd images to enable
> journaling. As a first step it could only be set when an image is
> created.
>
> One way to enable it dynamically would be to take a snapshot at the
> same time to serve as a base for mirroring further changes.  This
> could be added as a journal entry for snapshot creation with a special
> 'internal' flag, and the snapshot could be deleted by the process that
> trims this journal entry.
>
> Deleting an image would delete its journal, despite any mirroring in
> progress, since mirroring is not backup.
>
> Streaming Updates
> -----------------
>
> This a complex area with many trade-offs. I expect we'll need some
> iteration to find good general solutions here. I'll describe a simple
> initial step, and some potential optimizations, and issues to address
> in future versions.
>
> In general, there will be a new daemon (tentatively called rbd-mirror
> here) that reads journal entries from images in one zone and replays
> them in different zones. An initial implementation might connect to
> ceph clusters in all zones, and replay writes and metadata changes to
> images in other zones directly via librbd. To simplify failover, it
> would be better to run these in follower zones rather than the leader
> zone.
>
> There are a couple of improvements on this we'd probably want to make
> early:
>
> * using multiple threads to mirror many images at once
> * using multiple processes to scale across machines, so one node is
> not a bottleneck
>
> Some other possible optimizations:
> * reading a large window of the journal to coalesce overlapping writes
> * decoupling reading from the leader zone and writing to follower zones,
> to allow optimizations like compression of the journal or other
> transforms as data is sent, and relaxing the requirement for one node
> to be directly connected to more than one ceph cluster

Maybe we could add separate NIC/network support which only used to
write journaling data to journaling pool? From my mind, a multi-site
cluster always need another low-latency fiber.

>
> Noticing updates
> ^^^^^^^^^^^^^^^^
>
> There are two kinds of changes that rbd-mirror needs to be aware of:
>
> 1) journaled image creation/deletion
>
> The features of an image are only stored in the image's header right
> now. To get updates of these more easily, we need an index of some
> sort. This could take the form of an additional index in the
> rbd_directory object, which already contains all images. Creating or
> deleting an image with the journal feature bit could send a rados
> notify on the rbd_directory object, and rbd-mirror could watch
> rbd_directory for these notifications. The notifications could contain
> information about the image (at least its features), but if
> rbd-mirror's watch times out it could simply re-read the features of
> all images in a pool that it cares about (more on this later).
>
> Dynamically enabling/disabling features would work the same way. The
> image header would be updated as usual, and the rbd_directory index
> would be updated as well. If the journaling feature bit changed, a
> notify on the rbd_directory object would be sent.
>
> Since we'd be storing the features in two places, to keep them in sync
> we could use an approach like:
>
> a) set a new updated_features field on image header
> b) set features on rbd_directory
> c) clear updated_features and set features on image header
>
> This is all through the lock holder, so we don't need to worry about
> concurrent updates - header operations are prefixed by an assertion
> that the lock is still held for extra safety.
>
> 2) journal updates for a particular image
>
> Generally rbd-mirror can keep reading the journal until it hits the
> end, detected by -ENOENT on an object or less than the journal's
> target object size.
>
> Once it reaches the end, it can poll for new content periodically, or
> use notifications like watch/notify on the journal header for the max
> journal object number to change. I don't think polling in this case is
> very expensive, especially if it uses exponential backoff to a
> configurable max time it can be behind the journal.
>
> Clones
> ^^^^^^
>
> Cloning is currently the only way images can be related. Mirroring
> should preserve these relationships so mirrored zones behave the same
> as the original zone.
>
> In order for clones with non-zero overlap to be useful, their parent
> snapshot must be present in the zone already. A simple approach is to
> avoid mirroring clones until their parent snapshot is mirrored.
>
> Clones refer to parents by pool id, image id, and snapshot id. These
> are all generated automatically when each is created, so they will be
> different in different zones. Since pools and images can be renamed,
> we'll need a way to make sure we keep the correct mappings in mirrored
> zones. A simple way to do this is to record a leader zone ->
> follower zone mapping for pool and image ids. When a pool or image
> is created in follower zones, their mapping to the ids in the leader
> zone would be stored in the destination zone.
>
> Parallelism
> ^^^^^^^^^^^
>
> Mirroring many images is embarrassingly parallel. A simple unit of
> work is an image (more specifically a journal, if e.g. a group of
> images shared a journal as part of a consistency group in the future).
>
> Spreading this work across threads within a single process is
> relatively simple. For HA, and to avoid a single NIC becoming a
> bottleneck, we'll want to spread out the work across multiple
> processes (and probably multiple hosts). rbd-mirror should have no
> local state, so we just need a mechanism to coordinate the division of
> work across multiple processes.
>
> One way to do this would be layering on top of watch/notify. Each
> rbd-mirror process in a zone could watch the same object, and shard
> the set of images to mirror based on a hash of image ids onto the
> current set of rbd-mirror processes sorted by client gid. The set of
> rbd-mirror processes could be determined by listing watchers.
>
> Failover
> --------
>
> Watch/notify could also be used (via a predetermined object) to
> communicate with rbd-mirror processes to get sync status from each,
> and for managing failover.
>
> Failing over means preventing changes in the original leader zone, and
> making the new leader zone writeable. The state of a zone (read-only vs
> writeable) could be stored in a zone's metadata in rados to represent
> this, and images with the journal feature bit could check this before
> being opened read/write for safety. To make it race-proof, the zone
> state can be a tri-state - read-only, read-write, or changing.
>
> In the original leader zone, if it is still running, the zone would be
> set to read-only mode and all clients could be blacklisted to avoid
> creating too much divergent history to rollback later.
>
> In the new leader zone, the zone's state would be set to 'changing',
> and rbd-mirror processes would be told to stop copying from the
> original leader and close the images they were mirroring to.  New
> rbd-mirror processes should refuse to start mirroring when the zone is
> not read-only. Once the mirroring processes have stopped, the zone
> could be set to read-write, and begin normal usage.
>
> Failback
> ^^^^^^^^
>
> In this scenario, after failing over, the original leader zone (A)
> starts running again, but needs to catch up to the current leader
> (B). At a high level, this involves syncing up the image by rolling
> back the updates in A past the point B synced to as noted in an
> images's journal in A, and mirroring all the changes since then from
> B.
>
> This would need to be an offline operation, since at some point
> B would need to go read-only before A goes read-write. Making this
> transition online is outside the scope of mirroring for now, since it
> would require another level of indirection for rbd users like QEMU.

So do you mean when primary zone failed we need to switch primary zone
offline by hand?

> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RBD mirroring design draft
  2015-05-13  8:07 ` Haomai Wang
@ 2015-05-14  4:21   ` Josh Durgin
       [not found]     ` <CAAW3nmh+XxB8K2XsWgnD_cWWPZGw=VpsuomodMM1SNad8LmZAQ@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Josh Durgin @ 2015-05-14  4:21 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

On 05/13/2015 01:07 AM, Haomai Wang wrote:
> On Wed, May 13, 2015 at 8:42 AM, Josh Durgin <jdurgin@redhat.com> wrote:
>> Some other possible optimizations:
>> * reading a large window of the journal to coalesce overlapping writes
>> * decoupling reading from the leader zone and writing to follower zones,
>> to allow optimizations like compression of the journal or other
>> transforms as data is sent, and relaxing the requirement for one node
>> to be directly connected to more than one ceph cluster
>
> Maybe we could add separate NIC/network support which only used to
> write journaling data to journaling pool? From my mind, a multi-site
> cluster always need another low-latency fiber.

Yeah, this seems desirable. It seems like it'd be possible based on the 
way the NICs and routing tables are setup, without needing any special
configuration from ceph, or am I missing something?

>> Failover
>> --------
>>
>> Watch/notify could also be used (via a predetermined object) to
>> communicate with rbd-mirror processes to get sync status from each,
>> and for managing failover.
>>
>> Failing over means preventing changes in the original leader zone, and
>> making the new leader zone writeable. The state of a zone (read-only vs
>> writeable) could be stored in a zone's metadata in rados to represent
>> this, and images with the journal feature bit could check this before
>> being opened read/write for safety. To make it race-proof, the zone
>> state can be a tri-state - read-only, read-write, or changing.
>>
>> In the original leader zone, if it is still running, the zone would be
>> set to read-only mode and all clients could be blacklisted to avoid
>> creating too much divergent history to rollback later.
>>
>> In the new leader zone, the zone's state would be set to 'changing',
>> and rbd-mirror processes would be told to stop copying from the
>> original leader and close the images they were mirroring to.  New
>> rbd-mirror processes should refuse to start mirroring when the zone is
>> not read-only. Once the mirroring processes have stopped, the zone
>> could be set to read-write, and begin normal usage.
>>
>> Failback
>> ^^^^^^^^
>>
>> In this scenario, after failing over, the original leader zone (A)
>> starts running again, but needs to catch up to the current leader
>> (B). At a high level, this involves syncing up the image by rolling
>> back the updates in A past the point B synced to as noted in an
>> images's journal in A, and mirroring all the changes since then from
>> B.
>>
>> This would need to be an offline operation, since at some point
>> B would need to go read-only before A goes read-write. Making this
>> transition online is outside the scope of mirroring for now, since it
>> would require another level of indirection for rbd users like QEMU.
>
> So do you mean when primary zone failed we need to switch primary zone
> offline by hand?

I think we'd want to have some higher-level script controlling it, with
a pluggable trigger that could be based on user-defined monitoring.

This is something I'm less sure of though, it'd be good to get more
feedback on what users are interested in here. Would ceph detecting 
failure based on e.g. rbd-mirror timing out reads from the leader zone
be good enough for most users?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RBD mirroring design draft
       [not found]     ` <CAAW3nmh+XxB8K2XsWgnD_cWWPZGw=VpsuomodMM1SNad8LmZAQ@mail.gmail.com>
@ 2015-05-20 21:30       ` Josh Durgin
       [not found]         ` <CAAW3nmjWQTOOhym5t6LQ8E0P8AsHnD0c0MkfbF2zre_oUJFudw@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Josh Durgin @ 2015-05-20 21:30 UTC (permalink / raw)
  To: Chris H; +Cc: Haomai Wang, ceph-devel

On 05/18/2015 09:22 AM, Chris H wrote:
> I am actually working on something very similar to this for another
> project. Writing very small sequential IO groups with flushes to the
> "cloud" is very slow. The structure I am working on is nearly identical
> as well (I originally did padding, but might be not be necessary). My
> updating structure is a bit different. The server not the client will
> know what is in the log and what is in the "cloud". It will be complex
> to organize reads and writes to multiple chunks that reside in memory or
> the cloud, but doable.

Yeah, it's a lot simpler when the only client using the data is the one
keeping track of whether it's written back yet.

> Some ideas I had further for this project (And directly relates to this
> thread) is to incorporate some sort of HA setup too. It would have
> various checks to see if certain servers are up, certain daemons are
> running (And working correctly). Also, a single NIC port would be
> dedicated to writing to it's partner's LOG and to receive it's partners
> LOG. This is to ensure there's no stale data upon a failure/crash to
> eliminate a single point of failure.
>
> How long are the rbd-mirror time outs usually? The reason I ask is our
> potential use case is to use a parallel FS on top of RBD. I'd love to
> continue this discussion further.

The timeout would be configurable, but perhaps 30s by default? Ideally
other checks would be done too, so you don't fail over just because the 
connection between sites temporarily went away, but they're each still
operating correctly individually. This kind of higher-level monitoring 
info for each site's health could perhaps come from calamari.

Josh

> On Wed, May 13, 2015 at 10:21 PM, Josh Durgin <jdurgin@redhat.com
> <mailto:jdurgin@redhat.com>> wrote:
>
>     On 05/13/2015 01:07 AM, Haomai Wang wrote:
>
>         On Wed, May 13, 2015 at 8:42 AM, Josh Durgin <jdurgin@redhat.com
>         <mailto:jdurgin@redhat.com>> wrote:
>
>             Some other possible optimizations:
>             * reading a large window of the journal to coalesce
>             overlapping writes
>             * decoupling reading from the leader zone and writing to
>             follower zones,
>             to allow optimizations like compression of the journal or other
>             transforms as data is sent, and relaxing the requirement for
>             one node
>             to be directly connected to more than one ceph cluster
>
>
>         Maybe we could add separate NIC/network support which only used to
>         write journaling data to journaling pool? From my mind, a multi-site
>         cluster always need another low-latency fiber.
>
>
>     Yeah, this seems desirable. It seems like it'd be possible based on
>     the way the NICs and routing tables are setup, without needing any
>     special
>     configuration from ceph, or am I missing something?
>
>
>             Failover
>             --------
>
>             Watch/notify could also be used (via a predetermined object) to
>             communicate with rbd-mirror processes to get sync status
>             from each,
>             and for managing failover.
>
>             Failing over means preventing changes in the original leader
>             zone, and
>             making the new leader zone writeable. The state of a zone
>             (read-only vs
>             writeable) could be stored in a zone's metadata in rados to
>             represent
>             this, and images with the journal feature bit could check
>             this before
>             being opened read/write for safety. To make it race-proof,
>             the zone
>             state can be a tri-state - read-only, read-write, or changing.
>
>             In the original leader zone, if it is still running, the
>             zone would be
>             set to read-only mode and all clients could be blacklisted
>             to avoid
>             creating too much divergent history to rollback later.
>
>             In the new leader zone, the zone's state would be set to
>             'changing',
>             and rbd-mirror processes would be told to stop copying from the
>             original leader and close the images they were mirroring
>             to.  New
>             rbd-mirror processes should refuse to start mirroring when
>             the zone is
>             not read-only. Once the mirroring processes have stopped,
>             the zone
>             could be set to read-write, and begin normal usage.
>
>             Failback
>             ^^^^^^^^
>
>             In this scenario, after failing over, the original leader
>             zone (A)
>             starts running again, but needs to catch up to the current
>             leader
>             (B). At a high level, this involves syncing up the image by
>             rolling
>             back the updates in A past the point B synced to as noted in an
>             images's journal in A, and mirroring all the changes since
>             then from
>             B.
>
>             This would need to be an offline operation, since at some point
>             B would need to go read-only before A goes read-write.
>             Making this
>             transition online is outside the scope of mirroring for now,
>             since it
>             would require another level of indirection for rbd users
>             like QEMU.
>
>
>         So do you mean when primary zone failed we need to switch
>         primary zone
>         offline by hand?
>
>
>     I think we'd want to have some higher-level script controlling it, with
>     a pluggable trigger that could be based on user-defined monitoring.
>
>     This is something I'm less sure of though, it'd be good to get more
>     feedback on what users are interested in here. Would ceph detecting
>     failure based on e.g. rbd-mirror timing out reads from the leader zone
>     be good enough for most users?
>
>     --
>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>     the body of a message to majordomo@vger.kernel.org
>     <mailto:majordomo@vger.kernel.org>
>     More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RBD mirroring design draft
       [not found]         ` <CAAW3nmjWQTOOhym5t6LQ8E0P8AsHnD0c0MkfbF2zre_oUJFudw@mail.gmail.com>
@ 2015-05-21 15:34           ` Josh Durgin
  0 siblings, 0 replies; 9+ messages in thread
From: Josh Durgin @ 2015-05-21 15:34 UTC (permalink / raw)
  To: Chris H; +Cc: Haomai Wang, ceph-devel

On 05/21/2015 07:56 AM, Chris H wrote:
> I am assuming your client's do proper syncs and flushes? What about
> applications that traditionally write to a BBU backed RAID card and
> don't explicitly call sync and flushes? I just ask because the only way
> I can think of handling this is to force sync/flush to a log device of
> some sort before returning the write was successful. I know this is
> probably a new use case but this is the best time to address these sort
> of concerns.

Yeah, that's exactly what we do by default with our existing rbd client
side cache. It's writethrough until it sees a flush from the user. For
mirroring this wouldn't change, and we could ack writes to the client
after only writing them to the rbd journal. This would still be safe
since we'd replay any unfinished work in the journal the next time
the image was used. Optionally, we could wait for the write to the
journal and the regular image data in case the journal is stored in a
pool with lower fault tolerance.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RBD mirroring design draft
  2015-05-13  0:42 RBD mirroring design draft Josh Durgin
  2015-05-13  7:48 ` Haomai Wang
  2015-05-13  8:07 ` Haomai Wang
@ 2015-05-28  5:37 ` Gregory Farnum
  2015-05-28 10:42   ` John Spray
  2 siblings, 1 reply; 9+ messages in thread
From: Gregory Farnum @ 2015-05-28  5:37 UTC (permalink / raw)
  To: Josh Durgin, John Spray; +Cc: ceph-devel

On Tue, May 12, 2015 at 5:42 PM, Josh Durgin <jdurgin@redhat.com> wrote:
> We've talked about this a bit at ceph developer summits, but haven't
> gone through several parts of the design thoroughly. I'd like to post
> this to a wider audience and get feedback on this draft of a design.
>
> The journal parts are more defined, but the failover/failback workflow
> and general configuration need more fleshing out. Enabling/disabling
> journaling on existing images isn't described yet, though it's
> something that should be supported.
>
> =============
> RBD Mirroring
> =============
>
> The goal of rbd mirroring is to provide disaster recovery for rbd
> images.
>
> This includes:
>
> 1) maintaining a crash-consistent view of an image
> 2) streaming updates from one site to another
> 3) failover/failback
>
> I'll refer to different (cluster, pool1, [pool2, ... poolN]) combinations
> where rbd images are stored as "zones" here, which would be a new
> abstraction introduced for easier configuration of mirroring. This is
> the same term used by radosgw replication.
>
> Crash consistency
> -----------------
>
> This is the basic level of consistency block devices can provide with
> no higher-level hooks, like qemu's guest agent. For replaying a stream
> of block device writes, higher-level hooks could make sense, but these
> could be added later as points in a stream of writes. For crash
> consistency, rbd just needs to maintain the order of writes. There are
> a few ways to do this:
>
> a) snapshots
>
>   rbd has supported differential snapshots for a while now, and these
>   are great for performing backups. They don't work as well for
>   providing a stream of consistent updates, since there is overhead in
>   space and I/O load to creating and deleting rados snapshots. For
>   backend filesystems like xfs and ext4, frequent snapshots would turn
>   many small writes into copies of 4MB and a small write, wasting
>   space. Deleting snapshots is also expensive if there hundreds or
>   thousands happening all the time. Rados snapshots were not designed
>   for this kind of load. In addition, diffing snapshots does not tell
>   us the order in which writes were done, so a partially applied
>   diff would be inconsistent and likely unusable.
>
> b) log-structured rbd
>
>   The simplest way to keep writes in order is to only write them in
>   order, by appending to a log of rados objects. This is great for
>   mirroring, but vastly complicates everything else. This would
>   require all the usual bells and whistles of a log-structured
>   filesystem, including garbage collection, reference tracking, a new
>   rbd-level snapshot mechanism, and more. Custom fsck-like tools for
>   consistency checking and repair would be needed, and the I/O paths
>   would be much more complex. This is a good research project, but
>   it would take a long time to develop and stabilize.
>
> c) journaling
>
>   Journaling is an intermediate step between snapshots and log
>   structured rbd. The idea is that each image has a log of all writes
>   (including data) and metadata changes, like resize, snapshot
>   create/delete, etc. This journal is stored as a series of rados
>   objects, similar to cephfs' journal. A write would first be appended
>   to the journal, acked to the librbd user at that point, and later
>   written out to the usual rbd data objects. Extending rbd's existing
>   client-side cache to track this allows reads of data written to the
>   journal but not the data objects to be satisfied from the cache, and
>   avoids issues of stale reads. This data needs to be kept in memory
>   anyway, so it makes sense to keep it in the cache, where it can be
>   useful.
>
> Structure
> ^^^^^^^^^
>
> The journal could be stored in a separate pool from the image, such as
> one backed by ssds to improve write performance. Since it is
> append-only, the journal's data could be stored in an EC pool to save
> space.

This is a lot trickier than it sounds. Remember you need to append to
an EC pool in the appropriate append block size — the smallest such
size that is really feasible is going to be 4KB * M. I'm not sure what
sizes our RBD writes usually come down in, but I can see them being
rather smaller...

>
> It will need some metadata regarding positions in the journal. These
> could be stored as omap values in a 'journal header' object in a
> replicated pool, for rbd perhaps the same pool as the image for
> simplicity. The header would contain at least:
>
> * pool_id - where journal data is stored
> * journal_object_prefix - unique prefix for journal data objects
> * positions - (zone, purpose, object num, offset) tuples indexed by zone
> * object_size - approximate size of each data object
> * object_num_begin - current earliest object in the log
> * object_num_end - max potential object in the log
>
> Similar to rbd images, journal data would be stored in objects named
> after the journal_object_prefix and their object number. To avoid
> issues of padding or splitting journal entries, and to make it simpler
> to keep append-only, it's easier to allow the objects to be near
> object_size before moving to the next object number instead of
> sticking with an exact object size.
>
> Ideally this underlying structure could be used for both rbd and
> cephfs. Variable sized objects are different from the existing cephfs
> journal, which uses fixed-size objects for striping. The default is
> still 4MB chunks though. How important is striping the journal to
> cephfs? For rbd it seems unlikely to help much, since updates need to
> be batched up by the client cache anyway.

I think the journaling v2 stuff that John did actually made objects
variably-sized as you've described here. We've never done any sort of
striping on the MDS journal, although I think it was
possible.previously.

>
> Usage
> ^^^^^
>
> When an rbd image with journaling enabled is opened, the journal
> metadata would be read and the last part of the journal would be
> replayed if necessary.
>
> In general, a write would first go to the journal, return to the
> client, and then be written to the underlying rbd image. Once a
> threshold of bytes of journal entries are flushed, or a time period is
> reached and some journal entries were flushed, a position with purpose
> "flushed" for the zone the rbd image is in would be updated in the
> journal metadata.

Can you expand on this a bit? I think you're here referring to the
up-to-date tags on each zone, but I'm not quite sure — and it brings
up questions like how the local image is aware of which zones are
mirroring it (possibly addressed later; I haven't finished reading
yet).

>
> Trimming old entries from the journal would be allowed up to the
> minimum of all the positions stored in its metadata. This would be an
> asynchronous operation executed by the consumers of the journal.
>
> There would be a new feature bit for rbd images to enable
> journaling. As a first step it could only be set when an image is
> created.
>
> One way to enable it dynamically would be to take a snapshot at the
> same time to serve as a base for mirroring further changes.  This
> could be added as a journal entry for snapshot creation with a special
> 'internal' flag, and the snapshot could be deleted by the process that
> trims this journal entry.
>
> Deleting an image would delete its journal, despite any mirroring in
> progress, since mirroring is not backup.

I suspect that this is going to make some users sad. We've already got
people doing stuff like taking snapshots of RGW pools (don't, it
doesn't work right!) in order to protect against accidental deletions.

>
> Streaming Updates
> -----------------
>
> This a complex area with many trade-offs. I expect we'll need some
> iteration to find good general solutions here. I'll describe a simple
> initial step, and some potential optimizations, and issues to address
> in future versions.
>
> In general, there will be a new daemon (tentatively called rbd-mirror
> here) that reads journal entries from images in one zone and replays
> them in different zones. An initial implementation might connect to
> ceph clusters in all zones, and replay writes and metadata changes to
> images in other zones directly via librbd. To simplify failover, it
> would be better to run these in follower zones rather than the leader
> zone.
>
> There are a couple of improvements on this we'd probably want to make
> early:
>
> * using multiple threads to mirror many images at once
> * using multiple processes to scale across machines, so one node is
> not a bottleneck
>
> Some other possible optimizations:
> * reading a large window of the journal to coalesce overlapping writes
> * decoupling reading from the leader zone and writing to follower zones,
> to allow optimizations like compression of the journal or other
> transforms as data is sent, and relaxing the requirement for one node
> to be directly connected to more than one ceph cluster

Yeah. We actually already have formats in use by the ceph-objectstore
tool and some of the CephFS metadata dump commands for dumping out
data; we might want to start out by basing the transfer on these. I'm
not sure we really get much by making the replay daemons a unified
system to begin with versus having a reader generating a stream and a
writer replaying it. And if they start off separate they're much
easier to optimize in ways like you've already discussed.

>
> Noticing updates
> ^^^^^^^^^^^^^^^^
>
> There are two kinds of changes that rbd-mirror needs to be aware of:
>
> 1) journaled image creation/deletion
>
> The features of an image are only stored in the image's header right
> now. To get updates of these more easily, we need an index of some
> sort. This could take the form of an additional index in the
> rbd_directory object, which already contains all images. Creating or
> deleting an image with the journal feature bit could send a rados
> notify on the rbd_directory object, and rbd-mirror could watch
> rbd_directory for these notifications. The notifications could contain
> information about the image (at least its features), but if
> rbd-mirror's watch times out it could simply re-read the features of
> all images in a pool that it cares about (more on this later).

Do we actually need to store the features in the rbd_directory,
instead of simply having the mirror reader daemon check each new
image?
Do you plan to have any sort of separate database of the mirror
daemon's internal state so that on restart it can behave in a vaguely
efficient fashion instead of an order-N pass?

>
> Dynamically enabling/disabling features would work the same way. The
> image header would be updated as usual, and the rbd_directory index
> would be updated as well. If the journaling feature bit changed, a
> notify on the rbd_directory object would be sent.
>
> Since we'd be storing the features in two places, to keep them in sync
> we could use an approach like:
>
> a) set a new updated_features field on image header
> b) set features on rbd_directory
> c) clear updated_features and set features on image header
>
> This is all through the lock holder, so we don't need to worry about
> concurrent updates - header operations are prefixed by an assertion
> that the lock is still held for extra safety.
>
> 2) journal updates for a particular image
>
> Generally rbd-mirror can keep reading the journal until it hits the
> end, detected by -ENOENT on an object or less than the journal's
> target object size.
>
> Once it reaches the end, it can poll for new content periodically, or
> use notifications like watch/notify on the journal header for the max
> journal object number to change. I don't think polling in this case is
> very expensive, especially if it uses exponential backoff to a
> configurable max time it can be behind the journal.
>
> Clones
> ^^^^^^
>
> Cloning is currently the only way images can be related. Mirroring
> should preserve these relationships so mirrored zones behave the same
> as the original zone.
>
> In order for clones with non-zero overlap to be useful, their parent
> snapshot must be present in the zone already. A simple approach is to
> avoid mirroring clones until their parent snapshot is mirrored.
>
> Clones refer to parents by pool id, image id, and snapshot id. These
> are all generated automatically when each is created, so they will be
> different in different zones. Since pools and images can be renamed,
> we'll need a way to make sure we keep the correct mappings in mirrored
> zones. A simple way to do this is to record a leader zone ->
> follower zone mapping for pool and image ids. When a pool or image
> is created in follower zones, their mapping to the ids in the leader
> zone would be stored in the destination zone.
>
> Parallelism
> ^^^^^^^^^^^
>
> Mirroring many images is embarrassingly parallel. A simple unit of
> work is an image (more specifically a journal, if e.g. a group of
> images shared a journal as part of a consistency group in the future).
>
> Spreading this work across threads within a single process is
> relatively simple. For HA, and to avoid a single NIC becoming a
> bottleneck, we'll want to spread out the work across multiple
> processes (and probably multiple hosts). rbd-mirror should have no
> local state, so we just need a mechanism to coordinate the division of
> work across multiple processes.
>
> One way to do this would be layering on top of watch/notify. Each
> rbd-mirror process in a zone could watch the same object, and shard
> the set of images to mirror based on a hash of image ids onto the
> current set of rbd-mirror processes sorted by client gid. The set of
> rbd-mirror processes could be determined by listing watchers.

You're going to have some tricky cases here when reassigning authority
as watchers come and go, but I think it should be doable.

>
> Failover
> --------
>
> Watch/notify could also be used (via a predetermined object) to
> communicate with rbd-mirror processes to get sync status from each,
> and for managing failover.
>
> Failing over means preventing changes in the original leader zone, and
> making the new leader zone writeable. The state of a zone (read-only vs
> writeable) could be stored in a zone's metadata in rados to represent
> this, and images with the journal feature bit could check this before
> being opened read/write for safety. To make it race-proof, the zone
> state can be a tri-state - read-only, read-write, or changing.

If you want these states to be authoritative you're going to have a
bit of a tricky time — what happens when you do the failover and then
the old leader zone comes back up and think it's a leader? How do
rbd-mirror processes elsewhere react?
If nothing else you'll definitely need leader epochs, and possibly
more. Our experience with RGW DR has taught us that these
follower-leader paradigms are really difficult to get right. :(

>
> In the original leader zone, if it is still running, the zone would be
> set to read-only mode and all clients could be blacklisted to avoid
> creating too much divergent history to rollback later.
>
> In the new leader zone, the zone's state would be set to 'changing',
> and rbd-mirror processes would be told to stop copying from the
> original leader and close the images they were mirroring to.  New
> rbd-mirror processes should refuse to start mirroring when the zone is
> not read-only. Once the mirroring processes have stopped, the zone
> could be set to read-write, and begin normal usage.
>
> Failback
> ^^^^^^^^
>
> In this scenario, after failing over, the original leader zone (A)
> starts running again, but needs to catch up to the current leader
> (B). At a high level, this involves syncing up the image by rolling
> back the updates in A past the point B synced to as noted in an
> images's journal in A, and mirroring all the changes since then from
> B.

How do you envision this rollback happening? I don't see how it's
feasible — you can't possibly wait on writeback of data until it's
mirrored to all zones, and once it's written to the backing image
there's no undoing it.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RBD mirroring design draft
  2015-05-28  5:37 ` Gregory Farnum
@ 2015-05-28 10:42   ` John Spray
  2015-05-28 14:07     ` Gregory Farnum
  0 siblings, 1 reply; 9+ messages in thread
From: John Spray @ 2015-05-28 10:42 UTC (permalink / raw)
  To: Gregory Farnum, Josh Durgin; +Cc: ceph-devel



On 28/05/2015 06:37, Gregory Farnum wrote:
> On Tue, May 12, 2015 at 5:42 PM, Josh Durgin <jdurgin@redhat.com> wrote:
>> It will need some metadata regarding positions in the journal. These
>> could be stored as omap values in a 'journal header' object in a
>> replicated pool, for rbd perhaps the same pool as the image for
>> simplicity. The header would contain at least:
>>
>> * pool_id - where journal data is stored
>> * journal_object_prefix - unique prefix for journal data objects
>> * positions - (zone, purpose, object num, offset) tuples indexed by zone
>> * object_size - approximate size of each data object
>> * object_num_begin - current earliest object in the log
>> * object_num_end - max potential object in the log
>>
>> Similar to rbd images, journal data would be stored in objects named
>> after the journal_object_prefix and their object number. To avoid
>> issues of padding or splitting journal entries, and to make it simpler
>> to keep append-only, it's easier to allow the objects to be near
>> object_size before moving to the next object number instead of
>> sticking with an exact object size.
>>
>> Ideally this underlying structure could be used for both rbd and
>> cephfs. Variable sized objects are different from the existing cephfs
>> journal, which uses fixed-size objects for striping. The default is
>> still 4MB chunks though. How important is striping the journal to
>> cephfs? For rbd it seems unlikely to help much, since updates need to
>> be batched up by the client cache anyway.
> I think the journaling v2 stuff that John did actually made objects
> variably-sized as you've described here. We've never done any sort of
> striping on the MDS journal, although I think it was
> possible.previously.

The objects are still fixed size: we talked about changing it so that 
journal events would never span an object boundary, but didn't do it -- 
it still uses Filer.

>
>>
>> Parallelism
>> ^^^^^^^^^^^
>>
>> Mirroring many images is embarrassingly parallel. A simple unit of
>> work is an image (more specifically a journal, if e.g. a group of
>> images shared a journal as part of a consistency group in the future).
>>
>> Spreading this work across threads within a single process is
>> relatively simple. For HA, and to avoid a single NIC becoming a
>> bottleneck, we'll want to spread out the work across multiple
>> processes (and probably multiple hosts). rbd-mirror should have no
>> local state, so we just need a mechanism to coordinate the division of
>> work across multiple processes.
>>
>> One way to do this would be layering on top of watch/notify. Each
>> rbd-mirror process in a zone could watch the same object, and shard
>> the set of images to mirror based on a hash of image ids onto the
>> current set of rbd-mirror processes sorted by client gid. The set of
>> rbd-mirror processes could be determined by listing watchers.
> You're going to have some tricky cases here when reassigning authority
> as watchers come and go, but I think it should be doable.

I've been fantasizing about something similar to this for CephFS 
backward scrub/recovery.  My current code supports parallelism, but 
relies on the user to script their population of workers across client 
nodes.

I had been thinking of more of a master/slaves model, where one guy 
would get to be the master by e.g. taking the lock on an object, and he 
would then hand out work to everyone else that was a watch/notify 
subscriber to the magic object.  It seems like that could be simpler 
than having workers have to work out independently what their workload 
should be, and have the added bonus of providing a command-like 
mechanism in addition to continuous operation.

Cheers,
John

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RBD mirroring design draft
  2015-05-28 10:42   ` John Spray
@ 2015-05-28 14:07     ` Gregory Farnum
  0 siblings, 0 replies; 9+ messages in thread
From: Gregory Farnum @ 2015-05-28 14:07 UTC (permalink / raw)
  To: John Spray; +Cc: Josh Durgin, ceph-devel

On Thu, May 28, 2015 at 3:42 AM, John Spray <john.spray@redhat.com> wrote:
>
>
> On 28/05/2015 06:37, Gregory Farnum wrote:
>>
>> On Tue, May 12, 2015 at 5:42 PM, Josh Durgin <jdurgin@redhat.com> wrote:
>>> Parallelism
>>> ^^^^^^^^^^^
>>>
>>> Mirroring many images is embarrassingly parallel. A simple unit of
>>> work is an image (more specifically a journal, if e.g. a group of
>>> images shared a journal as part of a consistency group in the future).
>>>
>>> Spreading this work across threads within a single process is
>>> relatively simple. For HA, and to avoid a single NIC becoming a
>>> bottleneck, we'll want to spread out the work across multiple
>>> processes (and probably multiple hosts). rbd-mirror should have no
>>> local state, so we just need a mechanism to coordinate the division of
>>> work across multiple processes.
>>>
>>> One way to do this would be layering on top of watch/notify. Each
>>> rbd-mirror process in a zone could watch the same object, and shard
>>> the set of images to mirror based on a hash of image ids onto the
>>> current set of rbd-mirror processes sorted by client gid. The set of
>>> rbd-mirror processes could be determined by listing watchers.
>>
>> You're going to have some tricky cases here when reassigning authority
>> as watchers come and go, but I think it should be doable.
>
>
> I've been fantasizing about something similar to this for CephFS backward
> scrub/recovery.  My current code supports parallelism, but relies on the
> user to script their population of workers across client nodes.
>
> I had been thinking of more of a master/slaves model, where one guy would
> get to be the master by e.g. taking the lock on an object, and he would then
> hand out work to everyone else that was a watch/notify subscriber to the
> magic object.  It seems like that could be simpler than having workers have
> to work out independently what their workload should be, and have the added
> bonus of providing a command-like mechanism in addition to continuous
> operation.

Heh. This could be the method but I caution people that it's a
brand-new use case for watch-notify and I'm not too sure how it'd
perform. I suspect we'd need to keep the chunks of work pretty large
in order to avoid the watch-notify cycle latencies being a limiting
factor. ;)

Speaking more generally, unless a peer-based model turns out to be
infeasible I much prefer that — the systems are sometimes more
complicated but generally much more resilient to failures, and tend to
be better-designed for recovery than when everything is residing in
the master's memory and then has to get reconstructed.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-05-28 14:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-13  0:42 RBD mirroring design draft Josh Durgin
2015-05-13  7:48 ` Haomai Wang
2015-05-13  8:07 ` Haomai Wang
2015-05-14  4:21   ` Josh Durgin
     [not found]     ` <CAAW3nmh+XxB8K2XsWgnD_cWWPZGw=VpsuomodMM1SNad8LmZAQ@mail.gmail.com>
2015-05-20 21:30       ` Josh Durgin
     [not found]         ` <CAAW3nmjWQTOOhym5t6LQ8E0P8AsHnD0c0MkfbF2zre_oUJFudw@mail.gmail.com>
2015-05-21 15:34           ` Josh Durgin
2015-05-28  5:37 ` Gregory Farnum
2015-05-28 10:42   ` John Spray
2015-05-28 14:07     ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.