All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] reinstate ceph cluster_snap support
@ 2013-08-22  9:10 Alexandre Oliva
  2013-08-24  0:17 ` Sage Weil
  0 siblings, 1 reply; 15+ messages in thread
From: Alexandre Oliva @ 2013-08-22  9:10 UTC (permalink / raw)
  To: ceph-devel

This patch brings back and updates (for dumpling) the code originally
introduced to support “ceph osd cluster_snap <snap>”, that was disabled
and partially removed before cuttlefish.

Some minimal testing appears to indicate this even works: the modified
mon actually generated an osdmap with the cluster_snap request, and
starting a modified osd that was down and letting it catch up caused
the osd to take the requested snapshot.  I see no reason why it
wouldn't have taken it if it was up and running, so...  Why was this
feature disabled in the first place?

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
---
 src/mon/MonCommands.h |    6 ++++--
 src/mon/OSDMonitor.cc |   11 +++++++----
 src/osd/OSD.cc        |   13 +++++++++++++
 3 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h
index 8e9c2bb..225c687 100644
--- a/src/mon/MonCommands.h
+++ b/src/mon/MonCommands.h
@@ -431,8 +431,10 @@ COMMAND("osd set " \
 COMMAND("osd unset " \
 	"name=key,type=CephChoices,strings=pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub", \
 	"unset <key>", "osd", "rw", "cli,rest")
-COMMAND("osd cluster_snap", "take cluster snapshot (disabled)", \
-	"osd", "r", "")
+COMMAND("osd cluster_snap " \
+	"name=snap,type=CephString", \
+	"take cluster snapshot",	\
+	"osd", "r", "cli")
 COMMAND("osd down " \
 	"type=CephString,name=ids,n=N", \
 	"set osd(s) <id> [<id>...] down", "osd", "rw", "cli,rest")
diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index 07022ae..9bf9511 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -3099,10 +3099,13 @@ bool OSDMonitor::prepare_command(MMonCommand *m)
       return prepare_unset_flag(m, CEPH_OSDMAP_NODEEP_SCRUB);
 
   } else if (prefix == "osd cluster_snap") {
-    // ** DISABLE THIS FOR NOW **
-    ss << "cluster snapshot currently disabled (broken implementation)";
-    // ** DISABLE THIS FOR NOW **
-
+    string snap;
+    cmd_getval(g_ceph_context, cmdmap, "snap", snap);
+    pending_inc.cluster_snapshot = snap;
+    ss << "creating cluster snap " << snap;
+    getline(ss, rs);
+    wait_for_finished_proposal(new Monitor::C_Command(mon, m, 0, rs, get_last_committed()));
+    return true;
   } else if (prefix == "osd down" ||
 	     prefix == "osd out" ||
 	     prefix == "osd in" ||
diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 1a77dae..e41a6b3 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -5022,6 +5022,19 @@ void OSD::handle_osd_map(MOSDMap *m)
     assert(0 == "MOSDMap lied about what maps it had?");
   }
 
+  // check for cluster snapshots
+  for (epoch_t cur = superblock.current_epoch + 1; cur <= m->get_last(); cur++) {
+    OSDMapRef newmap = get_map(cur);
+    string cluster_snap = newmap->get_cluster_snapshot();
+    if (cluster_snap.length() == 0)
+      continue;
+
+    dout(0) << "creating cluster snapshot '" << cluster_snap << "'" << dendl;
+    int r = store->snapshot(cluster_snap);
+    if (r)
+      dout(0) << "failed to create cluster snapshot: " << cpp_strerror(r) << dendl;
+  }
+
   if (superblock.oldest_map) {
     int num = 0;
     epoch_t min(

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] reinstate ceph cluster_snap support
  2013-08-22  9:10 [PATCH] reinstate ceph cluster_snap support Alexandre Oliva
@ 2013-08-24  0:17 ` Sage Weil
  2013-08-24 14:56   ` Alexandre Oliva
  0 siblings, 1 reply; 15+ messages in thread
From: Sage Weil @ 2013-08-24  0:17 UTC (permalink / raw)
  To: Alexandre Oliva, sam.just; +Cc: ceph-devel

Sam, do you remember why we disabled this?

I think it happened when the pg threading stuff went in, but I'm not sure 
why we can't just take a blanket snapshot of current/.

FWIW Alexandre, this feature was never really complete.  For it to work, 
we also need to snapshot the monitors, and roll them back as well.

sage


On Thu, 22 Aug 2013, Alexandre Oliva wrote:

> This patch brings back and updates (for dumpling) the code originally
> introduced to support ?ceph osd cluster_snap <snap>?, that was disabled
> and partially removed before cuttlefish.
> 
> Some minimal testing appears to indicate this even works: the modified
> mon actually generated an osdmap with the cluster_snap request, and
> starting a modified osd that was down and letting it catch up caused
> the osd to take the requested snapshot.  I see no reason why it
> wouldn't have taken it if it was up and running, so...  Why was this
> feature disabled in the first place?
> 
> Signed-off-by: Alexandre Oliva <oliva@gnu.org>
> ---
>  src/mon/MonCommands.h |    6 ++++--
>  src/mon/OSDMonitor.cc |   11 +++++++----
>  src/osd/OSD.cc        |   13 +++++++++++++
>  3 files changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h
> index 8e9c2bb..225c687 100644
> --- a/src/mon/MonCommands.h
> +++ b/src/mon/MonCommands.h
> @@ -431,8 +431,10 @@ COMMAND("osd set " \
>  COMMAND("osd unset " \
>  	"name=key,type=CephChoices,strings=pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub", \
>  	"unset <key>", "osd", "rw", "cli,rest")
> -COMMAND("osd cluster_snap", "take cluster snapshot (disabled)", \
> -	"osd", "r", "")
> +COMMAND("osd cluster_snap " \
> +	"name=snap,type=CephString", \
> +	"take cluster snapshot",	\
> +	"osd", "r", "cli")
>  COMMAND("osd down " \
>  	"type=CephString,name=ids,n=N", \
>  	"set osd(s) <id> [<id>...] down", "osd", "rw", "cli,rest")
> diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
> index 07022ae..9bf9511 100644
> --- a/src/mon/OSDMonitor.cc
> +++ b/src/mon/OSDMonitor.cc
> @@ -3099,10 +3099,13 @@ bool OSDMonitor::prepare_command(MMonCommand *m)
>        return prepare_unset_flag(m, CEPH_OSDMAP_NODEEP_SCRUB);
>  
>    } else if (prefix == "osd cluster_snap") {
> -    // ** DISABLE THIS FOR NOW **
> -    ss << "cluster snapshot currently disabled (broken implementation)";
> -    // ** DISABLE THIS FOR NOW **
> -
> +    string snap;
> +    cmd_getval(g_ceph_context, cmdmap, "snap", snap);
> +    pending_inc.cluster_snapshot = snap;
> +    ss << "creating cluster snap " << snap;
> +    getline(ss, rs);
> +    wait_for_finished_proposal(new Monitor::C_Command(mon, m, 0, rs, get_last_committed()));
> +    return true;
>    } else if (prefix == "osd down" ||
>  	     prefix == "osd out" ||
>  	     prefix == "osd in" ||
> diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
> index 1a77dae..e41a6b3 100644
> --- a/src/osd/OSD.cc
> +++ b/src/osd/OSD.cc
> @@ -5022,6 +5022,19 @@ void OSD::handle_osd_map(MOSDMap *m)
>      assert(0 == "MOSDMap lied about what maps it had?");
>    }
>  
> +  // check for cluster snapshots
> +  for (epoch_t cur = superblock.current_epoch + 1; cur <= m->get_last(); cur++) {
> +    OSDMapRef newmap = get_map(cur);
> +    string cluster_snap = newmap->get_cluster_snapshot();
> +    if (cluster_snap.length() == 0)
> +      continue;
> +
> +    dout(0) << "creating cluster snapshot '" << cluster_snap << "'" << dendl;
> +    int r = store->snapshot(cluster_snap);
> +    if (r)
> +      dout(0) << "failed to create cluster snapshot: " << cpp_strerror(r) << dendl;
> +  }
> +
>    if (superblock.oldest_map) {
>      int num = 0;
>      epoch_t min(
> 
> -- 
> Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
> You must be the change you wish to see in the world. -- Gandhi
> Be Free! -- http://FSFLA.org/   FSF Latin America board member
> Free Software Evangelist      Red Hat Brazil Compiler Engineer
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] reinstate ceph cluster_snap support
  2013-08-24  0:17 ` Sage Weil
@ 2013-08-24 14:56   ` Alexandre Oliva
  2013-08-27 22:21     ` Sage Weil
  0 siblings, 1 reply; 15+ messages in thread
From: Alexandre Oliva @ 2013-08-24 14:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: sam.just, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1899 bytes --]

On Aug 23, 2013, Sage Weil <sage@inktank.com> wrote:

> FWIW Alexandre, this feature was never really complete.  For it to work, 
> we also need to snapshot the monitors, and roll them back as well.

That depends on what's expected from the feature, actually.

One use is to roll back a single osd, and for that, the feature works
just fine.  Of course, for that one doesn't need the multi-osd snapshots
to be mutually consistent, but it's still convenient to be able to take
a global snapshot with a single command.

Another use is to roll back the entire cluster to an earlier state, and
for that, you *probably* want to roll back the monitors too, although it
doesn't seem like this is strictly necessary, unless some significant
configuration changes occurrend in the cluster since the snapshot was
taken, and you want to roll back those too.

In my experience, rolling back only osds has worked just fine, with the
exception of cases in which the snapshot is much too old, and the mons
have already expired osdmaps after the last one the osd got when the
snapshot was taken.  For this one case, I have a patch that enables the
osd to rejoin the cluster in spite of the expired osdmaps, which has
always worked for me, but I understand there may be exceptional cluster
reconfigurations in which this wouldn't have worked.


As for snapshotting monitors...  I suppose the way to go is to start a
store.db dump in background, instead of taking a btrfs snapshot, since
the store.db is not created as a subvolume.  That said, it would make
some sense to make it so, to make it trivially snapshottable.


Anyway, I found a problem in the earlier patch: when I added a new disk
to my cluster this morning, it tried to iterate over osdmaps that were
not available (e.g. the long-gone osdmap 1), and crashed.

Here's a fixed version, that makes sure we don't start the iteration
before m->get_first().


[-- Attachment #2: mon-osd-reinstate-cluster-snap.patch --]
[-- Type: text/x-diff, Size: 3342 bytes --]

reinstate ceph cluster_snap support

From: Alexandre Oliva <oliva@gnu.org>

This patch brings back and updates (for dumpling) the code originally
introduced to support “ceph osd cluster_snap <snap>”, that was
disabled and partially removed before cuttlefish.

Some minimal testing appears to indicate this even works: the modified
mon actually generated an osdmap with the cluster_snap request, and
starting a modified osd that was down and letting it catch up caused
the osd to take the requested snapshot.  I see no reason why it
wouldn't have taken it if it was up and running, so...  Why was this
feature disabled in the first place?

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
---
 src/mon/MonCommands.h |    6 ++++--
 src/mon/OSDMonitor.cc |   11 +++++++----
 src/osd/OSD.cc        |   14 ++++++++++++++
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h
index 8e9c2bb..225c687 100644
--- a/src/mon/MonCommands.h
+++ b/src/mon/MonCommands.h
@@ -431,8 +431,10 @@ COMMAND("osd set " \
 COMMAND("osd unset " \
 	"name=key,type=CephChoices,strings=pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub", \
 	"unset <key>", "osd", "rw", "cli,rest")
-COMMAND("osd cluster_snap", "take cluster snapshot (disabled)", \
-	"osd", "r", "")
+COMMAND("osd cluster_snap " \
+	"name=snap,type=CephString", \
+	"take cluster snapshot",	\
+	"osd", "r", "cli")
 COMMAND("osd down " \
 	"type=CephString,name=ids,n=N", \
 	"set osd(s) <id> [<id>...] down", "osd", "rw", "cli,rest")
diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index 07022ae..9bf9511 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -3099,10 +3099,13 @@ bool OSDMonitor::prepare_command(MMonCommand *m)
       return prepare_unset_flag(m, CEPH_OSDMAP_NODEEP_SCRUB);
 
   } else if (prefix == "osd cluster_snap") {
-    // ** DISABLE THIS FOR NOW **
-    ss << "cluster snapshot currently disabled (broken implementation)";
-    // ** DISABLE THIS FOR NOW **
-
+    string snap;
+    cmd_getval(g_ceph_context, cmdmap, "snap", snap);
+    pending_inc.cluster_snapshot = snap;
+    ss << "creating cluster snap " << snap;
+    getline(ss, rs);
+    wait_for_finished_proposal(new Monitor::C_Command(mon, m, 0, rs, get_last_committed()));
+    return true;
   } else if (prefix == "osd down" ||
 	     prefix == "osd out" ||
 	     prefix == "osd in" ||
diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 1a77dae..01527a6 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -5022,6 +5022,20 @@ void OSD::handle_osd_map(MOSDMap *m)
     assert(0 == "MOSDMap lied about what maps it had?");
   }
 
+  // check for cluster snapshots
+  for (epoch_t cur = MAX(superblock.current_epoch + 1, m->get_first());
+       cur <= m->get_last(); cur++) {
+    OSDMapRef newmap = get_map(cur);
+    string cluster_snap = newmap->get_cluster_snapshot();
+    if (cluster_snap.length() == 0)
+      continue;
+
+    dout(0) << "creating cluster snapshot '" << cluster_snap << "'" << dendl;
+    int r = store->snapshot(cluster_snap);
+    if (r)
+      dout(0) << "failed to create cluster snapshot: " << cpp_strerror(r) << dendl;
+  }
+
   if (superblock.oldest_map) {
     int num = 0;
     epoch_t min(

[-- Attachment #3: Type: text/plain, Size: 258 bytes --]



-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] reinstate ceph cluster_snap support
  2013-08-24 14:56   ` Alexandre Oliva
@ 2013-08-27 22:21     ` Sage Weil
  2013-08-28  0:54       ` Yan, Zheng
                         ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Sage Weil @ 2013-08-27 22:21 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: sam.just, ceph-devel

Hi,

On Sat, 24 Aug 2013, Alexandre Oliva wrote:
> On Aug 23, 2013, Sage Weil <sage@inktank.com> wrote:
> 
> > FWIW Alexandre, this feature was never really complete.  For it to work, 
> > we also need to snapshot the monitors, and roll them back as well.
> 
> That depends on what's expected from the feature, actually.
> 
> One use is to roll back a single osd, and for that, the feature works
> just fine.  Of course, for that one doesn't need the multi-osd snapshots
> to be mutually consistent, but it's still convenient to be able to take
> a global snapshot with a single command.
> 
> Another use is to roll back the entire cluster to an earlier state, and
> for that, you *probably* want to roll back the monitors too, although it
> doesn't seem like this is strictly necessary, unless some significant
> configuration changes occurrend in the cluster since the snapshot was
> taken, and you want to roll back those too.
> 
> In my experience, rolling back only osds has worked just fine, with the
> exception of cases in which the snapshot is much too old, and the mons
> have already expired osdmaps after the last one the osd got when the
> snapshot was taken.  For this one case, I have a patch that enables the
> osd to rejoin the cluster in spite of the expired osdmaps, which has
> always worked for me, but I understand there may be exceptional cluster
> reconfigurations in which this wouldn't have worked.
> 
> 
> As for snapshotting monitors...  I suppose the way to go is to start a
> store.db dump in background, instead of taking a btrfs snapshot, since
> the store.db is not created as a subvolume.  That said, it would make
> some sense to make it so, to make it trivially snapshottable.
> 
> 
> Anyway, I found a problem in the earlier patch: when I added a new disk
> to my cluster this morning, it tried to iterate over osdmaps that were
> not available (e.g. the long-gone osdmap 1), and crashed.
> 
> Here's a fixed version, that makes sure we don't start the iteration
> before m->get_first().

In principle, we can add this back in.  I think it needs a few changes, 
though.

First, FileStore::snapshot() needs to pause and drain the workqueue before 
taking the snapshot, similar to what is done with the sync sequence.  
Otherwise it isn't a transactionally consistent snapshot and may tear some 
update.  Because it is draining the work queue, it *might* also need to 
drop some locks, but I'm hopeful that that isn't necessary.

Second, the call in handle_osd_map() should probably go in the loop a bit 
further down that is consuming maps.  It probably won't matter most of the 
time, but I'm paranoid about corner conditions.  It also avoids iterating 
over the new OSDMaps multiple times in the common case where there is no 
cluster_snap (a minor win).

Finally, eventually we should make this do a checkpoint on the mons too.  
We can add the osd snapping back in first, but before this can/should 
really be used the mons need to be snapshotted as well.  Probably that's 
just adding in a snapshot() method to MonitorStore.h and doing either a 
leveldb snap or making a full copy of store.db... I forget what leveldb is 
capable of here.

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] reinstate ceph cluster_snap support
  2013-08-27 22:21     ` Sage Weil
@ 2013-08-28  0:54       ` Yan, Zheng
  2013-08-28  4:34         ` Sage Weil
  2013-12-17 12:14       ` Alexandre Oliva
  2014-10-21  2:49       ` Alexandre Oliva
  2 siblings, 1 reply; 15+ messages in thread
From: Yan, Zheng @ 2013-08-28  0:54 UTC (permalink / raw)
  To: Sage Weil; +Cc: Alexandre Oliva, sam.just, ceph-devel

On Wed, Aug 28, 2013 at 6:21 AM, Sage Weil <sage@inktank.com> wrote:
> Hi,
>
> On Sat, 24 Aug 2013, Alexandre Oliva wrote:
>> On Aug 23, 2013, Sage Weil <sage@inktank.com> wrote:
>>
>> > FWIW Alexandre, this feature was never really complete.  For it to work,
>> > we also need to snapshot the monitors, and roll them back as well.
>>
>> That depends on what's expected from the feature, actually.
>>
>> One use is to roll back a single osd, and for that, the feature works
>> just fine.  Of course, for that one doesn't need the multi-osd snapshots
>> to be mutually consistent, but it's still convenient to be able to take
>> a global snapshot with a single command.
>>
>> Another use is to roll back the entire cluster to an earlier state, and
>> for that, you *probably* want to roll back the monitors too, although it
>> doesn't seem like this is strictly necessary, unless some significant
>> configuration changes occurrend in the cluster since the snapshot was
>> taken, and you want to roll back those too.
>>
>> In my experience, rolling back only osds has worked just fine, with the
>> exception of cases in which the snapshot is much too old, and the mons
>> have already expired osdmaps after the last one the osd got when the
>> snapshot was taken.  For this one case, I have a patch that enables the
>> osd to rejoin the cluster in spite of the expired osdmaps, which has
>> always worked for me, but I understand there may be exceptional cluster
>> reconfigurations in which this wouldn't have worked.
>>
>>
>> As for snapshotting monitors...  I suppose the way to go is to start a
>> store.db dump in background, instead of taking a btrfs snapshot, since
>> the store.db is not created as a subvolume.  That said, it would make
>> some sense to make it so, to make it trivially snapshottable.
>>
>>
>> Anyway, I found a problem in the earlier patch: when I added a new disk
>> to my cluster this morning, it tried to iterate over osdmaps that were
>> not available (e.g. the long-gone osdmap 1), and crashed.
>>
>> Here's a fixed version, that makes sure we don't start the iteration
>> before m->get_first().
>
> In principle, we can add this back in.  I think it needs a few changes,
> though.
>
> First, FileStore::snapshot() needs to pause and drain the workqueue before
> taking the snapshot, similar to what is done with the sync sequence.
> Otherwise it isn't a transactionally consistent snapshot and may tear some
> update.  Because it is draining the work queue, it *might* also need to
> drop some locks, but I'm hopeful that that isn't necessary.
>
> Second, the call in handle_osd_map() should probably go in the loop a bit
> further down that is consuming maps.  It probably won't matter most of the
> time, but I'm paranoid about corner conditions.  It also avoids iterating
> over the new OSDMaps multiple times in the common case where there is no
> cluster_snap (a minor win).
>
> Finally, eventually we should make this do a checkpoint on the mons too.
> We can add the osd snapping back in first, but before this can/should
> really be used the mons need to be snapshotted as well.  Probably that's
> just adding in a snapshot() method to MonitorStore.h and doing either a
> leveldb snap or making a full copy of store.db... I forget what leveldb is
> capable of here.
>

I think we also need to snapshot the osd journal

Regards
Yan, Zheng

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] reinstate ceph cluster_snap support
  2013-08-28  0:54       ` Yan, Zheng
@ 2013-08-28  4:34         ` Sage Weil
  0 siblings, 0 replies; 15+ messages in thread
From: Sage Weil @ 2013-08-28  4:34 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: Alexandre Oliva, sam.just, ceph-devel

On Wed, 28 Aug 2013, Yan, Zheng wrote:
> On Wed, Aug 28, 2013 at 6:21 AM, Sage Weil <sage@inktank.com> wrote:
> > Hi,
> >
> > On Sat, 24 Aug 2013, Alexandre Oliva wrote:
> >> On Aug 23, 2013, Sage Weil <sage@inktank.com> wrote:
> >>
> >> > FWIW Alexandre, this feature was never really complete.  For it to work,
> >> > we also need to snapshot the monitors, and roll them back as well.
> >>
> >> That depends on what's expected from the feature, actually.
> >>
> >> One use is to roll back a single osd, and for that, the feature works
> >> just fine.  Of course, for that one doesn't need the multi-osd snapshots
> >> to be mutually consistent, but it's still convenient to be able to take
> >> a global snapshot with a single command.
> >>
> >> Another use is to roll back the entire cluster to an earlier state, and
> >> for that, you *probably* want to roll back the monitors too, although it
> >> doesn't seem like this is strictly necessary, unless some significant
> >> configuration changes occurrend in the cluster since the snapshot was
> >> taken, and you want to roll back those too.
> >>
> >> In my experience, rolling back only osds has worked just fine, with the
> >> exception of cases in which the snapshot is much too old, and the mons
> >> have already expired osdmaps after the last one the osd got when the
> >> snapshot was taken.  For this one case, I have a patch that enables the
> >> osd to rejoin the cluster in spite of the expired osdmaps, which has
> >> always worked for me, but I understand there may be exceptional cluster
> >> reconfigurations in which this wouldn't have worked.
> >>
> >>
> >> As for snapshotting monitors...  I suppose the way to go is to start a
> >> store.db dump in background, instead of taking a btrfs snapshot, since
> >> the store.db is not created as a subvolume.  That said, it would make
> >> some sense to make it so, to make it trivially snapshottable.
> >>
> >>
> >> Anyway, I found a problem in the earlier patch: when I added a new disk
> >> to my cluster this morning, it tried to iterate over osdmaps that were
> >> not available (e.g. the long-gone osdmap 1), and crashed.
> >>
> >> Here's a fixed version, that makes sure we don't start the iteration
> >> before m->get_first().
> >
> > In principle, we can add this back in.  I think it needs a few changes,
> > though.
> >
> > First, FileStore::snapshot() needs to pause and drain the workqueue before
> > taking the snapshot, similar to what is done with the sync sequence.
> > Otherwise it isn't a transactionally consistent snapshot and may tear some
> > update.  Because it is draining the work queue, it *might* also need to
> > drop some locks, but I'm hopeful that that isn't necessary.
> >
> > Second, the call in handle_osd_map() should probably go in the loop a bit
> > further down that is consuming maps.  It probably won't matter most of the
> > time, but I'm paranoid about corner conditions.  It also avoids iterating
> > over the new OSDMaps multiple times in the common case where there is no
> > cluster_snap (a minor win).
> >
> > Finally, eventually we should make this do a checkpoint on the mons too.
> > We can add the osd snapping back in first, but before this can/should
> > really be used the mons need to be snapshotted as well.  Probably that's
> > just adding in a snapshot() method to MonitorStore.h and doing either a
> > leveldb snap or making a full copy of store.db... I forget what leveldb is
> > capable of here.
> >
> 
> I think we also need to snapshot the osd journal

If the snapshot does a sync (drain op_tp before doing the snap) that puts 
the file subvol in a consistent state.  To actually use it ceph-osd rolls 
back to that point on startup.  I didn't check that code, but I think what 
it should do is ignore/reset the journal then.

This is annoying code to test, unfortunately.

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] reinstate ceph cluster_snap support
  2013-08-27 22:21     ` Sage Weil
  2013-08-28  0:54       ` Yan, Zheng
@ 2013-12-17 12:14       ` Alexandre Oliva
  2013-12-17 13:50         ` Alexandre Oliva
  2014-10-21  2:49       ` Alexandre Oliva
  2 siblings, 1 reply; 15+ messages in thread
From: Alexandre Oliva @ 2013-12-17 12:14 UTC (permalink / raw)
  To: Sage Weil; +Cc: sam.just, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 2963 bytes --]

On Aug 27, 2013, Sage Weil <sage@inktank.com> wrote:

> Hi,
> On Sat, 24 Aug 2013, Alexandre Oliva wrote:
>> On Aug 23, 2013, Sage Weil <sage@inktank.com> wrote:
>> 
>> > FWIW Alexandre, this feature was never really complete.  For it to work, 
>> > we also need to snapshot the monitors, and roll them back as well.
>> 
>> That depends on what's expected from the feature, actually.
>> 
>> One use is to roll back a single osd, and for that, the feature works
>> just fine.  Of course, for that one doesn't need the multi-osd snapshots
>> to be mutually consistent, but it's still convenient to be able to take
>> a global snapshot with a single command.

> In principle, we can add this back in.  I think it needs a few changes, 
> though.

> First, FileStore::snapshot() needs to pause and drain the workqueue before 
> taking the snapshot, similar to what is done with the sync sequence.  
> Otherwise it isn't a transactionally consistent snapshot and may tear some 
> update.  Because it is draining the work queue, it *might* also need to 
> drop some locks, but I'm hopeful that that isn't necessary.

Hmm...  I don't quite get this.  The Filestore implementation of
snapshot already performs a sync_and_flush before calling the backend's
create_checkpoint.  Shouldn't that be enough?  FWIW, the code I brought
in from argonaut didn't do any such thing; it did drop locks, but that
doesn't seem to be necessary any more:

  // flush here so that the peering code can re-read any pg data off
  // disk that it needs to... say for backlog generation.  (hmm, is
  // this really needed?)
  osd_lock.Unlock();
  if (cluster_snap.length()) {
    dout(0) << "creating cluster snapshot '" << cluster_snap << "'" << dendl;
    int r = store->snapshot(cluster_snap);
    if (r)
      dout(0) << "failed to create cluster snapshot: " << cpp_strerror(r) << de
  } else {
    store->flush();
  }
  osd_lock.Lock();

> Second, the call in handle_osd_map() should probably go in the loop a bit 
> further down that is consuming maps.  It probably won't matter most of the 
> time, but I'm paranoid about corner conditions.  It also avoids iterating 
> over the new OSDMaps multiple times in the common case where there is no 
> cluster_snap (a minor win).

I've just moved the cluster creation down to the loop I think you're
speaking of above.  Here's the revised patch, so far untested, just for
reference so that you don't have to refer to the archives to locate the
earlier patch and make sense of the comments in this old thread.

> Finally, eventually we should make this do a checkpoint on the mons too.  
> We can add the osd snapping back in first, but before this can/should 
> really be used the mons need to be snapshotted as well.  Probably that's 
> just adding in a snapshot() method to MonitorStore.h and doing either a 
> leveldb snap or making a full copy of store.db... I forget what leveldb is 
> capable of here.

I haven't looked into this yet.



[-- Attachment #2: mon-osd-reinstate-cluster-snap.patch --]
[-- Type: text/x-diff, Size: 3077 bytes --]

reinstate ceph cluster_snap support

From: Alexandre Oliva <oliva@gnu.org>

This patch brings back and updates (for dumpling) the code originally
introduced to support “ceph osd cluster_snap <snap>”, that was
disabled and partially removed before cuttlefish.

Some minimal testing appears to indicate this even works: the modified
mon actually generated an osdmap with the cluster_snap request, and
starting a modified osd that was down and letting it catch up caused
the osd to take the requested snapshot.  I see no reason why it
wouldn't have taken it if it was up and running, so...  Why was this
feature disabled in the first place?

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
---
 src/mon/MonCommands.h |    6 ++++--
 src/mon/OSDMonitor.cc |   11 +++++++----
 src/osd/OSD.cc        |    8 ++++++++
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h
index 5a6ca6a..8977f29 100644
--- a/src/mon/MonCommands.h
+++ b/src/mon/MonCommands.h
@@ -445,8 +445,10 @@ COMMAND("osd set " \
 COMMAND("osd unset " \
 	"name=key,type=CephChoices,strings=pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub", \
 	"unset <key>", "osd", "rw", "cli,rest")
-COMMAND("osd cluster_snap", "take cluster snapshot (disabled)", \
-	"osd", "r", "")
+COMMAND("osd cluster_snap " \
+	"name=snap,type=CephString", \
+	"take cluster snapshot",	\
+	"osd", "r", "cli")
 COMMAND("osd down " \
 	"type=CephString,name=ids,n=N", \
 	"set osd(s) <id> [<id>...] down", "osd", "rw", "cli,rest")
diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index 07775fc..9a46978 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -3428,10 +3428,13 @@ bool OSDMonitor::prepare_command(MMonCommand *m)
       return prepare_unset_flag(m, CEPH_OSDMAP_NODEEP_SCRUB);
 
   } else if (prefix == "osd cluster_snap") {
-    // ** DISABLE THIS FOR NOW **
-    ss << "cluster snapshot currently disabled (broken implementation)";
-    // ** DISABLE THIS FOR NOW **
-
+    string snap;
+    cmd_getval(g_ceph_context, cmdmap, "snap", snap);
+    pending_inc.cluster_snapshot = snap;
+    ss << "creating cluster snap " << snap;
+    getline(ss, rs);
+    wait_for_finished_proposal(new Monitor::C_Command(mon, m, 0, rs, get_last_committed()));
+    return true;
   } else if (prefix == "osd down" ||
 	     prefix == "osd out" ||
 	     prefix == "osd in" ||
diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 8da4d96..b5720f7 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -5169,6 +5169,14 @@ void OSD::handle_osd_map(MOSDMap *m)
       }
     }
     
+    string cluster_snap = newmap->get_cluster_snapshot();
+    if (cluster_snap.length()) {
+      dout(0) << "creating cluster snapshot '" << cluster_snap << "'" << dendl;
+      int r = store->snapshot(cluster_snap);
+      if (r)
+	dout(0) << "failed to create cluster snapshot: " << cpp_strerror(r) << dendl;
+    }
+
     osdmap = newmap;
 
     superblock.current_epoch = cur;

[-- Attachment #3: Type: text/plain, Size: 258 bytes --]



-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] reinstate ceph cluster_snap support
  2013-12-17 12:14       ` Alexandre Oliva
@ 2013-12-17 13:50         ` Alexandre Oliva
  2013-12-17 14:22           ` Alexandre Oliva
  0 siblings, 1 reply; 15+ messages in thread
From: Alexandre Oliva @ 2013-12-17 13:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: sam.just, ceph-devel

On Dec 17, 2013, Alexandre Oliva <oliva@gnu.org> wrote:

>> Finally, eventually we should make this do a checkpoint on the mons too.  
>> We can add the osd snapping back in first, but before this can/should 
>> really be used the mons need to be snapshotted as well.  Probably that's 
>> just adding in a snapshot() method to MonitorStore.h and doing either a 
>> leveldb snap or making a full copy of store.db... I forget what leveldb is 
>> capable of here.

> I haven't looked into this yet.

I looked a bit at the leveldb interface.  It offers a facility to create
Snapshots, but they only last for the duration of one session of the
database.  It can be used to create multiple iterators at once state of
the db, or to read multiple values from the same state of the db, but
not to roll back to a state you had at an earlier session, e.g., after a
monitor restart.  So they won't help us.

I thus see a few possibilities (all of them to be done between taking
note of the request for the new snapshot and returning a response to the
requestor that the request was satisfied):

1. take a snapshot, create an iterator out of the snapshot, create a new
database named after the cluster_snap key, and go over all key/value
pairs tha the iterator can see, adding each one to this new database.

2. close the database, create a dir named after the cluster_snap key,
create hardlinks to all files in the database tree in the cluster_snap
dir, and then reopen the database

3. flush the leveldb (how?  will a write with sync=true do?  must we
close it?) and take a btrfs snapshot of the store.db tree, named after
the cluster_snap key, and then reopen the database

None of these are particularly appealing; (1) wastes disk space and cpu
cycles; (2) relies on leveldb internal implementation details such as
the fact that files are never modified after they're first closed, and
(3) requires a btrfs subvol for the store.db.  My favorite choice would
be 3, but can we just fail mon snaps when this requirement is not met?

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] reinstate ceph cluster_snap support
  2013-12-17 13:50         ` Alexandre Oliva
@ 2013-12-17 14:22           ` Alexandre Oliva
  2013-12-18 19:35             ` Gregory Farnum
  0 siblings, 1 reply; 15+ messages in thread
From: Alexandre Oliva @ 2013-12-17 14:22 UTC (permalink / raw)
  To: Sage Weil; +Cc: sam.just, ceph-devel

On Dec 17, 2013, Alexandre Oliva <oliva@gnu.org> wrote:

> On Dec 17, 2013, Alexandre Oliva <oliva@gnu.org> wrote:
>>> Finally, eventually we should make this do a checkpoint on the mons too.  
>>> We can add the osd snapping back in first, but before this can/should 
>>> really be used the mons need to be snapshotted as well.  Probably that's 
>>> just adding in a snapshot() method to MonitorStore.h and doing either a 
>>> leveldb snap or making a full copy of store.db... I forget what leveldb is 
>>> capable of here.

>> I haven't looked into this yet.

> None of these are particularly appealing; (1) wastes disk space and cpu
> cycles; (2) relies on leveldb internal implementation details such as
> the fact that files are never modified after they're first closed, and
> (3) requires a btrfs subvol for the store.db.  My favorite choice would
> be 3, but can we just fail mon snaps when this requirement is not met?

Another aspect that needs to be considered is whether to take a snapshot
of the leader only, or of all monitors in the quorum.  The fact that the
snapshot operation may take a while to complete (particularly (1)), and
monitors may not make progress while taking the snapshot (which might
cause the client and other monitors to assume other monitors have
failed), make the whole thing quite more complex than what I'd have
hoped for.

Another point that may affect the decision is the amount of information
in store.db that may have to be retained.  E.g., if it's just a small
amount of information, creating a separate database makes far more sense
than taking a complete copy of the entire database, and it might even
make sense for the leader to include the full snapshot data in the
snapshot-taking message shared with other monitors, so that they all
take exactly the same snapshot, even if they're not in the quorum and
receive the update at a later time.  Of course this wouldn't work if the
amount of snapshotted monitor data was more than reasonable for a
monitor message.

Anyway, this is probably more than what I'd be able to undertake myself,
at least in part because, although I can see one place to add the
snapshot-taking code to the leader (assuming it's ok to take the
snapshot just before or right after all monitors agree on it), I have no
idea of where to plug the snapshot-taking behavior into peon and
recovering monitors.  Absent a two-phase protocol, it seems to me that
all monitors ought to take snapshots tentatively when they issue or
acknowledge the snapshot-taking proposal, so as to make sure that if it
succeeds we'll have a quorum of snapshots, but if the proposal doesn't
succeed at first, I don't know how to deal with retries (overwrite
existing snapshots?  discard the snapshot when its proposal fails?) or
cancellation (say, the client doesn't get confirmation from the leader,
the leader changes, it retries that some times, and eventually it gives
up, but some monitors have already tentatively taken the snapshot in the
mean time).

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] reinstate ceph cluster_snap support
  2013-12-17 14:22           ` Alexandre Oliva
@ 2013-12-18 19:35             ` Gregory Farnum
  2013-12-19  8:22               ` Alexandre Oliva
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2013-12-18 19:35 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Sage Weil, Samuel Just, ceph-devel

On Tue, Dec 17, 2013 at 4:14 AM, Alexandre Oliva <oliva@gnu.org> wrote:
> On Aug 27, 2013, Sage Weil <sage@inktank.com> wrote:
>
>> Hi,
>> On Sat, 24 Aug 2013, Alexandre Oliva wrote:
>>> On Aug 23, 2013, Sage Weil <sage@inktank.com> wrote:
>>>
>>> > FWIW Alexandre, this feature was never really complete.  For it to work,
>>> > we also need to snapshot the monitors, and roll them back as well.
>>>
>>> That depends on what's expected from the feature, actually.
>>>
>>> One use is to roll back a single osd, and for that, the feature works
>>> just fine.  Of course, for that one doesn't need the multi-osd snapshots
>>> to be mutually consistent, but it's still convenient to be able to take
>>> a global snapshot with a single command.
>
>> In principle, we can add this back in.  I think it needs a few changes,
>> though.
>
>> First, FileStore::snapshot() needs to pause and drain the workqueue before
>> taking the snapshot, similar to what is done with the sync sequence.
>> Otherwise it isn't a transactionally consistent snapshot and may tear some
>> update.  Because it is draining the work queue, it *might* also need to
>> drop some locks, but I'm hopeful that that isn't necessary.
>
> Hmm...  I don't quite get this.  The Filestore implementation of
> snapshot already performs a sync_and_flush before calling the backend's
> create_checkpoint.  Shouldn't that be enough?  FWIW, the code I brought
> in from argonaut didn't do any such thing; it did drop locks, but that
> doesn't seem to be necessary any more:

From a quick skim I think you're right about that. The more serious
concern in the OSDs (which motivated removing the cluster snap) is
what Sage mentioned: we used to be able to take a snapshot for which
all PGs were at the same epoch, and we can't do that now. It's
possible that's okay, but it makes the semantics even weirder than
they used to be (you've never been getting a real point-in-time
snapshot, although as long as you didn't use external communication
channels you could at least be sure it contained a causal cut).

And of course that's nothing compared to snapshotting the monitors, as
you've noticed — but making it actually be a cluster snapshot (instead
of something you could basically do by taking a btrfs snapshot
yourself) is something I would want to see before we bring the feature
back into mainline.


On Tue, Dec 17, 2013 at 6:22 AM, Alexandre Oliva <oliva@gnu.org> wrote:
> On Dec 17, 2013, Alexandre Oliva <oliva@gnu.org> wrote:
>
>> On Dec 17, 2013, Alexandre Oliva <oliva@gnu.org> wrote:
>>>> Finally, eventually we should make this do a checkpoint on the mons too.
>>>> We can add the osd snapping back in first, but before this can/should
>>>> really be used the mons need to be snapshotted as well.  Probably that's
>>>> just adding in a snapshot() method to MonitorStore.h and doing either a
>>>> leveldb snap or making a full copy of store.db... I forget what leveldb is
>>>> capable of here.
>
>>> I haven't looked into this yet.
>
>> None of these are particularly appealing; (1) wastes disk space and cpu
>> cycles; (2) relies on leveldb internal implementation details such as
>> the fact that files are never modified after they're first closed, and
>> (3) requires a btrfs subvol for the store.db.  My favorite choice would
>> be 3, but can we just fail mon snaps when this requirement is not met?
>
> Another aspect that needs to be considered is whether to take a snapshot
> of the leader only, or of all monitors in the quorum.  The fact that the
> snapshot operation may take a while to complete (particularly (1)), and
> monitors may not make progress while taking the snapshot (which might
> cause the client and other monitors to assume other monitors have
> failed), make the whole thing quite more complex than what I'd have
> hoped for.
>
> Another point that may affect the decision is the amount of information
> in store.db that may have to be retained.  E.g., if it's just a small
> amount of information, creating a separate database makes far more sense
> than taking a complete copy of the entire database, and it might even
> make sense for the leader to include the full snapshot data in the
> snapshot-taking message shared with other monitors, so that they all
> take exactly the same snapshot, even if they're not in the quorum and
> receive the update at a later time.  Of course this wouldn't work if the
> amount of snapshotted monitor data was more than reasonable for a
> monitor message.
>
> Anyway, this is probably more than what I'd be able to undertake myself,
> at least in part because, although I can see one place to add the
> snapshot-taking code to the leader (assuming it's ok to take the
> snapshot just before or right after all monitors agree on it), I have no
> idea of where to plug the snapshot-taking behavior into peon and
> recovering monitors.  Absent a two-phase protocol, it seems to me that
> all monitors ought to take snapshots tentatively when they issue or
> acknowledge the snapshot-taking proposal, so as to make sure that if it
> succeeds we'll have a quorum of snapshots, but if the proposal doesn't
> succeed at first, I don't know how to deal with retries (overwrite
> existing snapshots?  discard the snapshot when its proposal fails?) or
> cancellation (say, the client doesn't get confirmation from the leader,
> the leader changes, it retries that some times, and eventually it gives
> up, but some monitors have already tentatively taken the snapshot in the
> mean time).

The best way I can think of in a short time to solve these problems
would be to make snapshots first-class citizens in the monitor. We
could extend the monitor store to handle multiple leveldb instances,
and then a snapshot would would be an async operation which does a
leveldb snapshot inline and spins off a thread to clone that data into
a new leveldb instance. When all the up monitors complete, the user
gets a report saying the snapshot was successful and it gets marked
complete in some snapshot map. Any monitors which have to get a full
store sync would also sync any snapshots they don't already have. If
the monitors can't complete a snapshot (all failing at once for some
reason) then they could block the user from doing anything except
deleting them.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] reinstate ceph cluster_snap support
  2013-12-18 19:35             ` Gregory Farnum
@ 2013-12-19  8:22               ` Alexandre Oliva
  0 siblings, 0 replies; 15+ messages in thread
From: Alexandre Oliva @ 2013-12-19  8:22 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, Samuel Just, ceph-devel

On Dec 18, 2013, Gregory Farnum <greg@inktank.com> wrote:

> (you've never been getting a real point-in-time
> snapshot, although as long as you didn't use external communication
> channels you could at least be sure it contained a causal cut).

I never expected more than a causal cut, really (my wife got a PhD in
consistent checkpointing of distributed systems, so my expectations may
be somewhat better informed than those a random user might have ;-),
although even now I've seldom got a snapshot in which the osd data
differs across replicas (I actually check for that; that's part of my
reason for taking the snapshots in the first place), even when I fail to
explicitly make the cluster quiescent.  But that's probably just “luck”,
as my cluster usually isn't busy when I take such snapshots ;-)

> And of course that's nothing compared to snapshotting the monitors, as
> you've noticed

I've given it some more thought, and it occurred to me that, if we make
mons take the snapshot when the snapshot-taking request is committed to
the cluster history, we should have the snapshots taking at the right
time and without the need for rolling them back and taking them again.

The idea is that, if the snapshot-taking is committed, eventually we'll
have a quorum carrying that commit, and thus each of the quorum members
will have taken a snapshot as soon as they got that commit, even if they
did so during recovery, or if they took so long to take the snapshot
that they got kicked out of the quorum for a while.  If they get
actually restarted, they will get the commit again and take the snapshot
from the beginning.  If all mons in the quorum that accepted the commit
get restarted so that none of them actually records the commit request,
and it doesn't get propagated to other mons that attempt to rejoin,
well, it's as if the request had never been committed.  OTOH, if it did
get to other mons, or if any of them survives, the committed request
will make to a quorum and eventually to all monitors, each one taking
its snapshot at the time it gets the commit.

This should work as long as all mons get recovery info in the same
order, i.e., they won't get into their database history information that
happens-after the snapshot commit before the snapshot commit, nor will
they fail to get information that happened-before the snapshot commit
before getting the snapshot commit.  That said, having little idea of
the inner workings of the monitors, I can't tell whether they actually
meet this “as long as” condition ;-(

> — but making it actually be a cluster snapshot (instead
> of something you could basically do by taking a btrfs snapshot
> yourself)

Taking btrfs snapshots manually over several osds on several hosts is
hardly a way to get a causal cut (but you already knew that ;-)

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] reinstate ceph cluster_snap support
  2013-08-27 22:21     ` Sage Weil
  2013-08-28  0:54       ` Yan, Zheng
  2013-12-17 12:14       ` Alexandre Oliva
@ 2014-10-21  2:49       ` Alexandre Oliva
  2014-10-27 21:00         ` Sage Weil
  2 siblings, 1 reply; 15+ messages in thread
From: Alexandre Oliva @ 2014-10-21  2:49 UTC (permalink / raw)
  To: Sage Weil; +Cc: sam.just, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1567 bytes --]

On Aug 27, 2013, Sage Weil <sage@inktank.com> wrote:

> Finally, eventually we should make this do a checkpoint on the mons too.  
> We can add the osd snapping back in first, but before this can/should 
> really be used the mons need to be snapshotted as well.  Probably that's 
> just adding in a snapshot() method to MonitorStore.h and doing either a 
> leveldb snap or making a full copy of store.db... I forget what leveldb is 
> capable of here.

I suppose it might be a bit too late for Giant, but I finally got 'round
to implementing this.  I attach the patch that implements it, to be
applied on top of the updated version of the patch I posted before, also
attached.

I have a backport to Firefly too, if there's interest.

I have tested both methods: btrfs snapshotting of store.db (I've
manually turned store.db into a btrfs subvolume), and creating a new db
with all (prefix,key,value) triples.  I'm undecided about inserting
multiple transaction commits for the latter case; the mon mem use grew
up a lot as it was, and in a few tests the snapshotting ran twice, but
in the end a dump of all the data in the database created by btrfs
snapshotting was identical to that created by explicit copying.  So, the
former is preferred, since it's so incredibly more efficient.  I also
considered hardlinking all files in store.db into a separate tree, but I
didn't like the idea of coding that in C+-, :-) and I figured it might
not work with other db backends, and maybe even not be guaranteed to
work with leveldb.  It's probably not worth much more effort.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: mon-cluster-snap.patch --]
[-- Type: text/x-diff, Size: 6324 bytes --]

mon: take mon snapshots

From: Alexandre Oliva <oliva@gnu.org>

Extend the machinery that takes osd snapshots to take mon snapshots
too.  Taking a btrfs snapshot of store.db is the preferred method, but
if that fails, the database is copied one tuple at a time.  Unlike osd
snapshots, existing mon snapshots are clobbered by the new snapshots,
mainly because a mon may attempt to take the same snapshot multiple
times, especially if a first attempt takes very long.  It shouldn't be
a problem: the most important property is that the osd snapshots are
taken before the updated osdmap is integrated, whereas the mon
snapshots are taken after that, so that osd snapshots happen-before
mon ones, otherwise in case of rollback osds might have an osdmap that
mons don't know about.

There's no guarantee that all monitors will be completely up to date
at the time the snapshot is taken.  It might be that monitors that
were lagging behind take the snapshot at a later time, and before they
get all the monitor state of the quorum set at the time of the
snapshot.  So, when rolling back the entire cluster (which would have
to be done by hand, as there's no command to do that for mons), it is
advisable to roll back and bring up all monitors, so that it's likely
that more than one monitor has the complete monitor state to propagate
to others.

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
---
 src/mon/Monitor.cc       |   11 +++++
 src/mon/Monitor.h        |    2 +
 src/mon/MonitorDBStore.h |   97 +++++++++++++++++++++++++++++++++++++++++++---
 src/mon/OSDMonitor.cc    |    5 ++
 4 files changed, 107 insertions(+), 8 deletions(-)

diff --git a/src/mon/Monitor.cc b/src/mon/Monitor.cc
index 1536b2e..dacda09 100644
--- a/src/mon/Monitor.cc
+++ b/src/mon/Monitor.cc
@@ -4025,6 +4025,17 @@ void Monitor::scrub_reset()
 
 
 
+int Monitor::store_snapshot(const string& name) {
+  while (paxos->is_writing() || paxos->is_writing_previous()) {
+    lock.Unlock();
+    store->flush();
+    lock.Lock();
+  }
+
+  return store->snapshot(name);
+}
+
+
 /************ TICK ***************/
 
 class C_Mon_Tick : public Context {
diff --git a/src/mon/Monitor.h b/src/mon/Monitor.h
index 63423f2..6a0bef5 100644
--- a/src/mon/Monitor.h
+++ b/src/mon/Monitor.h
@@ -835,6 +835,8 @@ public:
 
   void handle_signal(int sig);
 
+  int store_snapshot(const string& name);
+
   int mkfs(bufferlist& osdmapbl);
 
   /**
diff --git a/src/mon/MonitorDBStore.h b/src/mon/MonitorDBStore.h
index a0c82b7..bbe0011 100644
--- a/src/mon/MonitorDBStore.h
+++ b/src/mon/MonitorDBStore.h
@@ -27,6 +27,16 @@
 #include "common/Finisher.h"
 #include "common/errno.h"
 
+#include <unistd.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <stdlib.h>
+#include <sys/ioctl.h>
+
+#ifndef __CYGWIN__
+#include "os/btrfs_ioctl.h"
+#endif
+
 class MonitorDBStore
 {
   boost::scoped_ptr<KeyValueDB> db;
@@ -587,12 +597,10 @@ class MonitorDBStore
     return db->get_estimated_size(extras);
   }
 
-  MonitorDBStore(const string& path)
-    : db(0),
-      do_dump(false),
-      dump_fd(-1),
-      io_work(g_ceph_context, "monstore"),
-      is_open(false) {
+  static string store_path(const string *pathp = NULL,
+			   const string& name = "store.db") {
+    string path = pathp ? *pathp : g_conf->mon_data;
+
     string::const_reverse_iterator rit;
     int pos = 0;
     for (rit = path.rbegin(); rit != path.rend(); ++rit, ++pos) {
@@ -600,9 +608,84 @@ class MonitorDBStore
 	break;
     }
     ostringstream os;
-    os << path.substr(0, path.size() - pos) << "/store.db";
+    os << path.substr(0, path.size() - pos) << "/" << name;
     string full_path = os.str();
 
+    return full_path;
+  }
+
+  int snapshot(const string& name) {
+    int r = -ENOTSUP;
+
+#ifdef BTRFS_IOC_SNAP_CREATE
+    {
+      string snap = store_path(0, name);
+      string store = store_path();
+
+      int mondirfd = ::open(g_conf->mon_data.c_str(), 0);
+      int storefd = ::open(store.c_str(), O_RDONLY);
+
+      if (storefd >= 0 && mondirfd >= 0) {
+	struct btrfs_ioctl_vol_args vol_args;
+	memset(&vol_args, 0, sizeof(vol_args));
+
+	vol_args.fd = storefd;
+	strcpy(vol_args.name, name.c_str());
+	(void) ::ioctl(mondirfd, BTRFS_IOC_SNAP_DESTROY, &vol_args);
+	r = ::ioctl(mondirfd, BTRFS_IOC_SNAP_CREATE, &vol_args);
+      }
+
+      ::close(storefd);
+      ::close(mondirfd);
+    }
+#endif
+
+    if (r) {
+      string snap = store_path (0, name);
+      KeyValueDB* snapdb = KeyValueDB::create(g_ceph_context,
+					      g_conf->mon_keyvaluedb,
+					      snap);
+      if (!snapdb)
+	r = -errno;
+      else {
+	snapdb->init();
+	ostringstream os;
+	r = snapdb->create_and_open(os);
+	if (!r) {
+	  KeyValueDB::Transaction dbt = snapdb->get_transaction();
+	  KeyValueDB::WholeSpaceIterator it = snapdb->get_iterator();
+	  for (it->seek_to_first(); it->valid(); it->next()) {
+	    pair<string,string> k = it->raw_key();
+	    dbt->rmkey(k.first, k.second);
+	  }
+	  r = snapdb->submit_transaction(dbt);
+
+	  if (!r) {
+	    dbt = snapdb->get_transaction();
+	    it = db->get_snapshot_iterator();
+	    for (it->seek_to_first(); it->valid(); it->next()) {
+	      pair<string,string> k = it->raw_key();
+	      dbt->set(k.first, k.second, it->value());
+	    }
+	    r = snapdb->submit_transaction_sync(dbt);
+	  }
+
+	  delete snapdb;
+	}
+      }
+    }
+
+    return r;
+  }
+
+  MonitorDBStore(const string& path)
+    : db(0),
+      do_dump(false),
+      dump_fd(-1),
+      io_work(g_ceph_context, "monstore"),
+      is_open(false) {
+    string full_path = store_path (&path);
+
     KeyValueDB *db_ptr = KeyValueDB::create(g_ceph_context,
 					    g_conf->mon_keyvaluedb,
 					    full_path);
diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index b237846..068abbe 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -230,8 +230,11 @@ void OSDMonitor::update_from_paxos(bool *need_bootstrap)
       t->erase("mkfs", "osdmap");
     }
 
-    if (tx_size > g_conf->mon_sync_max_payload_size*2) {
+    if (tx_size > g_conf->mon_sync_max_payload_size*2 ||
+	osdmap.cluster_snapshot_epoch) {
       mon->store->apply_transaction(t);
+      if (osdmap.cluster_snapshot_epoch)
+	mon->store_snapshot("clustersnap_" + osdmap.cluster_snapshot);
       t = MonitorDBStore::TransactionRef();
       tx_size = 0;
     }

[-- Attachment #3: mon-osd-reinstate-cluster-snap.patch --]
[-- Type: text/x-diff, Size: 3038 bytes --]

reinstate ceph cluster_snap support

From: Alexandre Oliva <oliva@gnu.org>

This patch brings back and updates (for dumpling) the code originally
introduced to support “ceph osd cluster_snap <snap>”, that was
disabled and partially removed before cuttlefish.

Some minimal testing appears to indicate this even works: the modified
mon actually generated an osdmap with the cluster_snap request, and
starting a modified osd that was down and letting it catch up caused
the osd to take the requested snapshot.  I see no reason why it
wouldn't have taken it if it was up and running, so...  Why was this
feature disabled in the first place?

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
---
 src/mon/MonCommands.h |    6 ++++--
 src/mon/OSDMonitor.cc |   11 +++++++----
 src/osd/OSD.cc        |    8 ++++++++
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h
index d702615..8f468f4 100644
--- a/src/mon/MonCommands.h
+++ b/src/mon/MonCommands.h
@@ -499,8 +499,10 @@ COMMAND("osd set " \
 COMMAND("osd unset " \
 	"name=key,type=CephChoices,strings=pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub|notieragent", \
 	"unset <key>", "osd", "rw", "cli,rest")
-COMMAND("osd cluster_snap", "take cluster snapshot (disabled)", \
-	"osd", "r", "")
+COMMAND("osd cluster_snap " \
+	"name=snap,type=CephString", \
+	"take cluster snapshot",	\
+	"osd", "r", "cli")
 COMMAND("osd down " \
 	"type=CephString,name=ids,n=N", \
 	"set osd(s) <id> [<id>...] down", "osd", "rw", "cli,rest")
diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index bfcc09e..b237846 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -4766,10 +4766,13 @@ bool OSDMonitor::prepare_command_impl(MMonCommand *m,
     }
 
   } else if (prefix == "osd cluster_snap") {
-    // ** DISABLE THIS FOR NOW **
-    ss << "cluster snapshot currently disabled (broken implementation)";
-    // ** DISABLE THIS FOR NOW **
-
+    string snap;
+    cmd_getval(g_ceph_context, cmdmap, "snap", snap);
+    pending_inc.cluster_snapshot = snap;
+    ss << "creating cluster snap " << snap;
+    getline(ss, rs);
+    wait_for_finished_proposal(new Monitor::C_Command(mon, m, 0, rs, get_last_committed()));
+    return true;
   } else if (prefix == "osd down" ||
 	     prefix == "osd out" ||
 	     prefix == "osd in" ||
diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index f2f5df5..eb4f246 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -6310,6 +6310,14 @@ void OSD::handle_osd_map(MOSDMap *m)
       }
     }
     
+    string cluster_snap = newmap->get_cluster_snapshot();
+    if (cluster_snap.length()) {
+      dout(0) << "creating cluster snapshot '" << cluster_snap << "'" << dendl;
+      int r = store->snapshot(cluster_snap);
+      if (r)
+	dout(0) << "failed to create cluster snapshot: " << cpp_strerror(r) << dendl;
+    }
+
     osdmap = newmap;
 
     superblock.current_epoch = cur;

[-- Attachment #4: Type: text/plain, Size: 257 bytes --]


-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] reinstate ceph cluster_snap support
  2014-10-21  2:49       ` Alexandre Oliva
@ 2014-10-27 21:00         ` Sage Weil
  2014-11-03 19:57           ` Alexandre Oliva
  0 siblings, 1 reply; 15+ messages in thread
From: Sage Weil @ 2014-10-27 21:00 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: sam.just, ceph-devel

On Tue, 21 Oct 2014, Alexandre Oliva wrote:
> On Aug 27, 2013, Sage Weil <sage@inktank.com> wrote:
> 
> > Finally, eventually we should make this do a checkpoint on the mons too.  
> > We can add the osd snapping back in first, but before this can/should 
> > really be used the mons need to be snapshotted as well.  Probably that's 
> > just adding in a snapshot() method to MonitorStore.h and doing either a 
> > leveldb snap or making a full copy of store.db... I forget what leveldb is 
> > capable of here.
> 
> I suppose it might be a bit too late for Giant, but I finally got 'round
> to implementing this.  I attach the patch that implements it, to be
> applied on top of the updated version of the patch I posted before, also
> attached.
> 
> I have a backport to Firefly too, if there's interest.
> 
> I have tested both methods: btrfs snapshotting of store.db (I've
> manually turned store.db into a btrfs subvolume), and creating a new db
> with all (prefix,key,value) triples.  I'm undecided about inserting
> multiple transaction commits for the latter case; the mon mem use grew
> up a lot as it was, and in a few tests the snapshotting ran twice, but
> in the end a dump of all the data in the database created by btrfs
> snapshotting was identical to that created by explicit copying.  So, the
> former is preferred, since it's so incredibly more efficient.  I also
> considered hardlinking all files in store.db into a separate tree, but I
> didn't like the idea of coding that in C+-, :-) and I figured it might
> not work with other db backends, and maybe even not be guaranteed to
> work with leveldb.  It's probably not worth much more effort.

This looks pretty reasonable!

I think we definitely need to limit the size of the transaction when doing 
the snap.  The attached patch seems to try to do it all in one go, which 
is not going to work for large clusters.  Either re-use an existing 
tunable like the sync chunk size or add a new one?

Thanks!
sage


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] reinstate ceph cluster_snap support
  2014-10-27 21:00         ` Sage Weil
@ 2014-11-03 19:57           ` Alexandre Oliva
  2014-11-13 18:02             ` Sage Weil
  0 siblings, 1 reply; 15+ messages in thread
From: Alexandre Oliva @ 2014-11-03 19:57 UTC (permalink / raw)
  To: Sage Weil; +Cc: sam.just, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 4224 bytes --]

On Oct 27, 2014, Sage Weil <sage@newdream.net> wrote:

> On Tue, 21 Oct 2014, Alexandre Oliva wrote:

>> I have tested both methods: btrfs snapshotting of store.db (I've
>> manually turned store.db into a btrfs subvolume), and creating a new db
>> with all (prefix,key,value) triples.  I'm undecided about inserting
>> multiple transaction commits for the latter case; the mon mem use grew
>> up a lot as it was, and in a few tests the snapshotting ran twice, but
>> in the end a dump of all the data in the database created by btrfs
>> snapshotting was identical to that created by explicit copying.  So, the
>> former is preferred, since it's so incredibly more efficient.  I also
>> considered hardlinking all files in store.db into a separate tree, but I
>> didn't like the idea of coding that in C+-, :-) and I figured it might
>> not work with other db backends, and maybe even not be guaranteed to
>> work with leveldb.  It's probably not worth much more effort.

> This looks pretty reasonable!

> I think we definitely need to limit the size of the transaction when doing 
> the snap.  The attached patch seems to try to do it all in one go, which 
> is not going to work for large clusters.  Either re-use an existing 
> tunable like the sync chunk size or add a new one?

Ok, I changed it so that it reuses the same tunable used for mon sync
transaction sizes.  Tested on top of 0.87 (flawless upgrade, whee!),
with both btrfs snapshots and leveldb copying.  I noticed slight
differences between the databases at each mon with the leveldb copying,
presumably having to do with at least one round of monitor-internal
retrying as the first copy in each mon took too long, but each copy
appeared to be consistent, and their aggregate is presumably usable to
get the cluster rolled back to an earlier consistent state.

I wish there was a better way to avoid the retry; though.  I'm thinking
maybe not performing the copy when the snapshot dir already exists (like
we do for osd snapshots), but I'm not sure this would guarantee a
consistent monitor quorum snapshot.  But then, I'm not sure the current
approach does either.  Plus, if we were to fail because the dir already
existed, we'd need to somehow collect the success/fail status of the
first try so that we don't misreport a failure just because the internal
retry failed, and then, we'd need to distinguish an internal retry that
ought to pass if the first attempt passed from a user-commanded attempt
to created a new snapshot with the same name, which should ideally fail
if any cluster component fails to take the snapshot.  Not that we
currently send any success/failure status back...

Thoughts?


If we stick with re-copying (rather than dropping the request on the
floor if the dir exists), I might improve the overwriting from “remove
all entries, then add them all back” to “compare entries in source and
destination, and copy only those that changed”.  I have code to do just
that, that I could borrow from a leveldbcmp programlet I wrote, but it
would take some significant rewriting to refit it into the KeyValueDB
interface, so I didn't jump through this hoop.  Let me know if it is
desirable, and I'll try to schedule some time to work on it.

Another issue I'm somewhat unhappy about is that, while we perform the
copying (which can take a few tens of seconds), the cluster comes to a
halt because the mons won't make progress.  I wish we could do this
copying in background, but I have no clue as to how to go about making
the operation proceed asynchronously while returning a success at the
end, rather than, say, because we successfully *started* the copying.  I
could use a clue or two to do better than that ;-)


Another nit I'm slightly unhappy about is that the command is still
“ceph osd cluster_snap <tag>”, whereas it now snapshots mons too, and it
creates directories named clustersnap_<tag>, without the underscore
separating cluster and snap as in the command.  I'd like to spell them
out the same.  Any preferences?


Here's the patch I've tested, that currently builds upon the other patch
upthread, that reinstates osd snapshotting.



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: mon-cluster-snap.patch --]
[-- Type: text/x-diff, Size: 6950 bytes --]

mon: take mon snapshots

From: Alexandre Oliva <oliva@gnu.org>

Extend the machinery that takes osd snapshots to take mon snapshots
too.  Taking a btrfs snapshot of store.db is the preferred method, but
if that fails, the database is copied one tuple at a time.  Unlike osd
snapshots, existing mon snapshots are clobbered by the new snapshots,
mainly because a mon may attempt to take the same snapshot multiple
times, especially if a first attempt takes very long.  It shouldn't be
a problem: the most important property is that the osd snapshots are
taken before the updated osdmap is integrated, whereas the mon
snapshots are taken after that, so that osd snapshots happen-before
mon ones, otherwise in case of rollback osds might have an osdmap that
mons don't know about.

There's no guarantee that all monitors will be completely up to date
at the time the snapshot is taken.  It might be that monitors that
were lagging behind take the snapshot at a later time, and before they
get all the monitor state of the quorum set at the time of the
snapshot.  So, when rolling back the entire cluster (which would have
to be done by hand, as there's no command to do that for mons), it is
advisable to roll back and bring up all monitors, so that it's likely
that more than one monitor has the complete monitor state to propagate
to others.

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
---
 src/mon/Monitor.cc       |   11 ++++
 src/mon/Monitor.h        |    2 +
 src/mon/MonitorDBStore.h |  125 +++++++++++++++++++++++++++++++++++++++++++---
 src/mon/OSDMonitor.cc    |    5 +-
 4 files changed, 135 insertions(+), 8 deletions(-)

diff --git a/src/mon/Monitor.cc b/src/mon/Monitor.cc
index 52df6f0..bbdb5b2 100644
--- a/src/mon/Monitor.cc
+++ b/src/mon/Monitor.cc
@@ -4094,6 +4094,17 @@ void Monitor::scrub_reset()
 
 
 
+int Monitor::store_snapshot(const string& name) {
+  while (paxos->is_writing() || paxos->is_writing_previous()) {
+    lock.Unlock();
+    store->flush();
+    lock.Lock();
+  }
+
+  return store->snapshot(name);
+}
+
+
 /************ TICK ***************/
 
 class C_Mon_Tick : public Context {
diff --git a/src/mon/Monitor.h b/src/mon/Monitor.h
index 91c524a..bc54e27 100644
--- a/src/mon/Monitor.h
+++ b/src/mon/Monitor.h
@@ -836,6 +836,8 @@ public:
 
   void handle_signal(int sig);
 
+  int store_snapshot(const string& name);
+
   int mkfs(bufferlist& osdmapbl);
 
   /**
diff --git a/src/mon/MonitorDBStore.h b/src/mon/MonitorDBStore.h
index a0c82b7..3c170cb 100644
--- a/src/mon/MonitorDBStore.h
+++ b/src/mon/MonitorDBStore.h
@@ -27,6 +27,16 @@
 #include "common/Finisher.h"
 #include "common/errno.h"
 
+#include <unistd.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <stdlib.h>
+#include <sys/ioctl.h>
+
+#ifndef __CYGWIN__
+#include "os/btrfs_ioctl.h"
+#endif
+
 class MonitorDBStore
 {
   boost::scoped_ptr<KeyValueDB> db;
@@ -587,12 +597,10 @@ class MonitorDBStore
     return db->get_estimated_size(extras);
   }
 
-  MonitorDBStore(const string& path)
-    : db(0),
-      do_dump(false),
-      dump_fd(-1),
-      io_work(g_ceph_context, "monstore"),
-      is_open(false) {
+  static string store_path(const string *pathp = NULL,
+			   const string& name = "store.db") {
+    string path = pathp ? *pathp : g_conf->mon_data;
+
     string::const_reverse_iterator rit;
     int pos = 0;
     for (rit = path.rbegin(); rit != path.rend(); ++rit, ++pos) {
@@ -600,9 +608,112 @@ class MonitorDBStore
 	break;
     }
     ostringstream os;
-    os << path.substr(0, path.size() - pos) << "/store.db";
+    os << path.substr(0, path.size() - pos) << "/" << name;
     string full_path = os.str();
 
+    return full_path;
+  }
+
+  int snapshot(const string& name) {
+    int r = -ENOTSUP;
+
+#ifdef BTRFS_IOC_SNAP_CREATE
+    {
+      string snap = store_path(0, name);
+      string store = store_path();
+
+      int mondirfd = ::open(g_conf->mon_data.c_str(), 0);
+      int storefd = ::open(store.c_str(), O_RDONLY);
+
+      if (storefd >= 0 && mondirfd >= 0) {
+	struct btrfs_ioctl_vol_args vol_args;
+	memset(&vol_args, 0, sizeof(vol_args));
+
+	vol_args.fd = storefd;
+	strcpy(vol_args.name, name.c_str());
+	(void) ::ioctl(mondirfd, BTRFS_IOC_SNAP_DESTROY, &vol_args);
+	r = ::ioctl(mondirfd, BTRFS_IOC_SNAP_CREATE, &vol_args);
+      }
+
+      ::close(storefd);
+      ::close(mondirfd);
+    }
+#endif
+
+    if (r) {
+      string snap = store_path (0, name);
+      KeyValueDB* snapdb = KeyValueDB::create(g_ceph_context,
+					      g_conf->mon_keyvaluedb,
+					      snap);
+      if (!snapdb)
+	r = -errno;
+      else {
+	snapdb->init();
+	ostringstream os;
+	r = snapdb->create_and_open(os);
+	if (!r) {
+	  int const txsize = g_conf->mon_sync_max_payload_size;
+	  int left = txsize;
+	  KeyValueDB::Transaction dbt = snapdb->get_transaction();
+	  KeyValueDB::WholeSpaceIterator it = snapdb->get_iterator();
+	  for (it->seek_to_first(); it->valid(); it->next()) {
+	    pair<string,string> k = it->raw_key();
+
+	    int thissize = 10 + k.first.length() + k.second.length();
+	    if (left < thissize) {
+	      r = snapdb->submit_transaction(dbt);
+	      if (r)
+		break;
+	      dbt = snapdb->get_transaction();
+	      if (!dbt)
+		break;
+	      left = txsize;
+	    }
+
+	    dbt->rmkey(k.first, k.second);
+	    left -= thissize;
+	  }
+	  if (!r && dbt) {
+	    it = db->get_snapshot_iterator();
+	    for (it->seek_to_first(); it->valid(); it->next()) {
+	      pair<string,string> k = it->raw_key();
+
+	      int thissize = 10 + k.first.length() + k.second.length() +
+		it->value().length();
+	      if (left < thissize) {
+		r = snapdb->submit_transaction(dbt);
+		if (r)
+		  break;
+		dbt = snapdb->get_transaction();
+		if (!dbt)
+		  break;
+		left = txsize;
+	      }
+
+	      dbt->set(k.first, k.second, it->value());
+	      left -= thissize;
+	    }
+
+	    if (!r && dbt)
+	      r = snapdb->submit_transaction_sync(dbt);
+	  }
+
+	  delete snapdb;
+	}
+      }
+    }
+
+    return r;
+  }
+
+  MonitorDBStore(const string& path)
+    : db(0),
+      do_dump(false),
+      dump_fd(-1),
+      io_work(g_ceph_context, "monstore"),
+      is_open(false) {
+    string full_path = store_path (&path);
+
     KeyValueDB *db_ptr = KeyValueDB::create(g_ceph_context,
 					    g_conf->mon_keyvaluedb,
 					    full_path);
diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index b237846..068abbe 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -230,8 +230,11 @@ void OSDMonitor::update_from_paxos(bool *need_bootstrap)
       t->erase("mkfs", "osdmap");
     }
 
-    if (tx_size > g_conf->mon_sync_max_payload_size*2) {
+    if (tx_size > g_conf->mon_sync_max_payload_size*2 ||
+	osdmap.cluster_snapshot_epoch) {
       mon->store->apply_transaction(t);
+      if (osdmap.cluster_snapshot_epoch)
+	mon->store_snapshot("clustersnap_" + osdmap.cluster_snapshot);
       t = MonitorDBStore::TransactionRef();
       tx_size = 0;
     }

[-- Attachment #3: Type: text/plain, Size: 258 bytes --]



-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] reinstate ceph cluster_snap support
  2014-11-03 19:57           ` Alexandre Oliva
@ 2014-11-13 18:02             ` Sage Weil
  0 siblings, 0 replies; 15+ messages in thread
From: Sage Weil @ 2014-11-13 18:02 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: sam.just, ceph-devel

On Mon, 3 Nov 2014, Alexandre Oliva wrote:
> On Oct 27, 2014, Sage Weil <sage@newdream.net> wrote:
> 
> > On Tue, 21 Oct 2014, Alexandre Oliva wrote:
> 
> >> I have tested both methods: btrfs snapshotting of store.db (I've
> >> manually turned store.db into a btrfs subvolume), and creating a new db
> >> with all (prefix,key,value) triples.  I'm undecided about inserting
> >> multiple transaction commits for the latter case; the mon mem use grew
> >> up a lot as it was, and in a few tests the snapshotting ran twice, but
> >> in the end a dump of all the data in the database created by btrfs
> >> snapshotting was identical to that created by explicit copying.  So, the
> >> former is preferred, since it's so incredibly more efficient.  I also
> >> considered hardlinking all files in store.db into a separate tree, but I
> >> didn't like the idea of coding that in C+-, :-) and I figured it might
> >> not work with other db backends, and maybe even not be guaranteed to
> >> work with leveldb.  It's probably not worth much more effort.
> 
> > This looks pretty reasonable!
> 
> > I think we definitely need to limit the size of the transaction when doing 
> > the snap.  The attached patch seems to try to do it all in one go, which 
> > is not going to work for large clusters.  Either re-use an existing 
> > tunable like the sync chunk size or add a new one?
> 
> Ok, I changed it so that it reuses the same tunable used for mon sync
> transaction sizes.  Tested on top of 0.87 (flawless upgrade, whee!),
> with both btrfs snapshots and leveldb copying.  I noticed slight
> differences between the databases at each mon with the leveldb copying,
> presumably having to do with at least one round of monitor-internal
> retrying as the first copy in each mon took too long, but each copy
> appeared to be consistent, and their aggregate is presumably usable to
> get the cluster rolled back to an earlier consistent state.

Yeah, I don't think it matters that the mons are in sync when the snapshot 
happens since they're well-prepared to handle that.  As long as the 
snapshot happens at the same logical time (triggered by a broadcast from 
the leader).

> I wish there was a better way to avoid the retry; though.  I'm thinking
> maybe not performing the copy when the snapshot dir already exists (like
> we do for osd snapshots), but I'm not sure this would guarantee a
> consistent monitor quorum snapshot.  But then, I'm not sure the current
> approach does either.  Plus, if we were to fail because the dir already
> existed, we'd need to somehow collect the success/fail status of the
> first try so that we don't misreport a failure just because the internal
> retry failed, and then, we'd need to distinguish an internal retry that
> ought to pass if the first attempt passed from a user-commanded attempt
> to created a new snapshot with the same name, which should ideally fail
> if any cluster component fails to take the snapshot.  Not that we
> currently send any success/failure status back...

A failure would be ideal and not overwriting I would think.  Barring the 
error handling, I'd just send something to the cluster log that it failed 
so that the operator has a clue it didn't work.

> If we stick with re-copying (rather than dropping the request on the
> floor if the dir exists), I might improve the overwriting from ?remove
> all entries, then add them all back? to ?compare entries in source and
> destination, and copy only those that changed?.  I have code to do just
> that, that I could borrow from a leveldbcmp programlet I wrote, but it
> would take some significant rewriting to refit it into the KeyValueDB
> interface, so I didn't jump through this hoop.  Let me know if it is
> desirable, and I'll try to schedule some time to work on it.

Meh, simpler to avoid dup names in the first place, I think?

> Another issue I'm somewhat unhappy about is that, while we perform the
> copying (which can take a few tens of seconds), the cluster comes to a
> halt because the mons won't make progress.  I wish we could do this
> copying in background, but I have no clue as to how to go about making
> the operation proceed asynchronously while returning a success at the
> end, rather than, say, because we successfully *started* the copying.  I
> could use a clue or two to do better than that ;-)

Technically leveldb can do it but other backends we plug in later may not.  
I'm fine with this stalling for now since this is really about disaster 
recovery and isn't something we're ready to have everyone use.

> Another nit I'm slightly unhappy about is that the command is still
> ?ceph osd cluster_snap <tag>?, whereas it now snapshots mons too, and it
> creates directories named clustersnap_<tag>, without the underscore
> separating cluster and snap as in the command.  I'd like to spell them
> out the same.  Any preferences?

I'm all for picking new names.  No need to preserve any sort of 
compatibility here.  Maybe just 'ceph cluster_snap <name>'?

Anyway, if you're going to be using this and will be giving it some 
testing, happy to pull these into the tree.  Not ready to advertise the 
feature until it has more robust testing in the test suite, though.  We 
should make the usage/help text for the cluster_snap command indicate it 
is experimental.

sage



> Here's the patch I've tested, that currently builds upon the other patch
> upthread, that reinstates osd snapshotting.
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-11-13 18:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-22  9:10 [PATCH] reinstate ceph cluster_snap support Alexandre Oliva
2013-08-24  0:17 ` Sage Weil
2013-08-24 14:56   ` Alexandre Oliva
2013-08-27 22:21     ` Sage Weil
2013-08-28  0:54       ` Yan, Zheng
2013-08-28  4:34         ` Sage Weil
2013-12-17 12:14       ` Alexandre Oliva
2013-12-17 13:50         ` Alexandre Oliva
2013-12-17 14:22           ` Alexandre Oliva
2013-12-18 19:35             ` Gregory Farnum
2013-12-19  8:22               ` Alexandre Oliva
2014-10-21  2:49       ` Alexandre Oliva
2014-10-27 21:00         ` Sage Weil
2014-11-03 19:57           ` Alexandre Oliva
2014-11-13 18:02             ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.