All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sage Weil <sage@newdream.net>
To: Gregory Farnum <gfarnum@redhat.com>
Cc: Wido den Hollander <wido@42on.com>,
	ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Supplying ID to ceph-disk when creating OSD
Date: Wed, 15 Feb 2017 17:34:49 +0000 (UTC)	[thread overview]
Message-ID: <alpine.DEB.2.11.1702151723350.7782@piezo.novalocal> (raw)
In-Reply-To: <CAJ4mKGZPfWg+WCfYdaobuRj0BgqSatQ11UNTFhvxqNiG2Fh-vw@mail.gmail.com>

On Wed, 15 Feb 2017, Gregory Farnum wrote:
> On Wed, Feb 15, 2017 at 8:59 AM, Wido den Hollander <wido@42on.com> wrote:
> > Hi,
> >
> > Currently we can supply a OSD UUID to 'ceph-disk prepare', but we can't provide a OSD ID.
> >
> > With BlueStore coming I think the use-case for this is becoming very valid:
> >
> > 1. Stop OSD
> > 2. Zap disk
> > 3. Re-create OSD with same ID and UUID (with BlueStore)
> > 4. Start OSD
> >
> > This allows for a in-place update of the OSD without modifying the CRUSHMap. For the cluster's point of view the OSD goes down and comes back up empty.
> >
> > There were some drawbacks around this and some dangers, so before I start working on a PR for this, any gotcaches which might be a problem?
> 
> Yes. Unfortunately they are subtle and I don't remember them. :p
> 
> I'd recommend going back and finding the historical discussions about
> this to be sure. I *think* there were two main issues which prompted
> us to remove that:
> 1) people creating very large IDs, needlessly exploding OSDMap size
> because it's all array-based,

Working in terms of uuids should avoid this (i.e., users can't 
force a large oid id without significant effort).

> 2) issues reusing the ID of lost OSDs versus PGs recognizing that the
> OSD didn't have the data they wanted.
> 
> 1 is still a bit of a problem, though if anybody has a good UX way of
> handling it that's the real issue. 2 has hopefully been fixed over the
> course of various refactors and improvements, but it's not something
> I'd count on without checking very carefully.

Oh yeah, this is the one that worries me.  I think the scenario we want 
to help users avoid is that osd N exists (but might be down at the moment) 
and a new, empty version of that same OSD is created and started.  
Peering will reasonably conclude that PG instances don't exist and may 
end up concluding that writes didn't happen.

I think we want some sort of safety check so that the user as to say "this 
osd is dead" before they're allowed to create a new one in its image.  I 
think the simplest thing is to use the existing 'ceph osd lost ...' 
command for this.  I.e., the mon won't let a blank OSD start with a given 
uuid/id unless it is either a new osd rank or the rank is marked 
lost.

My main lingering doubt here is whether it's a bad idea to reuse a uuid; 
it seems like the whole point is that uuids are unique.  Perhaps instead 
the ceph-disk prepare --replace-oid NN command should replace the old uuid 
in the map with the new one as part of this process.  Probably something 
like 'ceph osd replace newuuid olduuid' to make the whole thing 
idempotent...

sage



> -Greg
> 
> >
> > The idea is that users have a very simple way to re-format a OSD in-place while keeping the same CRUSH location, ID and UUID.
> >
> > Wido
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

  reply	other threads:[~2017-02-15 17:34 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-15 16:59 Supplying ID to ceph-disk when creating OSD Wido den Hollander
2017-02-15 17:12 ` Loic Dachary
2017-02-15 17:14 ` Sage Weil
2017-02-15 17:16   ` Sage Weil
2017-02-16 15:14   ` Wido den Hollander
2017-02-16 15:32     ` Sage Weil
2017-02-16 16:20       ` Sage Weil
2017-02-16 21:56         ` Wido den Hollander
2017-03-06 21:52           ` Sage Weil
2017-02-15 17:16 ` Gregory Farnum
2017-02-15 17:34   ` Sage Weil [this message]
2017-02-15 18:13     ` Gregory Farnum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.11.1702151723350.7782@piezo.novalocal \
    --to=sage@newdream.net \
    --cc=ceph-devel@vger.kernel.org \
    --cc=gfarnum@redhat.com \
    --cc=wido@42on.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.