All of lore.kernel.org
 help / color / mirror / Atom feed
* upgrade procedure to Luminous
@ 2017-07-14 14:01 Joao Eduardo Luis
       [not found] ` <fc6ff947-e806-84c8-4eae-099a465482c6-l3A5Bk7waGM@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Joao Eduardo Luis @ 2017-07-14 14:01 UTC (permalink / raw)
  To: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA
  Cc: Sage Weil

Dear all,


The current upgrade procedure to jewel, as stated by the RC's release 
notes, can be boiled down to

- upgrade all monitors first
- upgrade osds only after we have a **full** quorum, comprised of all 
the monitors in the monmap, of luminous monitors (i.e., once we have the 
'luminous' feature enabled in the monmap).

While this is a reasonable idea in principle, reducing a lot of the 
possible upgrade testing combinations, and a simple enough procedure 
from Ceph's point-of-view, it seems it's not a widespread upgrade procedure.

As far as I can tell, it's not uncommon for users to take this 
maintenance window to perform system-wide upgrades, including kernel and 
glibc for instance, and finishing the upgrade with a reboot.

The problem with our current upgrade procedure is that once the first 
server reboots, the osds in that server will be unable to boot, as the 
monitor quorum is not yet 'luminous'.

The only way to minimize potential downtime is to upgrade and restart 
all the nodes at the same time, which can be daunting and it basically 
defeats the purpose of a rolling upgrade. And in this scenario, there is 
an expectation of downtime, something Ceph is built to prevent.

Additionally, requiring the `luminous` feature to be enabled in the 
quorum becomes even less realistic in the face of possible failures. God 
forbid that in the middle of upgrading, the last remaining monitor 
server dies a horrible death - e.g., power, network. We'll be left with 
still a 'not-luminous' quorum, and a bunch of OSDs waiting for this flag 
to be flipped. And not it's a race to either get that monitor up, or 
remove it from the monmap.

Even if one were to make the decision of only upgrading system packages, 
reboot, and then upgrade Ceph packages, there is the unfortunate 
possibility that library interdependencies would require Ceph's binaries 
to be updated, so this may be a show-stopper as well.

Alternatively, if one is to simply upgrade the system and not reboot, 
and then proceed to perform the upgrade procedure, one would still be in 
a fragile position: if, for some reason, one of the nodes reboots, we're 
in the same precarious situation as before.

Personally, I can see two ways out of this, at different positions in 
the reasonability spectrum:

1. add temporary monitor nodes to the cluster, may they be on VMs or 
bare hardware, already running Luminous, and then remove the same amount 
of monitors from the cluster. This leaves us to upgrade a single monitor 
node. This has the drawback of folks not having spare nodes to run the 
monitors on, or running monitors on VMs -- which may affect their 
performance during the upgrade window, and increase complexity in terms 
of firewall and routing rules.

2. migrate/upgrade all nodes on which Monitors are located first, then 
only restart them after we've gotten all nodes upgraded. If anything 
goes wrong, one can hurry through this step or fall-back to 3.

3. Reducing the monitor quorum to 1. This pains me to even think about, 
and it bothers me to bits that I'm finding myself even considering this 
as a reasonable possibility. It shouldn't, because it isn't. But it's a 
lot more realistic than expecting OSD downtime during an upgrade procedure.

On top of this all, I found during my tests that any OSD, running 
luminous prior to the luminous quorum, will need to be restarted before 
it can properly boot into the cluster. I'm guessing this is a bug rather 
than a feature though.

Any thoughts on how to mitigate this, or on whether I got this all wrong 
and am missing a crucial detail that blows this wall of text away, 
please let me know.


   -Joao

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: upgrade procedure to Luminous
       [not found] ` <fc6ff947-e806-84c8-4eae-099a465482c6-l3A5Bk7waGM@public.gmane.org>
@ 2017-07-14 14:12   ` Sage Weil
       [not found]     ` <alpine.DEB.2.11.1707141407300.27271-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Sage Weil @ 2017-07-14 14:12 UTC (permalink / raw)
  To: Joao Eduardo Luis
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

On Fri, 14 Jul 2017, Joao Eduardo Luis wrote:
> Dear all,
> 
> 
> The current upgrade procedure to jewel, as stated by the RC's release notes,

You mean (jewel or kraken) -> luminous, I assume...

> can be boiled down to
> 
> - upgrade all monitors first
> - upgrade osds only after we have a **full** quorum, comprised of all the
> monitors in the monmap, of luminous monitors (i.e., once we have the
> 'luminous' feature enabled in the monmap).
> 
> While this is a reasonable idea in principle, reducing a lot of the possible
> upgrade testing combinations, and a simple enough procedure from Ceph's
> point-of-view, it seems it's not a widespread upgrade procedure.
> 
> As far as I can tell, it's not uncommon for users to take this maintenance
> window to perform system-wide upgrades, including kernel and glibc for
> instance, and finishing the upgrade with a reboot.
> 
> The problem with our current upgrade procedure is that once the first server
> reboots, the osds in that server will be unable to boot, as the monitor quorum
> is not yet 'luminous'.
> 
> The only way to minimize potential downtime is to upgrade and restart all the
> nodes at the same time, which can be daunting and it basically defeats the
> purpose of a rolling upgrade. And in this scenario, there is an expectation of
> downtime, something Ceph is built to prevent.
> 
> Additionally, requiring the `luminous` feature to be enabled in the quorum
> becomes even less realistic in the face of possible failures. God forbid that
> in the middle of upgrading, the last remaining monitor server dies a horrible
> death - e.g., power, network. We'll be left with still a 'not-luminous'
> quorum, and a bunch of OSDs waiting for this flag to be flipped. And not it's
> a race to either get that monitor up, or remove it from the monmap.
> 
> Even if one were to make the decision of only upgrading system packages,
> reboot, and then upgrade Ceph packages, there is the unfortunate possibility
> that library interdependencies would require Ceph's binaries to be updated, so
> this may be a show-stopper as well.
> 
> Alternatively, if one is to simply upgrade the system and not reboot, and then
> proceed to perform the upgrade procedure, one would still be in a fragile
> position: if, for some reason, one of the nodes reboots, we're in the same
> precarious situation as before.
> 
> Personally, I can see two ways out of this, at different positions in the
> reasonability spectrum:
> 
> 1. add temporary monitor nodes to the cluster, may they be on VMs or bare
> hardware, already running Luminous, and then remove the same amount of
> monitors from the cluster. This leaves us to upgrade a single monitor node.
> This has the drawback of folks not having spare nodes to run the monitors on,
> or running monitors on VMs -- which may affect their performance during the
> upgrade window, and increase complexity in terms of firewall and routing
> rules.
> 
> 2. migrate/upgrade all nodes on which Monitors are located first, then only
> restart them after we've gotten all nodes upgraded. If anything goes wrong,
> one can hurry through this step or fall-back to 3.
> 
> 3. Reducing the monitor quorum to 1. This pains me to even think about, and it
> bothers me to bits that I'm finding myself even considering this as a
> reasonable possibility. It shouldn't, because it isn't. But it's a lot more
> realistic than expecting OSD downtime during an upgrade procedure.
> 
> On top of this all, I found during my tests that any OSD, running luminous
> prior to the luminous quorum, will need to be restarted before it can properly
> boot into the cluster. I'm guessing this is a bug rather than a feature
> though.

That sounds like a bug.. probably didn't subscribe to map updates from 
_start_boot() or something.  Can you open an immediate ticket?

> Any thoughts on how to mitigate this, or on whether I got this all wrong and
> am missing a crucial detail that blows this wall of text away, please let me
> know.

I don't know; the requirement that mons be upgraded before OSDs doesn't 
seem that unreasonable to me.  That might be slightly more painful in a 
hyperconverged scenario (osds and mons on the same host), but it should 
just require some admin TLC (restart mon daemons instead of 
rebooting).

Also, for large clusters, users often have mons on dedicated hosts.  And 
for small clusters even the slopppy "just reboot" approach will have a 
smaller impact.

Is there something in some distros that *requires* a reboot in order to 
upgrade packages?

Also, this only seems like it will affect users that are getting their 
ceph packages from the distro itself and not from a ceph.com channel or a 
special subscription/product channel (this is how the RHEL stuff works, I 
think).

sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: upgrade procedure to Luminous
       [not found]     ` <alpine.DEB.2.11.1707141407300.27271-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2017-07-14 14:12       ` Joao Eduardo Luis
  0 siblings, 0 replies; 10+ messages in thread
From: Joao Eduardo Luis @ 2017-07-14 14:12 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

On 07/14/2017 03:12 PM, Sage Weil wrote:
> On Fri, 14 Jul 2017, Joao Eduardo Luis wrote:
>> Dear all,
>>
>>
>> The current upgrade procedure to jewel, as stated by the RC's release notes,
> 
> You mean (jewel or kraken) -> luminous, I assume...

Yeah. *sigh*

   -Joao

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: upgrade procedure to Luminous
  2017-07-14 14:12   ` Sage Weil
       [not found]     ` <alpine.DEB.2.11.1707141407300.27271-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2017-07-14 14:27     ` Lars Marowsky-Bree
  2017-07-14 14:34       ` Mike Lowe
  2017-07-14 15:18       ` Sage Weil
  2017-07-14 14:43     ` Joao Eduardo Luis
  2 siblings, 2 replies; 10+ messages in thread
From: Lars Marowsky-Bree @ 2017-07-14 14:27 UTC (permalink / raw)
  To: Sage Weil; +Cc: Joao Eduardo Luis, ceph-users, ceph-devel, Josh Durgin

On 2017-07-14T14:12:08, Sage Weil <sage@newdream.net> wrote:

> > Any thoughts on how to mitigate this, or on whether I got this all wrong and
> > am missing a crucial detail that blows this wall of text away, please let me
> > know.
> I don't know; the requirement that mons be upgraded before OSDs doesn't 
> seem that unreasonable to me.  That might be slightly more painful in a 
> hyperconverged scenario (osds and mons on the same host), but it should 
> just require some admin TLC (restart mon daemons instead of 
> rebooting).

I think it's quite unreasonable, to be quite honest. Collocated MONs
with OSDs is very typical for smaller cluster environments.

> Is there something in some distros that *requires* a reboot in order to 
> upgrade packages?

Not necessarily.

*But* once we've upgraded the packages, a failure or reboot might
trigger this.

And customers don't always upgrade all nodes at once in a short period
(the benefit of a supposed rolling upgrade cycle), increasing the risk.

I wish we'd already be fully containerized so indeed the MONs were truly
independent of everything else going on on the cluster, but ...

> Also, this only seems like it will affect users that are getting their 
> ceph packages from the distro itself and not from a ceph.com channel or a 
> special subscription/product channel (this is how the RHEL stuff works, I 
> think).

Even there, upgrading only the MON daemons and not the OSDs is tricky?




-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: upgrade procedure to Luminous
  2017-07-14 14:27     ` Lars Marowsky-Bree
@ 2017-07-14 14:34       ` Mike Lowe
  2017-07-14 14:39         ` [ceph-users] " Lars Marowsky-Bree
  2017-07-14 15:18       ` Sage Weil
  1 sibling, 1 reply; 10+ messages in thread
From: Mike Lowe @ 2017-07-14 14:34 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Sage Weil, Joao Eduardo Luis, ceph-users, ceph-devel, Josh Durgin

Having run ceph clusters in production for the past six years and upgrading from every stable release starting with argonaut to the next, I can honestly say being careful about order of operations has not been a problem.

> On Jul 14, 2017, at 10:27 AM, Lars Marowsky-Bree <lmb@suse.com> wrote:
> 
> On 2017-07-14T14:12:08, Sage Weil <sage@newdream.net> wrote:
> 
>>> Any thoughts on how to mitigate this, or on whether I got this all wrong and
>>> am missing a crucial detail that blows this wall of text away, please let me
>>> know.
>> I don't know; the requirement that mons be upgraded before OSDs doesn't 
>> seem that unreasonable to me.  That might be slightly more painful in a 
>> hyperconverged scenario (osds and mons on the same host), but it should 
>> just require some admin TLC (restart mon daemons instead of 
>> rebooting).
> 
> I think it's quite unreasonable, to be quite honest. Collocated MONs
> with OSDs is very typical for smaller cluster environments.
> 
>> Is there something in some distros that *requires* a reboot in order to 
>> upgrade packages?
> 
> Not necessarily.
> 
> *But* once we've upgraded the packages, a failure or reboot might
> trigger this.
> 
> And customers don't always upgrade all nodes at once in a short period
> (the benefit of a supposed rolling upgrade cycle), increasing the risk.
> 
> I wish we'd already be fully containerized so indeed the MONs were truly
> independent of everything else going on on the cluster, but ...
> 
>> Also, this only seems like it will affect users that are getting their 
>> ceph packages from the distro itself and not from a ceph.com channel or a 
>> special subscription/product channel (this is how the RHEL stuff works, I 
>> think).
> 
> Even there, upgrading only the MON daemons and not the OSDs is tricky?
> 
> 
> 
> 
> -- 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ceph-users] upgrade procedure to Luminous
  2017-07-14 14:34       ` Mike Lowe
@ 2017-07-14 14:39         ` Lars Marowsky-Bree
       [not found]           ` <20170714143953.5fyirnegj4466ljr-IBi9RG/b67k@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Lars Marowsky-Bree @ 2017-07-14 14:39 UTC (permalink / raw)
  To: Mike Lowe; +Cc: ceph-devel, ceph-users

On 2017-07-14T10:34:35, Mike Lowe <j.michael.lowe@gmail.com> wrote:

> Having run ceph clusters in production for the past six years and upgrading from every stable release starting with argonaut to the next, I can honestly say being careful about order of operations has not been a problem.

This requirement did not exist as a mandatory one for previous releases.

The problem is not the sunshine-all-is-good path. It's about what to do
in case of failures during the upgrade process.



-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: upgrade procedure to Luminous
  2017-07-14 14:12   ` Sage Weil
       [not found]     ` <alpine.DEB.2.11.1707141407300.27271-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
  2017-07-14 14:27     ` Lars Marowsky-Bree
@ 2017-07-14 14:43     ` Joao Eduardo Luis
  2 siblings, 0 replies; 10+ messages in thread
From: Joao Eduardo Luis @ 2017-07-14 14:43 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users, ceph-devel, Josh Durgin

On 07/14/2017 03:12 PM, Sage Weil wrote:
> On Fri, 14 Jul 2017, Joao Eduardo Luis wrote:
>> On top of this all, I found during my tests that any OSD, running luminous
>> prior to the luminous quorum, will need to be restarted before it can properly
>> boot into the cluster. I'm guessing this is a bug rather than a feature
>> though.
> 
> That sounds like a bug.. probably didn't subscribe to map updates from
> _start_boot() or something.  Can you open an immediate ticket?

http://tracker.ceph.com/issues/20631

   -Joao

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: upgrade procedure to Luminous
       [not found]           ` <20170714143953.5fyirnegj4466ljr-IBi9RG/b67k@public.gmane.org>
@ 2017-07-14 15:14             ` Mike Lowe
  0 siblings, 0 replies; 10+ messages in thread
From: Mike Lowe @ 2017-07-14 15:14 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-idqoXFIVOFJgJs9I8MT0rw

It was required for Bobtail to Cuttlefish and Cuttlefish to Dumpling.  

Exactly how many mons do you have such that you are concerned about failure?  If you have let’s say 3 mons, you update all the bits, then it shouldn’t take you more than 2 minutes to restart the mons one by one.  You can take your time updating/restarting the osd’s.  I generally consider it bad practice to save your system updates for a major ceph upgrade. How exactly can you parse the difference between a ceph bug and a kernel regression if you do them all at once?  You have a resilient system why wouldn’t you take advantage of that property to change one thing at a time?  So what we are really talking about here is a hardware failure in the short period it takes to restart mon services because you shouldn’t be rebooting.  If the ceph mon doesn’t come back from a restart then you have a bug which in all likelihood will happen on the first mon and at that point you have options to roll back or run with degraded mons until Sage et al puts out a fix.  My only significant downtime was due to a bug in a new release having to do with pg splitting, 8 hours later I had my fix.

> On Jul 14, 2017, at 10:39 AM, Lars Marowsky-Bree <lmb@suse.com> wrote:
> 
> On 2017-07-14T10:34:35, Mike Lowe <j.michael.lowe@gmail.com> wrote:
> 
>> Having run ceph clusters in production for the past six years and upgrading from every stable release starting with argonaut to the next, I can honestly say being careful about order of operations has not been a problem.
> 
> This requirement did not exist as a mandatory one for previous releases.
> 
> The problem is not the sunshine-all-is-good path. It's about what to do
> in case of failures during the upgrade process.
> 
> 
> 
> -- 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: upgrade procedure to Luminous
  2017-07-14 14:27     ` Lars Marowsky-Bree
  2017-07-14 14:34       ` Mike Lowe
@ 2017-07-14 15:18       ` Sage Weil
  2017-07-17 12:30         ` Lars Marowsky-Bree
  1 sibling, 1 reply; 10+ messages in thread
From: Sage Weil @ 2017-07-14 15:18 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: Joao Eduardo Luis, ceph-users, ceph-devel, Josh Durgin

On Fri, 14 Jul 2017, Lars Marowsky-Bree wrote:
> On 2017-07-14T14:12:08, Sage Weil <sage@newdream.net> wrote:
> 
> > > Any thoughts on how to mitigate this, or on whether I got this all wrong and
> > > am missing a crucial detail that blows this wall of text away, please let me
> > > know.
> > I don't know; the requirement that mons be upgraded before OSDs doesn't 
> > seem that unreasonable to me.  That might be slightly more painful in a 
> > hyperconverged scenario (osds and mons on the same host), but it should 
> > just require some admin TLC (restart mon daemons instead of 
> > rebooting).
> 
> I think it's quite unreasonable, to be quite honest. Collocated MONs
> with OSDs is very typical for smaller cluster environments.

Yes, but how many of those clusters can only upgrade by updating the 
packages and rebooting?  Our documented procedures have always recommended 
upgrading the packages, then restarting either mons or osds first and to 
my recollection nobody has complained.  TBH my first encounter with the 
"reboot on upgrade" procedure in the Linux world was with Fedora (which I 
just recently switched to for my desktop)--and FWIW it felt very 
anachronistic.

But regardless, the real issue is this is a trade-off between the testing 
and software complexity burden vs user flexibility.  Enforcing an upgrade 
order means we have less to test and have greater confidence the user 
won't see something we haven't.  It also means, in this case, that we can 
rip out out a ton of legacy code in luminous without having to keep 
compatibility workarounds in place for another whole LTS cycle (a year!).  
That reduces code complexity, improves quality, and improves velocity.  
The downside is that the upgrade procedures has to be done in a particular 
order.

Honestly, though, I think it is a good idea for operators to be 
careful with their upgrades anyway.  They should upgrade just mons, let 
cluster stabilize, and make sure things are okay (e.g., no new 
health warnings saying they have to 'ceph osd set sortbitwise') before 
continuing.

Also, although I think it's a good idea to do the mon upgrade relatively 
quickly (one after the other until they are upgraded), the OSD upgrade can 
be stretched out longer.  (We do pretty thorough thrashing tests with 
mixed-version OSD clusters, but go through the mon upgrades pretty 
quickly.)
 
> > Is there something in some distros that *requires* a reboot in order to 
> > upgrade packages?
> 
> Not necessarily.
> 
> *But* once we've upgraded the packages, a failure or reboot might
> trigger this.

True, but this is rare, and even so the worst that can happen in this 
case is the OSDs don't come up until the other mons are upgrade.  If the 
admin plans to upgrade the mons in succession without lingering with 
mixed-versions mon the worst-case downtime window is very small--and only 
kicks in if *more than one* of the mon nodes fails (taking out OSDs in 
more than one failure domain).

> And customers don't always upgrade all nodes at once in a short period
> (the benefit of a supposed rolling upgrade cycle), increasing the risk.

I think they should plan to do this for the mons.  We can make a note 
stating as much in the upgrade procedure docs?
 
> I wish we'd already be fully containerized so indeed the MONs were truly
> independent of everything else going on on the cluster, but ...

Indeed!  Next time around...

> > Also, this only seems like it will affect users that are getting their 
> > ceph packages from the distro itself and not from a ceph.com channel or a 
> > special subscription/product channel (this is how the RHEL stuff works, I 
> > think).
> 
> Even there, upgrading only the MON daemons and not the OSDs is tricky?

I mean you would upgrade all of the packages, but only restart the mon 
daemons.  The deb packages have skipped the auto-restart in the postinst 
(or whatever) stage for years.  I'm pretty sure the rpms do the same?

Anyway, does that make sense?  Yes, it means that you can't just reboot in 
succession if your mons are mixed with OSDs.  But this time adding that 
restriction let us do the SnapSet and snapdir conversion in a single 
release, which is a *huge* win and will let us rip out a bunch of ugly OSD 
code.  We might not have a need for it next time around (and can try to 
avoid it), but I'm guessing something will come up and it will again be a 
hard call to make balancing between sloppy/easy upgrades vs simpler 
code...

sage



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: upgrade procedure to Luminous
  2017-07-14 15:18       ` Sage Weil
@ 2017-07-17 12:30         ` Lars Marowsky-Bree
  0 siblings, 0 replies; 10+ messages in thread
From: Lars Marowsky-Bree @ 2017-07-17 12:30 UTC (permalink / raw)
  To: Sage Weil; +Cc: Joao Eduardo Luis, ceph-users, ceph-devel, Josh Durgin

On 2017-07-14T15:18:54, Sage Weil <sage@newdream.net> wrote:

> Yes, but how many of those clusters can only upgrade by updating the 
> packages and rebooting?  Our documented procedures have always recommended 
> upgrading the packages, then restarting either mons or osds first and to 
> my recollection nobody has complained.  TBH my first encounter with the 
> "reboot on upgrade" procedure in the Linux world was with Fedora (which I 
> just recently switched to for my desktop)--and FWIW it felt very 
> anachronistic.

Admittedly, it is. This is my main reason for hoping for containers.

My main issue is not that they must be rebooted. In most cases, ceph-mon
can be restarted. My fear is that they *might* be rebooted by a failure
during that time, and it'd have been my expectation that normal
operation does not expose Ceph to such degraded scenarios. Ceph is,
after all, supposedly at least tolerant of one fault at a time.

And I'd obviously have considered upgrades a normal operation, not a
critical phase.

If one considers upgrades an operation that degrades redundancy, sure,
the current behaviour is in line.

> won't see something we haven't.  It also means, in this case, that we can 
> rip out out a ton of legacy code in luminous without having to keep 
> compatibility workarounds in place for another whole LTS cycle (a year!).  

Seriously, welcome to the world of enterprise software and customer
expectations ;-) 1 year! I wish! ;-)

> True, but this is rare, and even so the worst that can happen in this 
> case is the OSDs don't come up until the other mons are upgrade.  If the 
> admin plans to upgrade the mons in succession without lingering with 
> mixed-versions mon the worst-case downtime window is very small--and only 
> kicks in if *more than one* of the mon nodes fails (taking out OSDs in 
> more than one failure domain).

This is an interesting design philosophy in a fault tolerant distributed
system.

> > And customers don't always upgrade all nodes at once in a short period
> > (the benefit of a supposed rolling upgrade cycle), increasing the risk.
> I think they should plan to do this for the mons.  We can make a note 
> stating as much in the upgrade procedure docs?

Yes, we'll have to orchestrate this accordingly.

Upgrade all MONs; restart all MONs (while warning users that this is a
critical time period); start rebooting for the kernel/glibc updates.

> Anyway, does that make sense?  Yes, it means that you can't just reboot in 
> succession if your mons are mixed with OSDs.  But this time adding that 
> restriction let us do the SnapSet and snapdir conversion in a single 
> release, which is a *huge* win and will let us rip out a bunch of ugly OSD 
> code.  We might not have a need for it next time around (and can try to 
> avoid it), but I'm guessing something will come up and it will again be a 
> hard call to make balancing between sloppy/easy upgrades vs simpler 
> code...

The next major transition probably will be from non-containerized L to
fully-containerized N(autilus?). That'll be a fascinating can of worms
anyway. But that would *really* benefit if nodes could be more easily
redeployed and not just restarting daemon processes.

Thanks, at least now we know this is intentional. That was helpful, at
least!


-- 
Architect SDS
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-07-17 12:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-14 14:01 upgrade procedure to Luminous Joao Eduardo Luis
     [not found] ` <fc6ff947-e806-84c8-4eae-099a465482c6-l3A5Bk7waGM@public.gmane.org>
2017-07-14 14:12   ` Sage Weil
     [not found]     ` <alpine.DEB.2.11.1707141407300.27271-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2017-07-14 14:12       ` Joao Eduardo Luis
2017-07-14 14:27     ` Lars Marowsky-Bree
2017-07-14 14:34       ` Mike Lowe
2017-07-14 14:39         ` [ceph-users] " Lars Marowsky-Bree
     [not found]           ` <20170714143953.5fyirnegj4466ljr-IBi9RG/b67k@public.gmane.org>
2017-07-14 15:14             ` Mike Lowe
2017-07-14 15:18       ` Sage Weil
2017-07-17 12:30         ` Lars Marowsky-Bree
2017-07-14 14:43     ` Joao Eduardo Luis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.