Re: Nautilus 14.2.19 mon 100% CPU

ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Nautilus 14.2.19 mon 100% CPU
       [not found] <CAANLjFpjRLtV+GR4WV15iXXCvkig6tJAr_G=_bZpZ=jKnYfvTQ@mail.gmail.com>
@ 2021-04-08 17:24 ` Robert LeBlanc
  2021-04-08 19:11 ` [ceph-users] " Stefan Kooman
  1 sibling, 0 replies; 8+ messages in thread
From: Robert LeBlanc @ 2021-04-08 17:24 UTC (permalink / raw)
  To: ceph-devel, ceph-users

On Thu, Apr 8, 2021 at 10:22 AM Robert LeBlanc <robert@leblancnet.us> wrote:
>
> I upgraded our Luminous cluster to Nautilus a couple of weeks ago and converted the last batch of FileStore OSDs to BlueStore about 36 hours ago. Yesterday our monitor cluster went nuts and started constantly calling elections because monitor nodes were at 100% and wouldn't respond to heartbeats. I reduced the monitor cluster to one to prevent the constant elections and that let the system limp along until the backfills finished. There are large amounts of time where ceph commands hang with the CPU is at 100%, when the CPU drops I see a lot of work getting done in the monitor logs which stops as soon as the CPU is at 100% again.
>
> I did a `perf top` on the node to see what's taking all the time and it appears to be in the rocksdb code path. I've set `mon_compact_on_start = true` in the ceph.conf but that does not appear to help. The `/var/lib/ceph/mon/` directory is 311MB which is down from 3.0 GB while the backfills were going on. I've tried adding a second monitor, but it goes back to the constant elections. I tried restarting all the services without luck. I also pulled the monitor from the network work and tried restarting the mon service isolated (this helped a couple of weeks ago when `ceph -s` would cause 100% CPU and lock up the service much worse than this) and didn't see the high CPU load. So I'm guessing it's triggered from some external source.
>
> I'm happy to provide more info, just let me know what would be helpful.

Sent this to the dev list, but forgot it needed to be plain text. Here
is text output of the `perf top` taken a bit later, so not exactly the
same as the screenshot earlier.

Samples: 20M of event 'cycles', 4000 Hz, Event count (approx.):
61966526527 lost: 0/0 drop: 0/0
Overhead  Shared Object                             Symbol
 11.52%  ceph-mon                                  [.]
rocksdb::MemTable::KeyComparator::operator()
  6.80%  ceph-mon                                  [.]
rocksdb::MemTable::KeyComparator::operator()
  4.75%  ceph-mon                                  [.]
rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
const&>::FindGreaterOrEqual
  2.89%  libc-2.27.so                              [.] vfprintf
  2.54%  libtcmalloc.so.4.3.0                      [.] tc_deletearray_nothrow
  2.31%  ceph-mon                                  [.] TLS init
function for rocksdb::perf_context
  2.14%  ceph-mon                                  [.] rocksdb::DBImpl::GetImpl
  1.53%  libc-2.27.so                              [.] 0x000000000018acf8
  1.44%  libc-2.27.so                              [.] _IO_default_xsputn
  1.34%  ceph-mon                                  [.] memcmp@plt
  1.32%  libtcmalloc.so.4.3.0                      [.] tc_malloc
  1.28%  ceph-mon                                  [.] rocksdb::Version::Get
  1.27%  libc-2.27.so                              [.] 0x000000000018abf4
  1.17%  ceph-mon                                  [.] RocksDBStore::get
  1.08%  ceph-mon                                  [.] 0x0000000000639a33
  1.04%  ceph-mon                                  [.] 0x0000000000639a0e
  0.89%  ceph-mon                                  [.] 0x0000000000639a46
  0.86%  ceph-mon                                  [.] rocksdb::TableCache::Get
  0.72%  libc-2.27.so                              [.] 0x000000000018abfe
  0.68%  libceph-common.so.0                       [.] ceph_str_hash_rjenkins
  0.66%  ceph-mon                                  [.] rocksdb::Hash
  0.63%  ceph-mon                                  [.] rocksdb::MemTable::Get
  0.62%  ceph-mon                                  [.] 0x00000000006399ff
  0.57%  libc-2.27.so                              [.] 0x000000000018abf0
  0.57%  ceph-mon                                  [.]
rocksdb::GetContext::GetContext
  0.57%  ceph-mon                                  [.]
rocksdb::BlockBasedTable::Get
  0.57%  ceph-mon                                  [.]
rocksdb::BlockBasedTable::GetFilter
  0.55%  [vdso]                                    [.] __vdso_clock_gettime
  0.54%  ceph-mon                                  [.] 0x00000000005afa17
  0.53%  ceph-mgr                                  [.]
std::_Rb_tree<pg_t, pg_t, std::_Identity<pg_t>, std::less<pg_t>,
std::allocator<pg_t> >::equal_range
  0.51%  libceph-common.so.0                       [.] PerfCounters::tinc
  0.50%  ceph-mon                                  [.]
OSDMonitor::make_snap_epoch_key[abi:cxx11]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ceph-users] Nautilus 14.2.19 mon 100% CPU
       [not found] <CAANLjFpjRLtV+GR4WV15iXXCvkig6tJAr_G=_bZpZ=jKnYfvTQ@mail.gmail.com>
  2021-04-08 17:24 ` Nautilus 14.2.19 mon 100% CPU Robert LeBlanc
@ 2021-04-08 19:11 ` Stefan Kooman
  2021-04-08 20:26   ` Robert LeBlanc
  1 sibling, 1 reply; 8+ messages in thread
From: Stefan Kooman @ 2021-04-08 19:11 UTC (permalink / raw)
  To: Robert LeBlanc, ceph-devel, ceph-users

On 4/8/21 6:22 PM, Robert LeBlanc wrote:
> I upgraded our Luminous cluster to Nautilus a couple of weeks ago and 
> converted the last batch of FileStore OSDs to BlueStore about 36 hours 
> ago. Yesterday our monitor cluster went nuts and started constantly 
> calling elections because monitor nodes were at 100% and wouldn't 
> respond to heartbeats. I reduced the monitor cluster to one to prevent 
> the constant elections and that let the system limp along until the 
> backfills finished. There are large amounts of time where ceph commands 
> hang with the CPU is at 100%, when the CPU drops I see a lot of work 
> getting done in the monitor logs which stops as soon as the CPU is at 
> 100% again.


Try reducing mon_sync_max_payload_size=4096. I have seen Frank Schilder 
advise this several times because of monitor issues. Also recently for a 
cluster that got upgraded from Luminous -> Mimic -> Nautilus.

Worth a shot.

Otherwise I'll try to look in depth and see if I can come up with 
something smart (for now I need to go catch some sleep).

Gr. Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ceph-users] Nautilus 14.2.19 mon 100% CPU
  2021-04-08 19:11 ` [ceph-users] " Stefan Kooman
@ 2021-04-08 20:26   ` Robert LeBlanc
       [not found]     ` <CAKTRiELqxD+0LtRXan9gMzot3y4A4M4x=km-MB2aET6wP_5mQg@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Robert LeBlanc @ 2021-04-08 20:26 UTC (permalink / raw)
  To: Stefan Kooman; +Cc: ceph-devel, ceph-users

I found this thread that matches a lot of what I'm seeing. I see the
ms_dispatch thread going to 100%, but I'm at a single MON, the
recovery is done and the rocksdb MON database is ~300MB. I've tried
all the settings mentioned in that thread with no noticeable
improvement. I was hoping that once the recovery was done (backfills
to reformatted OSDs) that it would clear up, but not yet. So any other
ideas would be really helpful. Our MDS is functioning, but stalls a
lot because the mons miss heartbeats.

mon_compact_on_start = true
rocksdb_cache_size = 1342177280
mon_lease = 30
mon_osd_cache_size = 200000
mon_sync_max_payload_size = 4096

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Thu, Apr 8, 2021 at 1:11 PM Stefan Kooman <stefan@bit.nl> wrote:
>
> On 4/8/21 6:22 PM, Robert LeBlanc wrote:
> > I upgraded our Luminous cluster to Nautilus a couple of weeks ago and
> > converted the last batch of FileStore OSDs to BlueStore about 36 hours
> > ago. Yesterday our monitor cluster went nuts and started constantly
> > calling elections because monitor nodes were at 100% and wouldn't
> > respond to heartbeats. I reduced the monitor cluster to one to prevent
> > the constant elections and that let the system limp along until the
> > backfills finished. There are large amounts of time where ceph commands
> > hang with the CPU is at 100%, when the CPU drops I see a lot of work
> > getting done in the monitor logs which stops as soon as the CPU is at
> > 100% again.
>
>
> Try reducing mon_sync_max_payload_size=4096. I have seen Frank Schilder
> advise this several times because of monitor issues. Also recently for a
> cluster that got upgraded from Luminous -> Mimic -> Nautilus.
>
> Worth a shot.
>
> Otherwise I'll try to look in depth and see if I can come up with
> something smart (for now I need to go catch some sleep).
>
> Gr. Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <CAKTRiELqxD+0LtRXan9gMzot3y4A4M4x=km-MB2aET6wP_5mQg@mail.gmail.com>]

* Re: [ceph-users] Nautilus 14.2.19 mon 100% CPU
       [not found]     ` <CAKTRiELqxD+0LtRXan9gMzot3y4A4M4x=km-MB2aET6wP_5mQg@mail.gmail.com>
@ 2021-04-09  3:48       ` Robert LeBlanc
  2021-04-09 13:40         ` Robert LeBlanc
  0 siblings, 1 reply; 8+ messages in thread
From: Robert LeBlanc @ 2021-04-09  3:48 UTC (permalink / raw)
  To: Zizon Qiu; +Cc: Stefan Kooman, ceph-devel, ceph-users

Good thought. The storage for the monitor data is a RAID-0 over three
NVMe devices. Watching iostat, they are completely idle, maybe 0.8% to
1.4% for a second every minute or so.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Thu, Apr 8, 2021 at 7:48 PM Zizon Qiu <zzdtsv@gmail.com> wrote:
>
> Will it be related to some kind of disk issue of that mon located in,which may casually
> slow down IO and further the rocksdb?
>
>
> On Fri, Apr 9, 2021 at 4:29 AM Robert LeBlanc <robert@leblancnet.us> wrote:
>>
>> I found this thread that matches a lot of what I'm seeing. I see the
>> ms_dispatch thread going to 100%, but I'm at a single MON, the
>> recovery is done and the rocksdb MON database is ~300MB. I've tried
>> all the settings mentioned in that thread with no noticeable
>> improvement. I was hoping that once the recovery was done (backfills
>> to reformatted OSDs) that it would clear up, but not yet. So any other
>> ideas would be really helpful. Our MDS is functioning, but stalls a
>> lot because the mons miss heartbeats.
>>
>> mon_compact_on_start = true
>> rocksdb_cache_size = 1342177280
>> mon_lease = 30
>> mon_osd_cache_size = 200000
>> mon_sync_max_payload_size = 4096
>>
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>> On Thu, Apr 8, 2021 at 1:11 PM Stefan Kooman <stefan@bit.nl> wrote:
>> >
>> > On 4/8/21 6:22 PM, Robert LeBlanc wrote:
>> > > I upgraded our Luminous cluster to Nautilus a couple of weeks ago and
>> > > converted the last batch of FileStore OSDs to BlueStore about 36 hours
>> > > ago. Yesterday our monitor cluster went nuts and started constantly
>> > > calling elections because monitor nodes were at 100% and wouldn't
>> > > respond to heartbeats. I reduced the monitor cluster to one to prevent
>> > > the constant elections and that let the system limp along until the
>> > > backfills finished. There are large amounts of time where ceph commands
>> > > hang with the CPU is at 100%, when the CPU drops I see a lot of work
>> > > getting done in the monitor logs which stops as soon as the CPU is at
>> > > 100% again.
>> >
>> >
>> > Try reducing mon_sync_max_payload_size=4096. I have seen Frank Schilder
>> > advise this several times because of monitor issues. Also recently for a
>> > cluster that got upgraded from Luminous -> Mimic -> Nautilus.
>> >
>> > Worth a shot.
>> >
>> > Otherwise I'll try to look in depth and see if I can come up with
>> > something smart (for now I need to go catch some sleep).
>> >
>> > Gr. Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ceph-users] Nautilus 14.2.19 mon 100% CPU
  2021-04-09  3:48       ` Robert LeBlanc
@ 2021-04-09 13:40         ` Robert LeBlanc
  2021-04-09 15:25           ` [ceph-users] " Stefan Kooman
  0 siblings, 1 reply; 8+ messages in thread
From: Robert LeBlanc @ 2021-04-09 13:40 UTC (permalink / raw)
  To: ceph-devel, ceph-users

I'm attempting to deep scrub all the PGs to see if that helps clear up
some accounting issues, but that's going to take a really long time on
2PB of data.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Thu, Apr 8, 2021 at 9:48 PM Robert LeBlanc <robert@leblancnet.us> wrote:
>
> Good thought. The storage for the monitor data is a RAID-0 over three
> NVMe devices. Watching iostat, they are completely idle, maybe 0.8% to
> 1.4% for a second every minute or so.
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> On Thu, Apr 8, 2021 at 7:48 PM Zizon Qiu <zzdtsv@gmail.com> wrote:
> >
> > Will it be related to some kind of disk issue of that mon located in,which may casually
> > slow down IO and further the rocksdb?
> >
> >
> > On Fri, Apr 9, 2021 at 4:29 AM Robert LeBlanc <robert@leblancnet.us> wrote:
> >>
> >> I found this thread that matches a lot of what I'm seeing. I see the
> >> ms_dispatch thread going to 100%, but I'm at a single MON, the
> >> recovery is done and the rocksdb MON database is ~300MB. I've tried
> >> all the settings mentioned in that thread with no noticeable
> >> improvement. I was hoping that once the recovery was done (backfills
> >> to reformatted OSDs) that it would clear up, but not yet. So any other
> >> ideas would be really helpful. Our MDS is functioning, but stalls a
> >> lot because the mons miss heartbeats.
> >>
> >> mon_compact_on_start = true
> >> rocksdb_cache_size = 1342177280
> >> mon_lease = 30
> >> mon_osd_cache_size = 200000
> >> mon_sync_max_payload_size = 4096
> >>
> >> ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>
> >> On Thu, Apr 8, 2021 at 1:11 PM Stefan Kooman <stefan@bit.nl> wrote:
> >> >
> >> > On 4/8/21 6:22 PM, Robert LeBlanc wrote:
> >> > > I upgraded our Luminous cluster to Nautilus a couple of weeks ago and
> >> > > converted the last batch of FileStore OSDs to BlueStore about 36 hours
> >> > > ago. Yesterday our monitor cluster went nuts and started constantly
> >> > > calling elections because monitor nodes were at 100% and wouldn't
> >> > > respond to heartbeats. I reduced the monitor cluster to one to prevent
> >> > > the constant elections and that let the system limp along until the
> >> > > backfills finished. There are large amounts of time where ceph commands
> >> > > hang with the CPU is at 100%, when the CPU drops I see a lot of work
> >> > > getting done in the monitor logs which stops as soon as the CPU is at
> >> > > 100% again.
> >> >
> >> >
> >> > Try reducing mon_sync_max_payload_size=4096. I have seen Frank Schilder
> >> > advise this several times because of monitor issues. Also recently for a
> >> > cluster that got upgraded from Luminous -> Mimic -> Nautilus.
> >> >
> >> > Worth a shot.
> >> >
> >> > Otherwise I'll try to look in depth and see if I can come up with
> >> > something smart (for now I need to go catch some sleep).
> >> >
> >> > Gr. Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ceph-users] Re: Nautilus 14.2.19 mon 100% CPU
  2021-04-09 13:40         ` Robert LeBlanc
@ 2021-04-09 15:25           ` Stefan Kooman
  2021-04-09 16:41             ` Robert LeBlanc
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Kooman @ 2021-04-09 15:25 UTC (permalink / raw)
  To: Robert LeBlanc, ceph-devel, ceph-users

On 4/9/21 3:40 PM, Robert LeBlanc wrote:
> I'm attempting to deep scrub all the PGs to see if that helps clear up
> some accounting issues, but that's going to take a really long time on
> 2PB of data.

Are you running with 1 mon now? Have you tried adding mons from scratch? 
So with a fresh database? And then maybe after they have joined, kill 
the donor mon and start from scratch.

You have for sure not missed a step during the upgrade (just checking 
mode), i.e. ceph osd require-osd-release nautilus.

Gr. Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ceph-users] Re: Nautilus 14.2.19 mon 100% CPU
  2021-04-09 15:25           ` [ceph-users] " Stefan Kooman
@ 2021-04-09 16:41             ` Robert LeBlanc
  2021-04-09 17:01               ` Robert LeBlanc
  0 siblings, 1 reply; 8+ messages in thread
From: Robert LeBlanc @ 2021-04-09 16:41 UTC (permalink / raw)
  To: Stefan Kooman; +Cc: ceph-devel, ceph-users

On Fri, Apr 9, 2021 at 9:25 AM Stefan Kooman <stefan@bit.nl> wrote:
> Are you running with 1 mon now? Have you tried adding mons from scratch?
> So with a fresh database? And then maybe after they have joined, kill
> the donor mon and start from scratch.
>
> You have for sure not missed a step during the upgrade (just checking
> mode), i.e. ceph osd require-osd-release nautilus.

I have tried adding one of the other monitors by removing the data
directory and starting from scratch, but it would go back to the
monitor elections and I didn't feel comfortable that it's up to sync
to fail over to it so I took it back out. I have run `ceph
osd-require-osd-release nautilus` after the upgrade of all the OSDs.
I'll go back and double check all the steps, but I think I got them
all.

Thank you,
Robert LeBlanc

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ceph-users] Re: Nautilus 14.2.19 mon 100% CPU
  2021-04-09 16:41             ` Robert LeBlanc
@ 2021-04-09 17:01               ` Robert LeBlanc
  0 siblings, 0 replies; 8+ messages in thread
From: Robert LeBlanc @ 2021-04-09 17:01 UTC (permalink / raw)
  To: Stefan Kooman; +Cc: ceph-devel, ceph-users

The only step not yet taken was to move to straw2. That was the last
step we were going to do next.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Fri, Apr 9, 2021 at 10:41 AM Robert LeBlanc <robert@leblancnet.us> wrote:
>
> On Fri, Apr 9, 2021 at 9:25 AM Stefan Kooman <stefan@bit.nl> wrote:
> > Are you running with 1 mon now? Have you tried adding mons from scratch?
> > So with a fresh database? And then maybe after they have joined, kill
> > the donor mon and start from scratch.
> >
> > You have for sure not missed a step during the upgrade (just checking
> > mode), i.e. ceph osd require-osd-release nautilus.
>
> I have tried adding one of the other monitors by removing the data
> directory and starting from scratch, but it would go back to the
> monitor elections and I didn't feel comfortable that it's up to sync
> to fail over to it so I took it back out. I have run `ceph
> osd-require-osd-release nautilus` after the upgrade of all the OSDs.
> I'll go back and double check all the steps, but I think I got them
> all.
>
> Thank you,
> Robert LeBlanc

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-04-09 17:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAANLjFpjRLtV+GR4WV15iXXCvkig6tJAr_G=_bZpZ=jKnYfvTQ@mail.gmail.com>
2021-04-08 17:24 ` Nautilus 14.2.19 mon 100% CPU Robert LeBlanc
2021-04-08 19:11 ` [ceph-users] " Stefan Kooman
2021-04-08 20:26   ` Robert LeBlanc
     [not found]     ` <CAKTRiELqxD+0LtRXan9gMzot3y4A4M4x=km-MB2aET6wP_5mQg@mail.gmail.com>
2021-04-09  3:48       ` Robert LeBlanc
2021-04-09 13:40         ` Robert LeBlanc
2021-04-09 15:25           ` [ceph-users] " Stefan Kooman
2021-04-09 16:41             ` Robert LeBlanc
2021-04-09 17:01               ` Robert LeBlanc

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).