Re: Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)
       [not found] ` <CAFFcurqEctQ2fHHDcGYfy3YCuaq9DxZr0VU4e8dNVACNVLDmqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-04-17 15:15   ` Sage Weil
       [not found]     ` <alpine.DEB.2.11.1704171457320.10661-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
  2017-04-19 23:22     ` Aaron Ten Clay
  0 siblings, 2 replies; 8+ messages in thread
From: Sage Weil @ 2017-04-17 15:15 UTC (permalink / raw)
  To: Aaron Ten Clay
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2881 bytes --]

On Sat, 15 Apr 2017, Aaron Ten Clay wrote:
> Hi all,
> 
> Our cluster is experiencing a very odd issue and I'm hoping for some
> guidance on troubleshooting steps and/or suggestions to mitigate the issue.
> tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and are
> eventually nuked by oom_killer.

My guess is that there there is a bug in a decoding path and it's 
trying to allocate some huge amount of memory.  Can you try setting a 
memory ulimit to something like 40gb and then enabling core dumps so you 
can get a core?  Something like

ulimit -c unlimited
ulimit -m 20000000

or whatever the corresponding systemd unit file options are...

Once we have a core file it will hopefully be clear who is 
doing the bad allocation...

sage



> 
> I'll try to explain the situation in detail:
> 
> We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs are in
> a different CRUSH "root", used as a cache tier for the main storage pools,
> which are erasure coded and used for cephfs. The OSDs are spread across two
> identical machines with 128GiB of RAM each, and there are three monitor
> nodes on different hardware.
> 
> Several times we've encountered crippling bugs with previous Ceph releases
> when we were on RC or betas, or using non-recommended configurations, so in
> January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04, and
> went with stable Kraken 11.2.0 with the configuration mentioned above.
> Everything was fine until the end of March, when one day we find all but a
> couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
> came along and nuked almost all the ceph-osd processes.
> 
> We've gone through a bunch of iterations of restarting the OSDs, trying to
> bring them up one at a time gradually, all at once, various configuration
> settings to reduce cache size as suggested in this ticket:
> http://tracker.ceph.com/issues/18924...
> 
> I don't know if that ticket really pertains to our situation or not, I have
> no experience with memory allocation debugging. I'd be willing to try if
> someone can point me to a guide or walk me through the process.
> 
> I've even tried, just to see if the situation was  transitory, adding over
> 300GiB of swap to both OSD machines. The OSD procs managed to allocate, in a
> matter of 5-10 minutes, more than 300GiB of RAM pressure and became
> oom_killer victims once again.
> 
> No software or hardware changes took place around the time this problem
> started, and no significant data changes occurred either. We added about
> 40GiB of ~1GiB files a week or so before the problem started and that's the
> last time data was written.
> 
> I can only assume we've found another crippling bug of some kind, this level
> of memory usage is entirely unprecedented. What can we do?
> 
> Thanks in advance for any suggestions.
> -Aaron
> 
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)
       [not found]     ` <alpine.DEB.2.11.1704171457320.10661-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2017-04-19 23:18       ` Aaron Ten Clay
       [not found]         ` <CAFFcurrA8BKF0a+9gdGAsTDbE78ci9X8dwEaWEycoF4DNQN8uw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Aaron Ten Clay @ 2017-04-19 23:18 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 3890 bytes --]

I'm new to doing this all via systemd and systemd-coredump, but I appear to
have gotten cores from two OSD processes. When xzipped they are < 2MIB
each, but I threw them on my webserver to avoid polluting the mailing list.
This seems oddly small, so if I've botched the process somehow let me know
:)

https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.xz
https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.xz

And for reference:
root@osd001:/var/lib/systemd/coredump# ceph -v
ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)


I am also investigating sysdig as recommended.

Thanks!
-Aaron


On Mon, Apr 17, 2017 at 8:15 AM, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:

> On Sat, 15 Apr 2017, Aaron Ten Clay wrote:
> > Hi all,
> >
> > Our cluster is experiencing a very odd issue and I'm hoping for some
> > guidance on troubleshooting steps and/or suggestions to mitigate the
> issue.
> > tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and
> are
> > eventually nuked by oom_killer.
>
> My guess is that there there is a bug in a decoding path and it's
> trying to allocate some huge amount of memory.  Can you try setting a
> memory ulimit to something like 40gb and then enabling core dumps so you
> can get a core?  Something like
>
> ulimit -c unlimited
> ulimit -m 20000000
>
> or whatever the corresponding systemd unit file options are...
>
> Once we have a core file it will hopefully be clear who is
> doing the bad allocation...
>
> sage
>
>
>
> >
> > I'll try to explain the situation in detail:
> >
> > We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs
> are in
> > a different CRUSH "root", used as a cache tier for the main storage
> pools,
> > which are erasure coded and used for cephfs. The OSDs are spread across
> two
> > identical machines with 128GiB of RAM each, and there are three monitor
> > nodes on different hardware.
> >
> > Several times we've encountered crippling bugs with previous Ceph
> releases
> > when we were on RC or betas, or using non-recommended configurations, so
> in
> > January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04,
> and
> > went with stable Kraken 11.2.0 with the configuration mentioned above.
> > Everything was fine until the end of March, when one day we find all but
> a
> > couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
> > came along and nuked almost all the ceph-osd processes.
> >
> > We've gone through a bunch of iterations of restarting the OSDs, trying
> to
> > bring them up one at a time gradually, all at once, various configuration
> > settings to reduce cache size as suggested in this ticket:
> > http://tracker.ceph.com/issues/18924...
> >
> > I don't know if that ticket really pertains to our situation or not, I
> have
> > no experience with memory allocation debugging. I'd be willing to try if
> > someone can point me to a guide or walk me through the process.
> >
> > I've even tried, just to see if the situation was  transitory, adding
> over
> > 300GiB of swap to both OSD machines. The OSD procs managed to allocate,
> in a
> > matter of 5-10 minutes, more than 300GiB of RAM pressure and became
> > oom_killer victims once again.
> >
> > No software or hardware changes took place around the time this problem
> > started, and no significant data changes occurred either. We added about
> > 40GiB of ~1GiB files a week or so before the problem started and that's
> the
> > last time data was written.
> >
> > I can only assume we've found another crippling bug of some kind, this
> level
> > of memory usage is entirely unprecedented. What can we do?
> >
> > Thanks in advance for any suggestions.
> > -Aaron
> >
> >
>



-- 
Aaron Ten Clay
https://aarontc.com

[-- Attachment #1.2: Type: text/html, Size: 5414 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ceph-users] Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)
  2017-04-17 15:15   ` Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore) Sage Weil
       [not found]     ` <alpine.DEB.2.11.1704171457320.10661-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2017-04-19 23:22     ` Aaron Ten Clay
  1 sibling, 0 replies; 8+ messages in thread
From: Aaron Ten Clay @ 2017-04-19 23:22 UTC (permalink / raw)
  Cc: ceph-devel

(Re-sending this as plaintext to satisfy vger.kernel.org!)

I'm new to doing this all via systemd and systemd-coredump, but I
appear to have gotten cores from two OSD processes. When xzipped they
are < 2MIB each, but I threw them on my webserver to avoid polluting
the mailing list. This seems oddly small, so if I've botched the
process somehow let me know :)

https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.xz
https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.xz

And for reference:
root@osd001:/var/lib/systemd/coredump# ceph -v
ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)


I am also investigating sysdig as recommended.

Thanks!

On Mon, Apr 17, 2017 at 8:15 AM, Sage Weil <sage@newdream.net> wrote:
> On Sat, 15 Apr 2017, Aaron Ten Clay wrote:
>> Hi all,
>>
>> Our cluster is experiencing a very odd issue and I'm hoping for some
>> guidance on troubleshooting steps and/or suggestions to mitigate the issue.
>> tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and are
>> eventually nuked by oom_killer.
>
> My guess is that there there is a bug in a decoding path and it's
> trying to allocate some huge amount of memory.  Can you try setting a
> memory ulimit to something like 40gb and then enabling core dumps so you
> can get a core?  Something like
>
> ulimit -c unlimited
> ulimit -m 20000000
>
> or whatever the corresponding systemd unit file options are...
>
> Once we have a core file it will hopefully be clear who is
> doing the bad allocation...
>
> sage
>
>
>
>>
>> I'll try to explain the situation in detail:
>>
>> We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs are in
>> a different CRUSH "root", used as a cache tier for the main storage pools,
>> which are erasure coded and used for cephfs. The OSDs are spread across two
>> identical machines with 128GiB of RAM each, and there are three monitor
>> nodes on different hardware.
>>
>> Several times we've encountered crippling bugs with previous Ceph releases
>> when we were on RC or betas, or using non-recommended configurations, so in
>> January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04, and
>> went with stable Kraken 11.2.0 with the configuration mentioned above.
>> Everything was fine until the end of March, when one day we find all but a
>> couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
>> came along and nuked almost all the ceph-osd processes.
>>
>> We've gone through a bunch of iterations of restarting the OSDs, trying to
>> bring them up one at a time gradually, all at once, various configuration
>> settings to reduce cache size as suggested in this ticket:
>> http://tracker.ceph.com/issues/18924...
>>
>> I don't know if that ticket really pertains to our situation or not, I have
>> no experience with memory allocation debugging. I'd be willing to try if
>> someone can point me to a guide or walk me through the process.
>>
>> I've even tried, just to see if the situation was  transitory, adding over
>> 300GiB of swap to both OSD machines. The OSD procs managed to allocate, in a
>> matter of 5-10 minutes, more than 300GiB of RAM pressure and became
>> oom_killer victims once again.
>>
>> No software or hardware changes took place around the time this problem
>> started, and no significant data changes occurred either. We added about
>> 40GiB of ~1GiB files a week or so before the problem started and that's the
>> last time data was written.
>>
>> I can only assume we've found another crippling bug of some kind, this level
>> of memory usage is entirely unprecedented. What can we do?
>>
>> Thanks in advance for any suggestions.
>> -Aaron
>>
>>



-- 
Aaron Ten Clay
https://aarontc.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)
       [not found]         ` <CAFFcurrA8BKF0a+9gdGAsTDbE78ci9X8dwEaWEycoF4DNQN8uw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-05-04 20:57           ` Aaron Ten Clay
       [not found]             ` <CAFFcuroumvWGZA+KC4V7wOiF9T0y7k9v+Ms=GgOh+bGm8gP__g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Aaron Ten Clay @ 2017-05-04 20:57 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

Were the backtraces we obtained not useful? Is there anything else we
can try to get the OSDs up again?

On Wed, Apr 19, 2017 at 4:18 PM, Aaron Ten Clay <aarontc-q67U1YB0R7xBDgjK7y7TUQ@public.gmane.org> wrote:
> I'm new to doing this all via systemd and systemd-coredump, but I appear to
> have gotten cores from two OSD processes. When xzipped they are < 2MIB each,
> but I threw them on my webserver to avoid polluting the mailing list. This
> seems oddly small, so if I've botched the process somehow let me know :)
>
> https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.xz
> https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.xz
>
> And for reference:
> root@osd001:/var/lib/systemd/coredump# ceph -v
> ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
>
>
> I am also investigating sysdig as recommended.
>
> Thanks!
> -Aaron
>
>
> On Mon, Apr 17, 2017 at 8:15 AM, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
>>
>> On Sat, 15 Apr 2017, Aaron Ten Clay wrote:
>> > Hi all,
>> >
>> > Our cluster is experiencing a very odd issue and I'm hoping for some
>> > guidance on troubleshooting steps and/or suggestions to mitigate the
>> > issue.
>> > tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and
>> > are
>> > eventually nuked by oom_killer.
>>
>> My guess is that there there is a bug in a decoding path and it's
>> trying to allocate some huge amount of memory.  Can you try setting a
>> memory ulimit to something like 40gb and then enabling core dumps so you
>> can get a core?  Something like
>>
>> ulimit -c unlimited
>> ulimit -m 20000000
>>
>> or whatever the corresponding systemd unit file options are...
>>
>> Once we have a core file it will hopefully be clear who is
>> doing the bad allocation...
>>
>> sage
>>
>>
>>
>> >
>> > I'll try to explain the situation in detail:
>> >
>> > We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs
>> > are in
>> > a different CRUSH "root", used as a cache tier for the main storage
>> > pools,
>> > which are erasure coded and used for cephfs. The OSDs are spread across
>> > two
>> > identical machines with 128GiB of RAM each, and there are three monitor
>> > nodes on different hardware.
>> >
>> > Several times we've encountered crippling bugs with previous Ceph
>> > releases
>> > when we were on RC or betas, or using non-recommended configurations, so
>> > in
>> > January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04,
>> > and
>> > went with stable Kraken 11.2.0 with the configuration mentioned above.
>> > Everything was fine until the end of March, when one day we find all but
>> > a
>> > couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
>> > came along and nuked almost all the ceph-osd processes.
>> >
>> > We've gone through a bunch of iterations of restarting the OSDs, trying
>> > to
>> > bring them up one at a time gradually, all at once, various
>> > configuration
>> > settings to reduce cache size as suggested in this ticket:
>> > http://tracker.ceph.com/issues/18924...
>> >
>> > I don't know if that ticket really pertains to our situation or not, I
>> > have
>> > no experience with memory allocation debugging. I'd be willing to try if
>> > someone can point me to a guide or walk me through the process.
>> >
>> > I've even tried, just to see if the situation was  transitory, adding
>> > over
>> > 300GiB of swap to both OSD machines. The OSD procs managed to allocate,
>> > in a
>> > matter of 5-10 minutes, more than 300GiB of RAM pressure and became
>> > oom_killer victims once again.
>> >
>> > No software or hardware changes took place around the time this problem
>> > started, and no significant data changes occurred either. We added about
>> > 40GiB of ~1GiB files a week or so before the problem started and that's
>> > the
>> > last time data was written.
>> >
>> > I can only assume we've found another crippling bug of some kind, this
>> > level
>> > of memory usage is entirely unprecedented. What can we do?
>> >
>> > Thanks in advance for any suggestions.
>> > -Aaron
>> >
>> >
>
>
>
>
> --
> Aaron Ten Clay
> https://aarontc.com



-- 
Aaron Ten Clay
https://aarontc.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)
       [not found]             ` <CAFFcuroumvWGZA+KC4V7wOiF9T0y7k9v+Ms=GgOh+bGm8gP__g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-05-04 21:25               ` Sage Weil
       [not found]                 ` <alpine.DEB.2.11.1705042124210.3646-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2017-05-04 21:25 UTC (permalink / raw)
  To: Aaron Ten Clay
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

Hi Aaron-

Sorry, lost track of this one.  In order to get backtraces out of the core 
you need the matching executables.  Can you make sure the ceph-osd-dbg or 
ceph-debuginfo package is installed on the machine (depending on if it's 
deb or rpm) and then gdb ceph-osd corefile and 'thr app all bt'?

Thanks!
sage


On Thu, 4 May 2017, Aaron Ten Clay wrote:

> Were the backtraces we obtained not useful? Is there anything else we
> can try to get the OSDs up again?
> 
> On Wed, Apr 19, 2017 at 4:18 PM, Aaron Ten Clay <aarontc-q67U1YB0R7xBDgjK7y7TUQ@public.gmane.org> wrote:
> > I'm new to doing this all via systemd and systemd-coredump, but I appear to
> > have gotten cores from two OSD processes. When xzipped they are < 2MIB each,
> > but I threw them on my webserver to avoid polluting the mailing list. This
> > seems oddly small, so if I've botched the process somehow let me know :)
> >
> > https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.xz
> > https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.xz
> >
> > And for reference:
> > root@osd001:/var/lib/systemd/coredump# ceph -v
> > ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
> >
> >
> > I am also investigating sysdig as recommended.
> >
> > Thanks!
> > -Aaron
> >
> >
> > On Mon, Apr 17, 2017 at 8:15 AM, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>
> >> On Sat, 15 Apr 2017, Aaron Ten Clay wrote:
> >> > Hi all,
> >> >
> >> > Our cluster is experiencing a very odd issue and I'm hoping for some
> >> > guidance on troubleshooting steps and/or suggestions to mitigate the
> >> > issue.
> >> > tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and
> >> > are
> >> > eventually nuked by oom_killer.
> >>
> >> My guess is that there there is a bug in a decoding path and it's
> >> trying to allocate some huge amount of memory.  Can you try setting a
> >> memory ulimit to something like 40gb and then enabling core dumps so you
> >> can get a core?  Something like
> >>
> >> ulimit -c unlimited
> >> ulimit -m 20000000
> >>
> >> or whatever the corresponding systemd unit file options are...
> >>
> >> Once we have a core file it will hopefully be clear who is
> >> doing the bad allocation...
> >>
> >> sage
> >>
> >>
> >>
> >> >
> >> > I'll try to explain the situation in detail:
> >> >
> >> > We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs
> >> > are in
> >> > a different CRUSH "root", used as a cache tier for the main storage
> >> > pools,
> >> > which are erasure coded and used for cephfs. The OSDs are spread across
> >> > two
> >> > identical machines with 128GiB of RAM each, and there are three monitor
> >> > nodes on different hardware.
> >> >
> >> > Several times we've encountered crippling bugs with previous Ceph
> >> > releases
> >> > when we were on RC or betas, or using non-recommended configurations, so
> >> > in
> >> > January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04,
> >> > and
> >> > went with stable Kraken 11.2.0 with the configuration mentioned above.
> >> > Everything was fine until the end of March, when one day we find all but
> >> > a
> >> > couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
> >> > came along and nuked almost all the ceph-osd processes.
> >> >
> >> > We've gone through a bunch of iterations of restarting the OSDs, trying
> >> > to
> >> > bring them up one at a time gradually, all at once, various
> >> > configuration
> >> > settings to reduce cache size as suggested in this ticket:
> >> > http://tracker.ceph.com/issues/18924...
> >> >
> >> > I don't know if that ticket really pertains to our situation or not, I
> >> > have
> >> > no experience with memory allocation debugging. I'd be willing to try if
> >> > someone can point me to a guide or walk me through the process.
> >> >
> >> > I've even tried, just to see if the situation was  transitory, adding
> >> > over
> >> > 300GiB of swap to both OSD machines. The OSD procs managed to allocate,
> >> > in a
> >> > matter of 5-10 minutes, more than 300GiB of RAM pressure and became
> >> > oom_killer victims once again.
> >> >
> >> > No software or hardware changes took place around the time this problem
> >> > started, and no significant data changes occurred either. We added about
> >> > 40GiB of ~1GiB files a week or so before the problem started and that's
> >> > the
> >> > last time data was written.
> >> >
> >> > I can only assume we've found another crippling bug of some kind, this
> >> > level
> >> > of memory usage is entirely unprecedented. What can we do?
> >> >
> >> > Thanks in advance for any suggestions.
> >> > -Aaron
> >> >
> >> >
> >
> >
> >
> >
> > --
> > Aaron Ten Clay
> > https://aarontc.com
> 
> 
> 
> -- 
> Aaron Ten Clay
> https://aarontc.com
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)
       [not found]                 ` <alpine.DEB.2.11.1705042124210.3646-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2017-05-15 23:01                   ` Aaron Ten Clay
  2017-05-16  1:35                     ` [ceph-users] " Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Aaron Ten Clay @ 2017-05-15 23:01 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

Hi Sage,

No problem. I thought this would take a lot longer to resolve so I
waited to find a good chunk of time, then it only took a few minutes!

Here are the respective backtrace outputs from gdb:

https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.backtrace.txt
https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.backtrace.txt

Hope that helps!

-Aaron


On Thu, May 4, 2017 at 2:25 PM, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> Hi Aaron-
>
> Sorry, lost track of this one.  In order to get backtraces out of the core
> you need the matching executables.  Can you make sure the ceph-osd-dbg or
> ceph-debuginfo package is installed on the machine (depending on if it's
> deb or rpm) and then gdb ceph-osd corefile and 'thr app all bt'?
>
> Thanks!
> sage
>
>
> On Thu, 4 May 2017, Aaron Ten Clay wrote:
>
>> Were the backtraces we obtained not useful? Is there anything else we
>> can try to get the OSDs up again?
>>
>> On Wed, Apr 19, 2017 at 4:18 PM, Aaron Ten Clay <aarontc-q67U1YB0R7xBDgjK7y7TUQ@public.gmane.org> wrote:
>> > I'm new to doing this all via systemd and systemd-coredump, but I appear to
>> > have gotten cores from two OSD processes. When xzipped they are < 2MIB each,
>> > but I threw them on my webserver to avoid polluting the mailing list. This
>> > seems oddly small, so if I've botched the process somehow let me know :)
>> >
>> > https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.xz
>> > https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.xz
>> >
>> > And for reference:
>> > root@osd001:/var/lib/systemd/coredump# ceph -v
>> > ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
>> >
>> >
>> > I am also investigating sysdig as recommended.
>> >
>> > Thanks!
>> > -Aaron
>> >
>> >
>> > On Mon, Apr 17, 2017 at 8:15 AM, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
>> >>
>> >> On Sat, 15 Apr 2017, Aaron Ten Clay wrote:
>> >> > Hi all,
>> >> >
>> >> > Our cluster is experiencing a very odd issue and I'm hoping for some
>> >> > guidance on troubleshooting steps and/or suggestions to mitigate the
>> >> > issue.
>> >> > tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and
>> >> > are
>> >> > eventually nuked by oom_killer.
>> >>
>> >> My guess is that there there is a bug in a decoding path and it's
>> >> trying to allocate some huge amount of memory.  Can you try setting a
>> >> memory ulimit to something like 40gb and then enabling core dumps so you
>> >> can get a core?  Something like
>> >>
>> >> ulimit -c unlimited
>> >> ulimit -m 20000000
>> >>
>> >> or whatever the corresponding systemd unit file options are...
>> >>
>> >> Once we have a core file it will hopefully be clear who is
>> >> doing the bad allocation...
>> >>
>> >> sage
>> >>
>> >>
>> >>
>> >> >
>> >> > I'll try to explain the situation in detail:
>> >> >
>> >> > We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs
>> >> > are in
>> >> > a different CRUSH "root", used as a cache tier for the main storage
>> >> > pools,
>> >> > which are erasure coded and used for cephfs. The OSDs are spread across
>> >> > two
>> >> > identical machines with 128GiB of RAM each, and there are three monitor
>> >> > nodes on different hardware.
>> >> >
>> >> > Several times we've encountered crippling bugs with previous Ceph
>> >> > releases
>> >> > when we were on RC or betas, or using non-recommended configurations, so
>> >> > in
>> >> > January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04,
>> >> > and
>> >> > went with stable Kraken 11.2.0 with the configuration mentioned above.
>> >> > Everything was fine until the end of March, when one day we find all but
>> >> > a
>> >> > couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
>> >> > came along and nuked almost all the ceph-osd processes.
>> >> >
>> >> > We've gone through a bunch of iterations of restarting the OSDs, trying
>> >> > to
>> >> > bring them up one at a time gradually, all at once, various
>> >> > configuration
>> >> > settings to reduce cache size as suggested in this ticket:
>> >> > http://tracker.ceph.com/issues/18924...
>> >> >
>> >> > I don't know if that ticket really pertains to our situation or not, I
>> >> > have
>> >> > no experience with memory allocation debugging. I'd be willing to try if
>> >> > someone can point me to a guide or walk me through the process.
>> >> >
>> >> > I've even tried, just to see if the situation was  transitory, adding
>> >> > over
>> >> > 300GiB of swap to both OSD machines. The OSD procs managed to allocate,
>> >> > in a
>> >> > matter of 5-10 minutes, more than 300GiB of RAM pressure and became
>> >> > oom_killer victims once again.
>> >> >
>> >> > No software or hardware changes took place around the time this problem
>> >> > started, and no significant data changes occurred either. We added about
>> >> > 40GiB of ~1GiB files a week or so before the problem started and that's
>> >> > the
>> >> > last time data was written.
>> >> >
>> >> > I can only assume we've found another crippling bug of some kind, this
>> >> > level
>> >> > of memory usage is entirely unprecedented. What can we do?
>> >> >
>> >> > Thanks in advance for any suggestions.
>> >> > -Aaron
>> >> >
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Aaron Ten Clay
>> > https://aarontc.com
>>
>>
>>
>> --
>> Aaron Ten Clay
>> https://aarontc.com
>>
>>



-- 
Aaron Ten Clay
https://aarontc.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ceph-users] Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)
  2017-05-15 23:01                   ` Aaron Ten Clay
@ 2017-05-16  1:35                     ` Sage Weil
  2017-06-02 21:56                       ` Aaron Ten Clay
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2017-05-16  1:35 UTC (permalink / raw)
  To: Aaron Ten Clay; +Cc: ceph-devel

On Mon, 15 May 2017, Aaron Ten Clay wrote:
> Hi Sage,
> 
> No problem. I thought this would take a lot longer to resolve so I
> waited to find a good chunk of time, then it only took a few minutes!
> 
> Here are the respective backtrace outputs from gdb:
> 
> https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.backtrace.txt
> https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.backtrace.txt

Looks like it's in BlueFS replay.  Can you reproduce with 'log max recent 
= 1' and 'debug bluefs = 20'?

It's weird... the symptom is eating RAM, but it's hitting an assert during 
relay on mount...

Thanks!
sage



> 
> Hope that helps!
> 
> -Aaron
> 
> 
> On Thu, May 4, 2017 at 2:25 PM, Sage Weil <sage@newdream.net> wrote:
> > Hi Aaron-
> >
> > Sorry, lost track of this one.  In order to get backtraces out of the core
> > you need the matching executables.  Can you make sure the ceph-osd-dbg or
> > ceph-debuginfo package is installed on the machine (depending on if it's
> > deb or rpm) and then gdb ceph-osd corefile and 'thr app all bt'?
> >
> > Thanks!
> > sage
> >
> >
> > On Thu, 4 May 2017, Aaron Ten Clay wrote:
> >
> >> Were the backtraces we obtained not useful? Is there anything else we
> >> can try to get the OSDs up again?
> >>
> >> On Wed, Apr 19, 2017 at 4:18 PM, Aaron Ten Clay <aarontc@aarontc.com> wrote:
> >> > I'm new to doing this all via systemd and systemd-coredump, but I appear to
> >> > have gotten cores from two OSD processes. When xzipped they are < 2MIB each,
> >> > but I threw them on my webserver to avoid polluting the mailing list. This
> >> > seems oddly small, so if I've botched the process somehow let me know :)
> >> >
> >> > https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.xz
> >> > https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.xz
> >> >
> >> > And for reference:
> >> > root@osd001:/var/lib/systemd/coredump# ceph -v
> >> > ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
> >> >
> >> >
> >> > I am also investigating sysdig as recommended.
> >> >
> >> > Thanks!
> >> > -Aaron
> >> >
> >> >
> >> > On Mon, Apr 17, 2017 at 8:15 AM, Sage Weil <sage@newdream.net> wrote:
> >> >>
> >> >> On Sat, 15 Apr 2017, Aaron Ten Clay wrote:
> >> >> > Hi all,
> >> >> >
> >> >> > Our cluster is experiencing a very odd issue and I'm hoping for some
> >> >> > guidance on troubleshooting steps and/or suggestions to mitigate the
> >> >> > issue.
> >> >> > tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and
> >> >> > are
> >> >> > eventually nuked by oom_killer.
> >> >>
> >> >> My guess is that there there is a bug in a decoding path and it's
> >> >> trying to allocate some huge amount of memory.  Can you try setting a
> >> >> memory ulimit to something like 40gb and then enabling core dumps so you
> >> >> can get a core?  Something like
> >> >>
> >> >> ulimit -c unlimited
> >> >> ulimit -m 20000000
> >> >>
> >> >> or whatever the corresponding systemd unit file options are...
> >> >>
> >> >> Once we have a core file it will hopefully be clear who is
> >> >> doing the bad allocation...
> >> >>
> >> >> sage
> >> >>
> >> >>
> >> >>
> >> >> >
> >> >> > I'll try to explain the situation in detail:
> >> >> >
> >> >> > We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs
> >> >> > are in
> >> >> > a different CRUSH "root", used as a cache tier for the main storage
> >> >> > pools,
> >> >> > which are erasure coded and used for cephfs. The OSDs are spread across
> >> >> > two
> >> >> > identical machines with 128GiB of RAM each, and there are three monitor
> >> >> > nodes on different hardware.
> >> >> >
> >> >> > Several times we've encountered crippling bugs with previous Ceph
> >> >> > releases
> >> >> > when we were on RC or betas, or using non-recommended configurations, so
> >> >> > in
> >> >> > January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04,
> >> >> > and
> >> >> > went with stable Kraken 11.2.0 with the configuration mentioned above.
> >> >> > Everything was fine until the end of March, when one day we find all but
> >> >> > a
> >> >> > couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
> >> >> > came along and nuked almost all the ceph-osd processes.
> >> >> >
> >> >> > We've gone through a bunch of iterations of restarting the OSDs, trying
> >> >> > to
> >> >> > bring them up one at a time gradually, all at once, various
> >> >> > configuration
> >> >> > settings to reduce cache size as suggested in this ticket:
> >> >> > http://tracker.ceph.com/issues/18924...
> >> >> >
> >> >> > I don't know if that ticket really pertains to our situation or not, I
> >> >> > have
> >> >> > no experience with memory allocation debugging. I'd be willing to try if
> >> >> > someone can point me to a guide or walk me through the process.
> >> >> >
> >> >> > I've even tried, just to see if the situation was  transitory, adding
> >> >> > over
> >> >> > 300GiB of swap to both OSD machines. The OSD procs managed to allocate,
> >> >> > in a
> >> >> > matter of 5-10 minutes, more than 300GiB of RAM pressure and became
> >> >> > oom_killer victims once again.
> >> >> >
> >> >> > No software or hardware changes took place around the time this problem
> >> >> > started, and no significant data changes occurred either. We added about
> >> >> > 40GiB of ~1GiB files a week or so before the problem started and that's
> >> >> > the
> >> >> > last time data was written.
> >> >> >
> >> >> > I can only assume we've found another crippling bug of some kind, this
> >> >> > level
> >> >> > of memory usage is entirely unprecedented. What can we do?
> >> >> >
> >> >> > Thanks in advance for any suggestions.
> >> >> > -Aaron
> >> >> >
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Aaron Ten Clay
> >> > https://aarontc.com
> >>
> >>
> >>
> >> --
> >> Aaron Ten Clay
> >> https://aarontc.com
> >>
> >>
> 
> 
> 
> -- 
> Aaron Ten Clay
> https://aarontc.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ceph-users] Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)
  2017-05-16  1:35                     ` [ceph-users] " Sage Weil
@ 2017-06-02 21:56                       ` Aaron Ten Clay
  0 siblings, 0 replies; 8+ messages in thread
From: Aaron Ten Clay @ 2017-06-02 21:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Mon, May 15, 2017 at 6:35 PM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 15 May 2017, Aaron Ten Clay wrote:
>> Hi Sage,
>>
>> No problem. I thought this would take a lot longer to resolve so I
>> waited to find a good chunk of time, then it only took a few minutes!
>>
>> Here are the respective backtrace outputs from gdb:
>>
>> https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.backtrace.txt
>> https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.backtrace.txt
>
> Looks like it's in BlueFS replay.  Can you reproduce with 'log max recent
> = 1' and 'debug bluefs = 20'?
>
> It's weird... the symptom is eating RAM, but it's hitting an assert during
> relay on mount...
>
> Thanks!
> sage
>

Sage:

Here's the log from osd.1: https://aarontc.com/ceph/ceph-osd.1.log.bz2

I'm not entirely sure the issue was reproduced. The symptom of running
away with all the RAM still happens, but there is so much in this log
I'm not sure if it has what you're looking for.

-Aaron

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-06-02 21:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAFFcurqEctQ2fHHDcGYfy3YCuaq9DxZr0VU4e8dNVACNVLDmqA@mail.gmail.com>
     [not found] ` <CAFFcurqEctQ2fHHDcGYfy3YCuaq9DxZr0VU4e8dNVACNVLDmqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-04-17 15:15   ` Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore) Sage Weil
     [not found]     ` <alpine.DEB.2.11.1704171457320.10661-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2017-04-19 23:18       ` Aaron Ten Clay
     [not found]         ` <CAFFcurrA8BKF0a+9gdGAsTDbE78ci9X8dwEaWEycoF4DNQN8uw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-05-04 20:57           ` Aaron Ten Clay
     [not found]             ` <CAFFcuroumvWGZA+KC4V7wOiF9T0y7k9v+Ms=GgOh+bGm8gP__g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-05-04 21:25               ` Sage Weil
     [not found]                 ` <alpine.DEB.2.11.1705042124210.3646-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2017-05-15 23:01                   ` Aaron Ten Clay
2017-05-16  1:35                     ` [ceph-users] " Sage Weil
2017-06-02 21:56                       ` Aaron Ten Clay
2017-04-19 23:22     ` Aaron Ten Clay

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.