I'm new to doing this all via systemd and systemd-coredump, but I appear to
have gotten cores from two OSD processes. When xzipped they are < 2MIB
each, but I threw them on my webserver to avoid polluting the mailing list.
This seems oddly small, so if I've botched the process somehow let me know
:)

https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.xz
https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.xz

And for reference:
root@osd001:/var/lib/systemd/coredump# ceph -v
ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)


I am also investigating sysdig as recommended.

Thanks!
-Aaron


On Mon, Apr 17, 2017 at 8:15 AM, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:

> On Sat, 15 Apr 2017, Aaron Ten Clay wrote:
> > Hi all,
> >
> > Our cluster is experiencing a very odd issue and I'm hoping for some
> > guidance on troubleshooting steps and/or suggestions to mitigate the
> issue.
> > tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and
> are
> > eventually nuked by oom_killer.
>
> My guess is that there there is a bug in a decoding path and it's
> trying to allocate some huge amount of memory.  Can you try setting a
> memory ulimit to something like 40gb and then enabling core dumps so you
> can get a core?  Something like
>
> ulimit -c unlimited
> ulimit -m 20000000
>
> or whatever the corresponding systemd unit file options are...
>
> Once we have a core file it will hopefully be clear who is
> doing the bad allocation...
>
> sage
>
>
>
> >
> > I'll try to explain the situation in detail:
> >
> > We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs
> are in
> > a different CRUSH "root", used as a cache tier for the main storage
> pools,
> > which are erasure coded and used for cephfs. The OSDs are spread across
> two
> > identical machines with 128GiB of RAM each, and there are three monitor
> > nodes on different hardware.
> >
> > Several times we've encountered crippling bugs with previous Ceph
> releases
> > when we were on RC or betas, or using non-recommended configurations, so
> in
> > January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04,
> and
> > went with stable Kraken 11.2.0 with the configuration mentioned above.
> > Everything was fine until the end of March, when one day we find all but
> a
> > couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
> > came along and nuked almost all the ceph-osd processes.
> >
> > We've gone through a bunch of iterations of restarting the OSDs, trying
> to
> > bring them up one at a time gradually, all at once, various configuration
> > settings to reduce cache size as suggested in this ticket:
> > http://tracker.ceph.com/issues/18924...
> >
> > I don't know if that ticket really pertains to our situation or not, I
> have
> > no experience with memory allocation debugging. I'd be willing to try if
> > someone can point me to a guide or walk me through the process.
> >
> > I've even tried, just to see if the situation was  transitory, adding
> over
> > 300GiB of swap to both OSD machines. The OSD procs managed to allocate,
> in a
> > matter of 5-10 minutes, more than 300GiB of RAM pressure and became
> > oom_killer victims once again.
> >
> > No software or hardware changes took place around the time this problem
> > started, and no significant data changes occurred either. We added about
> > 40GiB of ~1GiB files a week or so before the problem started and that's
> the
> > last time data was written.
> >
> > I can only assume we've found another crippling bug of some kind, this
> level
> > of memory usage is entirely unprecedented. What can we do?
> >
> > Thanks in advance for any suggestions.
> > -Aaron
> >
> >
>


-- 
Aaron Ten Clay
https://aarontc.com