From mboxrd@z Thu Jan 1 00:00:00 1970 From: Aaron Ten Clay Subject: Re: [ceph-users] Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore) Date: Wed, 19 Apr 2017 16:22:30 -0700 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Return-path: Received: from mail-io0-f193.google.com ([209.85.223.193]:34624 "EHLO mail-io0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964890AbdDSXWc (ORCPT ); Wed, 19 Apr 2017 19:22:32 -0400 Received: by mail-io0-f193.google.com with SMTP id h41so8747996ioi.1 for ; Wed, 19 Apr 2017 16:22:31 -0700 (PDT) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: Cc: ceph-devel@vger.kernel.org (Re-sending this as plaintext to satisfy vger.kernel.org!) I'm new to doing this all via systemd and systemd-coredump, but I appear to have gotten cores from two OSD processes. When xzipped they are < 2MIB each, but I threw them on my webserver to avoid polluting the mailing list. This seems oddly small, so if I've botched the process somehow let me know :) https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.xz https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.xz And for reference: root@osd001:/var/lib/systemd/coredump# ceph -v ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7) I am also investigating sysdig as recommended. Thanks! On Mon, Apr 17, 2017 at 8:15 AM, Sage Weil wrote: > On Sat, 15 Apr 2017, Aaron Ten Clay wrote: >> Hi all, >> >> Our cluster is experiencing a very odd issue and I'm hoping for some >> guidance on troubleshooting steps and/or suggestions to mitigate the issue. >> tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and are >> eventually nuked by oom_killer. > > My guess is that there there is a bug in a decoding path and it's > trying to allocate some huge amount of memory. Can you try setting a > memory ulimit to something like 40gb and then enabling core dumps so you > can get a core? Something like > > ulimit -c unlimited > ulimit -m 20000000 > > or whatever the corresponding systemd unit file options are... > > Once we have a core file it will hopefully be clear who is > doing the bad allocation... > > sage > > > >> >> I'll try to explain the situation in detail: >> >> We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs are in >> a different CRUSH "root", used as a cache tier for the main storage pools, >> which are erasure coded and used for cephfs. The OSDs are spread across two >> identical machines with 128GiB of RAM each, and there are three monitor >> nodes on different hardware. >> >> Several times we've encountered crippling bugs with previous Ceph releases >> when we were on RC or betas, or using non-recommended configurations, so in >> January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04, and >> went with stable Kraken 11.2.0 with the configuration mentioned above. >> Everything was fine until the end of March, when one day we find all but a >> couple of OSDs are "down" inexplicably. Investigation reveals oom_killer >> came along and nuked almost all the ceph-osd processes. >> >> We've gone through a bunch of iterations of restarting the OSDs, trying to >> bring them up one at a time gradually, all at once, various configuration >> settings to reduce cache size as suggested in this ticket: >> http://tracker.ceph.com/issues/18924... >> >> I don't know if that ticket really pertains to our situation or not, I have >> no experience with memory allocation debugging. I'd be willing to try if >> someone can point me to a guide or walk me through the process. >> >> I've even tried, just to see if the situation was transitory, adding over >> 300GiB of swap to both OSD machines. The OSD procs managed to allocate, in a >> matter of 5-10 minutes, more than 300GiB of RAM pressure and became >> oom_killer victims once again. >> >> No software or hardware changes took place around the time this problem >> started, and no significant data changes occurred either. We added about >> 40GiB of ~1GiB files a week or so before the problem started and that's the >> last time data was written. >> >> I can only assume we've found another crippling bug of some kind, this level >> of memory usage is entirely unprecedented. What can we do? >> >> Thanks in advance for any suggestions. >> -Aaron >> >> -- Aaron Ten Clay https://aarontc.com