From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sage Weil Subject: Re: [ceph-users] Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore) Date: Tue, 16 May 2017 01:35:51 +0000 (UTC) Message-ID: References: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Return-path: Received: from cobra.newdream.net ([66.33.216.30]:56152 "EHLO cobra.newdream.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750753AbdEPBfy (ORCPT ); Mon, 15 May 2017 21:35:54 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Aaron Ten Clay Cc: ceph-devel@vger.kernel.org On Mon, 15 May 2017, Aaron Ten Clay wrote: > Hi Sage, > > No problem. I thought this would take a lot longer to resolve so I > waited to find a good chunk of time, then it only took a few minutes! > > Here are the respective backtrace outputs from gdb: > > https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.backtrace.txt > https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.backtrace.txt Looks like it's in BlueFS replay. Can you reproduce with 'log max recent = 1' and 'debug bluefs = 20'? It's weird... the symptom is eating RAM, but it's hitting an assert during relay on mount... Thanks! sage > > Hope that helps! > > -Aaron > > > On Thu, May 4, 2017 at 2:25 PM, Sage Weil wrote: > > Hi Aaron- > > > > Sorry, lost track of this one. In order to get backtraces out of the core > > you need the matching executables. Can you make sure the ceph-osd-dbg or > > ceph-debuginfo package is installed on the machine (depending on if it's > > deb or rpm) and then gdb ceph-osd corefile and 'thr app all bt'? > > > > Thanks! > > sage > > > > > > On Thu, 4 May 2017, Aaron Ten Clay wrote: > > > >> Were the backtraces we obtained not useful? Is there anything else we > >> can try to get the OSDs up again? > >> > >> On Wed, Apr 19, 2017 at 4:18 PM, Aaron Ten Clay wrote: > >> > I'm new to doing this all via systemd and systemd-coredump, but I appear to > >> > have gotten cores from two OSD processes. When xzipped they are < 2MIB each, > >> > but I threw them on my webserver to avoid polluting the mailing list. This > >> > seems oddly small, so if I've botched the process somehow let me know :) > >> > > >> > https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.xz > >> > https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.xz > >> > > >> > And for reference: > >> > root@osd001:/var/lib/systemd/coredump# ceph -v > >> > ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7) > >> > > >> > > >> > I am also investigating sysdig as recommended. > >> > > >> > Thanks! > >> > -Aaron > >> > > >> > > >> > On Mon, Apr 17, 2017 at 8:15 AM, Sage Weil wrote: > >> >> > >> >> On Sat, 15 Apr 2017, Aaron Ten Clay wrote: > >> >> > Hi all, > >> >> > > >> >> > Our cluster is experiencing a very odd issue and I'm hoping for some > >> >> > guidance on troubleshooting steps and/or suggestions to mitigate the > >> >> > issue. > >> >> > tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and > >> >> > are > >> >> > eventually nuked by oom_killer. > >> >> > >> >> My guess is that there there is a bug in a decoding path and it's > >> >> trying to allocate some huge amount of memory. Can you try setting a > >> >> memory ulimit to something like 40gb and then enabling core dumps so you > >> >> can get a core? Something like > >> >> > >> >> ulimit -c unlimited > >> >> ulimit -m 20000000 > >> >> > >> >> or whatever the corresponding systemd unit file options are... > >> >> > >> >> Once we have a core file it will hopefully be clear who is > >> >> doing the bad allocation... > >> >> > >> >> sage > >> >> > >> >> > >> >> > >> >> > > >> >> > I'll try to explain the situation in detail: > >> >> > > >> >> > We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs > >> >> > are in > >> >> > a different CRUSH "root", used as a cache tier for the main storage > >> >> > pools, > >> >> > which are erasure coded and used for cephfs. The OSDs are spread across > >> >> > two > >> >> > identical machines with 128GiB of RAM each, and there are three monitor > >> >> > nodes on different hardware. > >> >> > > >> >> > Several times we've encountered crippling bugs with previous Ceph > >> >> > releases > >> >> > when we were on RC or betas, or using non-recommended configurations, so > >> >> > in > >> >> > January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04, > >> >> > and > >> >> > went with stable Kraken 11.2.0 with the configuration mentioned above. > >> >> > Everything was fine until the end of March, when one day we find all but > >> >> > a > >> >> > couple of OSDs are "down" inexplicably. Investigation reveals oom_killer > >> >> > came along and nuked almost all the ceph-osd processes. > >> >> > > >> >> > We've gone through a bunch of iterations of restarting the OSDs, trying > >> >> > to > >> >> > bring them up one at a time gradually, all at once, various > >> >> > configuration > >> >> > settings to reduce cache size as suggested in this ticket: > >> >> > http://tracker.ceph.com/issues/18924... > >> >> > > >> >> > I don't know if that ticket really pertains to our situation or not, I > >> >> > have > >> >> > no experience with memory allocation debugging. I'd be willing to try if > >> >> > someone can point me to a guide or walk me through the process. > >> >> > > >> >> > I've even tried, just to see if the situation was transitory, adding > >> >> > over > >> >> > 300GiB of swap to both OSD machines. The OSD procs managed to allocate, > >> >> > in a > >> >> > matter of 5-10 minutes, more than 300GiB of RAM pressure and became > >> >> > oom_killer victims once again. > >> >> > > >> >> > No software or hardware changes took place around the time this problem > >> >> > started, and no significant data changes occurred either. We added about > >> >> > 40GiB of ~1GiB files a week or so before the problem started and that's > >> >> > the > >> >> > last time data was written. > >> >> > > >> >> > I can only assume we've found another crippling bug of some kind, this > >> >> > level > >> >> > of memory usage is entirely unprecedented. What can we do? > >> >> > > >> >> > Thanks in advance for any suggestions. > >> >> > -Aaron > >> >> > > >> >> > > >> > > >> > > >> > > >> > > >> > -- > >> > Aaron Ten Clay > >> > https://aarontc.com > >> > >> > >> > >> -- > >> Aaron Ten Clay > >> https://aarontc.com > >> > >> > > > > -- > Aaron Ten Clay > https://aarontc.com > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > >