From mboxrd@z Thu Jan 1 00:00:00 1970 From: Luis Henriques Subject: Re: =?utf-8?Q?Re=C2=A0=3A?= [ceph-users] general protection fault: 0000 [#1] SMP Date: Thu, 12 Oct 2017 14:58:47 +0100 Message-ID: <87lgkgcvs8.fsf@suse.com> References: <1507732840.2812.10.camel@daevel.fr> <1507793194.26826.8.camel@daevel.fr> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Return-path: Received: from mx2.suse.de ([195.135.220.15]:40648 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751138AbdJLN6u (ORCPT ); Thu, 12 Oct 2017 09:58:50 -0400 In-Reply-To: <1507793194.26826.8.camel@daevel.fr> (Olivier Bonvalet's message of "Thu, 12 Oct 2017 09:26:34 +0200") Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Olivier Bonvalet Cc: Jeff Layton , Ilya Dryomov , Ceph Users , Ceph Development Olivier Bonvalet writes: > Le jeudi 12 octobre 2017 à 09:12 +0200, Ilya Dryomov a écrit : >> It's a crash in memcpy() in skb_copy_ubufs(). It's not in ceph, but >> ceph-induced, it looks like. I don't remember seeing anything >> similar >> in the context of krbd. >> >> This is a Xen dom0 kernel, right? What did the workload look like? >> Can you provide dmesg before the crash? > > Hi, > > yes it's a Xen dom0 kernel. Linux 4.13.3, Xen 4.8.2, with an old > 0.94.10 Ceph (so, Hammer). > > Before this error, I add this in logs : > > Oct 11 16:00:41 lorunde kernel: [310548.899082] libceph: read_partial_message ffff88021a910200 data crc 2306836368 != exp. 2215155875 > Oct 11 16:00:41 lorunde kernel: [310548.899841] libceph: osd117 10.0.0.31:6804 bad crc/signature > Oct 11 16:02:25 lorunde kernel: [310652.695015] libceph: read_partial_message ffff880220b10100 data crc 842840543 != exp. 2657161714 > Oct 11 16:02:25 lorunde kernel: [310652.695731] libceph: osd3 10.0.0.26:6804 bad crc/signature > Oct 11 16:07:24 lorunde kernel: [310952.485202] libceph: read_partial_message ffff88025d1aa400 data crc 938978341 != exp. 4154366769 > Oct 11 16:07:24 lorunde kernel: [310952.485870] libceph: osd117 10.0.0.31:6804 bad crc/signature > Oct 11 16:10:44 lorunde kernel: [311151.841812] libceph: read_partial_message ffff880260300400 data crc 2988747958 != exp. 319958859 > Oct 11 16:10:44 lorunde kernel: [311151.842672] libceph: osd9 10.0.0.51:6802 bad crc/signature > Oct 11 16:10:57 lorunde kernel: [311165.211412] libceph: read_partial_message ffff8802208b8300 data crc 369498361 != exp. 906022772 > Oct 11 16:10:57 lorunde kernel: [311165.212135] libceph: osd87 10.0.0.5:6800 bad crc/signature > Oct 11 16:12:27 lorunde kernel: [311254.635767] libceph: read_partial_message ffff880236f9a000 data crc 2586662963 != exp. 2886241494 > Oct 11 16:12:27 lorunde kernel: [311254.636493] libceph: osd90 10.0.0.5:6814 bad crc/signature > Oct 11 16:14:31 lorunde kernel: [311378.808191] libceph: read_partial_message ffff88027e633c00 data crc 1102363051 != exp. 679243837 > Oct 11 16:14:31 lorunde kernel: [311378.808889] libceph: osd13 10.0.0.21:6804 bad crc/signature > Oct 11 16:15:01 lorunde kernel: [311409.431034] libceph: read_partial_message ffff88024ce0a800 data crc 2467415342 != exp. 1753860323 > Oct 11 16:15:01 lorunde kernel: [311409.431718] libceph: osd111 10.0.0.30:6804 bad crc/signature > Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 0000 [#1] SMP > > > We had to switch to TCP Cubic (instead of badly configured TCP BBR, without FQ), to reduce the data crc errors. > But since we still had some errors, last night we rebooted all the OSD nodes in Linux 4.4.91, instead of Linux 4.9.47 & 4.9.53. > > Since the last 7 hours, we haven't got any data crc errors from OSD, but we had one from a MON. Without hang/crash. Since there are a bunch of errors before the GPF I suspect this bug is related to some error paths that haven't been thoroughly tested (as it is the case for error paths in general I guess). My initial guess was a race in ceph_con_workfn: - An error returned from try_read() would cause a delayed retry (in function con_fault()) - con_fault_finish() would then trigger a ceph_con_close/ceph_con_open in osd_fault. - the delayed retry kicks-in and the above close+open, which includes releasing con->in_msg and con->out_msg, could cause this GPF. Unfortunately, I wasn't yet able to find any race there (probably because there's none), but maybe there's a small window where this could occur. I wonder if this occurred only once, or if this is something that is easily triggerable. Cheers, -- Luis