* general protection fault: 0000 [#1] SMP @ 2017-10-11 14:40 Olivier Bonvalet 2017-10-12 7:12 ` [ceph-users] " Ilya Dryomov 0 siblings, 1 reply; 6+ messages in thread From: Olivier Bonvalet @ 2017-10-11 14:40 UTC (permalink / raw) To: Ceph Users, ceph-devel-u79uwXL29TY76Z2rM5mHXA Hi, I had a "general protection fault: 0000" with Ceph RBD kernel client. Not sure how to read the call, is it Ceph related ? Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 0000 [#1] SMP Oct 11 16:15:11 lorunde kernel: [311418.891855] Modules linked in: cpuid binfmt_misc nls_iso8859_1 nls_cp437 vfat fat tcp_diag inet_diag xt_physdev br_netfilter iptable_filter xen_netback loop xen_blkback cbc rbd libceph xen_gntdev xen_evtchn xenfs xen_privcmd ipmi_ssif intel_rapl iosf_mbi sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul ghash_clmulni_intel iTCO_wdt pcbc iTCO_vendor_support mxm_wmi aesni_intel aes_x86_64 crypto_simd glue_helper cryptd mgag200 i2c_algo_bit drm_kms_helper intel_rapl_perf ttm drm syscopyarea sysfillrect efi_pstore sysimgblt fb_sys_fops lpc_ich efivars mfd_core evdev ioatdma shpchp acpi_power_meter ipmi_si wmi button ipmi_devintf ipmi_msghandler bridge efivarfs ip_tables x_tables autofs4 dm_mod dax raid10 raid456 async_raid6_recov async_ memcpy async_pq async_xor xor async_tx raid6_pq Oct 11 16:15:11 lorunde kernel: [311418.895403] libcrc32c raid1 raid0 multipath linear md_mod hid_generic usbhid i2c_i801 crc32c_intel i2c_core xhci_pci ahci ixgbe xhci_hcd libahci ehci_pci ehci_hcd libata usbcore dca ptp usb_common pps_core mdio Oct 11 16:15:11 lorunde kernel: [311418.896551] CPU: 1 PID: 4916 Comm: kworker/1:0 Not tainted 4.13-dae-dom0 #2 Oct 11 16:15:11 lorunde kernel: [311418.897134] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0019.101220160604 10/12/2016 Oct 11 16:15:11 lorunde kernel: [311418.897745] Workqueue: ceph-msgr ceph_con_workfn [libceph] Oct 11 16:15:11 lorunde kernel: [311418.898355] task: ffff8801ce434280 task.stack: ffffc900151bc000 Oct 11 16:15:11 lorunde kernel: [311418.899007] RIP: e030:memcpy_erms+0x6/0x10 Oct 11 16:15:11 lorunde kernel: [311418.899616] RSP: e02b:ffffc900151bfac0 EFLAGS: 00010202 Oct 11 16:15:11 lorunde kernel: [311418.900228] RAX: ffff8801b63df000 RBX: ffff88021b41be00 RCX: 0000000004df0000 Oct 11 16:15:11 lorunde kernel: [311418.900848] RDX: 0000000004df0000 RSI: 4450736e24806564 RDI: ffff8801b63df000 Oct 11 16:15:11 lorunde kernel: [311418.901479] RBP: ffffea0005fdd8c8 R08: ffff88028545d618 R09: 0000000000000010 Oct 11 16:15:11 lorunde kernel: [311418.902104] R10: 0000000000000000 R11: ffff880215815000 R12: 0000000000000000 Oct 11 16:15:11 lorunde kernel: [311418.902723] R13: ffff8802158156c0 R14: 0000000000000000 R15: ffff8801ce434280 Oct 11 16:15:11 lorunde kernel: [311418.903359] FS: 0000000000000000(0000) GS:ffff880285440000(0000) knlGS:ffff880285440000 Oct 11 16:15:11 lorunde kernel: [311418.903994] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 11 16:15:11 lorunde kernel: [311418.904627] CR2: 000055a8461cfc20 CR3: 0000000001809000 CR4: 0000000000042660 Oct 11 16:15:11 lorunde kernel: [311418.905271] Call Trace: Oct 11 16:15:11 lorunde kernel: [311418.905909] ? skb_copy_ubufs+0xef/0x290 Oct 11 16:15:11 lorunde kernel: [311418.906548] ? skb_clone+0x82/0x90 Oct 11 16:15:11 lorunde kernel: [311418.907225] ? tcp_transmit_skb+0x74/0x930 Oct 11 16:15:11 lorunde kernel: [311418.907858] ? tcp_write_xmit+0x1bd/0xfb0 Oct 11 16:15:11 lorunde kernel: [311418.908490] ? __sk_mem_raise_allocated+0x4e/0x220 Oct 11 16:15:11 lorunde kernel: [311418.909122] ? __tcp_push_pending_frames+0x28/0x90 Oct 11 16:15:11 lorunde kernel: [311418.909755] ? do_tcp_sendpages+0x4fc/0x590 Oct 11 16:15:11 lorunde kernel: [311418.910386] ? tcp_sendpage+0x7c/0xa0 Oct 11 16:15:11 lorunde kernel: [311418.911026] ? inet_sendpage+0x37/0xe0 Oct 11 16:15:11 lorunde kernel: [311418.911655] ? kernel_sendpage+0x12/0x20 Oct 11 16:15:11 lorunde kernel: [311418.912297] ? ceph_tcp_sendpage+0x5c/0xc0 [libceph] Oct 11 16:15:11 lorunde kernel: [311418.912926] ? ceph_tcp_recvmsg+0x53/0x70 [libceph] Oct 11 16:15:11 lorunde kernel: [311418.913553] ? ceph_con_workfn+0xd08/0x22a0 [libceph] Oct 11 16:15:11 lorunde kernel: [311418.914179] ? ceph_osdc_start_request+0x23/0x30 [libceph] Oct 11 16:15:11 lorunde kernel: [311418.914807] ? rbd_img_obj_request_submit+0x1ac/0x3c0 [rbd] Oct 11 16:15:11 lorunde kernel: [311418.915458] ? process_one_work+0x1ad/0x340 Oct 11 16:15:11 lorunde kernel: [311418.916083] ? worker_thread+0x45/0x3f0 Oct 11 16:15:11 lorunde kernel: [311418.916706] ? kthread+0xf2/0x130 Oct 11 16:15:11 lorunde kernel: [311418.917327] ? process_one_work+0x340/0x340 Oct 11 16:15:11 lorunde kernel: [311418.917946] ? kthread_create_on_node+0x40/0x40 Oct 11 16:15:11 lorunde kernel: [311418.918565] ? do_group_exit+0x35/0xa0 Oct 11 16:15:11 lorunde kernel: [311418.919215] ? ret_from_fork+0x25/0x30 Oct 11 16:15:11 lorunde kernel: [311418.919826] Code: 43 4e 5b eb ec eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 Oct 11 16:15:11 lorunde kernel: [311418.921094] RIP: memcpy_erms+0x6/0x10 RSP: ffffc900151bfac0 Oct 11 16:15:11 lorunde kernel: [311418.921970] ---[ end trace 904278a63cb49fca ]--- ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [ceph-users] general protection fault: 0000 [#1] SMP 2017-10-11 14:40 general protection fault: 0000 [#1] SMP Olivier Bonvalet @ 2017-10-12 7:12 ` Ilya Dryomov [not found] ` <CAOi1vP--q8y696g5W_AUmR9Yxe5Xop3BH3xjEQG6_pmQmXO6kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-10-12 10:23 ` Jeff Layton 0 siblings, 2 replies; 6+ messages in thread From: Ilya Dryomov @ 2017-10-12 7:12 UTC (permalink / raw) To: Olivier Bonvalet; +Cc: Ceph Users, Ceph Development On Wed, Oct 11, 2017 at 4:40 PM, Olivier Bonvalet <ceph.list@daevel.fr> wrote: > Hi, > > I had a "general protection fault: 0000" with Ceph RBD kernel client. > Not sure how to read the call, is it Ceph related ? > > > Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 0000 [#1] SMP > Oct 11 16:15:11 lorunde kernel: [311418.891855] Modules linked in: cpuid binfmt_misc nls_iso8859_1 nls_cp437 vfat fat tcp_diag inet_diag xt_physdev br_netfilter iptable_filter xen_netback loop xen_blkback cbc rbd libceph xen_gntdev xen_evtchn xenfs xen_privcmd ipmi_ssif intel_rapl iosf_mbi sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul ghash_clmulni_intel iTCO_wdt pcbc iTCO_vendor_support mxm_wmi aesni_intel aes_x86_64 crypto_simd glue_helper cryptd mgag200 i2c_algo_bit drm_kms_helper intel_rapl_perf ttm drm syscopyarea sysfillrect efi_pstore sysimgblt fb_sys_fops lpc_ich efivars mfd_core evdev ioatdma shpchp acpi_power_meter ipmi_si wmi button ipmi_devintf ipmi_msghandler bridge efivarfs ip_tables x_tables autofs4 dm_mod dax raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq > Oct 11 16:15:11 lorunde kernel: [311418.895403] libcrc32c raid1 raid0 multipath linear md_mod hid_generic usbhid i2c_i801 crc32c_intel i2c_core xhci_pci ahci ixgbe xhci_hcd libahci ehci_pci ehci_hcd libata usbcore dca ptp usb_common pps_core mdio > Oct 11 16:15:11 lorunde kernel: [311418.896551] CPU: 1 PID: 4916 Comm: kworker/1:0 Not tainted 4.13-dae-dom0 #2 > Oct 11 16:15:11 lorunde kernel: [311418.897134] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0019.101220160604 10/12/2016 > Oct 11 16:15:11 lorunde kernel: [311418.897745] Workqueue: ceph-msgr ceph_con_workfn [libceph] > Oct 11 16:15:11 lorunde kernel: [311418.898355] task: ffff8801ce434280 task.stack: ffffc900151bc000 > Oct 11 16:15:11 lorunde kernel: [311418.899007] RIP: e030:memcpy_erms+0x6/0x10 > Oct 11 16:15:11 lorunde kernel: [311418.899616] RSP: e02b:ffffc900151bfac0 EFLAGS: 00010202 > Oct 11 16:15:11 lorunde kernel: [311418.900228] RAX: ffff8801b63df000 RBX: ffff88021b41be00 RCX: 0000000004df0000 > Oct 11 16:15:11 lorunde kernel: [311418.900848] RDX: 0000000004df0000 RSI: 4450736e24806564 RDI: ffff8801b63df000 > Oct 11 16:15:11 lorunde kernel: [311418.901479] RBP: ffffea0005fdd8c8 R08: ffff88028545d618 R09: 0000000000000010 > Oct 11 16:15:11 lorunde kernel: [311418.902104] R10: 0000000000000000 R11: ffff880215815000 R12: 0000000000000000 > Oct 11 16:15:11 lorunde kernel: [311418.902723] R13: ffff8802158156c0 R14: 0000000000000000 R15: ffff8801ce434280 > Oct 11 16:15:11 lorunde kernel: [311418.903359] FS: 0000000000000000(0000) GS:ffff880285440000(0000) knlGS:ffff880285440000 > Oct 11 16:15:11 lorunde kernel: [311418.903994] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 > Oct 11 16:15:11 lorunde kernel: [311418.904627] CR2: 000055a8461cfc20 CR3: 0000000001809000 CR4: 0000000000042660 > Oct 11 16:15:11 lorunde kernel: [311418.905271] Call Trace: > Oct 11 16:15:11 lorunde kernel: [311418.905909] ? skb_copy_ubufs+0xef/0x290 > Oct 11 16:15:11 lorunde kernel: [311418.906548] ? skb_clone+0x82/0x90 > Oct 11 16:15:11 lorunde kernel: [311418.907225] ? tcp_transmit_skb+0x74/0x930 > Oct 11 16:15:11 lorunde kernel: [311418.907858] ? tcp_write_xmit+0x1bd/0xfb0 > Oct 11 16:15:11 lorunde kernel: [311418.908490] ? __sk_mem_raise_allocated+0x4e/0x220 > Oct 11 16:15:11 lorunde kernel: [311418.909122] ? __tcp_push_pending_frames+0x28/0x90 > Oct 11 16:15:11 lorunde kernel: [311418.909755] ? do_tcp_sendpages+0x4fc/0x590 > Oct 11 16:15:11 lorunde kernel: [311418.910386] ? tcp_sendpage+0x7c/0xa0 > Oct 11 16:15:11 lorunde kernel: [311418.911026] ? inet_sendpage+0x37/0xe0 > Oct 11 16:15:11 lorunde kernel: [311418.911655] ? kernel_sendpage+0x12/0x20 > Oct 11 16:15:11 lorunde kernel: [311418.912297] ? ceph_tcp_sendpage+0x5c/0xc0 [libceph] > Oct 11 16:15:11 lorunde kernel: [311418.912926] ? ceph_tcp_recvmsg+0x53/0x70 [libceph] > Oct 11 16:15:11 lorunde kernel: [311418.913553] ? ceph_con_workfn+0xd08/0x22a0 [libceph] > Oct 11 16:15:11 lorunde kernel: [311418.914179] ? ceph_osdc_start_request+0x23/0x30 [libceph] > Oct 11 16:15:11 lorunde kernel: [311418.914807] ? rbd_img_obj_request_submit+0x1ac/0x3c0 [rbd] > Oct 11 16:15:11 lorunde kernel: [311418.915458] ? process_one_work+0x1ad/0x340 > Oct 11 16:15:11 lorunde kernel: [311418.916083] ? worker_thread+0x45/0x3f0 > Oct 11 16:15:11 lorunde kernel: [311418.916706] ? kthread+0xf2/0x130 > Oct 11 16:15:11 lorunde kernel: [311418.917327] ? process_one_work+0x340/0x340 > Oct 11 16:15:11 lorunde kernel: [311418.917946] ? kthread_create_on_node+0x40/0x40 > Oct 11 16:15:11 lorunde kernel: [311418.918565] ? do_group_exit+0x35/0xa0 > Oct 11 16:15:11 lorunde kernel: [311418.919215] ? ret_from_fork+0x25/0x30 > Oct 11 16:15:11 lorunde kernel: [311418.919826] Code: 43 4e 5b eb ec eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 > Oct 11 16:15:11 lorunde kernel: [311418.921094] RIP: memcpy_erms+0x6/0x10 RSP: ffffc900151bfac0 > Oct 11 16:15:11 lorunde kernel: [311418.921970] ---[ end trace 904278a63cb49fca ]--- It's a crash in memcpy() in skb_copy_ubufs(). It's not in ceph, but ceph-induced, it looks like. I don't remember seeing anything similar in the context of krbd. This is a Xen dom0 kernel, right? What did the workload look like? Can you provide dmesg before the crash? Thanks, Ilya ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <CAOi1vP--q8y696g5W_AUmR9Yxe5Xop3BH3xjEQG6_pmQmXO6kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re : general protection fault: 0000 [#1] SMP [not found] ` <CAOi1vP--q8y696g5W_AUmR9Yxe5Xop3BH3xjEQG6_pmQmXO6kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-10-12 7:26 ` Olivier Bonvalet 2017-10-12 13:58 ` Re : [ceph-users] " Luis Henriques 0 siblings, 1 reply; 6+ messages in thread From: Olivier Bonvalet @ 2017-10-12 7:26 UTC (permalink / raw) To: Ilya Dryomov; +Cc: Ceph Users, Ceph Development Le jeudi 12 octobre 2017 à 09:12 +0200, Ilya Dryomov a écrit : > It's a crash in memcpy() in skb_copy_ubufs(). It's not in ceph, but > ceph-induced, it looks like. I don't remember seeing anything > similar > in the context of krbd. > > This is a Xen dom0 kernel, right? What did the workload look like? > Can you provide dmesg before the crash? Hi, yes it's a Xen dom0 kernel. Linux 4.13.3, Xen 4.8.2, with an old 0.94.10 Ceph (so, Hammer). Before this error, I add this in logs : Oct 11 16:00:41 lorunde kernel: [310548.899082] libceph: read_partial_message ffff88021a910200 data crc 2306836368 != exp. 2215155875 Oct 11 16:00:41 lorunde kernel: [310548.899841] libceph: osd117 10.0.0.31:6804 bad crc/signature Oct 11 16:02:25 lorunde kernel: [310652.695015] libceph: read_partial_message ffff880220b10100 data crc 842840543 != exp. 2657161714 Oct 11 16:02:25 lorunde kernel: [310652.695731] libceph: osd3 10.0.0.26:6804 bad crc/signature Oct 11 16:07:24 lorunde kernel: [310952.485202] libceph: read_partial_message ffff88025d1aa400 data crc 938978341 != exp. 4154366769 Oct 11 16:07:24 lorunde kernel: [310952.485870] libceph: osd117 10.0.0.31:6804 bad crc/signature Oct 11 16:10:44 lorunde kernel: [311151.841812] libceph: read_partial_message ffff880260300400 data crc 2988747958 != exp. 319958859 Oct 11 16:10:44 lorunde kernel: [311151.842672] libceph: osd9 10.0.0.51:6802 bad crc/signature Oct 11 16:10:57 lorunde kernel: [311165.211412] libceph: read_partial_message ffff8802208b8300 data crc 369498361 != exp. 906022772 Oct 11 16:10:57 lorunde kernel: [311165.212135] libceph: osd87 10.0.0.5:6800 bad crc/signature Oct 11 16:12:27 lorunde kernel: [311254.635767] libceph: read_partial_message ffff880236f9a000 data crc 2586662963 != exp. 2886241494 Oct 11 16:12:27 lorunde kernel: [311254.636493] libceph: osd90 10.0.0.5:6814 bad crc/signature Oct 11 16:14:31 lorunde kernel: [311378.808191] libceph: read_partial_message ffff88027e633c00 data crc 1102363051 != exp. 679243837 Oct 11 16:14:31 lorunde kernel: [311378.808889] libceph: osd13 10.0.0.21:6804 bad crc/signature Oct 11 16:15:01 lorunde kernel: [311409.431034] libceph: read_partial_message ffff88024ce0a800 data crc 2467415342 != exp. 1753860323 Oct 11 16:15:01 lorunde kernel: [311409.431718] libceph: osd111 10.0.0.30:6804 bad crc/signature Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 0000 [#1] SMP We had to switch to TCP Cubic (instead of badly configured TCP BBR, without FQ), to reduce the data crc errors. But since we still had some errors, last night we rebooted all the OSD nodes in Linux 4.4.91, instead of Linux 4.9.47 & 4.9.53. Since the last 7 hours, we haven't got any data crc errors from OSD, but we had one from a MON. Without hang/crash. About the workload, the Xen VMs are mainly LAMP servers : http traffic, handle by nginx or apache, php, and MySQL databases. Thanks, Olivier _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Re : [ceph-users] general protection fault: 0000 [#1] SMP 2017-10-12 7:26 ` Re : " Olivier Bonvalet @ 2017-10-12 13:58 ` Luis Henriques 0 siblings, 0 replies; 6+ messages in thread From: Luis Henriques @ 2017-10-12 13:58 UTC (permalink / raw) To: Olivier Bonvalet; +Cc: Jeff Layton, Ilya Dryomov, Ceph Users, Ceph Development Olivier Bonvalet <ceph.list@daevel.fr> writes: > Le jeudi 12 octobre 2017 à 09:12 +0200, Ilya Dryomov a écrit : >> It's a crash in memcpy() in skb_copy_ubufs(). It's not in ceph, but >> ceph-induced, it looks like. I don't remember seeing anything >> similar >> in the context of krbd. >> >> This is a Xen dom0 kernel, right? What did the workload look like? >> Can you provide dmesg before the crash? > > Hi, > > yes it's a Xen dom0 kernel. Linux 4.13.3, Xen 4.8.2, with an old > 0.94.10 Ceph (so, Hammer). > > Before this error, I add this in logs : > > Oct 11 16:00:41 lorunde kernel: [310548.899082] libceph: read_partial_message ffff88021a910200 data crc 2306836368 != exp. 2215155875 > Oct 11 16:00:41 lorunde kernel: [310548.899841] libceph: osd117 10.0.0.31:6804 bad crc/signature > Oct 11 16:02:25 lorunde kernel: [310652.695015] libceph: read_partial_message ffff880220b10100 data crc 842840543 != exp. 2657161714 > Oct 11 16:02:25 lorunde kernel: [310652.695731] libceph: osd3 10.0.0.26:6804 bad crc/signature > Oct 11 16:07:24 lorunde kernel: [310952.485202] libceph: read_partial_message ffff88025d1aa400 data crc 938978341 != exp. 4154366769 > Oct 11 16:07:24 lorunde kernel: [310952.485870] libceph: osd117 10.0.0.31:6804 bad crc/signature > Oct 11 16:10:44 lorunde kernel: [311151.841812] libceph: read_partial_message ffff880260300400 data crc 2988747958 != exp. 319958859 > Oct 11 16:10:44 lorunde kernel: [311151.842672] libceph: osd9 10.0.0.51:6802 bad crc/signature > Oct 11 16:10:57 lorunde kernel: [311165.211412] libceph: read_partial_message ffff8802208b8300 data crc 369498361 != exp. 906022772 > Oct 11 16:10:57 lorunde kernel: [311165.212135] libceph: osd87 10.0.0.5:6800 bad crc/signature > Oct 11 16:12:27 lorunde kernel: [311254.635767] libceph: read_partial_message ffff880236f9a000 data crc 2586662963 != exp. 2886241494 > Oct 11 16:12:27 lorunde kernel: [311254.636493] libceph: osd90 10.0.0.5:6814 bad crc/signature > Oct 11 16:14:31 lorunde kernel: [311378.808191] libceph: read_partial_message ffff88027e633c00 data crc 1102363051 != exp. 679243837 > Oct 11 16:14:31 lorunde kernel: [311378.808889] libceph: osd13 10.0.0.21:6804 bad crc/signature > Oct 11 16:15:01 lorunde kernel: [311409.431034] libceph: read_partial_message ffff88024ce0a800 data crc 2467415342 != exp. 1753860323 > Oct 11 16:15:01 lorunde kernel: [311409.431718] libceph: osd111 10.0.0.30:6804 bad crc/signature > Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 0000 [#1] SMP > > > We had to switch to TCP Cubic (instead of badly configured TCP BBR, without FQ), to reduce the data crc errors. > But since we still had some errors, last night we rebooted all the OSD nodes in Linux 4.4.91, instead of Linux 4.9.47 & 4.9.53. > > Since the last 7 hours, we haven't got any data crc errors from OSD, but we had one from a MON. Without hang/crash. Since there are a bunch of errors before the GPF I suspect this bug is related to some error paths that haven't been thoroughly tested (as it is the case for error paths in general I guess). My initial guess was a race in ceph_con_workfn: - An error returned from try_read() would cause a delayed retry (in function con_fault()) - con_fault_finish() would then trigger a ceph_con_close/ceph_con_open in osd_fault. - the delayed retry kicks-in and the above close+open, which includes releasing con->in_msg and con->out_msg, could cause this GPF. Unfortunately, I wasn't yet able to find any race there (probably because there's none), but maybe there's a small window where this could occur. I wonder if this occurred only once, or if this is something that is easily triggerable. Cheers, -- Luis ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [ceph-users] general protection fault: 0000 [#1] SMP 2017-10-12 7:12 ` [ceph-users] " Ilya Dryomov [not found] ` <CAOi1vP--q8y696g5W_AUmR9Yxe5Xop3BH3xjEQG6_pmQmXO6kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-10-12 10:23 ` Jeff Layton [not found] ` <1507803838.5310.9.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 6+ messages in thread From: Jeff Layton @ 2017-10-12 10:23 UTC (permalink / raw) To: Ilya Dryomov, Olivier Bonvalet; +Cc: Ceph Users, Ceph Development On Thu, 2017-10-12 at 09:12 +0200, Ilya Dryomov wrote: > On Wed, Oct 11, 2017 at 4:40 PM, Olivier Bonvalet <ceph.list@daevel.fr> wrote: > > Hi, > > > > I had a "general protection fault: 0000" with Ceph RBD kernel client. > > Not sure how to read the call, is it Ceph related ? > > > > > > Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 0000 [#1] SMP > > Oct 11 16:15:11 lorunde kernel: [311418.891855] Modules linked in: cpuid binfmt_misc nls_iso8859_1 nls_cp437 vfat fat tcp_diag inet_diag xt_physdev br_netfilter iptable_filter xen_netback loop xen_blkback cbc rbd libceph xen_gntdev xen_evtchn xenfs xen_privcmd ipmi_ssif intel_rapl iosf_mbi sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul ghash_clmulni_intel iTCO_wdt pcbc iTCO_vendor_support mxm_wmi aesni_intel aes_x86_64 crypto_simd glue_helper cryptd mgag200 i2c_algo_bit drm_kms_helper intel_rapl_perf ttm drm syscopyarea sysfillrect efi_pstore sysimgblt fb_sys_fops lpc_ich efivars mfd_core evdev ioatdma shpchp acpi_power_meter ipmi_si wmi button ipmi_devintf ipmi_msghandler bridge efivarfs ip_tables x_tables autofs4 dm_mod dax raid10 raid456 async_raid6_recov as ync_memcpy async_pq async_xor xor async_tx raid6_pq > > Oct 11 16:15:11 lorunde kernel: [311418.895403] libcrc32c raid1 raid0 multipath linear md_mod hid_generic usbhid i2c_i801 crc32c_intel i2c_core xhci_pci ahci ixgbe xhci_hcd libahci ehci_pci ehci_hcd libata usbcore dca ptp usb_common pps_core mdio > > Oct 11 16:15:11 lorunde kernel: [311418.896551] CPU: 1 PID: 4916 Comm: kworker/1:0 Not tainted 4.13-dae-dom0 #2 > > Oct 11 16:15:11 lorunde kernel: [311418.897134] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0019.101220160604 10/12/2016 > > Oct 11 16:15:11 lorunde kernel: [311418.897745] Workqueue: ceph-msgr ceph_con_workfn [libceph] > > Oct 11 16:15:11 lorunde kernel: [311418.898355] task: ffff8801ce434280 task.stack: ffffc900151bc000 > > Oct 11 16:15:11 lorunde kernel: [311418.899007] RIP: e030:memcpy_erms+0x6/0x10 > > Oct 11 16:15:11 lorunde kernel: [311418.899616] RSP: e02b:ffffc900151bfac0 EFLAGS: 00010202 > > Oct 11 16:15:11 lorunde kernel: [311418.900228] RAX: ffff8801b63df000 RBX: ffff88021b41be00 RCX: 0000000004df0000 > > Oct 11 16:15:11 lorunde kernel: [311418.900848] RDX: 0000000004df0000 RSI: 4450736e24806564 RDI: ffff8801b63df000 > > Oct 11 16:15:11 lorunde kernel: [311418.901479] RBP: ffffea0005fdd8c8 R08: ffff88028545d618 R09: 0000000000000010 > > Oct 11 16:15:11 lorunde kernel: [311418.902104] R10: 0000000000000000 R11: ffff880215815000 R12: 0000000000000000 > > Oct 11 16:15:11 lorunde kernel: [311418.902723] R13: ffff8802158156c0 R14: 0000000000000000 R15: ffff8801ce434280 > > Oct 11 16:15:11 lorunde kernel: [311418.903359] FS: 0000000000000000(0000) GS:ffff880285440000(0000) knlGS:ffff880285440000 > > Oct 11 16:15:11 lorunde kernel: [311418.903994] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 > > Oct 11 16:15:11 lorunde kernel: [311418.904627] CR2: 000055a8461cfc20 CR3: 0000000001809000 CR4: 0000000000042660 > > Oct 11 16:15:11 lorunde kernel: [311418.905271] Call Trace: > > Oct 11 16:15:11 lorunde kernel: [311418.905909] ? skb_copy_ubufs+0xef/0x290 > > Oct 11 16:15:11 lorunde kernel: [311418.906548] ? skb_clone+0x82/0x90 > > Oct 11 16:15:11 lorunde kernel: [311418.907225] ? tcp_transmit_skb+0x74/0x930 > > Oct 11 16:15:11 lorunde kernel: [311418.907858] ? tcp_write_xmit+0x1bd/0xfb0 > > Oct 11 16:15:11 lorunde kernel: [311418.908490] ? __sk_mem_raise_allocated+0x4e/0x220 > > Oct 11 16:15:11 lorunde kernel: [311418.909122] ? __tcp_push_pending_frames+0x28/0x90 > > Oct 11 16:15:11 lorunde kernel: [311418.909755] ? do_tcp_sendpages+0x4fc/0x590 > > Oct 11 16:15:11 lorunde kernel: [311418.910386] ? tcp_sendpage+0x7c/0xa0 > > Oct 11 16:15:11 lorunde kernel: [311418.911026] ? inet_sendpage+0x37/0xe0 > > Oct 11 16:15:11 lorunde kernel: [311418.911655] ? kernel_sendpage+0x12/0x20 > > Oct 11 16:15:11 lorunde kernel: [311418.912297] ? ceph_tcp_sendpage+0x5c/0xc0 [libceph] > > Oct 11 16:15:11 lorunde kernel: [311418.912926] ? ceph_tcp_recvmsg+0x53/0x70 [libceph] > > Oct 11 16:15:11 lorunde kernel: [311418.913553] ? ceph_con_workfn+0xd08/0x22a0 [libceph] > > Oct 11 16:15:11 lorunde kernel: [311418.914179] ? ceph_osdc_start_request+0x23/0x30 [libceph] > > Oct 11 16:15:11 lorunde kernel: [311418.914807] ? rbd_img_obj_request_submit+0x1ac/0x3c0 [rbd] > > Oct 11 16:15:11 lorunde kernel: [311418.915458] ? process_one_work+0x1ad/0x340 > > Oct 11 16:15:11 lorunde kernel: [311418.916083] ? worker_thread+0x45/0x3f0 > > Oct 11 16:15:11 lorunde kernel: [311418.916706] ? kthread+0xf2/0x130 > > Oct 11 16:15:11 lorunde kernel: [311418.917327] ? process_one_work+0x340/0x340 > > Oct 11 16:15:11 lorunde kernel: [311418.917946] ? kthread_create_on_node+0x40/0x40 > > Oct 11 16:15:11 lorunde kernel: [311418.918565] ? do_group_exit+0x35/0xa0 > > Oct 11 16:15:11 lorunde kernel: [311418.919215] ? ret_from_fork+0x25/0x30 > > Oct 11 16:15:11 lorunde kernel: [311418.919826] Code: 43 4e 5b eb ec eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 > > Oct 11 16:15:11 lorunde kernel: [311418.921094] RIP: memcpy_erms+0x6/0x10 RSP: ffffc900151bfac0 > > Oct 11 16:15:11 lorunde kernel: [311418.921970] ---[ end trace 904278a63cb49fca ]--- > > It's a crash in memcpy() in skb_copy_ubufs(). It's not in ceph, but > ceph-induced, it looks like. I don't remember seeing anything similar > in the context of krbd. > > This is a Xen dom0 kernel, right? What did the workload look like? > Can you provide dmesg before the crash? > ...and to be clear: (gdb) list *(memcpy_erms+0x6) 0xffffffff8188f136 is at arch/x86/lib/memcpy_64.S:54. 49 * simpler than memcpy. Use memcpy_erms when possible. 50 */ 51 ENTRY(memcpy_erms) 52 movq %rdi, %rax 53 movq %rdx, %rcx 54 rep movsb 55 ret 56 ENDPROC(memcpy_erms) 57 58 ENTRY(memcpy_orig) So either %rsi or %rdi held a bogus address at the time of the crash, most likely. If you have a vmcore, you may be able to dig in with crash and tell which address it was, and trace back up the call stack to where it came from. That said... %rcx looks quite large -- 81723392 bytes still to go in the copy. This might be a case where the copy length got screwed up somehow and it overran its bounds. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <1507803838.5310.9.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: general protection fault: 0000 [#1] SMP [not found] ` <1507803838.5310.9.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-10-12 10:50 ` Ilya Dryomov 0 siblings, 0 replies; 6+ messages in thread From: Ilya Dryomov @ 2017-10-12 10:50 UTC (permalink / raw) To: Jeff Layton; +Cc: Ceph Development, Ceph Users On Thu, Oct 12, 2017 at 12:23 PM, Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Thu, 2017-10-12 at 09:12 +0200, Ilya Dryomov wrote: >> On Wed, Oct 11, 2017 at 4:40 PM, Olivier Bonvalet <ceph.list-PaEMFeTk6C1QFI55V6+gNQ@public.gmane.org> wrote: >> > Hi, >> > >> > I had a "general protection fault: 0000" with Ceph RBD kernel client. >> > Not sure how to read the call, is it Ceph related ? >> > >> > >> > Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 0000 [#1] SMP >> > Oct 11 16:15:11 lorunde kernel: [311418.891855] Modules linked in: cpuid binfmt_misc nls_iso8859_1 nls_cp437 vfat fat tcp_diag inet_diag xt_physdev br_netfilter iptable_filter xen_netback loop xen_blkback cbc rbd libceph xen_gntdev xen_evtchn xenfs xen_privcmd ipmi_ssif intel_rapl iosf_mbi sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul ghash_clmulni_intel iTCO_wdt pcbc iTCO_vendor_support mxm_wmi aesni_intel aes_x86_64 crypto_simd glue_helper cryptd mgag200 i2c_algo_bit drm_kms_helper intel_rapl_perf ttm drm syscopyarea sysfillrect efi_pstore sysimgblt fb_sys_fops lpc_ich efivars mfd_core evdev ioatdma shpchp acpi_power_meter ipmi_si wmi button ipmi_devintf ipmi_msghandler bridge efivarfs ip_tables x_tables autofs4 dm_mod dax raid10 raid456 async_raid6_recov a sync_memcpy async_pq async_xor xor async_tx raid6_pq >> > Oct 11 16:15:11 lorunde kernel: [311418.895403] libcrc32c raid1 raid0 multipath linear md_mod hid_generic usbhid i2c_i801 crc32c_intel i2c_core xhci_pci ahci ixgbe xhci_hcd libahci ehci_pci ehci_hcd libata usbcore dca ptp usb_common pps_core mdio >> > Oct 11 16:15:11 lorunde kernel: [311418.896551] CPU: 1 PID: 4916 Comm: kworker/1:0 Not tainted 4.13-dae-dom0 #2 >> > Oct 11 16:15:11 lorunde kernel: [311418.897134] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0019.101220160604 10/12/2016 >> > Oct 11 16:15:11 lorunde kernel: [311418.897745] Workqueue: ceph-msgr ceph_con_workfn [libceph] >> > Oct 11 16:15:11 lorunde kernel: [311418.898355] task: ffff8801ce434280 task.stack: ffffc900151bc000 >> > Oct 11 16:15:11 lorunde kernel: [311418.899007] RIP: e030:memcpy_erms+0x6/0x10 >> > Oct 11 16:15:11 lorunde kernel: [311418.899616] RSP: e02b:ffffc900151bfac0 EFLAGS: 00010202 >> > Oct 11 16:15:11 lorunde kernel: [311418.900228] RAX: ffff8801b63df000 RBX: ffff88021b41be00 RCX: 0000000004df0000 >> > Oct 11 16:15:11 lorunde kernel: [311418.900848] RDX: 0000000004df0000 RSI: 4450736e24806564 RDI: ffff8801b63df000 >> > Oct 11 16:15:11 lorunde kernel: [311418.901479] RBP: ffffea0005fdd8c8 R08: ffff88028545d618 R09: 0000000000000010 >> > Oct 11 16:15:11 lorunde kernel: [311418.902104] R10: 0000000000000000 R11: ffff880215815000 R12: 0000000000000000 >> > Oct 11 16:15:11 lorunde kernel: [311418.902723] R13: ffff8802158156c0 R14: 0000000000000000 R15: ffff8801ce434280 >> > Oct 11 16:15:11 lorunde kernel: [311418.903359] FS: 0000000000000000(0000) GS:ffff880285440000(0000) knlGS:ffff880285440000 >> > Oct 11 16:15:11 lorunde kernel: [311418.903994] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 >> > Oct 11 16:15:11 lorunde kernel: [311418.904627] CR2: 000055a8461cfc20 CR3: 0000000001809000 CR4: 0000000000042660 >> > Oct 11 16:15:11 lorunde kernel: [311418.905271] Call Trace: >> > Oct 11 16:15:11 lorunde kernel: [311418.905909] ? skb_copy_ubufs+0xef/0x290 >> > Oct 11 16:15:11 lorunde kernel: [311418.906548] ? skb_clone+0x82/0x90 >> > Oct 11 16:15:11 lorunde kernel: [311418.907225] ? tcp_transmit_skb+0x74/0x930 >> > Oct 11 16:15:11 lorunde kernel: [311418.907858] ? tcp_write_xmit+0x1bd/0xfb0 >> > Oct 11 16:15:11 lorunde kernel: [311418.908490] ? __sk_mem_raise_allocated+0x4e/0x220 >> > Oct 11 16:15:11 lorunde kernel: [311418.909122] ? __tcp_push_pending_frames+0x28/0x90 >> > Oct 11 16:15:11 lorunde kernel: [311418.909755] ? do_tcp_sendpages+0x4fc/0x590 >> > Oct 11 16:15:11 lorunde kernel: [311418.910386] ? tcp_sendpage+0x7c/0xa0 >> > Oct 11 16:15:11 lorunde kernel: [311418.911026] ? inet_sendpage+0x37/0xe0 >> > Oct 11 16:15:11 lorunde kernel: [311418.911655] ? kernel_sendpage+0x12/0x20 >> > Oct 11 16:15:11 lorunde kernel: [311418.912297] ? ceph_tcp_sendpage+0x5c/0xc0 [libceph] >> > Oct 11 16:15:11 lorunde kernel: [311418.912926] ? ceph_tcp_recvmsg+0x53/0x70 [libceph] >> > Oct 11 16:15:11 lorunde kernel: [311418.913553] ? ceph_con_workfn+0xd08/0x22a0 [libceph] >> > Oct 11 16:15:11 lorunde kernel: [311418.914179] ? ceph_osdc_start_request+0x23/0x30 [libceph] >> > Oct 11 16:15:11 lorunde kernel: [311418.914807] ? rbd_img_obj_request_submit+0x1ac/0x3c0 [rbd] >> > Oct 11 16:15:11 lorunde kernel: [311418.915458] ? process_one_work+0x1ad/0x340 >> > Oct 11 16:15:11 lorunde kernel: [311418.916083] ? worker_thread+0x45/0x3f0 >> > Oct 11 16:15:11 lorunde kernel: [311418.916706] ? kthread+0xf2/0x130 >> > Oct 11 16:15:11 lorunde kernel: [311418.917327] ? process_one_work+0x340/0x340 >> > Oct 11 16:15:11 lorunde kernel: [311418.917946] ? kthread_create_on_node+0x40/0x40 >> > Oct 11 16:15:11 lorunde kernel: [311418.918565] ? do_group_exit+0x35/0xa0 >> > Oct 11 16:15:11 lorunde kernel: [311418.919215] ? ret_from_fork+0x25/0x30 >> > Oct 11 16:15:11 lorunde kernel: [311418.919826] Code: 43 4e 5b eb ec eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 >> > Oct 11 16:15:11 lorunde kernel: [311418.921094] RIP: memcpy_erms+0x6/0x10 RSP: ffffc900151bfac0 >> > Oct 11 16:15:11 lorunde kernel: [311418.921970] ---[ end trace 904278a63cb49fca ]--- >> >> It's a crash in memcpy() in skb_copy_ubufs(). It's not in ceph, but >> ceph-induced, it looks like. I don't remember seeing anything similar >> in the context of krbd. >> >> This is a Xen dom0 kernel, right? What did the workload look like? >> Can you provide dmesg before the crash? >> > > ...and to be clear: > > (gdb) list *(memcpy_erms+0x6) > 0xffffffff8188f136 is at arch/x86/lib/memcpy_64.S:54. > 49 * simpler than memcpy. Use memcpy_erms when possible. > 50 */ > 51 ENTRY(memcpy_erms) > 52 movq %rdi, %rax > 53 movq %rdx, %rcx > 54 rep movsb > 55 ret > 56 ENDPROC(memcpy_erms) > 57 > 58 ENTRY(memcpy_orig) > > So either %rsi or %rdi held a bogus address at the time of the crash, > most likely. If you have a vmcore, you may be able to dig in with crash > and tell which address it was, and trace back up the call stack to where > it came from. I suspect src-side bustage. > > That said... %rcx looks quite large -- 81723392 bytes still to go in the > copy. This might be a case where the copy length got screwed up somehow > and it overran its bounds. Yeah, suspiciously large. I don't think it copied a single byte though: %rcx never got decremented. Thanks, Ilya ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-10-12 13:58 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-10-11 14:40 general protection fault: 0000 [#1] SMP Olivier Bonvalet 2017-10-12 7:12 ` [ceph-users] " Ilya Dryomov [not found] ` <CAOi1vP--q8y696g5W_AUmR9Yxe5Xop3BH3xjEQG6_pmQmXO6kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-10-12 7:26 ` Re : " Olivier Bonvalet 2017-10-12 13:58 ` Re : [ceph-users] " Luis Henriques 2017-10-12 10:23 ` Jeff Layton [not found] ` <1507803838.5310.9.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2017-10-12 10:50 ` Ilya Dryomov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.