All of lore.kernel.org
 help / color / mirror / Atom feed
* general protection fault: 0000 [#1] SMP
@ 2017-10-11 14:40 Olivier Bonvalet
  2017-10-12  7:12 ` [ceph-users] " Ilya Dryomov
  0 siblings, 1 reply; 6+ messages in thread
From: Olivier Bonvalet @ 2017-10-11 14:40 UTC (permalink / raw)
  To: Ceph Users, ceph-devel-u79uwXL29TY76Z2rM5mHXA

Hi,

I had a "general protection fault: 0000" with Ceph RBD kernel client.
Not sure how to read the call, is it Ceph related ?


Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 0000 [#1] SMP
Oct 11 16:15:11 lorunde kernel: [311418.891855] Modules linked in: cpuid binfmt_misc nls_iso8859_1 nls_cp437 vfat fat tcp_diag inet_diag xt_physdev br_netfilter iptable_filter xen_netback loop xen_blkback cbc rbd libceph xen_gntdev xen_evtchn xenfs xen_privcmd ipmi_ssif intel_rapl iosf_mbi sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul ghash_clmulni_intel iTCO_wdt pcbc iTCO_vendor_support mxm_wmi aesni_intel aes_x86_64 crypto_simd glue_helper cryptd mgag200 i2c_algo_bit drm_kms_helper intel_rapl_perf ttm drm syscopyarea sysfillrect efi_pstore sysimgblt fb_sys_fops lpc_ich efivars mfd_core evdev ioatdma shpchp acpi_power_meter ipmi_si wmi button ipmi_devintf ipmi_msghandler bridge efivarfs ip_tables x_tables autofs4 dm_mod dax raid10 raid456 async_raid6_recov async_
 memcpy async_pq async_xor xor async_tx raid6_pq
Oct 11 16:15:11 lorunde kernel: [311418.895403]  libcrc32c raid1 raid0 multipath linear md_mod hid_generic usbhid i2c_i801 crc32c_intel i2c_core xhci_pci ahci ixgbe xhci_hcd libahci ehci_pci ehci_hcd libata usbcore dca ptp usb_common pps_core mdio
Oct 11 16:15:11 lorunde kernel: [311418.896551] CPU: 1 PID: 4916 Comm: kworker/1:0 Not tainted 4.13-dae-dom0 #2
Oct 11 16:15:11 lorunde kernel: [311418.897134] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0019.101220160604 10/12/2016
Oct 11 16:15:11 lorunde kernel: [311418.897745] Workqueue: ceph-msgr ceph_con_workfn [libceph]
Oct 11 16:15:11 lorunde kernel: [311418.898355] task: ffff8801ce434280 task.stack: ffffc900151bc000
Oct 11 16:15:11 lorunde kernel: [311418.899007] RIP: e030:memcpy_erms+0x6/0x10
Oct 11 16:15:11 lorunde kernel: [311418.899616] RSP: e02b:ffffc900151bfac0 EFLAGS: 00010202
Oct 11 16:15:11 lorunde kernel: [311418.900228] RAX: ffff8801b63df000 RBX: ffff88021b41be00 RCX: 0000000004df0000
Oct 11 16:15:11 lorunde kernel: [311418.900848] RDX: 0000000004df0000 RSI: 4450736e24806564 RDI: ffff8801b63df000
Oct 11 16:15:11 lorunde kernel: [311418.901479] RBP: ffffea0005fdd8c8 R08: ffff88028545d618 R09: 0000000000000010
Oct 11 16:15:11 lorunde kernel: [311418.902104] R10: 0000000000000000 R11: ffff880215815000 R12: 0000000000000000
Oct 11 16:15:11 lorunde kernel: [311418.902723] R13: ffff8802158156c0 R14: 0000000000000000 R15: ffff8801ce434280
Oct 11 16:15:11 lorunde kernel: [311418.903359] FS:  0000000000000000(0000) GS:ffff880285440000(0000) knlGS:ffff880285440000
Oct 11 16:15:11 lorunde kernel: [311418.903994] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 11 16:15:11 lorunde kernel: [311418.904627] CR2: 000055a8461cfc20 CR3: 0000000001809000 CR4: 0000000000042660
Oct 11 16:15:11 lorunde kernel: [311418.905271] Call Trace:
Oct 11 16:15:11 lorunde kernel: [311418.905909]  ? skb_copy_ubufs+0xef/0x290
Oct 11 16:15:11 lorunde kernel: [311418.906548]  ? skb_clone+0x82/0x90
Oct 11 16:15:11 lorunde kernel: [311418.907225]  ? tcp_transmit_skb+0x74/0x930
Oct 11 16:15:11 lorunde kernel: [311418.907858]  ? tcp_write_xmit+0x1bd/0xfb0
Oct 11 16:15:11 lorunde kernel: [311418.908490]  ? __sk_mem_raise_allocated+0x4e/0x220
Oct 11 16:15:11 lorunde kernel: [311418.909122]  ? __tcp_push_pending_frames+0x28/0x90
Oct 11 16:15:11 lorunde kernel: [311418.909755]  ? do_tcp_sendpages+0x4fc/0x590
Oct 11 16:15:11 lorunde kernel: [311418.910386]  ? tcp_sendpage+0x7c/0xa0
Oct 11 16:15:11 lorunde kernel: [311418.911026]  ? inet_sendpage+0x37/0xe0
Oct 11 16:15:11 lorunde kernel: [311418.911655]  ? kernel_sendpage+0x12/0x20
Oct 11 16:15:11 lorunde kernel: [311418.912297]  ? ceph_tcp_sendpage+0x5c/0xc0 [libceph]
Oct 11 16:15:11 lorunde kernel: [311418.912926]  ? ceph_tcp_recvmsg+0x53/0x70 [libceph]
Oct 11 16:15:11 lorunde kernel: [311418.913553]  ? ceph_con_workfn+0xd08/0x22a0 [libceph]
Oct 11 16:15:11 lorunde kernel: [311418.914179]  ? ceph_osdc_start_request+0x23/0x30 [libceph]
Oct 11 16:15:11 lorunde kernel: [311418.914807]  ? rbd_img_obj_request_submit+0x1ac/0x3c0 [rbd]
Oct 11 16:15:11 lorunde kernel: [311418.915458]  ? process_one_work+0x1ad/0x340
Oct 11 16:15:11 lorunde kernel: [311418.916083]  ? worker_thread+0x45/0x3f0
Oct 11 16:15:11 lorunde kernel: [311418.916706]  ? kthread+0xf2/0x130
Oct 11 16:15:11 lorunde kernel: [311418.917327]  ? process_one_work+0x340/0x340
Oct 11 16:15:11 lorunde kernel: [311418.917946]  ? kthread_create_on_node+0x40/0x40
Oct 11 16:15:11 lorunde kernel: [311418.918565]  ? do_group_exit+0x35/0xa0
Oct 11 16:15:11 lorunde kernel: [311418.919215]  ? ret_from_fork+0x25/0x30
Oct 11 16:15:11 lorunde kernel: [311418.919826] Code: 43 4e 5b eb ec eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 
Oct 11 16:15:11 lorunde kernel: [311418.921094] RIP: memcpy_erms+0x6/0x10 RSP: ffffc900151bfac0
Oct 11 16:15:11 lorunde kernel: [311418.921970] ---[ end trace 904278a63cb49fca ]---

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [ceph-users] general protection fault: 0000 [#1] SMP
  2017-10-11 14:40 general protection fault: 0000 [#1] SMP Olivier Bonvalet
@ 2017-10-12  7:12 ` Ilya Dryomov
       [not found]   ` <CAOi1vP--q8y696g5W_AUmR9Yxe5Xop3BH3xjEQG6_pmQmXO6kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-10-12 10:23   ` Jeff Layton
  0 siblings, 2 replies; 6+ messages in thread
From: Ilya Dryomov @ 2017-10-12  7:12 UTC (permalink / raw)
  To: Olivier Bonvalet; +Cc: Ceph Users, Ceph Development

On Wed, Oct 11, 2017 at 4:40 PM, Olivier Bonvalet <ceph.list@daevel.fr> wrote:
> Hi,
>
> I had a "general protection fault: 0000" with Ceph RBD kernel client.
> Not sure how to read the call, is it Ceph related ?
>
>
> Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 0000 [#1] SMP
> Oct 11 16:15:11 lorunde kernel: [311418.891855] Modules linked in: cpuid binfmt_misc nls_iso8859_1 nls_cp437 vfat fat tcp_diag inet_diag xt_physdev br_netfilter iptable_filter xen_netback loop xen_blkback cbc rbd libceph xen_gntdev xen_evtchn xenfs xen_privcmd ipmi_ssif intel_rapl iosf_mbi sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul ghash_clmulni_intel iTCO_wdt pcbc iTCO_vendor_support mxm_wmi aesni_intel aes_x86_64 crypto_simd glue_helper cryptd mgag200 i2c_algo_bit drm_kms_helper intel_rapl_perf ttm drm syscopyarea sysfillrect efi_pstore sysimgblt fb_sys_fops lpc_ich efivars mfd_core evdev ioatdma shpchp acpi_power_meter ipmi_si wmi button ipmi_devintf ipmi_msghandler bridge efivarfs ip_tables x_tables autofs4 dm_mod dax raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq
> Oct 11 16:15:11 lorunde kernel: [311418.895403]  libcrc32c raid1 raid0 multipath linear md_mod hid_generic usbhid i2c_i801 crc32c_intel i2c_core xhci_pci ahci ixgbe xhci_hcd libahci ehci_pci ehci_hcd libata usbcore dca ptp usb_common pps_core mdio
> Oct 11 16:15:11 lorunde kernel: [311418.896551] CPU: 1 PID: 4916 Comm: kworker/1:0 Not tainted 4.13-dae-dom0 #2
> Oct 11 16:15:11 lorunde kernel: [311418.897134] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0019.101220160604 10/12/2016
> Oct 11 16:15:11 lorunde kernel: [311418.897745] Workqueue: ceph-msgr ceph_con_workfn [libceph]
> Oct 11 16:15:11 lorunde kernel: [311418.898355] task: ffff8801ce434280 task.stack: ffffc900151bc000
> Oct 11 16:15:11 lorunde kernel: [311418.899007] RIP: e030:memcpy_erms+0x6/0x10
> Oct 11 16:15:11 lorunde kernel: [311418.899616] RSP: e02b:ffffc900151bfac0 EFLAGS: 00010202
> Oct 11 16:15:11 lorunde kernel: [311418.900228] RAX: ffff8801b63df000 RBX: ffff88021b41be00 RCX: 0000000004df0000
> Oct 11 16:15:11 lorunde kernel: [311418.900848] RDX: 0000000004df0000 RSI: 4450736e24806564 RDI: ffff8801b63df000
> Oct 11 16:15:11 lorunde kernel: [311418.901479] RBP: ffffea0005fdd8c8 R08: ffff88028545d618 R09: 0000000000000010
> Oct 11 16:15:11 lorunde kernel: [311418.902104] R10: 0000000000000000 R11: ffff880215815000 R12: 0000000000000000
> Oct 11 16:15:11 lorunde kernel: [311418.902723] R13: ffff8802158156c0 R14: 0000000000000000 R15: ffff8801ce434280
> Oct 11 16:15:11 lorunde kernel: [311418.903359] FS:  0000000000000000(0000) GS:ffff880285440000(0000) knlGS:ffff880285440000
> Oct 11 16:15:11 lorunde kernel: [311418.903994] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> Oct 11 16:15:11 lorunde kernel: [311418.904627] CR2: 000055a8461cfc20 CR3: 0000000001809000 CR4: 0000000000042660
> Oct 11 16:15:11 lorunde kernel: [311418.905271] Call Trace:
> Oct 11 16:15:11 lorunde kernel: [311418.905909]  ? skb_copy_ubufs+0xef/0x290
> Oct 11 16:15:11 lorunde kernel: [311418.906548]  ? skb_clone+0x82/0x90
> Oct 11 16:15:11 lorunde kernel: [311418.907225]  ? tcp_transmit_skb+0x74/0x930
> Oct 11 16:15:11 lorunde kernel: [311418.907858]  ? tcp_write_xmit+0x1bd/0xfb0
> Oct 11 16:15:11 lorunde kernel: [311418.908490]  ? __sk_mem_raise_allocated+0x4e/0x220
> Oct 11 16:15:11 lorunde kernel: [311418.909122]  ? __tcp_push_pending_frames+0x28/0x90
> Oct 11 16:15:11 lorunde kernel: [311418.909755]  ? do_tcp_sendpages+0x4fc/0x590
> Oct 11 16:15:11 lorunde kernel: [311418.910386]  ? tcp_sendpage+0x7c/0xa0
> Oct 11 16:15:11 lorunde kernel: [311418.911026]  ? inet_sendpage+0x37/0xe0
> Oct 11 16:15:11 lorunde kernel: [311418.911655]  ? kernel_sendpage+0x12/0x20
> Oct 11 16:15:11 lorunde kernel: [311418.912297]  ? ceph_tcp_sendpage+0x5c/0xc0 [libceph]
> Oct 11 16:15:11 lorunde kernel: [311418.912926]  ? ceph_tcp_recvmsg+0x53/0x70 [libceph]
> Oct 11 16:15:11 lorunde kernel: [311418.913553]  ? ceph_con_workfn+0xd08/0x22a0 [libceph]
> Oct 11 16:15:11 lorunde kernel: [311418.914179]  ? ceph_osdc_start_request+0x23/0x30 [libceph]
> Oct 11 16:15:11 lorunde kernel: [311418.914807]  ? rbd_img_obj_request_submit+0x1ac/0x3c0 [rbd]
> Oct 11 16:15:11 lorunde kernel: [311418.915458]  ? process_one_work+0x1ad/0x340
> Oct 11 16:15:11 lorunde kernel: [311418.916083]  ? worker_thread+0x45/0x3f0
> Oct 11 16:15:11 lorunde kernel: [311418.916706]  ? kthread+0xf2/0x130
> Oct 11 16:15:11 lorunde kernel: [311418.917327]  ? process_one_work+0x340/0x340
> Oct 11 16:15:11 lorunde kernel: [311418.917946]  ? kthread_create_on_node+0x40/0x40
> Oct 11 16:15:11 lorunde kernel: [311418.918565]  ? do_group_exit+0x35/0xa0
> Oct 11 16:15:11 lorunde kernel: [311418.919215]  ? ret_from_fork+0x25/0x30
> Oct 11 16:15:11 lorunde kernel: [311418.919826] Code: 43 4e 5b eb ec eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
> Oct 11 16:15:11 lorunde kernel: [311418.921094] RIP: memcpy_erms+0x6/0x10 RSP: ffffc900151bfac0
> Oct 11 16:15:11 lorunde kernel: [311418.921970] ---[ end trace 904278a63cb49fca ]---

It's a crash in memcpy() in skb_copy_ubufs().  It's not in ceph, but
ceph-induced, it looks like.  I don't remember seeing anything similar
in the context of krbd.

This is a Xen dom0 kernel, right?  What did the workload look like?
Can you provide dmesg before the crash?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re :  general protection fault: 0000 [#1] SMP
       [not found]   ` <CAOi1vP--q8y696g5W_AUmR9Yxe5Xop3BH3xjEQG6_pmQmXO6kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-12  7:26     ` Olivier Bonvalet
  2017-10-12 13:58       ` Re : [ceph-users] " Luis Henriques
  0 siblings, 1 reply; 6+ messages in thread
From: Olivier Bonvalet @ 2017-10-12  7:26 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Users, Ceph Development

Le jeudi 12 octobre 2017 à 09:12 +0200, Ilya Dryomov a écrit :
> It's a crash in memcpy() in skb_copy_ubufs().  It's not in ceph, but
> ceph-induced, it looks like.  I don't remember seeing anything
> similar
> in the context of krbd.
> 
> This is a Xen dom0 kernel, right?  What did the workload look like?
> Can you provide dmesg before the crash?

Hi,

yes it's a Xen dom0 kernel. Linux 4.13.3, Xen 4.8.2, with an old
0.94.10 Ceph (so, Hammer).

Before this error, I add this in logs :

Oct 11 16:00:41 lorunde kernel: [310548.899082] libceph: read_partial_message ffff88021a910200 data crc 2306836368 != exp. 2215155875
Oct 11 16:00:41 lorunde kernel: [310548.899841] libceph: osd117 10.0.0.31:6804 bad crc/signature
Oct 11 16:02:25 lorunde kernel: [310652.695015] libceph: read_partial_message ffff880220b10100 data crc 842840543 != exp. 2657161714
Oct 11 16:02:25 lorunde kernel: [310652.695731] libceph: osd3 10.0.0.26:6804 bad crc/signature
Oct 11 16:07:24 lorunde kernel: [310952.485202] libceph: read_partial_message ffff88025d1aa400 data crc 938978341 != exp. 4154366769
Oct 11 16:07:24 lorunde kernel: [310952.485870] libceph: osd117 10.0.0.31:6804 bad crc/signature
Oct 11 16:10:44 lorunde kernel: [311151.841812] libceph: read_partial_message ffff880260300400 data crc 2988747958 != exp. 319958859
Oct 11 16:10:44 lorunde kernel: [311151.842672] libceph: osd9 10.0.0.51:6802 bad crc/signature
Oct 11 16:10:57 lorunde kernel: [311165.211412] libceph: read_partial_message ffff8802208b8300 data crc 369498361 != exp. 906022772
Oct 11 16:10:57 lorunde kernel: [311165.212135] libceph: osd87 10.0.0.5:6800 bad crc/signature
Oct 11 16:12:27 lorunde kernel: [311254.635767] libceph: read_partial_message ffff880236f9a000 data crc 2586662963 != exp. 2886241494
Oct 11 16:12:27 lorunde kernel: [311254.636493] libceph: osd90 10.0.0.5:6814 bad crc/signature
Oct 11 16:14:31 lorunde kernel: [311378.808191] libceph: read_partial_message ffff88027e633c00 data crc 1102363051 != exp. 679243837
Oct 11 16:14:31 lorunde kernel: [311378.808889] libceph: osd13 10.0.0.21:6804 bad crc/signature
Oct 11 16:15:01 lorunde kernel: [311409.431034] libceph: read_partial_message ffff88024ce0a800 data crc 2467415342 != exp. 1753860323
Oct 11 16:15:01 lorunde kernel: [311409.431718] libceph: osd111 10.0.0.30:6804 bad crc/signature
Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 0000 [#1] SMP


We had to switch to TCP Cubic (instead of badly configured TCP BBR, without FQ), to reduce the data crc errors.
But since we still had some errors, last night we rebooted all the OSD nodes in Linux 4.4.91, instead of Linux 4.9.47 & 4.9.53.

Since the last 7 hours, we haven't got any data crc errors from OSD, but we had one from a MON. Without hang/crash.

About the workload, the Xen VMs are mainly LAMP servers : http traffic, handle by nginx or apache, php, and MySQL databases.

Thanks,

Olivier
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [ceph-users] general protection fault: 0000 [#1] SMP
  2017-10-12  7:12 ` [ceph-users] " Ilya Dryomov
       [not found]   ` <CAOi1vP--q8y696g5W_AUmR9Yxe5Xop3BH3xjEQG6_pmQmXO6kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-12 10:23   ` Jeff Layton
       [not found]     ` <1507803838.5310.9.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 6+ messages in thread
From: Jeff Layton @ 2017-10-12 10:23 UTC (permalink / raw)
  To: Ilya Dryomov, Olivier Bonvalet; +Cc: Ceph Users, Ceph Development

On Thu, 2017-10-12 at 09:12 +0200, Ilya Dryomov wrote:
> On Wed, Oct 11, 2017 at 4:40 PM, Olivier Bonvalet <ceph.list@daevel.fr> wrote:
> > Hi,
> > 
> > I had a "general protection fault: 0000" with Ceph RBD kernel client.
> > Not sure how to read the call, is it Ceph related ?
> > 
> > 
> > Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 0000 [#1] SMP
> > Oct 11 16:15:11 lorunde kernel: [311418.891855] Modules linked in: cpuid binfmt_misc nls_iso8859_1 nls_cp437 vfat fat tcp_diag inet_diag xt_physdev br_netfilter iptable_filter xen_netback loop xen_blkback cbc rbd libceph xen_gntdev xen_evtchn xenfs xen_privcmd ipmi_ssif intel_rapl iosf_mbi sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul ghash_clmulni_intel iTCO_wdt pcbc iTCO_vendor_support mxm_wmi aesni_intel aes_x86_64 crypto_simd glue_helper cryptd mgag200 i2c_algo_bit drm_kms_helper intel_rapl_perf ttm drm syscopyarea sysfillrect efi_pstore sysimgblt fb_sys_fops lpc_ich efivars mfd_core evdev ioatdma shpchp acpi_power_meter ipmi_si wmi button ipmi_devintf ipmi_msghandler bridge efivarfs ip_tables x_tables autofs4 dm_mod dax raid10 raid456 async_raid6_recov as
 ync_memcpy async_pq async_xor xor async_tx raid6_pq
> > Oct 11 16:15:11 lorunde kernel: [311418.895403]  libcrc32c raid1 raid0 multipath linear md_mod hid_generic usbhid i2c_i801 crc32c_intel i2c_core xhci_pci ahci ixgbe xhci_hcd libahci ehci_pci ehci_hcd libata usbcore dca ptp usb_common pps_core mdio
> > Oct 11 16:15:11 lorunde kernel: [311418.896551] CPU: 1 PID: 4916 Comm: kworker/1:0 Not tainted 4.13-dae-dom0 #2
> > Oct 11 16:15:11 lorunde kernel: [311418.897134] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0019.101220160604 10/12/2016
> > Oct 11 16:15:11 lorunde kernel: [311418.897745] Workqueue: ceph-msgr ceph_con_workfn [libceph]
> > Oct 11 16:15:11 lorunde kernel: [311418.898355] task: ffff8801ce434280 task.stack: ffffc900151bc000
> > Oct 11 16:15:11 lorunde kernel: [311418.899007] RIP: e030:memcpy_erms+0x6/0x10
> > Oct 11 16:15:11 lorunde kernel: [311418.899616] RSP: e02b:ffffc900151bfac0 EFLAGS: 00010202
> > Oct 11 16:15:11 lorunde kernel: [311418.900228] RAX: ffff8801b63df000 RBX: ffff88021b41be00 RCX: 0000000004df0000
> > Oct 11 16:15:11 lorunde kernel: [311418.900848] RDX: 0000000004df0000 RSI: 4450736e24806564 RDI: ffff8801b63df000
> > Oct 11 16:15:11 lorunde kernel: [311418.901479] RBP: ffffea0005fdd8c8 R08: ffff88028545d618 R09: 0000000000000010
> > Oct 11 16:15:11 lorunde kernel: [311418.902104] R10: 0000000000000000 R11: ffff880215815000 R12: 0000000000000000
> > Oct 11 16:15:11 lorunde kernel: [311418.902723] R13: ffff8802158156c0 R14: 0000000000000000 R15: ffff8801ce434280
> > Oct 11 16:15:11 lorunde kernel: [311418.903359] FS:  0000000000000000(0000) GS:ffff880285440000(0000) knlGS:ffff880285440000
> > Oct 11 16:15:11 lorunde kernel: [311418.903994] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Oct 11 16:15:11 lorunde kernel: [311418.904627] CR2: 000055a8461cfc20 CR3: 0000000001809000 CR4: 0000000000042660
> > Oct 11 16:15:11 lorunde kernel: [311418.905271] Call Trace:
> > Oct 11 16:15:11 lorunde kernel: [311418.905909]  ? skb_copy_ubufs+0xef/0x290
> > Oct 11 16:15:11 lorunde kernel: [311418.906548]  ? skb_clone+0x82/0x90
> > Oct 11 16:15:11 lorunde kernel: [311418.907225]  ? tcp_transmit_skb+0x74/0x930
> > Oct 11 16:15:11 lorunde kernel: [311418.907858]  ? tcp_write_xmit+0x1bd/0xfb0
> > Oct 11 16:15:11 lorunde kernel: [311418.908490]  ? __sk_mem_raise_allocated+0x4e/0x220
> > Oct 11 16:15:11 lorunde kernel: [311418.909122]  ? __tcp_push_pending_frames+0x28/0x90
> > Oct 11 16:15:11 lorunde kernel: [311418.909755]  ? do_tcp_sendpages+0x4fc/0x590
> > Oct 11 16:15:11 lorunde kernel: [311418.910386]  ? tcp_sendpage+0x7c/0xa0
> > Oct 11 16:15:11 lorunde kernel: [311418.911026]  ? inet_sendpage+0x37/0xe0
> > Oct 11 16:15:11 lorunde kernel: [311418.911655]  ? kernel_sendpage+0x12/0x20
> > Oct 11 16:15:11 lorunde kernel: [311418.912297]  ? ceph_tcp_sendpage+0x5c/0xc0 [libceph]
> > Oct 11 16:15:11 lorunde kernel: [311418.912926]  ? ceph_tcp_recvmsg+0x53/0x70 [libceph]
> > Oct 11 16:15:11 lorunde kernel: [311418.913553]  ? ceph_con_workfn+0xd08/0x22a0 [libceph]
> > Oct 11 16:15:11 lorunde kernel: [311418.914179]  ? ceph_osdc_start_request+0x23/0x30 [libceph]
> > Oct 11 16:15:11 lorunde kernel: [311418.914807]  ? rbd_img_obj_request_submit+0x1ac/0x3c0 [rbd]
> > Oct 11 16:15:11 lorunde kernel: [311418.915458]  ? process_one_work+0x1ad/0x340
> > Oct 11 16:15:11 lorunde kernel: [311418.916083]  ? worker_thread+0x45/0x3f0
> > Oct 11 16:15:11 lorunde kernel: [311418.916706]  ? kthread+0xf2/0x130
> > Oct 11 16:15:11 lorunde kernel: [311418.917327]  ? process_one_work+0x340/0x340
> > Oct 11 16:15:11 lorunde kernel: [311418.917946]  ? kthread_create_on_node+0x40/0x40
> > Oct 11 16:15:11 lorunde kernel: [311418.918565]  ? do_group_exit+0x35/0xa0
> > Oct 11 16:15:11 lorunde kernel: [311418.919215]  ? ret_from_fork+0x25/0x30
> > Oct 11 16:15:11 lorunde kernel: [311418.919826] Code: 43 4e 5b eb ec eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
> > Oct 11 16:15:11 lorunde kernel: [311418.921094] RIP: memcpy_erms+0x6/0x10 RSP: ffffc900151bfac0
> > Oct 11 16:15:11 lorunde kernel: [311418.921970] ---[ end trace 904278a63cb49fca ]---
> 
> It's a crash in memcpy() in skb_copy_ubufs().  It's not in ceph, but
> ceph-induced, it looks like.  I don't remember seeing anything similar
> in the context of krbd.
> 
> This is a Xen dom0 kernel, right?  What did the workload look like?
> Can you provide dmesg before the crash?
> 

...and to be clear:

(gdb) list *(memcpy_erms+0x6)
0xffffffff8188f136 is at arch/x86/lib/memcpy_64.S:54.
49	 * simpler than memcpy. Use memcpy_erms when possible.
50	 */
51	ENTRY(memcpy_erms)
52		movq %rdi, %rax
53		movq %rdx, %rcx
54		rep movsb
55		ret
56	ENDPROC(memcpy_erms)
57	
58	ENTRY(memcpy_orig)

So either %rsi or %rdi held a bogus address at the time of the crash,
most likely. If you have a vmcore, you may be able to dig in with crash
and tell which address it was, and trace back up the call stack to where
it came from.

That said... %rcx looks quite large -- 81723392 bytes still to go in the
copy. This might be a case where the copy length got screwed up somehow
and it overran its bounds.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: general protection fault: 0000 [#1] SMP
       [not found]     ` <1507803838.5310.9.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-10-12 10:50       ` Ilya Dryomov
  0 siblings, 0 replies; 6+ messages in thread
From: Ilya Dryomov @ 2017-10-12 10:50 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Ceph Development, Ceph Users

On Thu, Oct 12, 2017 at 12:23 PM, Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, 2017-10-12 at 09:12 +0200, Ilya Dryomov wrote:
>> On Wed, Oct 11, 2017 at 4:40 PM, Olivier Bonvalet <ceph.list-PaEMFeTk6C1QFI55V6+gNQ@public.gmane.org> wrote:
>> > Hi,
>> >
>> > I had a "general protection fault: 0000" with Ceph RBD kernel client.
>> > Not sure how to read the call, is it Ceph related ?
>> >
>> >
>> > Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 0000 [#1] SMP
>> > Oct 11 16:15:11 lorunde kernel: [311418.891855] Modules linked in: cpuid binfmt_misc nls_iso8859_1 nls_cp437 vfat fat tcp_diag inet_diag xt_physdev br_netfilter iptable_filter xen_netback loop xen_blkback cbc rbd libceph xen_gntdev xen_evtchn xenfs xen_privcmd ipmi_ssif intel_rapl iosf_mbi sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul ghash_clmulni_intel iTCO_wdt pcbc iTCO_vendor_support mxm_wmi aesni_intel aes_x86_64 crypto_simd glue_helper cryptd mgag200 i2c_algo_bit drm_kms_helper intel_rapl_perf ttm drm syscopyarea sysfillrect efi_pstore sysimgblt fb_sys_fops lpc_ich efivars mfd_core evdev ioatdma shpchp acpi_power_meter ipmi_si wmi button ipmi_devintf ipmi_msghandler bridge efivarfs ip_tables x_tables autofs4 dm_mod dax raid10 raid456 async_raid6_recov a
 sync_memcpy async_pq async_xor xor async_tx raid6_pq
>> > Oct 11 16:15:11 lorunde kernel: [311418.895403]  libcrc32c raid1 raid0 multipath linear md_mod hid_generic usbhid i2c_i801 crc32c_intel i2c_core xhci_pci ahci ixgbe xhci_hcd libahci ehci_pci ehci_hcd libata usbcore dca ptp usb_common pps_core mdio
>> > Oct 11 16:15:11 lorunde kernel: [311418.896551] CPU: 1 PID: 4916 Comm: kworker/1:0 Not tainted 4.13-dae-dom0 #2
>> > Oct 11 16:15:11 lorunde kernel: [311418.897134] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0019.101220160604 10/12/2016
>> > Oct 11 16:15:11 lorunde kernel: [311418.897745] Workqueue: ceph-msgr ceph_con_workfn [libceph]
>> > Oct 11 16:15:11 lorunde kernel: [311418.898355] task: ffff8801ce434280 task.stack: ffffc900151bc000
>> > Oct 11 16:15:11 lorunde kernel: [311418.899007] RIP: e030:memcpy_erms+0x6/0x10
>> > Oct 11 16:15:11 lorunde kernel: [311418.899616] RSP: e02b:ffffc900151bfac0 EFLAGS: 00010202
>> > Oct 11 16:15:11 lorunde kernel: [311418.900228] RAX: ffff8801b63df000 RBX: ffff88021b41be00 RCX: 0000000004df0000
>> > Oct 11 16:15:11 lorunde kernel: [311418.900848] RDX: 0000000004df0000 RSI: 4450736e24806564 RDI: ffff8801b63df000
>> > Oct 11 16:15:11 lorunde kernel: [311418.901479] RBP: ffffea0005fdd8c8 R08: ffff88028545d618 R09: 0000000000000010
>> > Oct 11 16:15:11 lorunde kernel: [311418.902104] R10: 0000000000000000 R11: ffff880215815000 R12: 0000000000000000
>> > Oct 11 16:15:11 lorunde kernel: [311418.902723] R13: ffff8802158156c0 R14: 0000000000000000 R15: ffff8801ce434280
>> > Oct 11 16:15:11 lorunde kernel: [311418.903359] FS:  0000000000000000(0000) GS:ffff880285440000(0000) knlGS:ffff880285440000
>> > Oct 11 16:15:11 lorunde kernel: [311418.903994] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > Oct 11 16:15:11 lorunde kernel: [311418.904627] CR2: 000055a8461cfc20 CR3: 0000000001809000 CR4: 0000000000042660
>> > Oct 11 16:15:11 lorunde kernel: [311418.905271] Call Trace:
>> > Oct 11 16:15:11 lorunde kernel: [311418.905909]  ? skb_copy_ubufs+0xef/0x290
>> > Oct 11 16:15:11 lorunde kernel: [311418.906548]  ? skb_clone+0x82/0x90
>> > Oct 11 16:15:11 lorunde kernel: [311418.907225]  ? tcp_transmit_skb+0x74/0x930
>> > Oct 11 16:15:11 lorunde kernel: [311418.907858]  ? tcp_write_xmit+0x1bd/0xfb0
>> > Oct 11 16:15:11 lorunde kernel: [311418.908490]  ? __sk_mem_raise_allocated+0x4e/0x220
>> > Oct 11 16:15:11 lorunde kernel: [311418.909122]  ? __tcp_push_pending_frames+0x28/0x90
>> > Oct 11 16:15:11 lorunde kernel: [311418.909755]  ? do_tcp_sendpages+0x4fc/0x590
>> > Oct 11 16:15:11 lorunde kernel: [311418.910386]  ? tcp_sendpage+0x7c/0xa0
>> > Oct 11 16:15:11 lorunde kernel: [311418.911026]  ? inet_sendpage+0x37/0xe0
>> > Oct 11 16:15:11 lorunde kernel: [311418.911655]  ? kernel_sendpage+0x12/0x20
>> > Oct 11 16:15:11 lorunde kernel: [311418.912297]  ? ceph_tcp_sendpage+0x5c/0xc0 [libceph]
>> > Oct 11 16:15:11 lorunde kernel: [311418.912926]  ? ceph_tcp_recvmsg+0x53/0x70 [libceph]
>> > Oct 11 16:15:11 lorunde kernel: [311418.913553]  ? ceph_con_workfn+0xd08/0x22a0 [libceph]
>> > Oct 11 16:15:11 lorunde kernel: [311418.914179]  ? ceph_osdc_start_request+0x23/0x30 [libceph]
>> > Oct 11 16:15:11 lorunde kernel: [311418.914807]  ? rbd_img_obj_request_submit+0x1ac/0x3c0 [rbd]
>> > Oct 11 16:15:11 lorunde kernel: [311418.915458]  ? process_one_work+0x1ad/0x340
>> > Oct 11 16:15:11 lorunde kernel: [311418.916083]  ? worker_thread+0x45/0x3f0
>> > Oct 11 16:15:11 lorunde kernel: [311418.916706]  ? kthread+0xf2/0x130
>> > Oct 11 16:15:11 lorunde kernel: [311418.917327]  ? process_one_work+0x340/0x340
>> > Oct 11 16:15:11 lorunde kernel: [311418.917946]  ? kthread_create_on_node+0x40/0x40
>> > Oct 11 16:15:11 lorunde kernel: [311418.918565]  ? do_group_exit+0x35/0xa0
>> > Oct 11 16:15:11 lorunde kernel: [311418.919215]  ? ret_from_fork+0x25/0x30
>> > Oct 11 16:15:11 lorunde kernel: [311418.919826] Code: 43 4e 5b eb ec eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
>> > Oct 11 16:15:11 lorunde kernel: [311418.921094] RIP: memcpy_erms+0x6/0x10 RSP: ffffc900151bfac0
>> > Oct 11 16:15:11 lorunde kernel: [311418.921970] ---[ end trace 904278a63cb49fca ]---
>>
>> It's a crash in memcpy() in skb_copy_ubufs().  It's not in ceph, but
>> ceph-induced, it looks like.  I don't remember seeing anything similar
>> in the context of krbd.
>>
>> This is a Xen dom0 kernel, right?  What did the workload look like?
>> Can you provide dmesg before the crash?
>>
>
> ...and to be clear:
>
> (gdb) list *(memcpy_erms+0x6)
> 0xffffffff8188f136 is at arch/x86/lib/memcpy_64.S:54.
> 49       * simpler than memcpy. Use memcpy_erms when possible.
> 50       */
> 51      ENTRY(memcpy_erms)
> 52              movq %rdi, %rax
> 53              movq %rdx, %rcx
> 54              rep movsb
> 55              ret
> 56      ENDPROC(memcpy_erms)
> 57
> 58      ENTRY(memcpy_orig)
>
> So either %rsi or %rdi held a bogus address at the time of the crash,
> most likely. If you have a vmcore, you may be able to dig in with crash
> and tell which address it was, and trace back up the call stack to where
> it came from.

I suspect src-side bustage.

>
> That said... %rcx looks quite large -- 81723392 bytes still to go in the
> copy. This might be a case where the copy length got screwed up somehow
> and it overran its bounds.

Yeah, suspiciously large.  I don't think it copied a single byte
though: %rcx never got decremented.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Re : [ceph-users] general protection fault: 0000 [#1] SMP
  2017-10-12  7:26     ` Re : " Olivier Bonvalet
@ 2017-10-12 13:58       ` Luis Henriques
  0 siblings, 0 replies; 6+ messages in thread
From: Luis Henriques @ 2017-10-12 13:58 UTC (permalink / raw)
  To: Olivier Bonvalet; +Cc: Jeff Layton, Ilya Dryomov, Ceph Users, Ceph Development

Olivier Bonvalet <ceph.list@daevel.fr> writes:

> Le jeudi 12 octobre 2017 à 09:12 +0200, Ilya Dryomov a écrit :
>> It's a crash in memcpy() in skb_copy_ubufs().  It's not in ceph, but
>> ceph-induced, it looks like.  I don't remember seeing anything
>> similar
>> in the context of krbd.
>> 
>> This is a Xen dom0 kernel, right?  What did the workload look like?
>> Can you provide dmesg before the crash?
>
> Hi,
>
> yes it's a Xen dom0 kernel. Linux 4.13.3, Xen 4.8.2, with an old
> 0.94.10 Ceph (so, Hammer).
>
> Before this error, I add this in logs :
>
> Oct 11 16:00:41 lorunde kernel: [310548.899082] libceph: read_partial_message ffff88021a910200 data crc 2306836368 != exp. 2215155875
> Oct 11 16:00:41 lorunde kernel: [310548.899841] libceph: osd117 10.0.0.31:6804 bad crc/signature
> Oct 11 16:02:25 lorunde kernel: [310652.695015] libceph: read_partial_message ffff880220b10100 data crc 842840543 != exp. 2657161714
> Oct 11 16:02:25 lorunde kernel: [310652.695731] libceph: osd3 10.0.0.26:6804 bad crc/signature
> Oct 11 16:07:24 lorunde kernel: [310952.485202] libceph: read_partial_message ffff88025d1aa400 data crc 938978341 != exp. 4154366769
> Oct 11 16:07:24 lorunde kernel: [310952.485870] libceph: osd117 10.0.0.31:6804 bad crc/signature
> Oct 11 16:10:44 lorunde kernel: [311151.841812] libceph: read_partial_message ffff880260300400 data crc 2988747958 != exp. 319958859
> Oct 11 16:10:44 lorunde kernel: [311151.842672] libceph: osd9 10.0.0.51:6802 bad crc/signature
> Oct 11 16:10:57 lorunde kernel: [311165.211412] libceph: read_partial_message ffff8802208b8300 data crc 369498361 != exp. 906022772
> Oct 11 16:10:57 lorunde kernel: [311165.212135] libceph: osd87 10.0.0.5:6800 bad crc/signature
> Oct 11 16:12:27 lorunde kernel: [311254.635767] libceph: read_partial_message ffff880236f9a000 data crc 2586662963 != exp. 2886241494
> Oct 11 16:12:27 lorunde kernel: [311254.636493] libceph: osd90 10.0.0.5:6814 bad crc/signature
> Oct 11 16:14:31 lorunde kernel: [311378.808191] libceph: read_partial_message ffff88027e633c00 data crc 1102363051 != exp. 679243837
> Oct 11 16:14:31 lorunde kernel: [311378.808889] libceph: osd13 10.0.0.21:6804 bad crc/signature
> Oct 11 16:15:01 lorunde kernel: [311409.431034] libceph: read_partial_message ffff88024ce0a800 data crc 2467415342 != exp. 1753860323
> Oct 11 16:15:01 lorunde kernel: [311409.431718] libceph: osd111 10.0.0.30:6804 bad crc/signature
> Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 0000 [#1] SMP
>
>
> We had to switch to TCP Cubic (instead of badly configured TCP BBR, without FQ), to reduce the data crc errors.
> But since we still had some errors, last night we rebooted all the OSD nodes in Linux 4.4.91, instead of Linux 4.9.47 & 4.9.53.
>
> Since the last 7 hours, we haven't got any data crc errors from OSD, but we had one from a MON. Without hang/crash.

Since there are a bunch of errors before the GPF I suspect this bug is
related to some error paths that haven't been thoroughly tested (as it is
the case for error paths in general I guess).

My initial guess was a race in ceph_con_workfn:

 - An error returned from try_read() would cause a delayed retry (in
   function con_fault())
 - con_fault_finish() would then trigger a ceph_con_close/ceph_con_open in
   osd_fault.
 - the delayed retry kicks-in and the above close+open, which includes
   releasing con->in_msg and con->out_msg, could cause this GPF.

Unfortunately, I wasn't yet able to find any race there (probably because
there's none), but maybe there's a small window where this could occur.

I wonder if this occurred only once, or if this is something that is
easily triggerable.

Cheers,
-- 
Luis

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-10-12 13:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-11 14:40 general protection fault: 0000 [#1] SMP Olivier Bonvalet
2017-10-12  7:12 ` [ceph-users] " Ilya Dryomov
     [not found]   ` <CAOi1vP--q8y696g5W_AUmR9Yxe5Xop3BH3xjEQG6_pmQmXO6kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-12  7:26     ` Re : " Olivier Bonvalet
2017-10-12 13:58       ` Re : [ceph-users] " Luis Henriques
2017-10-12 10:23   ` Jeff Layton
     [not found]     ` <1507803838.5310.9.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-10-12 10:50       ` Ilya Dryomov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.