All of lore.kernel.org
 help / color / mirror / Atom feed
* ceph kernel bug
@ 2011-09-10 21:12 Martin Mailand
  2011-09-10 22:47 ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Martin Mailand @ 2011-09-10 21:12 UTC (permalink / raw)
  To: ceph-devel

Hi,
I hit the following Bug. My Setup is very simple I have two osd (osd1 
and osd2) and one monitor.
On the fourth machine I mount ceph via the rbd device and I use the rbd 
device for a qemu instance.
When I reboot one of the two osds I hit reproducible this bug.
On all machine I use the kernel version 3.1.0-rc5 and ceph version 
0.34-1natty from the newdream repro.

Regards,
  martin

[  105.746163] libceph: osd2 192.168.42.114:6800 socket closed
[  105.757635] libceph: osd2 192.168.42.114:6800 connection failed
[  106.040203] libceph: osd2 192.168.42.114:6800 connection failed
[  107.040231] libceph: osd2 192.168.42.114:6800 connection failed
[  109.040508] libceph: osd2 192.168.42.114:6800 connection failed
[  113.050453] libceph: osd2 192.168.42.114:6800 connection failed
[  121.060191] libceph: osd2 192.168.42.114:6800 connection failed
[  137.090484] libceph: osd2 192.168.42.114:6800 connection failed
[  198.237123] ------------[ cut here ]------------
[  198.246419] kernel BUG at net/ceph/messenger.c:2193!
[  198.246949] invalid opcode: 0000 [#1] SMP
[  198.246949] CPU 0
[  198.246949] Modules linked in: rbd libceph libcrc32c ip6table_filter 
ip6_tables iptable_filter ip_tables x_tables nv_tco bridge stp kvm_amd 
kvm radeon ttm psmouse drm_kms_helper drm i2c_algo_bit k10temp 
i2c_nforce2 shpchp amd64_edac_mod serio_raw edac_core edac_mce_amd lp 
parport ses enclosure aacraid forcedeth
[  198.246949]
[  198.246949] Pid: 10, comm: kworker/0:1 Not tainted 3.1.0-rc5-custom 
#1 Supermicro H8DM8-2/H8DM8-2
[  198.246949] RIP: 0010:[<ffffffffa02d83f1>]  [<ffffffffa02d83f1>] 
ceph_con_send+0x111/0x120 [libceph]
[  198.246949] RSP: 0018:ffff880405cd5bc0  EFLAGS: 00010202
[  198.246949] RAX: ffff880803fe7878 RBX: ffff880403fb8030 RCX: 
ffff880803fd1650
[  198.246949] RDX: ffff880405cd5fd8 RSI: ffff880803fe7800 RDI: 
ffff880403fb81a8
[  198.246949] RBP: ffff880405cd5be0 R08: ffff880405cd5b70 R09: 
0000000000000002
[  198.246949] R10: 0000000000000002 R11: 0000000000000072 R12: 
ffff880403fb81a8
[  198.246949] R13: ffff880803fe7800 R14: ffff880803fd1660 R15: 
ffff880803fd1650
[  198.246949] FS:  00007fea65610700(0000) GS:ffff88040fc00000(0000) 
knlGS:0000000000000000
[  198.246949] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  198.246949] CR2: 00007f61e407f000 CR3: 0000000001a05000 CR4: 
00000000000006f0
[  198.246949] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  198.246949] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[  198.246949] Process kworker/0:1 (pid: 10, threadinfo 
ffff880405cd4000, task ffff880405cc5bc0)
[  198.246949] Stack:
[  198.246949]  ffff880405cd5be0 ffff880804fb5800 ffff880803fd1630 
ffff880803fd15a8
[  198.246949]  ffff880405cd5c30 ffffffffa02dd8ad ffff880803fd1480 
ffff880803fd1600
[  198.246949]  ffff880405cd5c30 ffff8803fde4c644 ffff880803fd15a8 
0000000000000000
[  198.246949] Call Trace:
[  198.246949]  [<ffffffffa02dd8ad>] send_queued+0xed/0x130 [libceph]
[  198.246949]  [<ffffffffa02dfd81>] ceph_osdc_handle_map+0x261/0x3b0 
[libceph]
[  198.246949]  [<ffffffffa02d711c>] ? ceph_msg_new+0x15c/0x230 [libceph]
[  198.246949]  [<ffffffffa02e01e0>] dispatch+0x150/0x360 [libceph]
[  198.246949]  [<ffffffffa02da54f>] con_work+0x214f/0x21d0 [libceph]
[  198.246949]  [<ffffffffa02d8400>] ? ceph_con_send+0x120/0x120 [libceph]
[  198.246949]  [<ffffffff8108110d>] process_one_work+0x11d/0x430
[  198.246949]  [<ffffffff81081c69>] worker_thread+0x169/0x360
[  198.246949]  [<ffffffff81081b00>] ? manage_workers.clone.21+0x240/0x240
[  198.246949]  [<ffffffff81086496>] kthread+0x96/0xa0
[  198.246949]  [<ffffffff815e5bb4>] kernel_thread_helper+0x4/0x10
[  198.246949]  [<ffffffff81086400>] ? flush_kthread_worker+0xb0/0xb0
[  198.246949]  [<ffffffff815e5bb0>] ? gs_change+0x13/0x13
[  198.246949] Code: 65 f0 4c 8b 6d f8 c9 c3 66 90 48 8d be 88 00 00 00 
48 c7 c6 70 a8 2d a0 e8 dd 9c 00 e1 48 8b 5d e8 4c 8b 65 f0 4c 8b 6d f8 
c9 c3 <0f> 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57
[  198.246949] RIP  [<ffffffffa02d83f1>] ceph_con_send+0x111/0x120 [libceph]
[  198.246949]  RSP <ffff880405cd5bc0>
[  198.927024] ---[ end trace 03cb81299b093f05 ]---
[  198.940010] BUG: unable to handle kernel paging request at 
fffffffffffffff8
[  198.949892] IP: [<ffffffff810868f0>] kthread_data+0x10/0x20
[  198.949892] PGD 1a07067 PUD 1a08067 PMD 0
[  198.949892] Oops: 0000 [#2] SMP
[  198.949892] CPU 0
[  198.949892] Modules linked in: rbd libceph libcrc32c ip6table_filter 
ip6_tables iptable_filter ip_tables x_tables nv_tco bridge stp kvm_amd 
kvm radeon ttm psmouse drm_kms_helper drm i2c_algo_bit k10temp 
i2c_nforce2 shpchp amd64_edac_mod serio_raw edac_core edac_mce_amd lp 
parport ses enclosure aacraid forcedeth
[  198.949892]
[  198.949892] Pid: 10, comm: kworker/0:1 Tainted: G      D 
3.1.0-rc5-custom #1 Supermicro H8DM8-2/H8DM8-2
[  198.949892] RIP: 0010:[<ffffffff810868f0>]  [<ffffffff810868f0>] 
kthread_data+0x10/0x20
[  198.949892] RSP: 0018:ffff880405cd5868  EFLAGS: 00010096
[  198.949892] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
0000000000000000
[  198.949892] RDX: ffff880405cc5bc0 RSI: 0000000000000000 RDI: 
ffff880405cc5bc0
[  198.949892] RBP: ffff880405cd5868 R08: 0000000000989680 R09: 
0000000000000000
[  198.949892] R10: 0000000000000400 R11: 0000000000000006 R12: 
ffff880405cc5f88
[  198.949892] R13: 0000000000000000 R14: 0000000000000000 R15: 
ffff880405cc5e90
[  198.949892] FS:  00007fea65610700(0000) GS:ffff88040fc00000(0000) 
knlGS:0000000000000000
[  198.949892] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  198.949892] CR2: fffffffffffffff8 CR3: 0000000001a05000 CR4: 
00000000000006f0
[  198.949892] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  198.949892] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[  198.949892] Process kworker/0:1 (pid: 10, threadinfo 
ffff880405cd4000, task ffff880405cc5bc0)
[  198.949892] Stack:
[  198.949892]  ffff880405cd5888 ffffffff81082345 ffff880405cd5888 
ffff88040fc13080
[  198.949892]  ffff880405cd5918 ffffffff815d9092 ffff880405e5a558 
ffff880405cc5bc0
[  198.949892]  ffff880405cd58d8 ffff880405cd5fd8 ffff880405cd4000 
ffff880405cd5fd8
[  198.949892] Call Trace:
[  198.949892]  [<ffffffff81082345>] wq_worker_sleeping+0x15/0xa0
[  198.949892]  [<ffffffff815d9092>] __schedule+0x5c2/0x8b0
[  198.949892]  [<ffffffff812caf96>] ? put_io_context+0x46/0x70
[  198.949892]  [<ffffffff8105b72f>] schedule+0x3f/0x60
[  198.949892]  [<ffffffff81068223>] do_exit+0x5e3/0x8a0
[  198.949892]  [<ffffffff815dcc4f>] oops_end+0xaf/0xf0
[  198.949892]  [<ffffffff8101689b>] die+0x5b/0x90
[  198.949892]  [<ffffffff815dc354>] do_trap+0xc4/0x170
[  198.949892]  [<ffffffff81013f25>] do_invalid_op+0x95/0xb0
[  198.949892]  [<ffffffffa02d83f1>] ? ceph_con_send+0x111/0x120 [libceph]
[  198.949892]  [<ffffffffa02e276a>] ? ceph_calc_pg_acting+0x2a/0x90 
[libceph]
[  198.949892]  [<ffffffff815e5a2b>] invalid_op+0x1b/0x20
[  198.949892]  [<ffffffffa02d83f1>] ? ceph_con_send+0x111/0x120 [libceph]
[  198.949892]  [<ffffffffa02dd8ad>] send_queued+0xed/0x130 [libceph]
[  198.949892]  [<ffffffffa02dfd81>] ceph_osdc_handle_map+0x261/0x3b0 
[libceph]
[  198.949892]  [<ffffffffa02d711c>] ? ceph_msg_new+0x15c/0x230 [libceph]
[  198.949892]  [<ffffffffa02e01e0>] dispatch+0x150/0x360 [libceph]
[  198.949892]  [<ffffffffa02da54f>] con_work+0x214f/0x21d0 [libceph]
[  198.949892]  [<ffffffffa02d8400>] ? ceph_con_send+0x120/0x120 [libceph]
[  198.949892]  [<ffffffff8108110d>] process_one_work+0x11d/0x430
[  198.949892]  [<ffffffff81081c69>] worker_thread+0x169/0x360
[  198.949892]  [<ffffffff81081b00>] ? manage_workers.clone.21+0x240/0x240
[  198.949892]  [<ffffffff81086496>] kthread+0x96/0xa0
[  198.949892]  [<ffffffff815e5bb4>] kernel_thread_helper+0x4/0x10
[  198.949892]  [<ffffffff81086400>] ? flush_kthread_worker+0xb0/0xb0
[  198.949892]  [<ffffffff815e5bb0>] ? gs_change+0x13/0x13
[  198.949892] Code: 5e 41 5f c9 c3 be 3e 01 00 00 48 c7 c7 5b 3a 7d 81 
e8 85 d3 fd ff e9 84 fe ff ff 55 48 89 e5 66 66 66 66 90 48 8b 87 70 03 
00 00
[  198.949892]  8b 40 f8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66
[  198.949892] RIP  [<ffffffff810868f0>] kthread_data+0x10/0x20
[  198.949892]  RSP <ffff880405cd5868>
[  198.949892] CR2: fffffffffffffff8
[  198.949892] ---[ end trace 03cb81299b093f06 ]---
[  198.949892] Fixing recursive fault but reboot is needed!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ceph kernel bug
  2011-09-10 21:12 ceph kernel bug Martin Mailand
@ 2011-09-10 22:47 ` Sage Weil
  2011-09-10 23:46   ` Martin Mailand
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2011-09-10 22:47 UTC (permalink / raw)
  To: Martin Mailand; +Cc: ceph-devel

Hi Martin,

Is this reproducible?  If so, does the patch below fix it?

Thanks!
sage

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 5634216..dcd3475 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -31,6 +31,7 @@ static void __unregister_linger_request(struct ceph_osd_client *osdc,
                                        struct ceph_osd_request *req);
 static int __send_request(struct ceph_osd_client *osdc,
                          struct ceph_osd_request *req);
+static void __cancel_request(struct ceph_osd_request *req);
 
 static int op_needs_trail(int op)
 {
@@ -571,6 +572,7 @@ static void __kick_osd_requests(struct ceph_osd_client *osdc,
                return;
 
        list_for_each_entry(req, &osd->o_requests, r_osd_item) {
+               __cancel_request(req);
                list_move(&req->r_req_lru_item, &osdc->req_unsent);
                dout("requeued %p tid %llu osd%d\n", req, req->r_tid,
                     osd->o_osd);



On Sat, 10 Sep 2011, Martin Mailand wrote:

> Hi,
> I hit the following Bug. My Setup is very simple I have two osd (osd1 and
> osd2) and one monitor.
> On the fourth machine I mount ceph via the rbd device and I use the rbd device
> for a qemu instance.
> When I reboot one of the two osds I hit reproducible this bug.
> On all machine I use the kernel version 3.1.0-rc5 and ceph version 0.34-1natty
> from the newdream repro.
> 
> Regards,
>  martin
> 
> [  105.746163] libceph: osd2 192.168.42.114:6800 socket closed
> [  105.757635] libceph: osd2 192.168.42.114:6800 connection failed
> [  106.040203] libceph: osd2 192.168.42.114:6800 connection failed
> [  107.040231] libceph: osd2 192.168.42.114:6800 connection failed
> [  109.040508] libceph: osd2 192.168.42.114:6800 connection failed
> [  113.050453] libceph: osd2 192.168.42.114:6800 connection failed
> [  121.060191] libceph: osd2 192.168.42.114:6800 connection failed
> [  137.090484] libceph: osd2 192.168.42.114:6800 connection failed
> [  198.237123] ------------[ cut here ]------------
> [  198.246419] kernel BUG at net/ceph/messenger.c:2193!
> [  198.246949] invalid opcode: 0000 [#1] SMP
> [  198.246949] CPU 0
> [  198.246949] Modules linked in: rbd libceph libcrc32c ip6table_filter
> ip6_tables iptable_filter ip_tables x_tables nv_tco bridge stp kvm_amd kvm
> radeon ttm psmouse drm_kms_helper drm i2c_algo_bit k10temp i2c_nforce2 shpchp
> amd64_edac_mod serio_raw edac_core edac_mce_amd lp parport ses enclosure
> aacraid forcedeth
> [  198.246949]
> [  198.246949] Pid: 10, comm: kworker/0:1 Not tainted 3.1.0-rc5-custom #1
> Supermicro H8DM8-2/H8DM8-2
> [  198.246949] RIP: 0010:[<ffffffffa02d83f1>]  [<ffffffffa02d83f1>]
> ceph_con_send+0x111/0x120 [libceph]
> [  198.246949] RSP: 0018:ffff880405cd5bc0  EFLAGS: 00010202
> [  198.246949] RAX: ffff880803fe7878 RBX: ffff880403fb8030 RCX:
> ffff880803fd1650
> [  198.246949] RDX: ffff880405cd5fd8 RSI: ffff880803fe7800 RDI:
> ffff880403fb81a8
> [  198.246949] RBP: ffff880405cd5be0 R08: ffff880405cd5b70 R09:
> 0000000000000002
> [  198.246949] R10: 0000000000000002 R11: 0000000000000072 R12:
> ffff880403fb81a8
> [  198.246949] R13: ffff880803fe7800 R14: ffff880803fd1660 R15:
> ffff880803fd1650
> [  198.246949] FS:  00007fea65610700(0000) GS:ffff88040fc00000(0000)
> knlGS:0000000000000000
> [  198.246949] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  198.246949] CR2: 00007f61e407f000 CR3: 0000000001a05000 CR4:
> 00000000000006f0
> [  198.246949] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  198.246949] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [  198.246949] Process kworker/0:1 (pid: 10, threadinfo ffff880405cd4000, task
> ffff880405cc5bc0)
> [  198.246949] Stack:
> [  198.246949]  ffff880405cd5be0 ffff880804fb5800 ffff880803fd1630
> ffff880803fd15a8
> [  198.246949]  ffff880405cd5c30 ffffffffa02dd8ad ffff880803fd1480
> ffff880803fd1600
> [  198.246949]  ffff880405cd5c30 ffff8803fde4c644 ffff880803fd15a8
> 0000000000000000
> [  198.246949] Call Trace:
> [  198.246949]  [<ffffffffa02dd8ad>] send_queued+0xed/0x130 [libceph]
> [  198.246949]  [<ffffffffa02dfd81>] ceph_osdc_handle_map+0x261/0x3b0
> [libceph]
> [  198.246949]  [<ffffffffa02d711c>] ? ceph_msg_new+0x15c/0x230 [libceph]
> [  198.246949]  [<ffffffffa02e01e0>] dispatch+0x150/0x360 [libceph]
> [  198.246949]  [<ffffffffa02da54f>] con_work+0x214f/0x21d0 [libceph]
> [  198.246949]  [<ffffffffa02d8400>] ? ceph_con_send+0x120/0x120 [libceph]
> [  198.246949]  [<ffffffff8108110d>] process_one_work+0x11d/0x430
> [  198.246949]  [<ffffffff81081c69>] worker_thread+0x169/0x360
> [  198.246949]  [<ffffffff81081b00>] ? manage_workers.clone.21+0x240/0x240
> [  198.246949]  [<ffffffff81086496>] kthread+0x96/0xa0
> [  198.246949]  [<ffffffff815e5bb4>] kernel_thread_helper+0x4/0x10
> [  198.246949]  [<ffffffff81086400>] ? flush_kthread_worker+0xb0/0xb0
> [  198.246949]  [<ffffffff815e5bb0>] ? gs_change+0x13/0x13
> [  198.246949] Code: 65 f0 4c 8b 6d f8 c9 c3 66 90 48 8d be 88 00 00 00 48 c7
> c6 70 a8 2d a0 e8 dd 9c 00 e1 48 8b 5d e8 4c 8b 65 f0 4c 8b 6d f8 c9 c3 <0f>
> 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57
> [  198.246949] RIP  [<ffffffffa02d83f1>] ceph_con_send+0x111/0x120 [libceph]
> [  198.246949]  RSP <ffff880405cd5bc0>
> [  198.927024] ---[ end trace 03cb81299b093f05 ]---
> [  198.940010] BUG: unable to handle kernel paging request at fffffffffffffff8
> [  198.949892] IP: [<ffffffff810868f0>] kthread_data+0x10/0x20
> [  198.949892] PGD 1a07067 PUD 1a08067 PMD 0
> [  198.949892] Oops: 0000 [#2] SMP
> [  198.949892] CPU 0
> [  198.949892] Modules linked in: rbd libceph libcrc32c ip6table_filter
> ip6_tables iptable_filter ip_tables x_tables nv_tco bridge stp kvm_amd kvm
> radeon ttm psmouse drm_kms_helper drm i2c_algo_bit k10temp i2c_nforce2 shpchp
> amd64_edac_mod serio_raw edac_core edac_mce_amd lp parport ses enclosure
> aacraid forcedeth
> [  198.949892]
> [  198.949892] Pid: 10, comm: kworker/0:1 Tainted: G      D 3.1.0-rc5-custom
> #1 Supermicro H8DM8-2/H8DM8-2
> [  198.949892] RIP: 0010:[<ffffffff810868f0>]  [<ffffffff810868f0>]
> kthread_data+0x10/0x20
> [  198.949892] RSP: 0018:ffff880405cd5868  EFLAGS: 00010096
> [  198.949892] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> 0000000000000000
> [  198.949892] RDX: ffff880405cc5bc0 RSI: 0000000000000000 RDI:
> ffff880405cc5bc0
> [  198.949892] RBP: ffff880405cd5868 R08: 0000000000989680 R09:
> 0000000000000000
> [  198.949892] R10: 0000000000000400 R11: 0000000000000006 R12:
> ffff880405cc5f88
> [  198.949892] R13: 0000000000000000 R14: 0000000000000000 R15:
> ffff880405cc5e90
> [  198.949892] FS:  00007fea65610700(0000) GS:ffff88040fc00000(0000)
> knlGS:0000000000000000
> [  198.949892] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  198.949892] CR2: fffffffffffffff8 CR3: 0000000001a05000 CR4:
> 00000000000006f0
> [  198.949892] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  198.949892] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [  198.949892] Process kworker/0:1 (pid: 10, threadinfo ffff880405cd4000, task
> ffff880405cc5bc0)
> [  198.949892] Stack:
> [  198.949892]  ffff880405cd5888 ffffffff81082345 ffff880405cd5888
> ffff88040fc13080
> [  198.949892]  ffff880405cd5918 ffffffff815d9092 ffff880405e5a558
> ffff880405cc5bc0
> [  198.949892]  ffff880405cd58d8 ffff880405cd5fd8 ffff880405cd4000
> ffff880405cd5fd8
> [  198.949892] Call Trace:
> [  198.949892]  [<ffffffff81082345>] wq_worker_sleeping+0x15/0xa0
> [  198.949892]  [<ffffffff815d9092>] __schedule+0x5c2/0x8b0
> [  198.949892]  [<ffffffff812caf96>] ? put_io_context+0x46/0x70
> [  198.949892]  [<ffffffff8105b72f>] schedule+0x3f/0x60
> [  198.949892]  [<ffffffff81068223>] do_exit+0x5e3/0x8a0
> [  198.949892]  [<ffffffff815dcc4f>] oops_end+0xaf/0xf0
> [  198.949892]  [<ffffffff8101689b>] die+0x5b/0x90
> [  198.949892]  [<ffffffff815dc354>] do_trap+0xc4/0x170
> [  198.949892]  [<ffffffff81013f25>] do_invalid_op+0x95/0xb0
> [  198.949892]  [<ffffffffa02d83f1>] ? ceph_con_send+0x111/0x120 [libceph]
> [  198.949892]  [<ffffffffa02e276a>] ? ceph_calc_pg_acting+0x2a/0x90 [libceph]
> [  198.949892]  [<ffffffff815e5a2b>] invalid_op+0x1b/0x20
> [  198.949892]  [<ffffffffa02d83f1>] ? ceph_con_send+0x111/0x120 [libceph]
> [  198.949892]  [<ffffffffa02dd8ad>] send_queued+0xed/0x130 [libceph]
> [  198.949892]  [<ffffffffa02dfd81>] ceph_osdc_handle_map+0x261/0x3b0
> [libceph]
> [  198.949892]  [<ffffffffa02d711c>] ? ceph_msg_new+0x15c/0x230 [libceph]
> [  198.949892]  [<ffffffffa02e01e0>] dispatch+0x150/0x360 [libceph]
> [  198.949892]  [<ffffffffa02da54f>] con_work+0x214f/0x21d0 [libceph]
> [  198.949892]  [<ffffffffa02d8400>] ? ceph_con_send+0x120/0x120 [libceph]
> [  198.949892]  [<ffffffff8108110d>] process_one_work+0x11d/0x430
> [  198.949892]  [<ffffffff81081c69>] worker_thread+0x169/0x360
> [  198.949892]  [<ffffffff81081b00>] ? manage_workers.clone.21+0x240/0x240
> [  198.949892]  [<ffffffff81086496>] kthread+0x96/0xa0
> [  198.949892]  [<ffffffff815e5bb4>] kernel_thread_helper+0x4/0x10
> [  198.949892]  [<ffffffff81086400>] ? flush_kthread_worker+0xb0/0xb0
> [  198.949892]  [<ffffffff815e5bb0>] ? gs_change+0x13/0x13
> [  198.949892] Code: 5e 41 5f c9 c3 be 3e 01 00 00 48 c7 c7 5b 3a 7d 81 e8 85
> d3 fd ff e9 84 fe ff ff 55 48 89 e5 66 66 66 66 90 48 8b 87 70 03 00 00
> [  198.949892]  8b 40 f8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66
> [  198.949892] RIP  [<ffffffff810868f0>] kthread_data+0x10/0x20
> [  198.949892]  RSP <ffff880405cd5868>
> [  198.949892] CR2: fffffffffffffff8
> [  198.949892] ---[ end trace 03cb81299b093f06 ]---
> [  198.949892] Fixing recursive fault but reboot is needed!
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: ceph kernel bug
  2011-09-10 22:47 ` Sage Weil
@ 2011-09-10 23:46   ` Martin Mailand
  2011-09-15 19:41     ` Martin Mailand
  0 siblings, 1 reply; 10+ messages in thread
From: Martin Mailand @ 2011-09-10 23:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,
no it did not fix it. Here the new trace.

Regards,
  martin

[  182.721180] libceph: osd2 192.168.42.114:6800 socket closed
[  182.732642] libceph: osd2 192.168.42.114:6800 connection failed
[  183.040233] libceph: osd2 192.168.42.114:6800 connection failed
[  184.040204] libceph: osd2 192.168.42.114:6800 connection failed
[  186.040244] libceph: osd2 192.168.42.114:6800 connection failed
[  190.060233] libceph: osd2 192.168.42.114:6800 connection failed
[  198.060214] libceph: osd2 192.168.42.114:6800 connection failed
[  213.964994] ------------[ cut here ]------------
[  213.974288] kernel BUG at net/ceph/messenger.c:2193!
[  213.974470] invalid opcode: 0000 [#1] SMP
[  213.974470] CPU 0
[  213.974470] Modules linked in: rbd libceph libcrc32c ip6table_filter 
ip6_tables iptable_filter ip_tables x_tables nv_tco bridge stp kvm_amd 
kvm radeon lp psmouse shpchp parport i2c_nforce2 amd64_edac_mod ttm 
drm_kms_helper drm edac_core i2c_algo_bit edac_mce_amd serio_raw k10temp 
ses enclosure aacraid forcedeth
[  213.974470]
[  213.974470] Pid: 10, comm: kworker/0:1 Not tainted 3.1.0-rc5-custom 
#3 Supermicro H8DM8-2/H8DM8-2
[  213.974470] RIP: 0010:[<ffffffffa02cf3f1>]  [<ffffffffa02cf3f1>] 
ceph_con_send+0x111/0x120 [libceph]
[  213.974470] RSP: 0018:ffff880405cddbd0  EFLAGS: 00010283
[  213.974470] RAX: ffff880403e93c78 RBX: ffff880803f97030 RCX: 
ffff8808034d2e50
[  213.974470] RDX: ffff880405cddfd8 RSI: ffff880403e93c00 RDI: 
ffff880803f971a8
[  213.974470] RBP: ffff880405cddbf0 R08: ffff88040fc0de40 R09: 
000000000000fffb
[  213.974470] R10: 0000000000000000 R11: 0000000000000001 R12: 
ffff880803f971a8
[  213.974470] R13: ffff880403e93c00 R14: ffff8808034d2e60 R15: 
ffff8808034d2e50
[  213.974470] FS:  00007f5909978720(0000) GS:ffff88040fc00000(0000) 
knlGS:0000000000000000
[  213.974470] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  213.974470] CR2: ffffffffff600400 CR3: 0000000404e6f000 CR4: 
00000000000006f0
[  213.974470] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  213.974470] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[  213.974470] Process kworker/0:1 (pid: 10, threadinfo 
ffff880405cdc000, task ffff880405cb5bc0)
[  213.974470] Stack:
[  213.974470]  ffff880405cddbf0 ffff880403e0ac00 ffff8808034d2e30 
ffff8808034d2da8
[  213.974470]  ffff880405cddc40 ffffffffa02d490d ffff8808034d2c80 
ffff8808034d2e00
[  213.974470]  ffff880405cddc40 ffff8804041d1c91 ffff8808034d2da8 
0000000000000000
[  213.974470] Call Trace:
[  213.974470]  [<ffffffffa02d490d>] send_queued+0xed/0x130 [libceph]
[  213.974470]  [<ffffffffa02d6d91>] ceph_osdc_handle_map+0x261/0x3b0 
[libceph]
[  213.974470]  [<ffffffffa02d331f>] dispatch+0x10f/0x580 [libceph]
[  213.974470]  [<ffffffffa02d154f>] con_work+0x214f/0x21d0 [libceph]
[  213.974470]  [<ffffffffa02cf400>] ? ceph_con_send+0x120/0x120 [libceph]
[  213.974470]  [<ffffffff8108110d>] process_one_work+0x11d/0x430
[  213.974470]  [<ffffffff81081c69>] worker_thread+0x169/0x360
[  213.974470]  [<ffffffff81081b00>] ? manage_workers.clone.21+0x240/0x240
[  213.974470]  [<ffffffff81086496>] kthread+0x96/0xa0
[  213.974470]  [<ffffffff815e5bb4>] kernel_thread_helper+0x4/0x10
[  213.974470]  [<ffffffff81086400>] ? flush_kthread_worker+0xb0/0xb0
[  213.974470]  [<ffffffff815e5bb0>] ? gs_change+0x13/0x13
[  213.974470] Code: 65 f0 4c 8b 6d f8 c9 c3 66 90 48 8d be 88 00 00 00 
48 c7 c6 70 18 2d a0 e8 dd 2c 01 e1 48 8b 5d e8 4c 8b 65 f0 4c 8b 6d f8 
c9 c3 <0f> 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57
[  213.974470] RIP  [<ffffffffa02cf3f1>] ceph_con_send+0x111/0x120 [libceph]
[  213.974470]  RSP <ffff880405cddbd0>
[  214.640753] ---[ end trace 837698aee31a73fc ]---
[  214.653687] BUG: unable to handle kernel paging request at 
fffffffffffffff8
[  214.663571] IP: [<ffffffff810868f0>] kthread_data+0x10/0x20
[  214.663571] PGD 1a07067 PUD 1a08067 PMD 0
[  214.663571] Oops: 0000 [#2] SMP
[  214.663571] CPU 0
[  214.663571] Modules linked in: rbd libceph libcrc32c ip6table_filter 
ip6_tables iptable_filter ip_tables x_tables nv_tco bridge stp kvm_amd 
kvm radeon lp psmouse shpchp parport i2c_nforce2 amd64_edac_mod ttm 
drm_kms_helper drm edac_core i2c_algo_bit edac_mce_amd serio_raw k10temp 
ses enclosure aacraid forcedeth
[  214.663571]
[  214.663571] Pid: 10, comm: kworker/0:1 Tainted: G      D 
3.1.0-rc5-custom #3 Supermicro H8DM8-2/H8DM8-2
[  214.663571] RIP: 0010:[<ffffffff810868f0>]  [<ffffffff810868f0>] 
kthread_data+0x10/0x20
[  214.663571] RSP: 0018:ffff880405cdd878  EFLAGS: 00010096
[  214.663571] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
0000000000000000
[  214.663571] RDX: ffff880405cb5bc0 RSI: 0000000000000000 RDI: 
ffff880405cb5bc0
[  214.663571] RBP: ffff880405cdd878 R08: 0000000000989680 R09: 
0000000000000000
[  214.663571] R10: 0000000000000400 R11: 0000000000000006 R12: 
ffff880405cb5f88
[  214.663571] R13: 0000000000000000 R14: 0000000000000000 R15: 
ffff880405cb5e90
[  214.663571] FS:  00007f5909978720(0000) GS:ffff88040fc00000(0000) 
knlGS:0000000000000000
[  214.663571] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  214.663571] CR2: fffffffffffffff8 CR3: 0000000404e6f000 CR4: 
00000000000006f0
[  214.663571] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  214.663571] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[  214.663571] Process kworker/0:1 (pid: 10, threadinfo 
ffff880405cdc000, task ffff880405cb5bc0)
[  214.663571] Stack:
[  214.663571]  ffff880405cdd898 ffffffff81082345 ffff880405cdd898 
ffff88040fc13080
[  214.663571]  ffff880405cdd928 ffffffff815d9092 ffff8804050938b8 
ffff880405cb5bc0
[  214.663571]  ffff880405cdd8e8 ffff880405cddfd8 ffff880405cdc000 
ffff880405cddfd8
[  214.663571] Call Trace:
[  214.663571]  [<ffffffff81082345>] wq_worker_sleeping+0x15/0xa0
[  214.663571]  [<ffffffff815d9092>] __schedule+0x5c2/0x8b0
[  214.663571]  [<ffffffff812caf96>] ? put_io_context+0x46/0x70
[  214.663571]  [<ffffffff8105b72f>] schedule+0x3f/0x60
[  214.663571]  [<ffffffff81068223>] do_exit+0x5e3/0x8a0
[  214.663571]  [<ffffffff815dcc4f>] oops_end+0xaf/0xf0
[  214.663571]  [<ffffffff8101689b>] die+0x5b/0x90
[  214.663571]  [<ffffffff815dc354>] do_trap+0xc4/0x170
[  214.663571]  [<ffffffff81013f25>] do_invalid_op+0x95/0xb0
[  214.663571]  [<ffffffffa02cf3f1>] ? ceph_con_send+0x111/0x120 [libceph]
[  214.663571]  [<ffffffff812e9759>] ? vsnprintf+0x479/0x620
[  214.663571]  [<ffffffff8103be49>] ? default_spin_lock_flags+0x9/0x10
[  214.663571]  [<ffffffff815e5a2b>] invalid_op+0x1b/0x20
[  214.663571]  [<ffffffffa02cf3f1>] ? ceph_con_send+0x111/0x120 [libceph]
[  214.663571]  [<ffffffffa02d490d>] send_queued+0xed/0x130 [libceph]
[  214.663571]  [<ffffffffa02d6d91>] ceph_osdc_handle_map+0x261/0x3b0 
[libceph]
[  214.663571]  [<ffffffffa02d331f>] dispatch+0x10f/0x580 [libceph]
[  214.663571]  [<ffffffffa02d154f>] con_work+0x214f/0x21d0 [libceph]
[  214.663571]  [<ffffffffa02cf400>] ? ceph_con_send+0x120/0x120 [libceph]
[  214.663571]  [<ffffffff8108110d>] process_one_work+0x11d/0x430
[  214.663571]  [<ffffffff81081c69>] worker_thread+0x169/0x360
[  214.663571]  [<ffffffff81081b00>] ? manage_workers.clone.21+0x240/0x240
[  214.663571]  [<ffffffff81086496>] kthread+0x96/0xa0
[  214.663571]  [<ffffffff815e5bb4>] kernel_thread_helper+0x4/0x10
[  214.663571]  [<ffffffff81086400>] ? flush_kthread_worker+0xb0/0xb0
[  214.663571]  [<ffffffff815e5bb0>] ? gs_change+0x13/0x13
[  214.663571] Code: 5e 41 5f c9 c3 be 3e 01 00 00 48 c7 c7 5b 3a 7d 81 
e8 85 d3 fd ff e9 84 fe ff ff 55 48 89 e5 66 66 66 66 90 48 8b 87 70 03 
00 00
[  214.663571]  8b 40 f8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66
[  214.663571] RIP  [<ffffffff810868f0>] kthread_data+0x10/0x20
[  214.663571]  RSP <ffff880405cdd878>
[  214.663571] CR2: fffffffffffffff8
[  214.663571] ---[ end trace 837698aee31a73fd ]---
[  214.663571] Fixing recursive fault but reboot is needed!



Sage Weil schrieb:
> Hi Martin,
> 
> Is this reproducible?  If so, does the patch below fix it?
> 
> Thanks!
> sage
> 
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 5634216..dcd3475 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -31,6 +31,7 @@ static void __unregister_linger_request(struct ceph_osd_client *osdc,
>                                         struct ceph_osd_request *req);
>  static int __send_request(struct ceph_osd_client *osdc,
>                           struct ceph_osd_request *req);
> +static void __cancel_request(struct ceph_osd_request *req);
>  
>  static int op_needs_trail(int op)
>  {
> @@ -571,6 +572,7 @@ static void __kick_osd_requests(struct ceph_osd_client *osdc,
>                 return;
>  
>         list_for_each_entry(req, &osd->o_requests, r_osd_item) {
> +               __cancel_request(req);
>                 list_move(&req->r_req_lru_item, &osdc->req_unsent);
>                 dout("requeued %p tid %llu osd%d\n", req, req->r_tid,
>                      osd->o_osd);
> 
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ceph kernel bug
  2011-09-10 23:46   ` Martin Mailand
@ 2011-09-15 19:41     ` Martin Mailand
  2011-09-15 20:06       ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Martin Mailand @ 2011-09-15 19:41 UTC (permalink / raw)
  To: martin; +Cc: Sage Weil, ceph-devel

Hi Sage,
I am still hitting this in -rc6. It happeneds every time I stop an OSD.
Do you need more information to reproduce it?

Best Regards,
  martin

[103159.164630] libceph: osd0 192.168.42.113:6800 socket closed
[103169.153484] ------------[ cut here ]------------
[103169.162935] kernel BUG at net/ceph/messenger.c:2193!
[103169.163332] invalid opcode: 0000 [#1] SMP
[103169.163332] CPU 0
[103169.163332] Modules linked in: btrfs zlib_deflate rbd libceph 
libcrc32c ip6table_filter ip6_tables iptable_filter ip_tables x_tables 
kvm_amd kvm bridge nv_tco stp radeon ttm drm_kms_helper drm lp parport 
i2c_algo_bit amd64_edac_mod i2c_nforce2 edac_core edac_mce_amd k10temp 
shpchp psmouse serio_raw ses enclosure aacraid forcedeth
[103169.163332]
[103169.163332] Pid: 4405, comm: kworker/0:1 Not tainted 3.1.0-rc6 #1 
Supermicro H8DM8-2/H8DM8-2
[103169.163332] RIP: 0010:[<ffffffffa02b73f1>]  [<ffffffffa02b73f1>] 
ceph_con_send+0x111/0x120 [libceph]
[103169.163332] RSP: 0018:ffff88031c5b3bd0  EFLAGS: 00010202
[103169.163332] RAX: ffff88040502c678 RBX: ffff88040452b030 RCX: 
ffff88031c8a9e50
[103169.163332] RDX: ffff88031c5b3fd8 RSI: ffff88040502c600 RDI: 
ffff88040452b1a8
[103169.163332] RBP: ffff88031c5b3bf0 R08: ffff88040fc0de40 R09: 
0000000000000002
[103169.163332] R10: 0000000000000002 R11: 0000000000000072 R12: 
ffff88040452b1a8
[103169.163332] R13: ffff88040502c600 R14: ffff88031c8a9e60 R15: 
ffff88031c8a9e50
[103169.163332] FS:  00007f6d43dd2700(0000) GS:ffff88040fc00000(0000) 
knlGS:0000000000000000
[103169.163332] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[103169.163332] CR2: ffffffffff600400 CR3: 0000000403fb1000 CR4: 
00000000000006f0
[103169.163332] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[103169.163332] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[103169.163332] Process kworker/0:1 (pid: 4405, threadinfo 
ffff88031c5b2000, task ffff880405cd5bc0)
[103169.163332] Stack:
[103169.163332]  ffff88031c5b3bf0 ffff880404632a00 ffff88031c8a9e30 
ffff88031c8a9da8
[103169.163332]  ffff88031c5b3c40 ffffffffa02bc8ad ffff88031c8a9c80 
ffff88031c8a9e00
[103169.163332]  ffff88031c5b3c40 ffff8804045b7151 ffff88031c8a9da8 
0000000000000000
[103169.163332] Call Trace:
[103169.163332]  [<ffffffffa02bc8ad>] send_queued+0xed/0x130 [libceph]
[103169.163332]  [<ffffffffa02bed81>] ceph_osdc_handle_map+0x261/0x3b0 
[libceph]
[103169.163332]  [<ffffffffa02bb31f>] dispatch+0x10f/0x580 [libceph]
[103169.163332]  [<ffffffffa02b954f>] con_work+0x214f/0x21d0 [libceph]
[103169.163332]  [<ffffffffa02b7400>] ? ceph_con_send+0x120/0x120 [libceph]
[103169.163332]  [<ffffffff8108110d>] process_one_work+0x11d/0x430
[103169.163332]  [<ffffffff81081c69>] worker_thread+0x169/0x360
[103169.163332]  [<ffffffff81081b00>] ? manage_workers.clone.21+0x240/0x240
[103169.163332]  [<ffffffff81086496>] kthread+0x96/0xa0
[103169.163332]  [<ffffffff815e5c34>] kernel_thread_helper+0x4/0x10
[103169.163332]  [<ffffffff81086400>] ? flush_kthread_worker+0xb0/0xb0
[103169.163332]  [<ffffffff815e5c30>] ? gs_change+0x13/0x13
[103169.163332] Code: 65 f0 4c 8b 6d f8 c9 c3 66 90 48 8d be 88 00 00 00 
48 c7 c6 70 98 2b a0 e8 1d ad 02 e1 48 8b 5d e8 4c 8b 65 f0 4c 8b 6d f8 
c9 c3 <0f> 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57
[103169.163332] RIP  [<ffffffffa02b73f1>] ceph_con_send+0x111/0x120 
[libceph]
[103169.163332]  RSP <ffff88031c5b3bd0>
[103169.805672] ---[ end trace 49d197af1dff5a93 ]---
[103169.818910] BUG: unable to handle kernel paging request at 
fffffffffffffff8
[103169.828781] IP: [<ffffffff810868f0>] kthread_data+0x10/0x20
[103169.828781] PGD 1a07067 PUD 1a08067 PMD 0
[103169.828781] Oops: 0000 [#2] SMP
[103169.828781] CPU 0
[103169.828781] Modules linked in: btrfs zlib_deflate rbd libceph 
libcrc32c ip6table_filter ip6_tables iptable_filter ip_tables x_tables 
kvm_amd kvm bridge nv_tco stp radeon ttm drm_kms_helper drm lp parport 
i2c_algo_bit amd64_edac_mod i2c_nforce2 edac_core edac_mce_amd k10temp 
shpchp psmouse serio_raw ses enclosure aacraid forcedeth
[103169.828781]
[103169.828781] Pid: 4405, comm: kworker/0:1 Tainted: G      D 
3.1.0-rc6 #1 Supermicro H8DM8-2/H8DM8-2
[103169.828781] RIP: 0010:[<ffffffff810868f0>]  [<ffffffff810868f0>] 
kthread_data+0x10/0x20
[103169.828781] RSP: 0018:ffff88031c5b3878  EFLAGS: 00010096
[103169.828781] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
0000000000000000
[103169.828781] RDX: ffff880405cd5bc0 RSI: 0000000000000000 RDI: 
ffff880405cd5bc0
[103169.828781] RBP: ffff88031c5b3878 R08: 0000000000989680 R09: 
0000000000000000
[103169.828781] R10: 0000000000000400 R11: 0000000000000005 R12: 
ffff880405cd5f88
[103169.828781] R13: 0000000000000000 R14: 0000000000000000 R15: 
ffff880405cd5e90
[103169.828781] FS:  00007f6d43dd2700(0000) GS:ffff88040fc00000(0000) 
knlGS:0000000000000000
[103169.828781] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[103169.828781] CR2: fffffffffffffff8 CR3: 0000000403fb1000 CR4: 
00000000000006f0
[103169.828781] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[103169.828781] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[103169.828781] Process kworker/0:1 (pid: 4405, threadinfo 
ffff88031c5b2000, task ffff880405cd5bc0)
[103169.828781] Stack:
[103169.828781]  ffff88031c5b3898 ffffffff81082345 ffff88031c5b3898 
ffff88040fc13080
[103169.828781]  ffff88031c5b3928 ffffffff815d9142 ffff88031c5b38c8 
ffff880405cd5bc0
[103169.828781]  ffff880405cd5bc0 ffff88031c5b3fd8 ffff88031c5b2000 
ffff88031c5b3fd8
[103169.828781] Call Trace:
[103169.828781]  [<ffffffff81082345>] wq_worker_sleeping+0x15/0xa0
[103169.828781]  [<ffffffff815d9142>] __schedule+0x5c2/0x8b0
[103169.828781]  [<ffffffff8105b72f>] schedule+0x3f/0x60
[103169.828781]  [<ffffffff81068223>] do_exit+0x5e3/0x8a0
[103169.828781]  [<ffffffff815dcccf>] oops_end+0xaf/0xf0
[103169.828781]  [<ffffffff8101689b>] die+0x5b/0x90
[103169.828781]  [<ffffffff815dc3d4>] do_trap+0xc4/0x170
[103169.828781]  [<ffffffff81013f25>] do_invalid_op+0x95/0xb0
[103169.828781]  [<ffffffffa02b73f1>] ? ceph_con_send+0x111/0x120 [libceph]
[103169.828781]  [<ffffffff8103be49>] ? default_spin_lock_flags+0x9/0x10
[103169.828781]  [<ffffffff815e5aab>] invalid_op+0x1b/0x20
[103169.828781]  [<ffffffffa02b73f1>] ? ceph_con_send+0x111/0x120 [libceph]
[103169.828781]  [<ffffffffa02bc8ad>] send_queued+0xed/0x130 [libceph]
[103169.828781]  [<ffffffffa02bed81>] ceph_osdc_handle_map+0x261/0x3b0 
[libceph]
[103169.828781]  [<ffffffffa02bb31f>] dispatch+0x10f/0x580 [libceph]
[103169.828781]  [<ffffffffa02b954f>] con_work+0x214f/0x21d0 [libceph]
[103169.828781]  [<ffffffffa02b7400>] ? ceph_con_send+0x120/0x120 [libceph]
[103169.828781]  [<ffffffff8108110d>] process_one_work+0x11d/0x430
[103169.828781]  [<ffffffff81081c69>] worker_thread+0x169/0x360
[103169.828781]  [<ffffffff81081b00>] ? manage_workers.clone.21+0x240/0x240
[103169.828781]  [<ffffffff81086496>] kthread+0x96/0xa0
[103169.828781]  [<ffffffff815e5c34>] kernel_thread_helper+0x4/0x10
[103169.828781]  [<ffffffff81086400>] ? flush_kthread_worker+0xb0/0xb0
[103169.828781]  [<ffffffff815e5c30>] ? gs_change+0x13/0x13
[103169.828781] Code: 5e 41 5f c9 c3 be 3e 01 00 00 48 c7 c7 54 3a 7d 81 
e8 85 d3 fd ff e9 84 fe ff ff 55 48 89 e5 66 66 66 66 90 48 8b 87 70 03 
00 00
[103169.828781]  8b 40 f8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 
e5 66
[103169.828781] RIP  [<ffffffff810868f0>] kthread_data+0x10/0x20
[103169.828781]  RSP <ffff88031c5b3878>
[103169.828781] CR2: fffffffffffffff8
[103169.828781] ---[ end trace 49d197af1dff5a94 ]---
[103169.828781] Fixing recursive fault but reboot is needed!

> Sage Weil schrieb:
>> Hi Martin,
>>
>> Is this reproducible?  If so, does the patch below fix it?
>>
>> Thanks!
>> sage
>>
>> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> index 5634216..dcd3475 100644
>> --- a/net/ceph/osd_client.c
>> +++ b/net/ceph/osd_client.c
>> @@ -31,6 +31,7 @@ static void __unregister_linger_request(struct 
>> ceph_osd_client *osdc,
>>                                         struct ceph_osd_request *req);
>>  static int __send_request(struct ceph_osd_client *osdc,
>>                           struct ceph_osd_request *req);
>> +static void __cancel_request(struct ceph_osd_request *req);
>>  
>>  static int op_needs_trail(int op)
>>  {
>> @@ -571,6 +572,7 @@ static void __kick_osd_requests(struct 
>> ceph_osd_client *osdc,
>>                 return;
>>  
>>         list_for_each_entry(req, &osd->o_requests, r_osd_item) {
>> +               __cancel_request(req);
>>                 list_move(&req->r_req_lru_item, &osdc->req_unsent);
>>                 dout("requeued %p tid %llu osd%d\n", req, req->r_tid,
>>                      osd->o_osd);
>>
>>
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ceph kernel bug
  2011-09-15 19:41     ` Martin Mailand
@ 2011-09-15 20:06       ` Sage Weil
  2011-09-15 21:10         ` Martin Mailand
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2011-09-15 20:06 UTC (permalink / raw)
  To: Martin Mailand; +Cc: ceph-devel

On Thu, 15 Sep 2011, Martin Mailand wrote:
> Hi Sage,
> I am still hitting this in -rc6. It happeneds every time I stop an OSD.
> Do you need more information to reproduce it?

Oh, great to hear it's easy to reproduce!  I was trying (in my uml 
environment) and failing.

Can run the script below right before stopping the osd, and send the dmesg 
output along?  (Or attach to http://tracker.newdream.net/issues/1382)

Thanks!
sage


#!/bin/sh -x

p() {
 echo "$*" > /sys/kernel/debug/dynamic_debug/control
}

p 'module ceph +p'
p 'module libceph +p'
p 'module rbd +p'
p 'file net/ceph/messenger.c -p'
p 'file' `grep -- --- /sys/kernel/debug/dynamic_debug/control | grep ceph \
	| awk '{print $1}' | sed 's/:/ line /'` '+p'
p 'file' `grep -- === /sys/kernel/debug/dynamic_debug/control | grep ceph \
	| awk '{print $1}' | sed 's/:/ line /'` '+p'


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ceph kernel bug
  2011-09-15 20:06       ` Sage Weil
@ 2011-09-15 21:10         ` Martin Mailand
  2011-09-15 22:54           ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Martin Mailand @ 2011-09-15 21:10 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,
that's quite a bit of output, I put it in a pastebin. 
http://pastebin.com/9CNJk0Pw.

Best Regards,
  martin

Sage Weil schrieb:
> On Thu, 15 Sep 2011, Martin Mailand wrote:
>> Hi Sage,
>> I am still hitting this in -rc6. It happeneds every time I stop an OSD.
>> Do you need more information to reproduce it?
> 
> Oh, great to hear it's easy to reproduce!  I was trying (in my uml 
> environment) and failing.
> 
> Can run the script below right before stopping the osd, and send the dmesg 
> output along?  (Or attach to http://tracker.newdream.net/issues/1382)
> 
> Thanks!
> sage
> 
> 
> #!/bin/sh -x
> 
> p() {
>  echo "$*" > /sys/kernel/debug/dynamic_debug/control
> }
> 
> p 'module ceph +p'
> p 'module libceph +p'
> p 'module rbd +p'
> p 'file net/ceph/messenger.c -p'
> p 'file' `grep -- --- /sys/kernel/debug/dynamic_debug/control | grep ceph \
> 	| awk '{print $1}' | sed 's/:/ line /'` '+p'
> p 'file' `grep -- === /sys/kernel/debug/dynamic_debug/control | grep ceph \
> 	| awk '{print $1}' | sed 's/:/ line /'` '+p'
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ceph kernel bug
  2011-09-15 21:10         ` Martin Mailand
@ 2011-09-15 22:54           ` Sage Weil
  2011-09-16  9:05             ` Martin Mailand
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2011-09-15 22:54 UTC (permalink / raw)
  To: Martin Mailand; +Cc: ceph-devel

On Thu, 15 Sep 2011, Martin Mailand wrote:
> Hi Sage,
> that's quite a bit of output, I put it in a pastebin.
> http://pastebin.com/9CNJk0Pw.

Any chance you can include the output of 'objdump -rdS libceph.ko'?  
ceph.ko too, for good measure.

This looks like a sightly different crash than the one on that bug!

Thanks!
sage


> 
> Best Regards,
>  martin
> 
> Sage Weil schrieb:
> > On Thu, 15 Sep 2011, Martin Mailand wrote:
> > > Hi Sage,
> > > I am still hitting this in -rc6. It happeneds every time I stop an OSD.
> > > Do you need more information to reproduce it?
> > 
> > Oh, great to hear it's easy to reproduce!  I was trying (in my uml
> > environment) and failing.
> > 
> > Can run the script below right before stopping the osd, and send the dmesg
> > output along?  (Or attach to http://tracker.newdream.net/issues/1382)
> > 
> > Thanks!
> > sage
> > 
> > 
> > #!/bin/sh -x
> > 
> > p() {
> >  echo "$*" > /sys/kernel/debug/dynamic_debug/control
> > }
> > 
> > p 'module ceph +p'
> > p 'module libceph +p'
> > p 'module rbd +p'
> > p 'file net/ceph/messenger.c -p'
> > p 'file' `grep -- --- /sys/kernel/debug/dynamic_debug/control | grep ceph \
> > 	| awk '{print $1}' | sed 's/:/ line /'` '+p'
> > p 'file' `grep -- === /sys/kernel/debug/dynamic_debug/control | grep ceph \
> > 	| awk '{print $1}' | sed 's/:/ line /'` '+p'
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ceph kernel bug
  2011-09-15 22:54           ` Sage Weil
@ 2011-09-16  9:05             ` Martin Mailand
  2011-09-16 18:22               ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Martin Mailand @ 2011-09-16  9:05 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,
I rerun the test and I think I triggered the first bug again.
http://pastebin.com/ydNm0pff

I did also the dumps for you.
http://tuxadero.com/multistorage/ceph.ko_dump
http://tuxadero.com/multistorage/libceph.ko_dump

Best Regards,
  martin
Am 16.09.2011 00:54, schrieb Sage Weil:
> On Thu, 15 Sep 2011, Martin Mailand wrote:
>> Hi Sage,
>> that's quite a bit of output, I put it in a pastebin.
>> http://pastebin.com/9CNJk0Pw.
> Any chance you can include the output of 'objdump -rdS libceph.ko'?
> ceph.ko too, for good measure.
>
> This looks like a sightly different crash than the one on that bug!
>
> Thanks!
> sage
>
>
>> Best Regards,
>>   martin
>>
>> Sage Weil schrieb:
>>> On Thu, 15 Sep 2011, Martin Mailand wrote:
>>>> Hi Sage,
>>>> I am still hitting this in -rc6. It happeneds every time I stop an OSD.
>>>> Do you need more information to reproduce it?
>>> Oh, great to hear it's easy to reproduce!  I was trying (in my uml
>>> environment) and failing.
>>>
>>> Can run the script below right before stopping the osd, and send the dmesg
>>> output along?  (Or attach to http://tracker.newdream.net/issues/1382)
>>>
>>> Thanks!
>>> sage
>>>
>>>
>>> #!/bin/sh -x
>>>
>>> p() {
>>>   echo "$*">  /sys/kernel/debug/dynamic_debug/control
>>> }
>>>
>>> p 'module ceph +p'
>>> p 'module libceph +p'
>>> p 'module rbd +p'
>>> p 'file net/ceph/messenger.c -p'
>>> p 'file' `grep -- --- /sys/kernel/debug/dynamic_debug/control | grep ceph \
>>> 	| awk '{print $1}' | sed 's/:/ line /'` '+p'
>>> p 'file' `grep -- === /sys/kernel/debug/dynamic_debug/control | grep ceph \
>>> 	| awk '{print $1}' | sed 's/:/ line /'` '+p'
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ceph kernel bug
  2011-09-16  9:05             ` Martin Mailand
@ 2011-09-16 18:22               ` Sage Weil
  2011-09-16 20:17                 ` Martin Mailand
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2011-09-16 18:22 UTC (permalink / raw)
  To: Martin Mailand; +Cc: ceph-devel

Hi Martin,

Thanks, this was enough to help me reproduce it, and I believe I have a 
correct fix (it's working for me).  Can you try commit 935b639 'libceph: 
fix linger request requeuing' (for-linus branch of 
git://github.com/NewDreamNetwork/ceph-client.git) and confirm that it 
fixes things for you as well?

Thanks!
sage


On Fri, 16 Sep 2011, Martin Mailand wrote:

> Hi Sage,
> I rerun the test and I think I triggered the first bug again.
> http://pastebin.com/ydNm0pff
> 
> I did also the dumps for you.
> http://tuxadero.com/multistorage/ceph.ko_dump
> http://tuxadero.com/multistorage/libceph.ko_dump
> 
> Best Regards,
>  martin
> Am 16.09.2011 00:54, schrieb Sage Weil:
> > On Thu, 15 Sep 2011, Martin Mailand wrote:
> > > Hi Sage,
> > > that's quite a bit of output, I put it in a pastebin.
> > > http://pastebin.com/9CNJk0Pw.
> > Any chance you can include the output of 'objdump -rdS libceph.ko'?
> > ceph.ko too, for good measure.
> > 
> > This looks like a sightly different crash than the one on that bug!
> > 
> > Thanks!
> > sage
> > 
> > 
> > > Best Regards,
> > >   martin
> > > 
> > > Sage Weil schrieb:
> > > > On Thu, 15 Sep 2011, Martin Mailand wrote:
> > > > > Hi Sage,
> > > > > I am still hitting this in -rc6. It happeneds every time I stop an
> > > > > OSD.
> > > > > Do you need more information to reproduce it?
> > > > Oh, great to hear it's easy to reproduce!  I was trying (in my uml
> > > > environment) and failing.
> > > > 
> > > > Can run the script below right before stopping the osd, and send the
> > > > dmesg
> > > > output along?  (Or attach to http://tracker.newdream.net/issues/1382)
> > > > 
> > > > Thanks!
> > > > sage
> > > > 
> > > > 
> > > > #!/bin/sh -x
> > > > 
> > > > p() {
> > > >   echo "$*">  /sys/kernel/debug/dynamic_debug/control
> > > > }
> > > > 
> > > > p 'module ceph +p'
> > > > p 'module libceph +p'
> > > > p 'module rbd +p'
> > > > p 'file net/ceph/messenger.c -p'
> > > > p 'file' `grep -- --- /sys/kernel/debug/dynamic_debug/control | grep
> > > > ceph \
> > > > 	| awk '{print $1}' | sed 's/:/ line /'` '+p'
> > > > p 'file' `grep -- === /sys/kernel/debug/dynamic_debug/control | grep
> > > > ceph \
> > > > 	| awk '{print $1}' | sed 's/:/ line /'` '+p'
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ceph kernel bug
  2011-09-16 18:22               ` Sage Weil
@ 2011-09-16 20:17                 ` Martin Mailand
  0 siblings, 0 replies; 10+ messages in thread
From: Martin Mailand @ 2011-09-16 20:17 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,
yes it fixes things for me as well.

Best Regards,
  martin

Sage Weil schrieb:
> Hi Martin,
> 
> Thanks, this was enough to help me reproduce it, and I believe I have a 
> correct fix (it's working for me).  Can you try commit 935b639 'libceph: 
> fix linger request requeuing' (for-linus branch of 
> git://github.com/NewDreamNetwork/ceph-client.git) and confirm that it 
> fixes things for you as well?
> 
> Thanks!
> sage
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-09-16 20:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-10 21:12 ceph kernel bug Martin Mailand
2011-09-10 22:47 ` Sage Weil
2011-09-10 23:46   ` Martin Mailand
2011-09-15 19:41     ` Martin Mailand
2011-09-15 20:06       ` Sage Weil
2011-09-15 21:10         ` Martin Mailand
2011-09-15 22:54           ` Sage Weil
2011-09-16  9:05             ` Martin Mailand
2011-09-16 18:22               ` Sage Weil
2011-09-16 20:17                 ` Martin Mailand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.