* Kernel crashes with RBD @ 2012-04-11 22:30 Danny Kukawka 2012-04-13 17:48 ` Josh Durgin 2012-06-06 7:32 ` Yan, Zheng 0 siblings, 2 replies; 7+ messages in thread From: Danny Kukawka @ 2012-04-11 22:30 UTC (permalink / raw) To: ceph-devel [-- Attachment #1: Type: text/plain, Size: 2005 bytes --] Hi, we are currently testing CEPH with RBD on a cluster with 1GBit and 10Gbit interfaces. While we see no kernel crashes with RBD if the cluster runs on the 1GBit interfaces, we see very frequent kernel crashes with the 10Gbit network while running tests with e.g. fio against the RBDs. I've tested it with kernel v3.0 and also 3.3.0 (with the patches from the 'for-linus' branch from ceph-client.git at git.kernel.org). With more client machines running tests the crashes occur even much faster. The issue is fully reproducible here. Has anyone seen similar problems? See the backtrace below. Regards Danny PID: 10902 TASK: ffff88032a9a2080 CPU: 0 COMMAND: "kworker/0:0" #0 [ffff8803235fd950] machine_kexec at ffffffff810265ee #1 [ffff8803235fd9a0] crash_kexec at ffffffff810a3bda #2 [ffff8803235fda70] oops_end at ffffffff81444688 #3 [ffff8803235fda90] __bad_area_nosemaphore at ffffffff81032a35 #4 [ffff8803235fdb50] do_page_fault at ffffffff81446d3e #5 [ffff8803235fdc50] page_fault at ffffffff81443865 [exception RIP: read_partial_message+816] RIP: ffffffffa041e500 RSP: ffff8803235fdd00 RFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000000009d7 RCX: 0000000000008000 RDX: 0000000000000000 RSI: 00000000000009d7 RDI: ffffffff813c8d78 RBP: ffff880328827030 R8: 00000000000009d7 R9: 0000000000004000 R10: 0000000000000000 R11: ffffffff81205800 R12: 0000000000000000 R13: 0000000000000069 R14: ffff88032a9bc780 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #6 [ffff8803235fdd38] thread_return at ffffffff81440e82 #7 [ffff8803235fdd78] try_read at ffffffffa041ed58 [libceph] #8 [ffff8803235fddf8] con_work at ffffffffa041fb2e [libceph] #9 [ffff8803235fde28] process_one_work at ffffffff8107487c #10 [ffff8803235fde78] worker_thread at ffffffff8107740a #11 [ffff8803235fdee8] kthread at ffffffff8107b736 #12 [ffff8803235fdf48] kernel_thread_helper at ffffffff8144c144 [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 316 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Kernel crashes with RBD 2012-04-11 22:30 Kernel crashes with RBD Danny Kukawka @ 2012-04-13 17:48 ` Josh Durgin 2012-04-13 18:18 ` Danny Kukawka 2012-06-06 7:32 ` Yan, Zheng 1 sibling, 1 reply; 7+ messages in thread From: Josh Durgin @ 2012-04-13 17:48 UTC (permalink / raw) To: Danny Kukawka; +Cc: ceph-devel, Alex Elder On 04/11/2012 03:30 PM, Danny Kukawka wrote: > Hi, > > we are currently testing CEPH with RBD on a cluster with 1GBit and > 10Gbit interfaces. While we see no kernel crashes with RBD if the > cluster runs on the 1GBit interfaces, we see very frequent kernel > crashes with the 10Gbit network while running tests with e.g. fio > against the RBDs. > > I've tested it with kernel v3.0 and also 3.3.0 (with the patches from > the 'for-linus' branch from ceph-client.git at git.kernel.org). > > With more client machines running tests the crashes occur even much > faster. The issue is fully reproducible here. > > Has anyone seen similar problems? See the backtrace below. > > Regards > > Danny > > PID: 10902 TASK: ffff88032a9a2080 CPU: 0 COMMAND: "kworker/0:0" > #0 [ffff8803235fd950] machine_kexec at ffffffff810265ee > #1 [ffff8803235fd9a0] crash_kexec at ffffffff810a3bda > #2 [ffff8803235fda70] oops_end at ffffffff81444688 > #3 [ffff8803235fda90] __bad_area_nosemaphore at ffffffff81032a35 > #4 [ffff8803235fdb50] do_page_fault at ffffffff81446d3e > #5 [ffff8803235fdc50] page_fault at ffffffff81443865 > [exception RIP: read_partial_message+816] > RIP: ffffffffa041e500 RSP: ffff8803235fdd00 RFLAGS: 00010246 > RAX: 0000000000000000 RBX: 00000000000009d7 RCX: 0000000000008000 > RDX: 0000000000000000 RSI: 00000000000009d7 RDI: ffffffff813c8d78 > RBP: ffff880328827030 R8: 00000000000009d7 R9: 0000000000004000 > R10: 0000000000000000 R11: ffffffff81205800 R12: 0000000000000000 > R13: 0000000000000069 R14: ffff88032a9bc780 R15: 0000000000000000 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #6 [ffff8803235fdd38] thread_return at ffffffff81440e82 > #7 [ffff8803235fdd78] try_read at ffffffffa041ed58 [libceph] > #8 [ffff8803235fddf8] con_work at ffffffffa041fb2e [libceph] > #9 [ffff8803235fde28] process_one_work at ffffffff8107487c > #10 [ffff8803235fde78] worker_thread at ffffffff8107740a > #11 [ffff8803235fdee8] kthread at ffffffff8107b736 > #12 [ffff8803235fdf48] kernel_thread_helper at ffffffff8144c144 > This looks similar to http://tracker.newdream.net/issues/2261. What do you think Alex? Josh ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Kernel crashes with RBD 2012-04-13 17:48 ` Josh Durgin @ 2012-04-13 18:18 ` Danny Kukawka 2012-04-13 20:56 ` Josh Durgin 0 siblings, 1 reply; 7+ messages in thread From: Danny Kukawka @ 2012-04-13 18:18 UTC (permalink / raw) To: ceph-devel; +Cc: Josh Durgin, Alex Elder [-- Attachment #1: Type: text/plain, Size: 686 bytes --] Hi Am 13.04.2012 19:48, schrieb Josh Durgin: > On 04/11/2012 03:30 PM, Danny Kukawka wrote: [...] > > This looks similar to http://tracker.newdream.net/issues/2261. What do > you think Alex? Not sure about that, since this crashes only the clients and not the OSDs. We see no crashes in the cluster. I analyzed it a bit more and found that the last working version was 0.43. Any later released version leads to this crash sooner or later, but as I already said only on a 10Gbit (FC) network. I didn't see any crash on the 1Gbit net on the same machines. What kind of network do you use at dreamhost for testing? If you need more info, let me known. Danny [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 316 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Kernel crashes with RBD 2012-04-13 18:18 ` Danny Kukawka @ 2012-04-13 20:56 ` Josh Durgin 2012-04-13 23:03 ` Danny Kukawka 0 siblings, 1 reply; 7+ messages in thread From: Josh Durgin @ 2012-04-13 20:56 UTC (permalink / raw) To: Danny Kukawka; +Cc: ceph-devel, Alex Elder On 04/13/2012 11:18 AM, Danny Kukawka wrote: > Hi > > Am 13.04.2012 19:48, schrieb Josh Durgin: >> On 04/11/2012 03:30 PM, Danny Kukawka wrote: > [...] >> >> This looks similar to http://tracker.newdream.net/issues/2261. What do >> you think Alex? > > Not sure about that, since this crashes only the clients and not the > OSDs. We see no crashes in the cluster. These are both rbd kernel client crashes, but looking again they seem like different underlying issues, so I opened http://tracker.newdream.net/issues/2287 to track this problem. > > I analyzed it a bit more and found that the last working version was > 0.43. Any later released version leads to this crash sooner or later, > but as I already said only on a 10Gbit (FC) network. I didn't see any > crash on the 1Gbit net on the same machines. > > What kind of network do you use at dreamhost for testing? Mostly 1Gbit, some 10Gbit, but I don't think we've tested the kernel client on 10Gbit yet. > If you need more info, let me known. Do the crashes always have the same stack trace? When you say 10Gbit for the cluster, does that include the client using rbd, or just the osds? Josh ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Kernel crashes with RBD 2012-04-13 20:56 ` Josh Durgin @ 2012-04-13 23:03 ` Danny Kukawka 2012-04-14 13:32 ` Danny Kukawka 0 siblings, 1 reply; 7+ messages in thread From: Danny Kukawka @ 2012-04-13 23:03 UTC (permalink / raw) To: ceph-devel; +Cc: Josh Durgin, Alex Elder, Danny Kukawka [-- Attachment #1: Type: text/plain, Size: 2178 bytes --] Am 13.04.2012 22:56, schrieb Josh Durgin: > On 04/13/2012 11:18 AM, Danny Kukawka wrote: >> Hi >> >> Am 13.04.2012 19:48, schrieb Josh Durgin: >>> On 04/11/2012 03:30 PM, Danny Kukawka wrote: >> [...] >>> >>> This looks similar to http://tracker.newdream.net/issues/2261. What do >>> you think Alex? >> >> Not sure about that, since this crashes only the clients and not the >> OSDs. We see no crashes in the cluster. > > These are both rbd kernel client crashes, but looking again they seem > like different underlying issues, so I opened > http://tracker.newdream.net/issues/2287 to track this problem. > >> >> I analyzed it a bit more and found that the last working version was >> 0.43. Any later released version leads to this crash sooner or later, >> but as I already said only on a 10Gbit (FC) network. I didn't see any >> crash on the 1Gbit net on the same machines. >> >> What kind of network do you use at dreamhost for testing? > > Mostly 1Gbit, some 10Gbit, but I don't think we've tested the kernel > client on 10Gbit yet. That's what I assumed ;-) >> If you need more info, let me known. > > Do the crashes always have the same stack trace? When you say 10Gbit > for the cluster, does that include the client using rbd, or just the > osds? It's always the same stack trace (sometimes a address is different, but everything else looks identical). We tested basically the following setups and the crash happend with all of them: 1) OSD, MON and Clients in the same 10Gbit network 2) OSD, MON and Clients in different public/cluster 10Gbit networks 3) OSD and Clients in the same 10Gbit network, MON in 1Gbit network 3) OSD and Clients in the same 10Gbit network, MON in 1Gbit network different public/cluster networks The number of OSDs (tested 4 nodes with 10 OSDs per node, each one physical harddisk) didn't matter in this case. If I use 2 clients running fio tests against one 50GByte RBD per client, I hit the problem faster than with one client. If you need information about the used fio tests, let me know. As already I said: we didn't hit this problem with 1Gbit networks yet. Danny [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 316 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Kernel crashes with RBD 2012-04-13 23:03 ` Danny Kukawka @ 2012-04-14 13:32 ` Danny Kukawka 0 siblings, 0 replies; 7+ messages in thread From: Danny Kukawka @ 2012-04-14 13:32 UTC (permalink / raw) To: ceph-devel; +Cc: Josh Durgin, Alex Elder, Danny Kukawka [-- Attachment #1: Type: text/plain, Size: 5488 bytes --] Am 14.04.2012 01:03, schrieb Danny Kukawka: > Am 13.04.2012 22:56, schrieb Josh Durgin: >> On 04/13/2012 11:18 AM, Danny Kukawka wrote: >>> Hi >>> >>> Am 13.04.2012 19:48, schrieb Josh Durgin: >>>> On 04/11/2012 03:30 PM, Danny Kukawka wrote: >>> [...] >>>> >>>> This looks similar to http://tracker.newdream.net/issues/2261. What do >>>> you think Alex? >>> >>> Not sure about that, since this crashes only the clients and not the >>> OSDs. We see no crashes in the cluster. >> >> These are both rbd kernel client crashes, but looking again they seem >> like different underlying issues, so I opened >> http://tracker.newdream.net/issues/2287 to track this problem. >> >>> >>> I analyzed it a bit more and found that the last working version was >>> 0.43. Any later released version leads to this crash sooner or later, >>> but as I already said only on a 10Gbit (FC) network. I didn't see any >>> crash on the 1Gbit net on the same machines. >>> >>> What kind of network do you use at dreamhost for testing? >> >> Mostly 1Gbit, some 10Gbit, but I don't think we've tested the kernel >> client on 10Gbit yet. > > That's what I assumed ;-) > >>> If you need more info, let me known. >> >> Do the crashes always have the same stack trace? When you say 10Gbit >> for the cluster, does that include the client using rbd, or just the >> osds? > > It's always the same stack trace (sometimes a address is different, but > everything else looks identical). > > We tested basically the following setups and the crash happend with all > of them: > 1) OSD, MON and Clients in the same 10Gbit network > 2) OSD, MON and Clients in different public/cluster 10Gbit networks > 3) OSD and Clients in the same 10Gbit network, MON in 1Gbit network > 3) OSD and Clients in the same 10Gbit network, MON in 1Gbit network > different public/cluster networks > > The number of OSDs (tested 4 nodes with 10 OSDs per node, each one > physical harddisk) didn't matter in this case. If I use 2 clients > running fio tests against one 50GByte RBD per client, I hit the problem > faster than with one client. If you need information about the used fio > tests, let me know. > > As already I said: we didn't hit this problem with 1Gbit networks yet. Now I see this kind of crash also on 1Gbit and running tests against RBD with the following setup: - 30 OSD nodes, 3 OSDs per node - 11 clients - 3 MON - 2 MDS (1 active, 1 standby) - different public/cluster 1Gbit networks - ceph 0.45 ceph -s: ------------ 2012-04-14 15:29:53.744829 pg v5609: 18018 pgs: 18018 active+clean; 549 GB data, 1107 GB used, 22382 GB / 23490 GB avail 2012-04-14 15:29:53.781966 mds e6: 1/1/1 up {0=alpha=up:active}, 1 up:standby-replay 2012-04-14 15:29:53.782002 osd e26: 90 osds: 90 up, 90 in 2012-04-14 15:29:53.782233 log 2012-04-14 15:11:51.857127 osd.68 192.168.111.43:6801/23591 1372 : [WRN] old request osd_sub_op(client.4180.1:931208 2.dfc eca8dfc/rb.0.3.000000000e76/head [] v 26'1534 snapset=0=[]:[] snapc=0=[]) v6 received at 2012-04-14 15:11:20.882595 currently startedold request osd_sub_op(client.4250.1:885302 2.898 6bf1d898/rb.0.6.000000000b61/head [] v 26'1662 snapset=0=[]:[] snapc=0=[]) v6 received at 2012-04-14 15:11:21.663493 currently startedold request osd_op(client.4262.1:888106 rb.0.9.000000000672 [write 1286144~4096] 2.45aa94a) received at 2012-04-14 15:11:19.879144 currently waiting for sub opsold request osd_op(client.4250.1:885252 rb.0.6.0000000018a2 [write 1073152~4096] 2.fb3ab288) received at 2012-04-14 15:11:19.927412 currently waiting for sub opsold request osd_op(client.4241.1:930301 rb.0.1.000000002fdc [write 8192~4096] 2.9110ba90) received at 2012-04-14 15:11:20.645040 currently waiting for sub opsold request osd_op(client.4250.1:885278 rb.0.6.0000000013cf [write 1892352~4096] 2.cfe911f0) received at 2012-04-14 15:11:20.616330 currently waiting for sub ops 2012-04-14 15:29:53.782370 mon e1: 3 mons at {alpha=192.168.111.33:6789/0,beta=192.168.111.34:6789/0,gamma=192.168.111.35:6789/0} ------------ crash backtrace: ------------ PID: 86 TASK: ffff880432326040 CPU: 13 COMMAND: "kworker/13:1" #0 [ffff880432329970] machine_kexec at ffffffff810265ee #1 [ffff8804323299c0] crash_kexec at ffffffff810a3bda #2 [ffff880432329a90] oops_end at ffffffff81444688 #3 [ffff880432329ab0] __bad_area_nosemaphore at ffffffff81032a35 #4 [ffff880432329b70] do_page_fault at ffffffff81446d3e #5 [ffff880432329c70] page_fault at ffffffff81443865 [exception RIP: write_partial_msg_pages+1181] RIP: ffffffffa0352e8d RSP: ffff880432329d20 RFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88043268a030 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000001000 RBP: 0000000000000000 R8: 0000000000000000 R9: 00000000479aae8f R10: 00000000000005a8 R11: 0000000000000000 R12: 0000000000001000 R13: ffffea000e917608 R14: 0000160000000000 R15: ffff88043217fbc0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #6 [ffff880432329d78] try_write at ffffffffa0355345 [libceph] #7 [ffff880432329df8] con_work at ffffffffa0355bdd [libceph] #8 [ffff880432329e28] process_one_work at ffffffff8107487c #9 [ffff880432329e78] worker_thread at ffffffff8107740a #10 [ffff880432329ee8] kthread at ffffffff8107b736 #11 [ffff880432329f48] kernel_thread_helper at ffffffff8144c144 ------------ Danny [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 316 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Kernel crashes with RBD 2012-04-11 22:30 Kernel crashes with RBD Danny Kukawka 2012-04-13 17:48 ` Josh Durgin @ 2012-06-06 7:32 ` Yan, Zheng 1 sibling, 0 replies; 7+ messages in thread From: Yan, Zheng @ 2012-06-06 7:32 UTC (permalink / raw) To: ceph-devel I think I tracked this bug down, the Oops is due to 'msg->bio_iter == NULL'. --- diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index f0993af..ac16f13 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -549,6 +549,10 @@ static void prepare_write_message(struct ceph_connection *con) } m = list_first_entry(&con->out_queue, struct ceph_msg, list_head); +#ifdef CONFIG_BLOCK + if (m->bio && m->bio_iter) + m->bio_iter = NULL; +#endif con->out_msg = m; /* put message on sent list */ On Thu, Apr 12, 2012 at 6:30 AM, Danny Kukawka <danny.kukawka@bisect.de> wrote: > Hi, > > we are currently testing CEPH with RBD on a cluster with 1GBit and > 10Gbit interfaces. While we see no kernel crashes with RBD if the > cluster runs on the 1GBit interfaces, we see very frequent kernel > crashes with the 10Gbit network while running tests with e.g. fio > against the RBDs. > > I've tested it with kernel v3.0 and also 3.3.0 (with the patches from > the 'for-linus' branch from ceph-client.git at git.kernel.org). > > With more client machines running tests the crashes occur even much > faster. The issue is fully reproducible here. > > Has anyone seen similar problems? See the backtrace below. > > Regards > > Danny > > PID: 10902 TASK: ffff88032a9a2080 CPU: 0 COMMAND: "kworker/0:0" > #0 [ffff8803235fd950] machine_kexec at ffffffff810265ee > #1 [ffff8803235fd9a0] crash_kexec at ffffffff810a3bda > #2 [ffff8803235fda70] oops_end at ffffffff81444688 > #3 [ffff8803235fda90] __bad_area_nosemaphore at ffffffff81032a35 > #4 [ffff8803235fdb50] do_page_fault at ffffffff81446d3e > #5 [ffff8803235fdc50] page_fault at ffffffff81443865 > [exception RIP: read_partial_message+816] > RIP: ffffffffa041e500 RSP: ffff8803235fdd00 RFLAGS: 00010246 > RAX: 0000000000000000 RBX: 00000000000009d7 RCX: 0000000000008000 > RDX: 0000000000000000 RSI: 00000000000009d7 RDI: ffffffff813c8d78 > RBP: ffff880328827030 R8: 00000000000009d7 R9: 0000000000004000 > R10: 0000000000000000 R11: ffffffff81205800 R12: 0000000000000000 > R13: 0000000000000069 R14: ffff88032a9bc780 R15: 0000000000000000 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #6 [ffff8803235fdd38] thread_return at ffffffff81440e82 > #7 [ffff8803235fdd78] try_read at ffffffffa041ed58 [libceph] > #8 [ffff8803235fddf8] con_work at ffffffffa041fb2e [libceph] > #9 [ffff8803235fde28] process_one_work at ffffffff8107487c > #10 [ffff8803235fde78] worker_thread at ffffffff8107740a > #11 [ffff8803235fdee8] kthread at ffffffff8107b736 > #12 [ffff8803235fdf48] kernel_thread_helper at ffffffff8144c144 > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-06-06 7:32 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-04-11 22:30 Kernel crashes with RBD Danny Kukawka 2012-04-13 17:48 ` Josh Durgin 2012-04-13 18:18 ` Danny Kukawka 2012-04-13 20:56 ` Josh Durgin 2012-04-13 23:03 ` Danny Kukawka 2012-04-14 13:32 ` Danny Kukawka 2012-06-06 7:32 ` Yan, Zheng
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.