All of lore.kernel.org
 help / color / mirror / Atom feed
* oops in rbd module (con_work in libceph)
@ 2012-07-06  8:31 Yann Dupont
  2012-07-06 15:35 ` Yann Dupont
  0 siblings, 1 reply; 10+ messages in thread
From: Yann Dupont @ 2012-07-06  8:31 UTC (permalink / raw)
  To: ceph-devel

Hello.

Bug happens in rbd client, at least in Kernel 3.4.4 . I have a 
completely reproductible bug.

here is the oops :


Jul  6 10:16:52 label5.u14.univ-nantes.prive kernel: [  329.456285] 
EXT4-fs (rbd1): mounted filesystem with ordered data mode. Opts: (null)
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.709145] 
libceph: osd1 172.20.14.131:6801 socket closed
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.715245] BUG: 
unable to handle kernel NULL pointer dereference at 0000000000000048
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.715430] IP: 
[<ffffffffa08488f0>] con_work+0xfb0/0x20b0 [libceph]
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.715554] PGD 
a094cb067 PUD a0a7a7067 PMD 0
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.715758] 
Oops: 0000 [#1] SMP
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.715914] CPU 0
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.715963] 
Modules linked in: ext4 jbd2 crc16 rbd libceph drbd lru_cache cn 
ip6table_filter ip6_tables iptable_filt
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.720338]
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.720406] Pid: 
1007, comm: kworker/0:2 Not tainted 3.4.4-dsiun-120521 #111 Dell Inc. 
PowerEdge M610/0V56FN
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.720637] RIP: 
0010:[<ffffffffa08488f0>]  [<ffffffffa08488f0>] con_work+0xfb0/0x20b0 
[libceph]
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.720779] RSP: 
0000:ffff880a1036dd50  EFLAGS: 00010246
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.720851] RAX: 
0000000000000000 RBX: 0000000000000000 RCX: 0000000000031000
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.720925] RDX: 
0000000000000000 RSI: ffff880a1092c5a0 RDI: ffff880a1092c598
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.721002] RBP: 
000000000004f000 R08: 0000000000000020 R09: 0000000000000000
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.721100] R10: 
0000000000000010 R11: ffff880a122e0f08 R12: 0000000000000001
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.721173] R13: 
ffff880a1092c500 R14: ffffea001430e300 R15: ffff880a0990f030
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.721247] FS:  
0000000000000000(0000) GS:ffff880a2fc00000(0000) knlGS:0000000000000000
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.721337] CS:  
0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.721409] CR2: 
0000000000000048 CR3: 0000000a10823000 CR4: 00000000000007f0
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.721483] DR0: 
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.721557] DR3: 
0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.721632] 
Process kworker/0:2 (pid: 1007, threadinfo ffff880a1036c000, task 
ffff880a10b2f2c0)
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.721721] Stack:
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.721784] 
0000000200000000 ffff880a1036ddfc 0000000000000400 ffff880a00000000
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.722050] 
ffff880a1036ddd8 000000000004f000 ffff880a0004f000 ffff880a00000000
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.722315] 
ffff880a0990f420 ffff880a1092c5a0 ffff880a0990f308 ffff880a0990f1a8
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.722581] Call 
Trace:
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.722653] 
[<ffffffff810534d2>] ? process_one_work+0x122/0x3f0
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.722728] 
[<ffffffffa0847940>] ? ceph_con_revoke_message+0xc0/0xc0 [libceph]
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.722819] 
[<ffffffff81054c65>] ? worker_thread+0x125/0x2e0
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.722892] 
[<ffffffff81054b40>] ? manage_workers.isra.25+0x1f0/0x1f0
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.722969] 
[<ffffffff81059b85>] ? kthread+0x85/0x90
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.723042] 
[<ffffffff813baee4>] ? kernel_thread_helper+0x4/0x10
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.723116] 
[<ffffffff81059b00>] ? flush_kthread_worker+0x80/0x80
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.723189] 
[<ffffffff813baee0>] ? gs_change+0x13/0x13
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.723258] 
Code: ea f4 ff ff 0f 1f 80 00 00 00 00 49 83 bd 90 00 00 00 00 0f 84 ca 
03 00 00 49 63 85 a0 00 00 00 49
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.727478] RIP  
[<ffffffffa08488f0>] con_work+0xfb0/0x20b0 [libceph]
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.727599] RSP 
<ffff880a1036dd50>
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.727664] CR2: 
0000000000000048
Jul  6 10:18:38 label5.u14.univ-nantes.prive kernel: [  434.727846] ---[ 
end trace 100f342b55356819 ]---
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.728031] BUG: 
unable to handle kernel paging request at fffffffffffffff8
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.728192] IP: 
[<ffffffff81059d27>] kthread_data+0x7/0x10
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.728313] PGD 
14fe067 PUD 14ff067 PMD 0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.728517] 
Oops: 0000 [#2] SMP
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.728676] CPU 0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.728725] 
Modules linked in: ext4 jbd2 crc16 rbd libceph drbd lru_cache cn 
ip6table_filter ip6_tables iptable_filt
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.733034]
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.733100] Pid: 
1007, comm: kworker/0:2 Tainted: G      D 3.4.4-dsiun-120521 #111 Dell 
Inc. PowerEdge M610/0V5
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.733330] RIP: 
0010:[<ffffffff81059d27>]  [<ffffffff81059d27>] kthread_data+0x7/0x10
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.733470] RSP: 
0000:ffff880a1036da30  EFLAGS: 00010002
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.733539] RAX: 
0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.733612] RDX: 
ffffffff8164a380 RSI: 0000000000000000 RDI: ffff880a10b2f2c0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.733686] RBP: 
ffff880a10b2f2c0 R08: 0000000000989680 R09: ffffffff8164a380
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.733758] R10: 
0000000000000800 R11: 000000000000fff8 R12: ffff880a2fc120c0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.733830] R13: 
0000000000000000 R14: ffff880a10b2f2b0 R15: ffff880a10b2f2c0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.733904] FS:  
0000000000000000(0000) GS:ffff880a2fc00000(0000) knlGS:0000000000000000
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.733993] CS:  
0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.734064] CR2: 
fffffffffffffff8 CR3: 0000000a10823000 CR4: 00000000000007f0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.734138] DR0: 
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.734211] DR3: 
0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.734284] 
Process kworker/0:2 (pid: 1007, threadinfo ffff880a1036c000, task 
ffff880a10b2f2c0)
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.734375] Stack:
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.734439] 
ffffffff81055ae8 ffff880a10b2f590 ffffffff813b807d ffff880a10b2f2c0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.734706] 
ffff880a10b2f2c0 ffff880a1036dfd8 ffff880a1036dfd8 ffff880a1036dfd8
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.734971] 
ffff880a10b2f2c0 0000000000000001 ffff880a10b2f7a4 0000000000000000
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.735237] Call 
Trace:
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.735309] 
[<ffffffff81055ae8>] ? wq_worker_sleeping+0x8/0x90
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.735386] 
[<ffffffff813b807d>] ? __schedule+0x41d/0x6c0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.735463] 
[<ffffffff8103e2a2>] ? do_exit+0x592/0x8c0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.735537] 
[<ffffffff81006068>] ? oops_end+0x98/0xe0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.735611] 
[<ffffffff813b0f96>] ? no_context+0x24e/0x279
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.735685] 
[<ffffffff8102e31b>] ? do_page_fault+0x3ab/0x460
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.735760] 
[<ffffffff8135677b>] ? tcp_established_options+0x3b/0xd0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.735833] 
[<ffffffff813589aa>] ? tcp_write_xmit+0x15a/0xac0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.735907] 
[<ffffffff813b9179>] ? _raw_spin_lock_bh+0x9/0x30
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.735984] 
[<ffffffff812f9a79>] ? release_sock+0x19/0x100
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.736056] 
[<ffffffff8134af43>] ? tcp_sendpage+0xf3/0x700
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.736131] 
[<ffffffff813b94f5>] ? page_fault+0x25/0x30
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.736206] 
[<ffffffffa08488f0>] ? con_work+0xfb0/0x20b0 [libceph]
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.736280] 
[<ffffffff810534d2>] ? process_one_work+0x122/0x3f0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.736355] 
[<ffffffffa0847940>] ? ceph_con_revoke_message+0xc0/0xc0 [libceph]
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.736446] 
[<ffffffff81054c65>] ? worker_thread+0x125/0x2e0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.736518] 
[<ffffffff81054b40>] ? manage_workers.isra.25+0x1f0/0x1f0
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.736593] 
[<ffffffff81059b85>] ? kthread+0x85/0x90
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.736664] 
[<ffffffff813baee4>] ? kernel_thread_helper+0x4/0x10
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.736739] 
[<ffffffff81059b00>] ? flush_kthread_worker+0x80/0x80
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.736813] 
[<ffffffff813baee0>] ? gs_change+0x13/0x13
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.736883] 
Code: fe ff ff 90 eb 90 be 57 01 00 00 48 c7 c7 9b 70 47 81 e8 cd 00 fe 
ff e9 94 fe ff ff 0f 1f 84 00 00
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.739914] RIP  
[<ffffffff81059d27>] kthread_data+0x7/0x10
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.740036] RSP 
<ffff880a1036da30>
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.740103] CR2: 
fffffffffffffff8
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.740170] ---[ 
end trace 100f342b5535681a ]---
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  434.740250] 
Fixing recursive fault but reboot is needed!
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  494.699770] 
INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 16, 
t=6002 jiffies)
Jul  6 10:19:38 label5.u14.univ-nantes.prive kernel: [  494.700039] 
INFO: Stall ended before state dump start




Step (for me) to reproduce :

the volume is on my freshly re-created ceph with 8 osd nodes (xfs 
formatted osd). I created an rbd volume (yd-bench) on it.
this rbd volume is ext4 formatted.

It only contains copy of the git trunk linux-stable

Then, on a client (running nothing ceph-related)

modprobe rbd
rbd map yd-bench
mount
cd linux-stable
make -j24 bzImage modules

is sufficient to lead to the crash. The machine has 32GB of ram, 64 
bits, and the same works on localdisk.

kernel is vanilla 3.4.4.

Any ideas ?

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oops in rbd module (con_work in libceph)
  2012-07-06  8:31 oops in rbd module (con_work in libceph) Yann Dupont
@ 2012-07-06 15:35 ` Yann Dupont
  2012-07-07  0:16   ` Alex Elder
  0 siblings, 1 reply; 10+ messages in thread
From: Yann Dupont @ 2012-07-06 15:35 UTC (permalink / raw)
  To: Yann Dupont; +Cc: ceph-devel

Le 06/07/2012 10:31, Yann Dupont a écrit :
> Hello.
>
> Bug happens in rbd client, at least in Kernel 3.4.4 . I have a 
> completely reproductible bug.
just a note : 3.2.22 doesn't seems to exhibit the problem. I repeated 
the process 2 times without problems on this kernel.

I'll launch realistic load on our ceph volume this week end (bacula 
backups). I'll see if 3.2.22 is solid.

Then monday or tuesday I'll take the git bisect route to see which patch 
introduced the problem.

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oops in rbd module (con_work in libceph)
  2012-07-06 15:35 ` Yann Dupont
@ 2012-07-07  0:16   ` Alex Elder
  2012-07-09 16:54     ` Yann Dupont
  0 siblings, 1 reply; 10+ messages in thread
From: Alex Elder @ 2012-07-07  0:16 UTC (permalink / raw)
  To: Yann Dupont; +Cc: ceph-devel

On 07/06/2012 10:35 AM, Yann Dupont wrote:
> Le 06/07/2012 10:31, Yann Dupont a écrit :
>> Hello.
>>
>> Bug happens in rbd client, at least in Kernel 3.4.4 . I have a
>> completely reproductible bug.
> just a note : 3.2.22 doesn't seems to exhibit the problem. I repeated
> the process 2 times without problems on this kernel.

There are a number of bugs that have been fixed since Linux 3.4,
and the fixes have not made it into the 3.4.y stable releases.

I just sent an announcement about the Ceph stable branch that's
available in the Ceph git repository.  If possible I would
recommend you try using that for 3.4 testing.  The branch is here:

    http://github.com/ceph/ceph-client/tree/linux-3.4.4-ceph

If you do have troubles I would very much like to hear about it.
And if you don't run into the problems you've been seeing that
would be good to know as well.

					-Alex

> I'll launch realistic load on our ceph volume this week end (bacula
> backups). I'll see if 3.2.22 is solid.
> 
> Then monday or tuesday I'll take the git bisect route to see which patch
> introduced the problem.
> 
> Cheers,
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oops in rbd module (con_work in libceph)
  2012-07-07  0:16   ` Alex Elder
@ 2012-07-09 16:54     ` Yann Dupont
  2012-07-09 17:04       ` Yann Dupont
  0 siblings, 1 reply; 10+ messages in thread
From: Yann Dupont @ 2012-07-09 16:54 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

Le 07/07/2012 02:16, Alex Elder a écrit :

[...]
> There are a number of bugs that have been fixed since Linux 3.4,
> and the fixes have not made it into the 3.4.y stable releases.
>
> I just sent an announcement about the Ceph stable branch that's
> available in the Ceph git repository.  If possible I would
> recommend you try using that for 3.4 testing.  The branch is here:
>
>      http://github.com/ceph/ceph-client/tree/linux-3.4.4-ceph

Ok. I've compiled the kernel this afternoon, and tested it without much 
success :

Jul  9 18:17:23 label5.u14.univ-nantes.prive kernel: [  284.116236] 
libceph: osd0 172.20.14.130:6801 socket closed
Jul  9 18:17:43 label5.u14.univ-nantes.prive kernel: [  304.101545] 
libceph: osd6 172.20.14.137:6800 socket closed
Jul  9 18:17:53 label5.u14.univ-nantes.prive kernel: [  314.095155] 
libceph: osd3 172.20.14.134:6800 socket closed
Jul  9 18:18:38 label5.u14.univ-nantes.prive kernel: [  359.075473] 
libceph: osd5 172.20.14.136:6800 socket closed
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.107334] 
libceph: osd6 172.20.14.137:6800 socket closed
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.121001] BUG: 
unable to handle kernel NULL pointer dereference at 0000000000000048
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.121184] IP: 
[<ffffffffa0822940>] con_work+0xfb0/0x20e0 [libceph]
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.121307] PGD 
a122d4067 PUD a11753067 PMD 0
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.121512] 
Oops: 0000 [#1] SMP
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.121670] CPU 0
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.121721] 
Modules linked in: ext4 jbd2 crc16 rbd libceph drbd lru_cache cn 
ip6table_filter ip6_tables iptable_filter ip_tables x_tables bridge stp 
dlm sctp nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ocfs2_dlmfs 
ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs 
ipv6 fuse ext2 mbcache dm_round_robin dm_multipath scsi_dh snd_pcm 
snd_timer snd ioatdma soundcore coretemp dcdbas i7core_edac edac_core 
snd_page_alloc iTCO_wdt crc32c_intel dca processor joydev pcspkr hed 
evdev button microcode thermal_sys xfs exportfs btrfs zlib_deflate 
dm_mod sd_mod usbhid hid ata_generic ata_piix libata uhci_hcd mptsas 
mptscsih ide_pci_generic mptbase ide_core scsi_transport_sas lpfc bnx2x 
ehci_hcd scsi_transport_fc scsi_tgt crc32c scsi_mod libcrc32c bnx2 mdio 
[last unloaded: scsi_wait_scan]
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.126062]
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.126127] Pid: 
15886, comm: kworker/0:3 Not tainted 3.4.4-dsiun-120521+ #1 Dell Inc. 
PowerEdge M610/0V56FN
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.126351] RIP: 
0010:[<ffffffffa0822940>]  [<ffffffffa0822940>] con_work+0xfb0/0x20e0 
[libceph]
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.126491] RSP: 
0000:ffff880a08fa5d50  EFLAGS: 00010246
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.126560] RAX: 
0000000000000000 RBX: 0000000000000000 RCX: 0000000000020000
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.126633] RDX: 
0000000000000000 RSI: ffff880a0d9854a0 RDI: ffff880a0d985498
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.126707] RBP: 
0000000000080000 R08: 0000000000000020 R09: 0000000000000000
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.126779] R10: 
0000000000000029 R11: ffff880509d30808 R12: 0000000000000001
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.126852] R13: 
ffff880a0d985400 R14: ffffea0014151f00 R15: ffff880a12ffa830
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.126925] FS:  
0000000000000000(0000) GS:ffff880a2fc00000(0000) knlGS:0000000000000000
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.127014] CS:  
0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.127084] CR2: 
0000000000000048 CR3: 0000000a0d275000 CR4: 00000000000007f0
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.127157] DR0: 
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.127229] DR3: 
0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.127302] 
Process kworker/0:3 (pid: 15886, threadinfo ffff880a08fa4000, task 
ffff880a07960000)
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.127391] Stack:
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.127454] 
0000000200000000 ffff880a08fa5dfc 0000000000000400 ffffffff00000000
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.127721] 
ffff880a08fa5dd8 0000000000080000 ffff880a00080000 ffff880a00000000
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.127990] 
ffff880a12ffac20 ffff880a0d9854a0 ffff880a12ffab08 ffff880a12ffa9a8
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.128260] Call 
Trace:
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.128335] 
[<ffffffff810534d2>] ? process_one_work+0x122/0x3f0
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.128412] 
[<ffffffffa0821990>] ? ceph_con_revoke_message+0xc0/0xc0 [libceph]
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.128503] 
[<ffffffff81054c65>] ? worker_thread+0x125/0x2e0
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.128576] 
[<ffffffff81054b40>] ? manage_workers.isra.25+0x1f0/0x1f0
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.129863] 
[<ffffffff81059b85>] ? kthread+0x85/0x90
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.129938] 
[<ffffffff813baee4>] ? kernel_thread_helper+0x4/0x10
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.130012] 
[<ffffffff81059b00>] ? flush_kthread_worker+0x80/0x80
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.130085] 
[<ffffffff813baee0>] ? gs_change+0x13/0x13
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.130154] 
Code: ea f4 ff ff 0f 1f 80 00 00 00 00 49 83 bd 90 00 00 00 00 0f 84 ca 
03 00 00 49 63 85 a0 00 00 00 49 8b 95 98 00 00 00 48 c1 e0 04 <48> 03 
42 48 4c 8b 30 44 8b 48 0c 8b 70 08 e9 32 fc ff ff 31 c0
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.133172] RIP  
[<ffffffffa0822940>] con_work+0xfb0/0x20e0 [libceph]
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.133297] RSP 
<ffff880a08fa5d50>
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.133363] CR2: 
0000000000000048
Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [  429.133526] ---[ 
end trace 00282dc1efb5b115 ]---
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.133654] BUG: 
unable to handle kernel paging request at fffffffffffffff8
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.133817] IP: 
[<ffffffff81059d27>] kthread_data+0x7/0x10
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.133938] PGD 
14fe067 PUD 14ff067 PMD 0
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.134141] 
Oops: 0000 [#2] SMP
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.134294] CPU 0
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.134341] 
Modules linked in: ext4 jbd2 crc16 rbd libceph drbd lru_cache cn 
ip6table_filter ip6_tables iptable_filter ip_tables x_tables bridge stp 
dlm sctp nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ocfs2_dlmfs 
ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs 
ipv6 fuse ext2 mbcache dm_round_robin dm_multipath scsi_dh snd_pcm 
snd_timer snd ioatdma soundcore coretemp dcdbas i7core_edac edac_core 
snd_page_alloc iTCO_wdt crc32c_intel dca processor joydev pcspkr hed 
evdev button microcode thermal_sys xfs exportfs btrfs zlib_deflate 
dm_mod sd_mod usbhid hid ata_generic ata_piix libata uhci_hcd mptsas 
mptscsih ide_pci_generic mptbase ide_core scsi_transport_sas lpfc bnx2x 
ehci_hcd scsi_transport_fc scsi_tgt crc32c scsi_mod libcrc32c bnx2 mdio 
[last unloaded: scsi_wait_scan]
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.138608]
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.138673] Pid: 
15886, comm: kworker/0:3 Tainted: G      D 3.4.4-dsiun-120521+ #1 Dell 
Inc. PowerEdge M610/0V56FN
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.138903] RIP: 
0010:[<ffffffff81059d27>]  [<ffffffff81059d27>] kthread_data+0x7/0x10
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.139041] RSP: 
0000:ffff880a08fa5a30  EFLAGS: 00010002
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.139110] RAX: 
0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.139182] RDX: 
ffffffff8164a380 RSI: 0000000000000000 RDI: ffff880a07960000
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.139254] RBP: 
ffff880a07960000 R08: 0000000000989680 R09: ffffffff8164a380
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.139325] R10: 
0000000000000400 R11: 0000000000000000 R12: ffff880a2fc120c0
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.139397] R13: 
0000000000000000 R14: ffff880a0795fff0 R15: ffff880a07960000
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.139470] FS:  
0000000000000000(0000) GS:ffff880a2fc00000(0000) knlGS:0000000000000000
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.139557] CS:  
0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.139627] CR2: 
fffffffffffffff8 CR3: 0000000a0d275000 CR4: 00000000000007f0
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.139700] DR0: 
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.139773] DR3: 
0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.139847] 
Process kworker/0:3 (pid: 15886, threadinfo ffff880a08fa4000, task 
ffff880a07960000)
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.139937] Stack:
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.140000] 
ffffffff81055ae8 ffff880a079602d0 ffffffff813b807d ffff880a07960000
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.140266] 
ffff880a07960000 ffff880a08fa5fd8 ffff880a08fa5fd8 ffff880a08fa5fd8
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.140530] 
ffff880a07960000 ffff880a07960000 ffff880a079604e4 0000000000000000
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.140795] Call 
Trace:
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.140865] 
[<ffffffff81055ae8>] ? wq_worker_sleeping+0x8/0x90
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.140943] 
[<ffffffff813b807d>] ? __schedule+0x41d/0x6c0
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.141019] 
[<ffffffff8103e2a2>] ? do_exit+0x592/0x8c0
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.141093] 
[<ffffffff81006068>] ? oops_end+0x98/0xe0
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.141166] 
[<ffffffff813b0f96>] ? no_context+0x24e/0x279
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.141240] 
[<ffffffff8102e31b>] ? do_page_fault+0x3ab/0x460
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.141314] 
[<ffffffff8135677b>] ? tcp_established_options+0x3b/0xd0
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.141387] 
[<ffffffff813589aa>] ? tcp_write_xmit+0x15a/0xac0
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.141460] 
[<ffffffff813b9179>] ? _raw_spin_lock_bh+0x9/0x30
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.141535] 
[<ffffffff812f9a79>] ? release_sock+0x19/0x100
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.141606] 
[<ffffffff8134af43>] ? tcp_sendpage+0xf3/0x700
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.141677] 
[<ffffffff813b94f5>] ? page_fault+0x25/0x30
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.141754] 
[<ffffffffa0822940>] ? con_work+0xfb0/0x20e0 [libceph]
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.141828] 
[<ffffffff810534d2>] ? process_one_work+0x122/0x3f0
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.141901] 
[<ffffffffa0821990>] ? ceph_con_revoke_message+0xc0/0xc0 [libceph]
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.141991] 
[<ffffffff81054c65>] ? worker_thread+0x125/0x2e0
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.142062] 
[<ffffffff81054b40>] ? manage_workers.isra.25+0x1f0/0x1f0
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.142135] 
[<ffffffff81059b85>] ? kthread+0x85/0x90
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.142207] 
[<ffffffff813baee4>] ? kernel_thread_helper+0x4/0x10
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.142281] 
[<ffffffff81059b00>] ? flush_kthread_worker+0x80/0x80
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.142354] 
[<ffffffff813baee0>] ? gs_change+0x13/0x13
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.142423] 
Code: fe ff ff 90 eb 90 be 57 01 00 00 48 c7 c7 9c 70 47 81 e8 cd 00 fe 
ff e9 94 fe ff ff 0f 1f 84 00 00 00 00 00 48 8b 87 78 02 00 00 <48> 8b 
40 f8 c3 0f 1f 40 00 48 3b 3d b1 05 5f 00 74 0f 65 8b 04
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.145429] RIP  
[<ffffffff81059d27>] kthread_data+0x7/0x10
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.145549] RSP 
<ffff880a08fa5a30>
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.145615] CR2: 
fffffffffffffff8
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.145682] ---[ 
end trace 00282dc1efb5b116 ]---
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  429.145764] 
Fixing recursive fault but reboot is needed!
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  489.120508] 
INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 5, 
t=6002 jiffies)
Jul  9 18:20:48 label5.u14.univ-nantes.prive kernel: [  489.120772] 
INFO: Stall ended before state dump start

>
> If you do have troubles I would very much like to hear about it.
> And if you don't run into the problems you've been seeing that
> would be good to know as well.

As the error is the same than plain 3.4.4,
I just doubled-checked to see If I don't mess with an older kernel but 
as far as I can tell, it's the good kernel .

root@label5:/usr/src/GIT/ceph-client# uname -a
Linux label5 3.4.4-dsiun-120521+ #1 SMP Mon Jul 9 17:23:49 CEST 2012 
x86_64 GNU/Linux


root@label5:~# modinfo libceph
filename: /lib/modules/3.4.4-dsiun-120521+/kernel/net/ceph/libceph.ko
license:        GPL
description:    Ceph filesystem for Linux
author:         Patience Warnick <patience@newdream.net>
author:         Yehuda Sadeh <yehuda@hq.newdream.net>
author:         Sage Weil <sage@newdream.net>
depends:        libcrc32c
intree:         Y
vermagic:       3.4.4-dsiun-120521+ SMP mod_unload modversions
root@label5:~# modinfo ceph
filename: /lib/modules/3.4.4-dsiun-120521+/kernel/fs/ceph/ceph.ko
license:        GPL
description:    Ceph filesystem for Linux
author:         Patience Warnick <patience@newdream.net>
author:         Yehuda Sadeh <yehuda@hq.newdream.net>
author:         Sage Weil <sage@newdream.net>
depends:        libceph
intree:         Y
vermagic:       3.4.4-dsiun-120521+ SMP mod_unload modversions


So maybe I messed with git ? for me the last commit in my local branch 
(tracking remote branch linux-3.4.4-ceph from origin)
is that :

root@label5:/usr/src/GIT/ceph-client# git branch
   for-linus
* linux-3.4.4-ceph

root@label5:/usr/src/GIT/ceph-client# git log
commit c92a3ead0da1f13f5c971bba4eaa041ed22bb06e
Author: Sage Weil <sage@inktank.com>
Date:   Sun Jun 10 20:43:56 2012 -0700

...

> 					-Alex
>
>> I'll launch realistic load on our ceph volume this week end (bacula
>> backups). I'll see if 3.2.22 is solid.

At least a good news : as far as I can say, it is. backuping since 
friday without problems on 3.2.22.

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oops in rbd module (con_work in libceph)
  2012-07-09 16:54     ` Yann Dupont
@ 2012-07-09 17:04       ` Yann Dupont
  2012-07-10 17:46         ` Gregory Farnum
  0 siblings, 1 reply; 10+ messages in thread
From: Yann Dupont @ 2012-07-09 17:04 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Alex Elder, ceph-devel

Le 09/07/2012 18:54, Yann Dupont a écrit :
>
> Ok. I've compiled the kernel this afternoon, and tested it without 
> much success :
>
> Jul  9 18:17:23 label5.u14.univ-nantes.prive kernel: [ 284.116236] 
> libceph: osd0 172.20.14.130:6801 socket closed
> Jul  9 18:17:43 label5.u14.univ-nantes.prive kernel: [ 304.101545] 
> libceph: osd6 172.20.14.137:6800 socket closed
> Jul  9 18:17:53 label5.u14.univ-nantes.prive kernel: [ 314.095155] 
> libceph: osd3 172.20.14.134:6800 socket closed
> Jul  9 18:18:38 label5.u14.univ-nantes.prive kernel: [ 359.075473] 
> libceph: osd5 172.20.14.136:6800 socket closed
> Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [ 429.107334] 
> libceph: osd6 172.20.14.137:6800 socket closed

just an interesting thing I just noticed in the logs :

osd-0.log
2012-07-09 18:17:23.763925 7ff9fc19e700  0 bad crc in data 3071411075 != 
exp 2231697357
2012-07-09 18:17:23.777607 7ff9fc19e700  0 -- 172.20.14.130:6801/5842 >> 
172.20.14.132:0/1974511416 pipe(0x2236c80 sd=38 pgs=0 cs=0 l=0).accept 
peer addr is really 172.20.14.132:0/1974511416 (socket is 
172.20.14.132:57972/0)

osd-3.log
2012-07-09 18:17:53.770111 7fe35461c700  0 bad crc in data 826922774 != 
exp 2498450653
2012-07-09 18:17:53.770972 7fe35461c700  0 -- 172.20.14.134:6800/4495 >> 
172.20.14.132:0/1974511416
  pipe(0xa44ec80 sd=56 pgs=0 cs=0 l=0).accept peer addr is really 
172.20.14.132:0/1974511416 (socket
  is 172.20.14.132:40726/0)

osd-5.log
2012-07-09 18:18:38.766417 7ff4a66cb700  0 bad crc in data 3949121728 != 
exp 2496058560
2012-07-09 18:18:38.773386 7ff4a66cb700  0 -- 172.20.14.136:6800/4876 >> 
172.20.14.132:0/1974511416 pipe(0x20eeb780 sd=56 pgs=0 cs=0 l=0).accept 
peer addr is really 172.20.14.132:0/1974511416 (socket is 
172.20.14.132:57072/0)

osd-6.log
2012-07-09 18:17:43.765740 7fdf86b9d700  0 bad crc in data 2899452345 != 
exp 2656886014
2012-07-09 18:17:43.772599 7fdf86b9d700  0 -- 172.20.14.137:6800/5260 >> 
172.20.14.132:0/1974511416
  pipe(0x1ec64780 sd=31 pgs=0 cs=0 l=0).accept peer addr is really 
172.20.14.132:0/1974511416 (socke
t is 172.20.14.132:48615/0)

2012-07-09 18:17:43.773170 7fdf8c718700  0 osd.6 347 pg[2.60( v 
347'36181 (337'35180,347'36181] n=4
144 ec=1 les/c 6/6 5/5/5) [6,7] r=0 lpr=5 mlcod 347'36180 active+clean] 
watch: ctx->obc=0x102db340
cookie=1 oi.version=36169 ctx->at_version=347'36182
2012-07-09 18:17:43.773209 7fdf8c718700  0 osd.6 347 pg[2.60( v 
347'36181 (337'35180,347'36181] n=4144 ec=1 les/c 6/6 5/5/5) [6,7] r=0 
lpr=5 mlcod 347'36180 active+clean] watch: oi.user_version=1559
2012-07-09 18:19:48.837952 7fdf86b9d700  0 bad crc in data 1231964953 != 
exp 2305533436
2012-07-09 18:19:48.838850 7fdf86b9d700  0 -- 172.20.14.137:6800/5260 >> 
172.20.14.132:0/1974511416 pipe(0x1ec64c80 sd=31 pgs=0 cs=0 l=0).accept 
peer addr is really 172.20.14.132:0/1974511416 (socket is 
172.20.14.132:48618/0)
2012-07-09 18:19:48.839493 7fdf8c718700  0 osd.6 347 pg[2.60( v 
347'36192 (337'35191,347'36192] n=4144 ec=1 les/c 6/6 5/5/5) [6,7] r=0 
lpr=5 mlcod 347'36191 active+clean] watch: ctx->obc=0x102db340 cookie=1 
oi.version=36169 ctx->at_version=347'36193
2012-07-09 18:19:48.839530 7fdf8c718700  0 osd.6 347 pg[2.60( v 
347'36192 (337'35191,347'36192] n=4144 ec=1 les/c 6/6 5/5/5) [6,7] r=0 
lpr=5 mlcod 347'36191 active+clean] watch: oi.user_version=1559


Each time, at the exact date, a bad CRC (they are the only ones for this 
day, so it seems related)

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oops in rbd module (con_work in libceph)
  2012-07-09 17:04       ` Yann Dupont
@ 2012-07-10 17:46         ` Gregory Farnum
  2012-07-11 20:23           ` Yann Dupont
  0 siblings, 1 reply; 10+ messages in thread
From: Gregory Farnum @ 2012-07-10 17:46 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Alex Elder, ceph-devel

On Mon, Jul 9, 2012 at 10:04 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Le 09/07/2012 18:54, Yann Dupont a écrit :
>
>>
>> Ok. I've compiled the kernel this afternoon, and tested it without much
>> success :
>>
>> Jul  9 18:17:23 label5.u14.univ-nantes.prive kernel: [ 284.116236]
>> libceph: osd0 172.20.14.130:6801 socket closed
>> Jul  9 18:17:43 label5.u14.univ-nantes.prive kernel: [ 304.101545]
>> libceph: osd6 172.20.14.137:6800 socket closed
>> Jul  9 18:17:53 label5.u14.univ-nantes.prive kernel: [ 314.095155]
>> libceph: osd3 172.20.14.134:6800 socket closed
>> Jul  9 18:18:38 label5.u14.univ-nantes.prive kernel: [ 359.075473]
>> libceph: osd5 172.20.14.136:6800 socket closed
>> Jul  9 18:19:48 label5.u14.univ-nantes.prive kernel: [ 429.107334]
>> libceph: osd6 172.20.14.137:6800 socket closed
>
>
> just an interesting thing I just noticed in the logs :
>
> osd-0.log
> 2012-07-09 18:17:23.763925 7ff9fc19e700  0 bad crc in data 3071411075 != exp
> 2231697357
> 2012-07-09 18:17:23.777607 7ff9fc19e700  0 -- 172.20.14.130:6801/5842 >>
> 172.20.14.132:0/1974511416 pipe(0x2236c80 sd=38 pgs=0 cs=0 l=0).accept peer
> addr is really 172.20.14.132:0/1974511416 (socket is 172.20.14.132:57972/0)
>
> osd-3.log
> 2012-07-09 18:17:53.770111 7fe35461c700  0 bad crc in data 826922774 != exp
> 2498450653
> 2012-07-09 18:17:53.770972 7fe35461c700  0 -- 172.20.14.134:6800/4495 >>
> 172.20.14.132:0/1974511416
>  pipe(0xa44ec80 sd=56 pgs=0 cs=0 l=0).accept peer addr is really
> 172.20.14.132:0/1974511416 (socket
>  is 172.20.14.132:40726/0)
>
> osd-5.log
> 2012-07-09 18:18:38.766417 7ff4a66cb700  0 bad crc in data 3949121728 != exp
> 2496058560
> 2012-07-09 18:18:38.773386 7ff4a66cb700  0 -- 172.20.14.136:6800/4876 >>
> 172.20.14.132:0/1974511416 pipe(0x20eeb780 sd=56 pgs=0 cs=0 l=0).accept peer
> addr is really 172.20.14.132:0/1974511416 (socket is 172.20.14.132:57072/0)
>
> osd-6.log
> 2012-07-09 18:17:43.765740 7fdf86b9d700  0 bad crc in data 2899452345 != exp
> 2656886014
> 2012-07-09 18:17:43.772599 7fdf86b9d700  0 -- 172.20.14.137:6800/5260 >>
> 172.20.14.132:0/1974511416
>  pipe(0x1ec64780 sd=31 pgs=0 cs=0 l=0).accept peer addr is really
> 172.20.14.132:0/1974511416 (socke
> t is 172.20.14.132:48615/0)
>
> 2012-07-09 18:17:43.773170 7fdf8c718700  0 osd.6 347 pg[2.60( v 347'36181
> (337'35180,347'36181] n=4
> 144 ec=1 les/c 6/6 5/5/5) [6,7] r=0 lpr=5 mlcod 347'36180 active+clean]
> watch: ctx->obc=0x102db340
> cookie=1 oi.version=36169 ctx->at_version=347'36182
> 2012-07-09 18:17:43.773209 7fdf8c718700  0 osd.6 347 pg[2.60( v 347'36181
> (337'35180,347'36181] n=4144 ec=1 les/c 6/6 5/5/5) [6,7] r=0 lpr=5 mlcod
> 347'36180 active+clean] watch: oi.user_version=1559
> 2012-07-09 18:19:48.837952 7fdf86b9d700  0 bad crc in data 1231964953 != exp
> 2305533436
> 2012-07-09 18:19:48.838850 7fdf86b9d700  0 -- 172.20.14.137:6800/5260 >>
> 172.20.14.132:0/1974511416 pipe(0x1ec64c80 sd=31 pgs=0 cs=0 l=0).accept peer
> addr is really 172.20.14.132:0/1974511416 (socket is 172.20.14.132:48618/0)
> 2012-07-09 18:19:48.839493 7fdf8c718700  0 osd.6 347 pg[2.60( v 347'36192
> (337'35191,347'36192] n=4144 ec=1 les/c 6/6 5/5/5) [6,7] r=0 lpr=5 mlcod
> 347'36191 active+clean] watch: ctx->obc=0x102db340 cookie=1 oi.version=36169
> ctx->at_version=347'36193
> 2012-07-09 18:19:48.839530 7fdf8c718700  0 osd.6 347 pg[2.60( v 347'36192
> (337'35191,347'36192] n=4144 ec=1 les/c 6/6 5/5/5) [6,7] r=0 lpr=5 mlcod
> 347'36191 active+clean] watch: oi.user_version=1559
>
>
> Each time, at the exact date, a bad CRC (they are the only ones for this
> day, so it seems related)

Yes; a bad CRC should cause the socket to close — that's intended
behavior (although you might want to look into why that's happening,
since it's not something we've seen locally at all). Not handling that
socket close is definitely a bug in the kernel that needs to get
tracked down, though.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oops in rbd module (con_work in libceph)
  2012-07-10 17:46         ` Gregory Farnum
@ 2012-07-11 20:23           ` Yann Dupont
  2012-07-12  7:15             ` Yann Dupont
  0 siblings, 1 reply; 10+ messages in thread
From: Yann Dupont @ 2012-07-11 20:23 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Alex Elder, ceph-devel

Le 10/07/2012 19:46, Gregory Farnum a écrit :
>
>
> Each time, at the exact date, a bad CRC (they are the only ones for this
> day, so it seems related)
> Yes; a bad CRC should cause the socket to close — that's intended
> behavior (although you might want to look into why that's happening,
ah ! very interesting ! 3.2 is ok, 3.4 not (even with latest ceph-client 
patch).
the 8 nodes are similar (poweredge M610, intel 10 Gb), but the client is 
not : Also M610 (older) but with brocade 10Gb.

I'll also try another client with intel 10Gb and 3.4 kernel, to see if 
that change things.
And then I'll narrow the gap between working & not working, and try to 
bisect that.

> since it's not something we've seen locally at all). Not handling that
> socket close is definitely a bug in the kernel that needs to get
> tracked down, though.
Ok, the oops is not the root cause, just an unfortunate consequence

Thanks,
Cheers

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oops in rbd module (con_work in libceph)
  2012-07-11 20:23           ` Yann Dupont
@ 2012-07-12  7:15             ` Yann Dupont
  2012-07-12 20:20               ` Yann Dupont
  0 siblings, 1 reply; 10+ messages in thread
From: Yann Dupont @ 2012-07-12  7:15 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Gregory Farnum, Alex Elder, ceph-devel

Le 11/07/2012 22:23, Yann Dupont a écrit :
> Le 10/07/2012 19:46, Gregory Farnum a écrit :
>>
>>
>> Each time, at the exact date, a bad CRC (they are the only ones for this
>> day, so it seems related)
>> Yes; a bad CRC should cause the socket to close — that's intended
>> behavior (although you might want to look into why that's happening,
> ah ! very interesting ! 3.2 is ok, 3.4 not (even with latest 
> ceph-client patch).
> the 8 nodes are similar (poweredge M610, intel 10 Gb), but the client 
> is not : Also M610 (older) but with brocade 10Gb.
>
It is broadcom 10 Gb, not brocade . sorry for the confusion

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oops in rbd module (con_work in libceph)
  2012-07-12  7:15             ` Yann Dupont
@ 2012-07-12 20:20               ` Yann Dupont
  2012-07-12 21:07                 ` Yann Dupont
  0 siblings, 1 reply; 10+ messages in thread
From: Yann Dupont @ 2012-07-12 20:20 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Gregory Farnum, Alex Elder, ceph-devel

Le 12/07/2012 09:15, Yann Dupont a écrit :
> Le 11/07/2012 22:23, Yann Dupont a écrit :
>> Le 10/07/2012 19:46, Gregory Farnum a écrit :
>>>
>>>
>>> Each time, at the exact date, a bad CRC (they are the only ones for 
>>> this
>>> day, so it seems related)
>>> Yes; a bad CRC should cause the socket to close — that's intended
>>> behavior (although you might want to look into why that's happening,
>> ah ! very interesting ! 3.2 is ok, 3.4 not (even with latest 
>> ceph-client patch).
Really 3.3.0 is OK, 3.4.0 not. This was my git bisect starting point.

with a "good kernel", make -j24 on a linux kernel tree is working ok for 
at least 10 minutes.
On a "bad kernel", during make -j24 I see quite quickly (some minutes)  
messages about socket closed, and, then , the kernel oops.

>> the 8 nodes are similar (poweredge M610, intel 10 Gb), but the client 
>> is not : Also M610 (older) but with brocade 10Gb.
>>
> It is broadcom 10 Gb, not brocade . sorry for the confusion
>
Probably not related.

here is the offending patch as found by git bisect. It's a merge, not 
containing code by itself :/
commit : 69e1aaddd63104f37021d0b0f6abfd9623c9134c. It's ext4 related . 
Not sure ext4 itself is to blame, I'm thinking of a race with rbd.

I wish it was an individual patch, not a merge request ? but I'm not a 
git expert, I've done the git bisect manually. Maybe I missed a bad 
kernel during my test.

BTW, the patch just before this one (show by  git log ) is
56b59b429b4c26e5e730bc8c3d837de9f7d0a966 which is a ceph merge, maybe 
related too ?



here is the git bisect log if that matters:

git bisect start
# bad: [76e10d158efb6d4516018846f60c2ab5501900bc] Linux 3.4
git bisect bad 76e10d158efb6d4516018846f60c2ab5501900bc
# good: [c16fa4f2ad19908a47c63d8fa436a1178438c7e7] Linux 3.3
git bisect good c16fa4f2ad19908a47c63d8fa436a1178438c7e7
# good: [141124c02059eee9dbc5c86ea797b1ca888e77f7] Delete all instances 
of asm/system.h
git bisect good 141124c02059eee9dbc5c86ea797b1ca888e77f7
# bad: [55a320308902f7a0746569ee57eeb3f254e6ed16] Merge branch 
'irqdomain/merge' of git://git.secretlab.ca/git/linux-2.6
git bisect bad 55a320308902f7a0746569ee57eeb3f254e6ed16
# good: [281b05392fc2cb26209b4d85abaf4889ab1991f3] Merge tag 'soc' of 
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect good 281b05392fc2cb26209b4d85abaf4889ab1991f3
# bad: [a8364d5555b2030d093cde0f07951628e55454e1] slub: only IPI CPUs 
that have per cpu obj to flush
git bisect bad a8364d5555b2030d093cde0f07951628e55454e1
# good: [66f03c614c0902ccf7d6160459362a9352f33271] Merge tag 'dt' of 
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect good 66f03c614c0902ccf7d6160459362a9352f33271
# good: [30eebb54b13ef198a3f1a143ee9dd68f295c60de] Merge branch 'next' 
of git://git.monstr.eu/linux-2.6-microblaze
git bisect good 30eebb54b13ef198a3f1a143ee9dd68f295c60de
# good: [56b59b429b4c26e5e730bc8c3d837de9f7d0a966] Merge branch 
'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
git bisect good 56b59b429b4c26e5e730bc8c3d837de9f7d0a966
# good: [31d4f3a2f3c73f279ff96a7135d7202ef6833f12] ext4: check for zero 
length extent
git bisect good 31d4f3a2f3c73f279ff96a7135d7202ef6833f12
# good: [21e7fd22a5a0ca83befe12c58cced21975dab213] ext4: fix trimmed 
block count accunting
git bisect good 21e7fd22a5a0ca83befe12c58cced21975dab213
# bad: [69e1aaddd63104f37021d0b0f6abfd9623c9134c] Merge tag 
'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
git bisect bad 69e1aaddd63104f37021d0b0f6abfd9623c9134c
# good: [1b8b9750f07cdd6e13f12c06ae7ec853f2abbe6c] ext4: do not mark 
superblock as dirty unnecessarily
git bisect good 1b8b9750f07cdd6e13f12c06ae7ec853f2abbe6c
# good: [182f514f883abb5f942c94e61c371c4b406352d4] ext4: remove useless 
s_dirt assignment
git bisect good 182f514f883abb5f942c94e61c371c4b406352d4
# good: [9d547c35799a4ddd235f1565cec2fff6c9263504] vfs: remove unused 
superblock helpers


Tomorrow I'll try to see if I made errors on my git bisect. I'll also 
try to format the rbd with xfs to see if it's really ext4 related.

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber :Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oops in rbd module (con_work in libceph)
  2012-07-12 20:20               ` Yann Dupont
@ 2012-07-12 21:07                 ` Yann Dupont
  0 siblings, 0 replies; 10+ messages in thread
From: Yann Dupont @ 2012-07-12 21:07 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Gregory Farnum, Alex Elder, ceph-devel

Le 12/07/2012 22:20, Yann Dupont a écrit :
>
>
> Tomorrow I'll try to see if I made errors on my git bisect. I'll also 
> try to format the rbd with xfs to see if it's really ext4 related.
>

Well. I can confirm it's a race between ext4 & rbd (or maybe just a 
plain ext4 bug, but could it explain the crc bugs ?)
With a rbd block device formatted with xfs on vanilla kernel 3.4.4, all 
is working well. With ext4 I have the socket closed syndrom / kernel 
oops very fast.


Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-07-12 21:07 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-06  8:31 oops in rbd module (con_work in libceph) Yann Dupont
2012-07-06 15:35 ` Yann Dupont
2012-07-07  0:16   ` Alex Elder
2012-07-09 16:54     ` Yann Dupont
2012-07-09 17:04       ` Yann Dupont
2012-07-10 17:46         ` Gregory Farnum
2012-07-11 20:23           ` Yann Dupont
2012-07-12  7:15             ` Yann Dupont
2012-07-12 20:20               ` Yann Dupont
2012-07-12 21:07                 ` Yann Dupont

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.