All of lore.kernel.org
 help / color / mirror / Atom feed
* reproducable osd crash
@ 2012-06-21 12:55 Stefan Priebe - Profihost AG
  2012-06-21 13:07 ` Stefan Priebe - Profihost AG
  2012-06-22  6:43 ` Stefan Priebe
  0 siblings, 2 replies; 24+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-21 12:55 UTC (permalink / raw)
  To: ceph-devel

Hello list,

i'm able to reproducably crash osd daemons.

How i can reproduce:

Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS: btrfs
Journal: 2GB tmpfs per OSD
OSD: 3x servers with 4x Intel SSD OSDs each
10GBE Network
rbd_cache_max_age: 2.0
rbd_cache_size: 33554432

Disk is set to writeback.

Start a KVM VM via PXE with the disk attached in writeback mode.

Then run randwrite stress more than 2 time. Mostly OSD 22 in my case 
crashes.

# fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G 
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio 
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G 
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio 
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G 
--numjobs=50 --runtime=90 --group_reporting --name=file1; halt

Strangely exactly THIS OSD also has the most log entries:
64K     ceph-osd.20.log
64K     ceph-osd.21.log
1,3M    ceph-osd.22.log
64K     ceph-osd.23.log

But all OSDs are set to debug osd = 20.

dmesg shows:
ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp 
00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]

I uploaded the following files:
priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't 
crash
priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
üu
priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
priebe_fio_randwrite_ceph-osd.bz2 => osd binary

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-21 12:55 reproducable osd crash Stefan Priebe - Profihost AG
@ 2012-06-21 13:07 ` Stefan Priebe - Profihost AG
  2012-06-21 13:13   ` Stefan Priebe - Profihost AG
  2012-06-22  6:43 ` Stefan Priebe
  1 sibling, 1 reply; 24+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-21 13:07 UTC (permalink / raw)
  To: ceph-devel


When i start now the OSD again it seems to hang for forever. Load goes 
up to 200 and I/O Waits rise vom 0% to 20%.

Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:
> Hello list,
>
> i'm able to reproducably crash osd daemons.
>
> How i can reproduce:
>
> Kernel: 3.5.0-rc3
> Ceph: 0.47.3
> FS: btrfs
> Journal: 2GB tmpfs per OSD
> OSD: 3x servers with 4x Intel SSD OSDs each
> 10GBE Network
> rbd_cache_max_age: 2.0
> rbd_cache_size: 33554432
>
> Disk is set to writeback.
>
> Start a KVM VM via PXE with the disk attached in writeback mode.
>
> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
> crashes.
>
> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>
> Strangely exactly THIS OSD also has the most log entries:
> 64K ceph-osd.20.log
> 64K ceph-osd.21.log
> 1,3M ceph-osd.22.log
> 64K ceph-osd.23.log
>
> But all OSDs are set to debug osd = 20.
>
> dmesg shows:
> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>
> I uploaded the following files:
> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't
> crash
> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
> üu
> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>
> Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-21 13:07 ` Stefan Priebe - Profihost AG
@ 2012-06-21 13:13   ` Stefan Priebe - Profihost AG
  2012-06-21 13:23     ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-21 13:13 UTC (permalink / raw)
  To: ceph-devel

Another strange thing. Why does THIS OSD have 24GB and the others just 
650MB?

/dev/sdb1             224G  654M  214G   1% /srv/osd.20
/dev/sdc1             224G  638M  214G   1% /srv/osd.21
/dev/sdd1             224G   24G  190G  12% /srv/osd.22
/dev/sde1             224G  607M  214G   1% /srv/osd.23

> When i start now the OSD again it seems to hang for forever. Load goes
> up to 200 and I/O Waits rise vom 0% to 20%.
>
> Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:
>> Hello list,
>>
>> i'm able to reproducably crash osd daemons.
>>
>> How i can reproduce:
>>
>> Kernel: 3.5.0-rc3
>> Ceph: 0.47.3
>> FS: btrfs
>> Journal: 2GB tmpfs per OSD
>> OSD: 3x servers with 4x Intel SSD OSDs each
>> 10GBE Network
>> rbd_cache_max_age: 2.0
>> rbd_cache_size: 33554432
>>
>> Disk is set to writeback.
>>
>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>
>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
>> crashes.
>>
>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>
>> Strangely exactly THIS OSD also has the most log entries:
>> 64K ceph-osd.20.log
>> 64K ceph-osd.21.log
>> 1,3M ceph-osd.22.log
>> 64K ceph-osd.23.log
>>
>> But all OSDs are set to debug osd = 20.
>>
>> dmesg shows:
>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
>> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>
>> I uploaded the following files:
>> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't
>> crash
>> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
>> üu
>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
>> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>>
>> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-21 13:13   ` Stefan Priebe - Profihost AG
@ 2012-06-21 13:23     ` Stefan Priebe - Profihost AG
  2012-06-21 19:57       ` Stefan Priebe
  0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-21 13:23 UTC (permalink / raw)
  To: ceph-devel

Mhm is this normal (ceph health is NOW OK again)

/dev/sdb1             224G  655M  214G   1% /srv/osd.20
/dev/sdc1             224G  640M  214G   1% /srv/osd.21
/dev/sdd1             224G   34G  181G  16% /srv/osd.22
/dev/sde1             224G  608M  214G   1% /srv/osd.23

Why does one OSD has so much more used space than the others?

On my other OSD nodes all have around 600MB-700MB. Even when i reformat 
/dev/sdd1 after the backfill it has again 34GB?

Stefan

Am 21.06.2012 15:13, schrieb Stefan Priebe - Profihost AG:
> Another strange thing. Why does THIS OSD have 24GB and the others just
> 650MB?
>
> /dev/sdb1 224G 654M 214G 1% /srv/osd.20
> /dev/sdc1 224G 638M 214G 1% /srv/osd.21
> /dev/sdd1 224G 24G 190G 12% /srv/osd.22
> /dev/sde1 224G 607M 214G 1% /srv/osd.23
>
>> When i start now the OSD again it seems to hang for forever. Load goes
>> up to 200 and I/O Waits rise vom 0% to 20%.
>>
>> Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:
>>> Hello list,
>>>
>>> i'm able to reproducably crash osd daemons.
>>>
>>> How i can reproduce:
>>>
>>> Kernel: 3.5.0-rc3
>>> Ceph: 0.47.3
>>> FS: btrfs
>>> Journal: 2GB tmpfs per OSD
>>> OSD: 3x servers with 4x Intel SSD OSDs each
>>> 10GBE Network
>>> rbd_cache_max_age: 2.0
>>> rbd_cache_size: 33554432
>>>
>>> Disk is set to writeback.
>>>
>>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>>
>>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
>>> crashes.
>>>
>>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>>
>>> Strangely exactly THIS OSD also has the most log entries:
>>> 64K ceph-osd.20.log
>>> 64K ceph-osd.21.log
>>> 1,3M ceph-osd.22.log
>>> 64K ceph-osd.23.log
>>>
>>> But all OSDs are set to debug osd = 20.
>>>
>>> dmesg shows:
>>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
>>> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>>
>>> I uploaded the following files:
>>> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't
>>> crash
>>> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
>>> üu
>>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
>>> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>>>
>>> Stefan
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-21 13:23     ` Stefan Priebe - Profihost AG
@ 2012-06-21 19:57       ` Stefan Priebe
  2012-06-22 16:01         ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe @ 2012-06-21 19:57 UTC (permalink / raw)
  To: ceph-devel

OK i discovered this time that all osds had the same disk usage before 
crash. After starting the osd again i got this one:
/dev/sdb1             224G   23G  191G  11% /srv/osd.30
/dev/sdc1             224G  1,5G  213G   1% /srv/osd.31
/dev/sdd1             224G  1,5G  213G   1% /srv/osd.32
/dev/sde1             224G  1,6G  213G   1% /srv/osd.33

So instead of 1,5GB osd 30 now uses 23G.

Stefan

Am 21.06.2012 15:23, schrieb Stefan Priebe - Profihost AG:
> Mhm is this normal (ceph health is NOW OK again)
>
> /dev/sdb1             224G  655M  214G   1% /srv/osd.20
> /dev/sdc1             224G  640M  214G   1% /srv/osd.21
> /dev/sdd1             224G   34G  181G  16% /srv/osd.22
> /dev/sde1             224G  608M  214G   1% /srv/osd.23
>
> Why does one OSD has so much more used space than the others?
>
> On my other OSD nodes all have around 600MB-700MB. Even when i reformat
> /dev/sdd1 after the backfill it has again 34GB?
>
> Stefan
>
> Am 21.06.2012 15:13, schrieb Stefan Priebe - Profihost AG:
>> Another strange thing. Why does THIS OSD have 24GB and the others just
>> 650MB?
>>
>> /dev/sdb1 224G 654M 214G 1% /srv/osd.20
>> /dev/sdc1 224G 638M 214G 1% /srv/osd.21
>> /dev/sdd1 224G 24G 190G 12% /srv/osd.22
>> /dev/sde1 224G 607M 214G 1% /srv/osd.23
>>
>>> When i start now the OSD again it seems to hang for forever. Load goes
>>> up to 200 and I/O Waits rise vom 0% to 20%.
>>>
>>> Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:
>>>> Hello list,
>>>>
>>>> i'm able to reproducably crash osd daemons.
>>>>
>>>> How i can reproduce:
>>>>
>>>> Kernel: 3.5.0-rc3
>>>> Ceph: 0.47.3
>>>> FS: btrfs
>>>> Journal: 2GB tmpfs per OSD
>>>> OSD: 3x servers with 4x Intel SSD OSDs each
>>>> 10GBE Network
>>>> rbd_cache_max_age: 2.0
>>>> rbd_cache_size: 33554432
>>>>
>>>> Disk is set to writeback.
>>>>
>>>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>>>
>>>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
>>>> crashes.
>>>>
>>>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k
>>>> --size=200G
>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>>>
>>>> Strangely exactly THIS OSD also has the most log entries:
>>>> 64K ceph-osd.20.log
>>>> 64K ceph-osd.21.log
>>>> 1,3M ceph-osd.22.log
>>>> 64K ceph-osd.23.log
>>>>
>>>> But all OSDs are set to debug osd = 20.
>>>>
>>>> dmesg shows:
>>>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
>>>> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>>>
>>>> I uploaded the following files:
>>>> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't
>>>> crash
>>>> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
>>>> üu
>>>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
>>>> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>>>>
>>>> Stefan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-21 12:55 reproducable osd crash Stefan Priebe - Profihost AG
  2012-06-21 13:07 ` Stefan Priebe - Profihost AG
@ 2012-06-22  6:43 ` Stefan Priebe
  2012-06-22 22:56   ` Dan Mick
  2012-06-23  0:26   ` Dan Mick
  1 sibling, 2 replies; 24+ messages in thread
From: Stefan Priebe @ 2012-06-22  6:43 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel

Does anybody have an idea? This is right now a showstopper to me.

Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AG <s.priebe@profihost.ag>:

> Hello list,
> 
> i'm able to reproducably crash osd daemons.
> 
> How i can reproduce:
> 
> Kernel: 3.5.0-rc3
> Ceph: 0.47.3
> FS: btrfs
> Journal: 2GB tmpfs per OSD
> OSD: 3x servers with 4x Intel SSD OSDs each
> 10GBE Network
> rbd_cache_max_age: 2.0
> rbd_cache_size: 33554432
> 
> Disk is set to writeback.
> 
> Start a KVM VM via PXE with the disk attached in writeback mode.
> 
> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes.
> 
> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
> 
> Strangely exactly THIS OSD also has the most log entries:
> 64K     ceph-osd.20.log
> 64K     ceph-osd.21.log
> 1,3M    ceph-osd.22.log
> 64K     ceph-osd.23.log
> 
> But all OSDs are set to debug osd = 20.
> 
> dmesg shows:
> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
> 
> I uploaded the following files:
> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't crash
> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
> üu
> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
> 
> Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-21 19:57       ` Stefan Priebe
@ 2012-06-22 16:01         ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 24+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-22 16:01 UTC (permalink / raw)
  To: ceph-devel

I'm still able to crash the ceph cluster while doing a lot of random I/O 
and then shut down the KVM.

Stefan

Am 21.06.2012 21:57, schrieb Stefan Priebe:
> OK i discovered this time that all osds had the same disk usage before
> crash. After starting the osd again i got this one:
> /dev/sdb1 224G 23G 191G 11% /srv/osd.30
> /dev/sdc1 224G 1,5G 213G 1% /srv/osd.31
> /dev/sdd1 224G 1,5G 213G 1% /srv/osd.32
> /dev/sde1 224G 1,6G 213G 1% /srv/osd.33
>
> So instead of 1,5GB osd 30 now uses 23G.
>
> Stefan
>
> Am 21.06.2012 15:23, schrieb Stefan Priebe - Profihost AG:
>> Mhm is this normal (ceph health is NOW OK again)
>>
>> /dev/sdb1 224G 655M 214G 1% /srv/osd.20
>> /dev/sdc1 224G 640M 214G 1% /srv/osd.21
>> /dev/sdd1 224G 34G 181G 16% /srv/osd.22
>> /dev/sde1 224G 608M 214G 1% /srv/osd.23
>>
>> Why does one OSD has so much more used space than the others?
>>
>> On my other OSD nodes all have around 600MB-700MB. Even when i reformat
>> /dev/sdd1 after the backfill it has again 34GB?
>>
>> Stefan
>>
>> Am 21.06.2012 15:13, schrieb Stefan Priebe - Profihost AG:
>>> Another strange thing. Why does THIS OSD have 24GB and the others just
>>> 650MB?
>>>
>>> /dev/sdb1 224G 654M 214G 1% /srv/osd.20
>>> /dev/sdc1 224G 638M 214G 1% /srv/osd.21
>>> /dev/sdd1 224G 24G 190G 12% /srv/osd.22
>>> /dev/sde1 224G 607M 214G 1% /srv/osd.23
>>>
>>>> When i start now the OSD again it seems to hang for forever. Load goes
>>>> up to 200 and I/O Waits rise vom 0% to 20%.
>>>>
>>>> Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:
>>>>> Hello list,
>>>>>
>>>>> i'm able to reproducably crash osd daemons.
>>>>>
>>>>> How i can reproduce:
>>>>>
>>>>> Kernel: 3.5.0-rc3
>>>>> Ceph: 0.47.3
>>>>> FS: btrfs
>>>>> Journal: 2GB tmpfs per OSD
>>>>> OSD: 3x servers with 4x Intel SSD OSDs each
>>>>> 10GBE Network
>>>>> rbd_cache_max_age: 2.0
>>>>> rbd_cache_size: 33554432
>>>>>
>>>>> Disk is set to writeback.
>>>>>
>>>>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>>>>
>>>>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
>>>>> crashes.
>>>>>
>>>>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k
>>>>> --size=200G
>>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>>>>
>>>>> Strangely exactly THIS OSD also has the most log entries:
>>>>> 64K ceph-osd.20.log
>>>>> 64K ceph-osd.21.log
>>>>> 1,3M ceph-osd.22.log
>>>>> 64K ceph-osd.23.log
>>>>>
>>>>> But all OSDs are set to debug osd = 20.
>>>>>
>>>>> dmesg shows:
>>>>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
>>>>> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>>>>
>>>>> I uploaded the following files:
>>>>> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and
>>>>> didn't
>>>>> crash
>>>>> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
>>>>> üu
>>>>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
>>>>> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>>>>>
>>>>> Stefan
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-22  6:43 ` Stefan Priebe
@ 2012-06-22 22:56   ` Dan Mick
  2012-06-22 23:59     ` Sam Just
  2012-06-23  0:26   ` Dan Mick
  1 sibling, 1 reply; 24+ messages in thread
From: Dan Mick @ 2012-06-22 22:56 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel

Stefan, I'm looking at your logs and coredump now.

On 06/21/2012 11:43 PM, Stefan Priebe wrote:
> Does anybody have an idea? This is right now a showstopper to me.
>
> Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AG<s.priebe@profihost.ag>:
>
>> Hello list,
>>
>> i'm able to reproducably crash osd daemons.
>>
>> How i can reproduce:
>>
>> Kernel: 3.5.0-rc3
>> Ceph: 0.47.3
>> FS: btrfs
>> Journal: 2GB tmpfs per OSD
>> OSD: 3x servers with 4x Intel SSD OSDs each
>> 10GBE Network
>> rbd_cache_max_age: 2.0
>> rbd_cache_size: 33554432
>>
>> Disk is set to writeback.
>>
>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>
>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes.
>>
>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>
>> Strangely exactly THIS OSD also has the most log entries:
>> 64K     ceph-osd.20.log
>> 64K     ceph-osd.21.log
>> 1,3M    ceph-osd.22.log
>> 64K     ceph-osd.23.log
>>
>> But all OSDs are set to debug osd = 20.
>>
>> dmesg shows:
>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>
>> I uploaded the following files:
>> priebe_fio_randwrite_ceph-osd.21.log.bz2 =>  OSD which was OK and didn't crash
>> priebe_fio_randwrite_ceph-osd.22.log.bz2 =>  Log from the crashed OSD
>> üu
>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 =>  Core dump
>> priebe_fio_randwrite_ceph-osd.bz2 =>  osd binary
>>
>> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-22 22:56   ` Dan Mick
@ 2012-06-22 23:59     ` Sam Just
  2012-06-23  6:32       ` Stefan Priebe
  0 siblings, 1 reply; 24+ messages in thread
From: Sam Just @ 2012-06-22 23:59 UTC (permalink / raw)
  To: Dan Mick; +Cc: Stefan Priebe, ceph-devel

I am still looking into the logs.
-Sam

On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick <dan.mick@inktank.com> wrote:
> Stefan, I'm looking at your logs and coredump now.
>
>
> On 06/21/2012 11:43 PM, Stefan Priebe wrote:
>>
>> Does anybody have an idea? This is right now a showstopper to me.
>>
>> Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost
>> AG<s.priebe@profihost.ag>:
>>
>>> Hello list,
>>>
>>> i'm able to reproducably crash osd daemons.
>>>
>>> How i can reproduce:
>>>
>>> Kernel: 3.5.0-rc3
>>> Ceph: 0.47.3
>>> FS: btrfs
>>> Journal: 2GB tmpfs per OSD
>>> OSD: 3x servers with 4x Intel SSD OSDs each
>>> 10GBE Network
>>> rbd_cache_max_age: 2.0
>>> rbd_cache_size: 33554432
>>>
>>> Disk is set to writeback.
>>>
>>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>>
>>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
>>> crashes.
>>>
>>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>>
>>> Strangely exactly THIS OSD also has the most log entries:
>>> 64K     ceph-osd.20.log
>>> 64K     ceph-osd.21.log
>>> 1,3M    ceph-osd.22.log
>>> 64K     ceph-osd.23.log
>>>
>>> But all OSDs are set to debug osd = 20.
>>>
>>> dmesg shows:
>>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
>>> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>>
>>> I uploaded the following files:
>>> priebe_fio_randwrite_ceph-osd.21.log.bz2 =>  OSD which was OK and didn't
>>> crash
>>> priebe_fio_randwrite_ceph-osd.22.log.bz2 =>  Log from the crashed OSD
>>> üu
>>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 =>  Core dump
>>> priebe_fio_randwrite_ceph-osd.bz2 =>  osd binary
>>>
>>> Stefan
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-22  6:43 ` Stefan Priebe
  2012-06-22 22:56   ` Dan Mick
@ 2012-06-23  0:26   ` Dan Mick
  2012-06-23  6:32     ` Stefan Priebe
  1 sibling, 1 reply; 24+ messages in thread
From: Dan Mick @ 2012-06-23  0:26 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel

The ceph-osd binary you sent claims to be version 0.47.2-521-g88c762, 
which is not quite 0.47.3.  You can get the version with <binary> -v, or 
(in my case) examining strings in the binary.  I'm retrieving that 
version to analyze the core dump.


On 06/21/2012 11:43 PM, Stefan Priebe wrote:
> Does anybody have an idea? This is right now a showstopper to me.
>
> Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AG<s.priebe@profihost.ag>:
>
>> Hello list,
>>
>> i'm able to reproducably crash osd daemons.
>>
>> How i can reproduce:
>>
>> Kernel: 3.5.0-rc3
>> Ceph: 0.47.3
>> FS: btrfs
>> Journal: 2GB tmpfs per OSD
>> OSD: 3x servers with 4x Intel SSD OSDs each
>> 10GBE Network
>> rbd_cache_max_age: 2.0
>> rbd_cache_size: 33554432
>>
>> Disk is set to writeback.
>>
>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>
>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes.
>>
>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>
>> Strangely exactly THIS OSD also has the most log entries:
>> 64K     ceph-osd.20.log
>> 64K     ceph-osd.21.log
>> 1,3M    ceph-osd.22.log
>> 64K     ceph-osd.23.log
>>
>> But all OSDs are set to debug osd = 20.
>>
>> dmesg shows:
>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>
>> I uploaded the following files:
>> priebe_fio_randwrite_ceph-osd.21.log.bz2 =>  OSD which was OK and didn't crash
>> priebe_fio_randwrite_ceph-osd.22.log.bz2 =>  Log from the crashed OSD
>> üu
>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 =>  Core dump
>> priebe_fio_randwrite_ceph-osd.bz2 =>  osd binary
>>
>> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-23  0:26   ` Dan Mick
@ 2012-06-23  6:32     ` Stefan Priebe
  0 siblings, 0 replies; 24+ messages in thread
From: Stefan Priebe @ 2012-06-23  6:32 UTC (permalink / raw)
  To: Dan Mick; +Cc: ceph-devel

Thanks yes it is from the next branch.

Am 23.06.2012 um 02:26 schrieb Dan Mick <dan.mick@inktank.com>:

> The ceph-osd binary you sent claims to be version 0.47.2-521-g88c762, which is not quite 0.47.3.  You can get the version with <binary> -v, or (in my case) examining strings in the binary.  I'm retrieving that version to analyze the core dump.
> 
> 
> On 06/21/2012 11:43 PM, Stefan Priebe wrote:
>> Does anybody have an idea? This is right now a showstopper to me.
>> 
>> Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AG<s.priebe@profihost.ag>:
>> 
>>> Hello list,
>>> 
>>> i'm able to reproducably crash osd daemons.
>>> 
>>> How i can reproduce:
>>> 
>>> Kernel: 3.5.0-rc3
>>> Ceph: 0.47.3
>>> FS: btrfs
>>> Journal: 2GB tmpfs per OSD
>>> OSD: 3x servers with 4x Intel SSD OSDs each
>>> 10GBE Network
>>> rbd_cache_max_age: 2.0
>>> rbd_cache_size: 33554432
>>> 
>>> Disk is set to writeback.
>>> 
>>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>> 
>>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes.
>>> 
>>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>> 
>>> Strangely exactly THIS OSD also has the most log entries:
>>> 64K     ceph-osd.20.log
>>> 64K     ceph-osd.21.log
>>> 1,3M    ceph-osd.22.log
>>> 64K     ceph-osd.23.log
>>> 
>>> But all OSDs are set to debug osd = 20.
>>> 
>>> dmesg shows:
>>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>> 
>>> I uploaded the following files:
>>> priebe_fio_randwrite_ceph-osd.21.log.bz2 =>  OSD which was OK and didn't crash
>>> priebe_fio_randwrite_ceph-osd.22.log.bz2 =>  Log from the crashed OSD
>>> üu
>>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 =>  Core dump
>>> priebe_fio_randwrite_ceph-osd.bz2 =>  osd binary
>>> 
>>> Stefan
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-22 23:59     ` Sam Just
@ 2012-06-23  6:32       ` Stefan Priebe
  2012-06-25 16:39         ` Dan Mick
  0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe @ 2012-06-23  6:32 UTC (permalink / raw)
  To: Sam Just; +Cc: Dan Mick, ceph-devel

Thanks did you find anything?

Am 23.06.2012 um 01:59 schrieb Sam Just <sam.just@inktank.com>:

> I am still looking into the logs.
> -Sam
> 
> On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick <dan.mick@inktank.com> wrote:
>> Stefan, I'm looking at your logs and coredump now.
>> 
>> 
>> On 06/21/2012 11:43 PM, Stefan Priebe wrote:
>>> 
>>> Does anybody have an idea? This is right now a showstopper to me.
>>> 
>>> Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost
>>> AG<s.priebe@profihost.ag>:
>>> 
>>>> Hello list,
>>>> 
>>>> i'm able to reproducably crash osd daemons.
>>>> 
>>>> How i can reproduce:
>>>> 
>>>> Kernel: 3.5.0-rc3
>>>> Ceph: 0.47.3
>>>> FS: btrfs
>>>> Journal: 2GB tmpfs per OSD
>>>> OSD: 3x servers with 4x Intel SSD OSDs each
>>>> 10GBE Network
>>>> rbd_cache_max_age: 2.0
>>>> rbd_cache_size: 33554432
>>>> 
>>>> Disk is set to writeback.
>>>> 
>>>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>>> 
>>>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
>>>> crashes.
>>>> 
>>>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>>> 
>>>> Strangely exactly THIS OSD also has the most log entries:
>>>> 64K     ceph-osd.20.log
>>>> 64K     ceph-osd.21.log
>>>> 1,3M    ceph-osd.22.log
>>>> 64K     ceph-osd.23.log
>>>> 
>>>> But all OSDs are set to debug osd = 20.
>>>> 
>>>> dmesg shows:
>>>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
>>>> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>>> 
>>>> I uploaded the following files:
>>>> priebe_fio_randwrite_ceph-osd.21.log.bz2 =>  OSD which was OK and didn't
>>>> crash
>>>> priebe_fio_randwrite_ceph-osd.22.log.bz2 =>  Log from the crashed OSD
>>>> üu
>>>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 =>  Core dump
>>>> priebe_fio_randwrite_ceph-osd.bz2 =>  osd binary
>>>> 
>>>> Stefan
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-23  6:32       ` Stefan Priebe
@ 2012-06-25 16:39         ` Dan Mick
  2012-06-25 17:19           ` Stefan Priebe
  0 siblings, 1 reply; 24+ messages in thread
From: Dan Mick @ 2012-06-25 16:39 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Sam Just, ceph-devel

I've yet to make the core match the binary.  

On Jun 22, 2012, at 11:32 PM, Stefan Priebe <s.priebe@profihost.ag> wrote:

> Thanks did you find anything?
> 
> Am 23.06.2012 um 01:59 schrieb Sam Just <sam.just@inktank.com>:
> 
>> I am still looking into the logs.
>> -Sam
>> 
>> On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick <dan.mick@inktank.com> wrote:
>>> Stefan, I'm looking at your logs and coredump now.
>>> 
>>> 
>>> On 06/21/2012 11:43 PM, Stefan Priebe wrote:
>>>> 
>>>> Does anybody have an idea? This is right now a showstopper to me.
>>>> 
>>>> Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost
>>>> AG<s.priebe@profihost.ag>:
>>>> 
>>>>> Hello list,
>>>>> 
>>>>> i'm able to reproducably crash osd daemons.
>>>>> 
>>>>> How i can reproduce:
>>>>> 
>>>>> Kernel: 3.5.0-rc3
>>>>> Ceph: 0.47.3
>>>>> FS: btrfs
>>>>> Journal: 2GB tmpfs per OSD
>>>>> OSD: 3x servers with 4x Intel SSD OSDs each
>>>>> 10GBE Network
>>>>> rbd_cache_max_age: 2.0
>>>>> rbd_cache_size: 33554432
>>>>> 
>>>>> Disk is set to writeback.
>>>>> 
>>>>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>>>> 
>>>>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
>>>>> crashes.
>>>>> 
>>>>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>>>> 
>>>>> Strangely exactly THIS OSD also has the most log entries:
>>>>> 64K     ceph-osd.20.log
>>>>> 64K     ceph-osd.21.log
>>>>> 1,3M    ceph-osd.22.log
>>>>> 64K     ceph-osd.23.log
>>>>> 
>>>>> But all OSDs are set to debug osd = 20.
>>>>> 
>>>>> dmesg shows:
>>>>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
>>>>> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>>>> 
>>>>> I uploaded the following files:
>>>>> priebe_fio_randwrite_ceph-osd.21.log.bz2 =>  OSD which was OK and didn't
>>>>> crash
>>>>> priebe_fio_randwrite_ceph-osd.22.log.bz2 =>  Log from the crashed OSD
>>>>> üu
>>>>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 =>  Core dump
>>>>> priebe_fio_randwrite_ceph-osd.bz2 =>  osd binary
>>>>> 
>>>>> Stefan
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-25 16:39         ` Dan Mick
@ 2012-06-25 17:19           ` Stefan Priebe
  2012-06-25 21:01             ` Dan Mick
  0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe @ 2012-06-25 17:19 UTC (permalink / raw)
  To: Dan Mick; +Cc: Sam Just, ceph-devel


> I've yet to make the core match the binary.

I'm sorry. Is there a way to provide more / better information.

Stefan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-25 17:19           ` Stefan Priebe
@ 2012-06-25 21:01             ` Dan Mick
  2012-06-25 21:18               ` Stefan Priebe
  0 siblings, 1 reply; 24+ messages in thread
From: Dan Mick @ 2012-06-25 21:01 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Sam Just, ceph-devel



On 06/25/2012 10:19 AM, Stefan Priebe wrote:
>
>> I've yet to make the core match the binary.
>
> I'm sorry. Is there a way to provide more / better information.
>
> Stefan

Well, I was still looking at the possibility that it was my problem, but 
here are versions:

- the osd binary you sent is version 0.47.2-521-g88c7629, corresponding 
to sha1 5ce8d71fb6d2e3065c8b274e196c26288c7a93d0

- the corefile is from version 0.47.3 (sha1 
c467d9d1b2eac9d3d4706b8e044979aa63b009f8)

Using any combination of your binary or my build of the binary from 
either version does not result in a matching stack backtrace from gdb.

If you can get a corefile and binary that match and send them again, 
perhaps we can get further; I'm not sure why what you sent didn't seem 
to match.  Are you running the same version of Ceph on each host?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-25 21:01             ` Dan Mick
@ 2012-06-25 21:18               ` Stefan Priebe
  2012-06-26  0:11                 ` Dan Mick
  0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe @ 2012-06-25 21:18 UTC (permalink / raw)
  To: Dan Mick; +Cc: Sam Just, ceph-devel

Am 25.06.2012 23:01, schrieb Dan Mick:
> On 06/25/2012 10:19 AM, Stefan Priebe wrote:
>>
>>> I've yet to make the core match the binary.
>>
>> I'm sorry. Is there a way to provide more / better information.
>>
>> Stefan
>
> Well, I was still looking at the possibility that it was my problem, but
> here are versions:
>
> - the osd binary you sent is version 0.47.2-521-g88c7629, corresponding
> to sha1 5ce8d71fb6d2e3065c8b274e196c26288c7a93d0
>
> - the corefile is from version 0.47.3 (sha1
> c467d9d1b2eac9d3d4706b8e044979aa63b009f8)

ouch i'm really sorry. No idea right know how this happened.

Here are two new files:
priebe_core.ssdstor001.gz
priebe_ceph-osd.gz

most probably the same crash.

Thanks, sorry and
Greets
Stefan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-25 21:18               ` Stefan Priebe
@ 2012-06-26  0:11                 ` Dan Mick
  2012-06-26  5:15                   ` Stefan Priebe
  0 siblings, 1 reply; 24+ messages in thread
From: Dan Mick @ 2012-06-26  0:11 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Sam Just, ceph-devel



On 06/25/2012 02:18 PM, Stefan Priebe wrote:
> Am 25.06.2012 23:01, schrieb Dan Mick:
>> On 06/25/2012 10:19 AM, Stefan Priebe wrote:
>>>
>>>> I've yet to make the core match the binary.
>>>
>>> I'm sorry. Is there a way to provide more / better information.
>>>
>>> Stefan
>>
>> Well, I was still looking at the possibility that it was my problem, but
>> here are versions:
>>
>> - the osd binary you sent is version 0.47.2-521-g88c7629, corresponding
>> to sha1 5ce8d71fb6d2e3065c8b274e196c26288c7a93d0
>>
>> - the corefile is from version 0.47.3 (sha1
>> c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
>
> ouch i'm really sorry. No idea right know how this happened.
>
> Here are two new files:
> priebe_core.ssdstor001.gz
> priebe_ceph-osd.gz
>
> most probably the same crash.
>
> Thanks, sorry and
> Greets
> Stefan

These show

core: ceph version 0.47.3 (commit:c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
binary: 0.47.3-528-g9fcc3de

which don't match either.  How are you collecting the binary and the 
coredump file?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-26  0:11                 ` Dan Mick
@ 2012-06-26  5:15                   ` Stefan Priebe
  2012-06-26  5:48                     ` Stefan Priebe
  0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe @ 2012-06-26  5:15 UTC (permalink / raw)
  To: Dan Mick; +Cc: Sam Just, ceph-devel

Strange! I just copied the core dump file and the ceph-osd binary from /us

Am 26.06.2012 um 02:11 schrieb Dan Mick <dan.mick@inktank.com>:

> 
> 
> On 06/25/2012 02:18 PM, Stefan Priebe wrote:
>> Am 25.06.2012 23:01, schrieb Dan Mick:
>>> On 06/25/2012 10:19 AM, Stefan Priebe wrote:
>>>> 
>>>>> I've yet to make the core match the binary.
>>>> 
>>>> I'm sorry. Is there a way to provide more / better information.
>>>> 
>>>> Stefan
>>> 
>>> Well, I was still looking at the possibility that it was my problem, but
>>> here are versions:
>>> 
>>> - the osd binary you sent is version 0.47.2-521-g88c7629, corresponding
>>> to sha1 5ce8d71fb6d2e3065c8b274e196c26288c7a93d0
>>> 
>>> - the corefile is from version 0.47.3 (sha1
>>> c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
>> 
>> ouch i'm really sorry. No idea right know how this happened.
>> 
>> Here are two new files:
>> priebe_core.ssdstor001.gz
>> priebe_ceph-osd.gz
>> 
>> most probably the same crash.
>> 
>> Thanks, sorry and
>> Greets
>> Stefan
> 
> These show
> 
> core: ceph version 0.47.3 (commit:c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
> binary: 0.47.3-528-g9fcc3de
> 
> which don't match either.  How are you collecting the binary and the coredump file?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-26  5:15                   ` Stefan Priebe
@ 2012-06-26  5:48                     ` Stefan Priebe
  2012-06-26 16:05                       ` Tommi Virtanen
  0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe @ 2012-06-26  5:48 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Dan Mick, Sam Just, ceph-devel

Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
can happen. For building I use the provided Debian scripts.

Am 26.06.2012 um 07:15 schrieb Stefan Priebe <s.priebe@profihost.ag>:

> Strange! I just copied the core dump file and the ceph-osd binary from /us
> 
> Am 26.06.2012 um 02:11 schrieb Dan Mick <dan.mick@inktank.com>:
> 
>> 
>> 
>> On 06/25/2012 02:18 PM, Stefan Priebe wrote:
>>> Am 25.06.2012 23:01, schrieb Dan Mick:
>>>> On 06/25/2012 10:19 AM, Stefan Priebe wrote:
>>>>> 
>>>>>> I've yet to make the core match the binary.
>>>>> 
>>>>> I'm sorry. Is there a way to provide more / better information.
>>>>> 
>>>>> Stefan
>>>> 
>>>> Well, I was still looking at the possibility that it was my problem, but
>>>> here are versions:
>>>> 
>>>> - the osd binary you sent is version 0.47.2-521-g88c7629, corresponding
>>>> to sha1 5ce8d71fb6d2e3065c8b274e196c26288c7a93d0
>>>> 
>>>> - the corefile is from version 0.47.3 (sha1
>>>> c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
>>> 
>>> ouch i'm really sorry. No idea right know how this happened.
>>> 
>>> Here are two new files:
>>> priebe_core.ssdstor001.gz
>>> priebe_ceph-osd.gz
>>> 
>>> most probably the same crash.
>>> 
>>> Thanks, sorry and
>>> Greets
>>> Stefan
>> 
>> These show
>> 
>> core: ceph version 0.47.3 (commit:c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
>> binary: 0.47.3-528-g9fcc3de
>> 
>> which don't match either.  How are you collecting the binary and the coredump file?
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-26  5:48                     ` Stefan Priebe
@ 2012-06-26 16:05                       ` Tommi Virtanen
  2012-06-26 16:47                         ` Stefan Priebe
  0 siblings, 1 reply; 24+ messages in thread
From: Tommi Virtanen @ 2012-06-26 16:05 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Dan Mick, Sam Just, ceph-devel

On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe <s.priebe@profihost.ag> wrote:
> Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
> can happen. For building I use the provided Debian scripts.

Perhaps you upgraded the debs but did not restart the daemons? That
would make the on-disk executable with that name not match the
in-memory one.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-26 16:05                       ` Tommi Virtanen
@ 2012-06-26 16:47                         ` Stefan Priebe
  2012-06-26 18:01                           ` Sam Just
  0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe @ 2012-06-26 16:47 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Dan Mick, Sam Just, ceph-devel

Am 26.06.2012 18:05, schrieb Tommi Virtanen:
> On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe <s.priebe@profihost.ag> wrote:
>> Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
>> can happen. For building I use the provided Debian scripts.
>
> Perhaps you upgraded the debs but did not restart the daemons? That
> would make the on-disk executable with that name not match the
> in-memory one.

No, i reboot after each upgrade ;-)

Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce 
the issue.

Stefan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-26 16:47                         ` Stefan Priebe
@ 2012-06-26 18:01                           ` Sam Just
  2012-06-27  7:22                             ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 24+ messages in thread
From: Sam Just @ 2012-06-26 18:01 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Tommi Virtanen, Dan Mick, ceph-devel

Stefan,

Sorry for the delay, I think I've found the problem.  Could you give
wip_ms_handle_reset_race a try?
-Sam

On Tue, Jun 26, 2012 at 9:47 AM, Stefan Priebe <s.priebe@profihost.ag> wrote:
> Am 26.06.2012 18:05, schrieb Tommi Virtanen:
>
>> On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe <s.priebe@profihost.ag>
>> wrote:
>>>
>>> Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
>>> can happen. For building I use the provided Debian scripts.
>>
>>
>> Perhaps you upgraded the debs but did not restart the daemons? That
>> would make the on-disk executable with that name not match the
>> in-memory one.
>
>
> No, i reboot after each upgrade ;-)
>
> Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce the
> issue.
>
> Stefan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-26 18:01                           ` Sam Just
@ 2012-06-27  7:22                             ` Stefan Priebe - Profihost AG
  2012-06-27 15:19                               ` Sage Weil
  0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-27  7:22 UTC (permalink / raw)
  To: Sam Just; +Cc: Tommi Virtanen, Dan Mick, ceph-devel

THANKS a lot. This fixes it. I've merged your branch into next and i 
wsn't able to trigger the osd crash again. So please include this into 0.48.

Greets
Stefan

Am 26.06.2012 20:01, schrieb Sam Just:
> Stefan,
>
> Sorry for the delay, I think I've found the problem.  Could you give
> wip_ms_handle_reset_race a try?
> -Sam
>
> On Tue, Jun 26, 2012 at 9:47 AM, Stefan Priebe <s.priebe@profihost.ag> wrote:
>> Am 26.06.2012 18:05, schrieb Tommi Virtanen:
>>
>>> On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe <s.priebe@profihost.ag>
>>> wrote:
>>>>
>>>> Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
>>>> can happen. For building I use the provided Debian scripts.
>>>
>>>
>>> Perhaps you upgraded the debs but did not restart the daemons? That
>>> would make the on-disk executable with that name not match the
>>> in-memory one.
>>
>>
>> No, i reboot after each upgrade ;-)
>>
>> Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce the
>> issue.
>>
>> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: reproducable osd crash
  2012-06-27  7:22                             ` Stefan Priebe - Profihost AG
@ 2012-06-27 15:19                               ` Sage Weil
  0 siblings, 0 replies; 24+ messages in thread
From: Sage Weil @ 2012-06-27 15:19 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Sam Just, Tommi Virtanen, Dan Mick, ceph-devel

On Wed, 27 Jun 2012, Stefan Priebe - Profihost AG wrote:
> THANKS a lot. This fixes it. I've merged your branch into next and i wsn't
> able to trigger the osd crash again. So please include this into 0.48.

Excellent.  Thanks for testing!  This now in next.

sage


> 
> Greets
> Stefan
> 
> Am 26.06.2012 20:01, schrieb Sam Just:
> > Stefan,
> > 
> > Sorry for the delay, I think I've found the problem.  Could you give
> > wip_ms_handle_reset_race a try?
> > -Sam
> > 
> > On Tue, Jun 26, 2012 at 9:47 AM, Stefan Priebe <s.priebe@profihost.ag>
> > wrote:
> > > Am 26.06.2012 18:05, schrieb Tommi Virtanen:
> > > 
> > > > On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe <s.priebe@profihost.ag>
> > > > wrote:
> > > > > 
> > > > > Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how
> > > > > this
> > > > > can happen. For building I use the provided Debian scripts.
> > > > 
> > > > 
> > > > Perhaps you upgraded the debs but did not restart the daemons? That
> > > > would make the on-disk executable with that name not match the
> > > > in-memory one.
> > > 
> > > 
> > > No, i reboot after each upgrade ;-)
> > > 
> > > Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce
> > > the
> > > issue.
> > > 
> > > Stefan
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2012-06-27 15:19 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-21 12:55 reproducable osd crash Stefan Priebe - Profihost AG
2012-06-21 13:07 ` Stefan Priebe - Profihost AG
2012-06-21 13:13   ` Stefan Priebe - Profihost AG
2012-06-21 13:23     ` Stefan Priebe - Profihost AG
2012-06-21 19:57       ` Stefan Priebe
2012-06-22 16:01         ` Stefan Priebe - Profihost AG
2012-06-22  6:43 ` Stefan Priebe
2012-06-22 22:56   ` Dan Mick
2012-06-22 23:59     ` Sam Just
2012-06-23  6:32       ` Stefan Priebe
2012-06-25 16:39         ` Dan Mick
2012-06-25 17:19           ` Stefan Priebe
2012-06-25 21:01             ` Dan Mick
2012-06-25 21:18               ` Stefan Priebe
2012-06-26  0:11                 ` Dan Mick
2012-06-26  5:15                   ` Stefan Priebe
2012-06-26  5:48                     ` Stefan Priebe
2012-06-26 16:05                       ` Tommi Virtanen
2012-06-26 16:47                         ` Stefan Priebe
2012-06-26 18:01                           ` Sam Just
2012-06-27  7:22                             ` Stefan Priebe - Profihost AG
2012-06-27 15:19                               ` Sage Weil
2012-06-23  0:26   ` Dan Mick
2012-06-23  6:32     ` Stefan Priebe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.