* reproducable osd crash
@ 2012-06-21 12:55 Stefan Priebe - Profihost AG
2012-06-21 13:07 ` Stefan Priebe - Profihost AG
2012-06-22 6:43 ` Stefan Priebe
0 siblings, 2 replies; 24+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-21 12:55 UTC (permalink / raw)
To: ceph-devel
Hello list,
i'm able to reproducably crash osd daemons.
How i can reproduce:
Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS: btrfs
Journal: 2GB tmpfs per OSD
OSD: 3x servers with 4x Intel SSD OSDs each
10GBE Network
rbd_cache_max_age: 2.0
rbd_cache_size: 33554432
Disk is set to writeback.
Start a KVM VM via PXE with the disk attached in writeback mode.
Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
crashes.
# fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; halt
Strangely exactly THIS OSD also has the most log entries:
64K ceph-osd.20.log
64K ceph-osd.21.log
1,3M ceph-osd.22.log
64K ceph-osd.23.log
But all OSDs are set to debug osd = 20.
dmesg shows:
ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
I uploaded the following files:
priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't
crash
priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
üu
priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
priebe_fio_randwrite_ceph-osd.bz2 => osd binary
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-21 12:55 reproducable osd crash Stefan Priebe - Profihost AG
@ 2012-06-21 13:07 ` Stefan Priebe - Profihost AG
2012-06-21 13:13 ` Stefan Priebe - Profihost AG
2012-06-22 6:43 ` Stefan Priebe
1 sibling, 1 reply; 24+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-21 13:07 UTC (permalink / raw)
To: ceph-devel
When i start now the OSD again it seems to hang for forever. Load goes
up to 200 and I/O Waits rise vom 0% to 20%.
Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:
> Hello list,
>
> i'm able to reproducably crash osd daemons.
>
> How i can reproduce:
>
> Kernel: 3.5.0-rc3
> Ceph: 0.47.3
> FS: btrfs
> Journal: 2GB tmpfs per OSD
> OSD: 3x servers with 4x Intel SSD OSDs each
> 10GBE Network
> rbd_cache_max_age: 2.0
> rbd_cache_size: 33554432
>
> Disk is set to writeback.
>
> Start a KVM VM via PXE with the disk attached in writeback mode.
>
> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
> crashes.
>
> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>
> Strangely exactly THIS OSD also has the most log entries:
> 64K ceph-osd.20.log
> 64K ceph-osd.21.log
> 1,3M ceph-osd.22.log
> 64K ceph-osd.23.log
>
> But all OSDs are set to debug osd = 20.
>
> dmesg shows:
> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>
> I uploaded the following files:
> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't
> crash
> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
> üu
> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>
> Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-21 13:07 ` Stefan Priebe - Profihost AG
@ 2012-06-21 13:13 ` Stefan Priebe - Profihost AG
2012-06-21 13:23 ` Stefan Priebe - Profihost AG
0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-21 13:13 UTC (permalink / raw)
To: ceph-devel
Another strange thing. Why does THIS OSD have 24GB and the others just
650MB?
/dev/sdb1 224G 654M 214G 1% /srv/osd.20
/dev/sdc1 224G 638M 214G 1% /srv/osd.21
/dev/sdd1 224G 24G 190G 12% /srv/osd.22
/dev/sde1 224G 607M 214G 1% /srv/osd.23
> When i start now the OSD again it seems to hang for forever. Load goes
> up to 200 and I/O Waits rise vom 0% to 20%.
>
> Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:
>> Hello list,
>>
>> i'm able to reproducably crash osd daemons.
>>
>> How i can reproduce:
>>
>> Kernel: 3.5.0-rc3
>> Ceph: 0.47.3
>> FS: btrfs
>> Journal: 2GB tmpfs per OSD
>> OSD: 3x servers with 4x Intel SSD OSDs each
>> 10GBE Network
>> rbd_cache_max_age: 2.0
>> rbd_cache_size: 33554432
>>
>> Disk is set to writeback.
>>
>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>
>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
>> crashes.
>>
>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>
>> Strangely exactly THIS OSD also has the most log entries:
>> 64K ceph-osd.20.log
>> 64K ceph-osd.21.log
>> 1,3M ceph-osd.22.log
>> 64K ceph-osd.23.log
>>
>> But all OSDs are set to debug osd = 20.
>>
>> dmesg shows:
>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
>> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>
>> I uploaded the following files:
>> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't
>> crash
>> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
>> üu
>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
>> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>>
>> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-21 13:13 ` Stefan Priebe - Profihost AG
@ 2012-06-21 13:23 ` Stefan Priebe - Profihost AG
2012-06-21 19:57 ` Stefan Priebe
0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-21 13:23 UTC (permalink / raw)
To: ceph-devel
Mhm is this normal (ceph health is NOW OK again)
/dev/sdb1 224G 655M 214G 1% /srv/osd.20
/dev/sdc1 224G 640M 214G 1% /srv/osd.21
/dev/sdd1 224G 34G 181G 16% /srv/osd.22
/dev/sde1 224G 608M 214G 1% /srv/osd.23
Why does one OSD has so much more used space than the others?
On my other OSD nodes all have around 600MB-700MB. Even when i reformat
/dev/sdd1 after the backfill it has again 34GB?
Stefan
Am 21.06.2012 15:13, schrieb Stefan Priebe - Profihost AG:
> Another strange thing. Why does THIS OSD have 24GB and the others just
> 650MB?
>
> /dev/sdb1 224G 654M 214G 1% /srv/osd.20
> /dev/sdc1 224G 638M 214G 1% /srv/osd.21
> /dev/sdd1 224G 24G 190G 12% /srv/osd.22
> /dev/sde1 224G 607M 214G 1% /srv/osd.23
>
>> When i start now the OSD again it seems to hang for forever. Load goes
>> up to 200 and I/O Waits rise vom 0% to 20%.
>>
>> Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:
>>> Hello list,
>>>
>>> i'm able to reproducably crash osd daemons.
>>>
>>> How i can reproduce:
>>>
>>> Kernel: 3.5.0-rc3
>>> Ceph: 0.47.3
>>> FS: btrfs
>>> Journal: 2GB tmpfs per OSD
>>> OSD: 3x servers with 4x Intel SSD OSDs each
>>> 10GBE Network
>>> rbd_cache_max_age: 2.0
>>> rbd_cache_size: 33554432
>>>
>>> Disk is set to writeback.
>>>
>>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>>
>>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
>>> crashes.
>>>
>>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>>
>>> Strangely exactly THIS OSD also has the most log entries:
>>> 64K ceph-osd.20.log
>>> 64K ceph-osd.21.log
>>> 1,3M ceph-osd.22.log
>>> 64K ceph-osd.23.log
>>>
>>> But all OSDs are set to debug osd = 20.
>>>
>>> dmesg shows:
>>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
>>> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>>
>>> I uploaded the following files:
>>> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't
>>> crash
>>> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
>>> üu
>>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
>>> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>>>
>>> Stefan
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-21 13:23 ` Stefan Priebe - Profihost AG
@ 2012-06-21 19:57 ` Stefan Priebe
2012-06-22 16:01 ` Stefan Priebe - Profihost AG
0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe @ 2012-06-21 19:57 UTC (permalink / raw)
To: ceph-devel
OK i discovered this time that all osds had the same disk usage before
crash. After starting the osd again i got this one:
/dev/sdb1 224G 23G 191G 11% /srv/osd.30
/dev/sdc1 224G 1,5G 213G 1% /srv/osd.31
/dev/sdd1 224G 1,5G 213G 1% /srv/osd.32
/dev/sde1 224G 1,6G 213G 1% /srv/osd.33
So instead of 1,5GB osd 30 now uses 23G.
Stefan
Am 21.06.2012 15:23, schrieb Stefan Priebe - Profihost AG:
> Mhm is this normal (ceph health is NOW OK again)
>
> /dev/sdb1 224G 655M 214G 1% /srv/osd.20
> /dev/sdc1 224G 640M 214G 1% /srv/osd.21
> /dev/sdd1 224G 34G 181G 16% /srv/osd.22
> /dev/sde1 224G 608M 214G 1% /srv/osd.23
>
> Why does one OSD has so much more used space than the others?
>
> On my other OSD nodes all have around 600MB-700MB. Even when i reformat
> /dev/sdd1 after the backfill it has again 34GB?
>
> Stefan
>
> Am 21.06.2012 15:13, schrieb Stefan Priebe - Profihost AG:
>> Another strange thing. Why does THIS OSD have 24GB and the others just
>> 650MB?
>>
>> /dev/sdb1 224G 654M 214G 1% /srv/osd.20
>> /dev/sdc1 224G 638M 214G 1% /srv/osd.21
>> /dev/sdd1 224G 24G 190G 12% /srv/osd.22
>> /dev/sde1 224G 607M 214G 1% /srv/osd.23
>>
>>> When i start now the OSD again it seems to hang for forever. Load goes
>>> up to 200 and I/O Waits rise vom 0% to 20%.
>>>
>>> Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:
>>>> Hello list,
>>>>
>>>> i'm able to reproducably crash osd daemons.
>>>>
>>>> How i can reproduce:
>>>>
>>>> Kernel: 3.5.0-rc3
>>>> Ceph: 0.47.3
>>>> FS: btrfs
>>>> Journal: 2GB tmpfs per OSD
>>>> OSD: 3x servers with 4x Intel SSD OSDs each
>>>> 10GBE Network
>>>> rbd_cache_max_age: 2.0
>>>> rbd_cache_size: 33554432
>>>>
>>>> Disk is set to writeback.
>>>>
>>>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>>>
>>>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
>>>> crashes.
>>>>
>>>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k
>>>> --size=200G
>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>>>
>>>> Strangely exactly THIS OSD also has the most log entries:
>>>> 64K ceph-osd.20.log
>>>> 64K ceph-osd.21.log
>>>> 1,3M ceph-osd.22.log
>>>> 64K ceph-osd.23.log
>>>>
>>>> But all OSDs are set to debug osd = 20.
>>>>
>>>> dmesg shows:
>>>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
>>>> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>>>
>>>> I uploaded the following files:
>>>> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't
>>>> crash
>>>> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
>>>> üu
>>>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
>>>> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>>>>
>>>> Stefan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-21 12:55 reproducable osd crash Stefan Priebe - Profihost AG
2012-06-21 13:07 ` Stefan Priebe - Profihost AG
@ 2012-06-22 6:43 ` Stefan Priebe
2012-06-22 22:56 ` Dan Mick
2012-06-23 0:26 ` Dan Mick
1 sibling, 2 replies; 24+ messages in thread
From: Stefan Priebe @ 2012-06-22 6:43 UTC (permalink / raw)
To: Stefan Priebe - Profihost AG; +Cc: ceph-devel
Does anybody have an idea? This is right now a showstopper to me.
Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AG <s.priebe@profihost.ag>:
> Hello list,
>
> i'm able to reproducably crash osd daemons.
>
> How i can reproduce:
>
> Kernel: 3.5.0-rc3
> Ceph: 0.47.3
> FS: btrfs
> Journal: 2GB tmpfs per OSD
> OSD: 3x servers with 4x Intel SSD OSDs each
> 10GBE Network
> rbd_cache_max_age: 2.0
> rbd_cache_size: 33554432
>
> Disk is set to writeback.
>
> Start a KVM VM via PXE with the disk attached in writeback mode.
>
> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes.
>
> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>
> Strangely exactly THIS OSD also has the most log entries:
> 64K ceph-osd.20.log
> 64K ceph-osd.21.log
> 1,3M ceph-osd.22.log
> 64K ceph-osd.23.log
>
> But all OSDs are set to debug osd = 20.
>
> dmesg shows:
> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>
> I uploaded the following files:
> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't crash
> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
> üu
> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>
> Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-21 19:57 ` Stefan Priebe
@ 2012-06-22 16:01 ` Stefan Priebe - Profihost AG
0 siblings, 0 replies; 24+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-22 16:01 UTC (permalink / raw)
To: ceph-devel
I'm still able to crash the ceph cluster while doing a lot of random I/O
and then shut down the KVM.
Stefan
Am 21.06.2012 21:57, schrieb Stefan Priebe:
> OK i discovered this time that all osds had the same disk usage before
> crash. After starting the osd again i got this one:
> /dev/sdb1 224G 23G 191G 11% /srv/osd.30
> /dev/sdc1 224G 1,5G 213G 1% /srv/osd.31
> /dev/sdd1 224G 1,5G 213G 1% /srv/osd.32
> /dev/sde1 224G 1,6G 213G 1% /srv/osd.33
>
> So instead of 1,5GB osd 30 now uses 23G.
>
> Stefan
>
> Am 21.06.2012 15:23, schrieb Stefan Priebe - Profihost AG:
>> Mhm is this normal (ceph health is NOW OK again)
>>
>> /dev/sdb1 224G 655M 214G 1% /srv/osd.20
>> /dev/sdc1 224G 640M 214G 1% /srv/osd.21
>> /dev/sdd1 224G 34G 181G 16% /srv/osd.22
>> /dev/sde1 224G 608M 214G 1% /srv/osd.23
>>
>> Why does one OSD has so much more used space than the others?
>>
>> On my other OSD nodes all have around 600MB-700MB. Even when i reformat
>> /dev/sdd1 after the backfill it has again 34GB?
>>
>> Stefan
>>
>> Am 21.06.2012 15:13, schrieb Stefan Priebe - Profihost AG:
>>> Another strange thing. Why does THIS OSD have 24GB and the others just
>>> 650MB?
>>>
>>> /dev/sdb1 224G 654M 214G 1% /srv/osd.20
>>> /dev/sdc1 224G 638M 214G 1% /srv/osd.21
>>> /dev/sdd1 224G 24G 190G 12% /srv/osd.22
>>> /dev/sde1 224G 607M 214G 1% /srv/osd.23
>>>
>>>> When i start now the OSD again it seems to hang for forever. Load goes
>>>> up to 200 and I/O Waits rise vom 0% to 20%.
>>>>
>>>> Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:
>>>>> Hello list,
>>>>>
>>>>> i'm able to reproducably crash osd daemons.
>>>>>
>>>>> How i can reproduce:
>>>>>
>>>>> Kernel: 3.5.0-rc3
>>>>> Ceph: 0.47.3
>>>>> FS: btrfs
>>>>> Journal: 2GB tmpfs per OSD
>>>>> OSD: 3x servers with 4x Intel SSD OSDs each
>>>>> 10GBE Network
>>>>> rbd_cache_max_age: 2.0
>>>>> rbd_cache_size: 33554432
>>>>>
>>>>> Disk is set to writeback.
>>>>>
>>>>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>>>>
>>>>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
>>>>> crashes.
>>>>>
>>>>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k
>>>>> --size=200G
>>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>>>>
>>>>> Strangely exactly THIS OSD also has the most log entries:
>>>>> 64K ceph-osd.20.log
>>>>> 64K ceph-osd.21.log
>>>>> 1,3M ceph-osd.22.log
>>>>> 64K ceph-osd.23.log
>>>>>
>>>>> But all OSDs are set to debug osd = 20.
>>>>>
>>>>> dmesg shows:
>>>>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
>>>>> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>>>>
>>>>> I uploaded the following files:
>>>>> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and
>>>>> didn't
>>>>> crash
>>>>> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
>>>>> üu
>>>>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
>>>>> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>>>>>
>>>>> Stefan
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-22 6:43 ` Stefan Priebe
@ 2012-06-22 22:56 ` Dan Mick
2012-06-22 23:59 ` Sam Just
2012-06-23 0:26 ` Dan Mick
1 sibling, 1 reply; 24+ messages in thread
From: Dan Mick @ 2012-06-22 22:56 UTC (permalink / raw)
To: Stefan Priebe; +Cc: ceph-devel
Stefan, I'm looking at your logs and coredump now.
On 06/21/2012 11:43 PM, Stefan Priebe wrote:
> Does anybody have an idea? This is right now a showstopper to me.
>
> Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AG<s.priebe@profihost.ag>:
>
>> Hello list,
>>
>> i'm able to reproducably crash osd daemons.
>>
>> How i can reproduce:
>>
>> Kernel: 3.5.0-rc3
>> Ceph: 0.47.3
>> FS: btrfs
>> Journal: 2GB tmpfs per OSD
>> OSD: 3x servers with 4x Intel SSD OSDs each
>> 10GBE Network
>> rbd_cache_max_age: 2.0
>> rbd_cache_size: 33554432
>>
>> Disk is set to writeback.
>>
>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>
>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes.
>>
>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>
>> Strangely exactly THIS OSD also has the most log entries:
>> 64K ceph-osd.20.log
>> 64K ceph-osd.21.log
>> 1,3M ceph-osd.22.log
>> 64K ceph-osd.23.log
>>
>> But all OSDs are set to debug osd = 20.
>>
>> dmesg shows:
>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>
>> I uploaded the following files:
>> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't crash
>> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
>> üu
>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
>> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>>
>> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-22 22:56 ` Dan Mick
@ 2012-06-22 23:59 ` Sam Just
2012-06-23 6:32 ` Stefan Priebe
0 siblings, 1 reply; 24+ messages in thread
From: Sam Just @ 2012-06-22 23:59 UTC (permalink / raw)
To: Dan Mick; +Cc: Stefan Priebe, ceph-devel
I am still looking into the logs.
-Sam
On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick <dan.mick@inktank.com> wrote:
> Stefan, I'm looking at your logs and coredump now.
>
>
> On 06/21/2012 11:43 PM, Stefan Priebe wrote:
>>
>> Does anybody have an idea? This is right now a showstopper to me.
>>
>> Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost
>> AG<s.priebe@profihost.ag>:
>>
>>> Hello list,
>>>
>>> i'm able to reproducably crash osd daemons.
>>>
>>> How i can reproduce:
>>>
>>> Kernel: 3.5.0-rc3
>>> Ceph: 0.47.3
>>> FS: btrfs
>>> Journal: 2GB tmpfs per OSD
>>> OSD: 3x servers with 4x Intel SSD OSDs each
>>> 10GBE Network
>>> rbd_cache_max_age: 2.0
>>> rbd_cache_size: 33554432
>>>
>>> Disk is set to writeback.
>>>
>>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>>
>>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
>>> crashes.
>>>
>>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>>
>>> Strangely exactly THIS OSD also has the most log entries:
>>> 64K ceph-osd.20.log
>>> 64K ceph-osd.21.log
>>> 1,3M ceph-osd.22.log
>>> 64K ceph-osd.23.log
>>>
>>> But all OSDs are set to debug osd = 20.
>>>
>>> dmesg shows:
>>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
>>> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>>
>>> I uploaded the following files:
>>> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't
>>> crash
>>> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
>>> üu
>>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
>>> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>>>
>>> Stefan
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-22 6:43 ` Stefan Priebe
2012-06-22 22:56 ` Dan Mick
@ 2012-06-23 0:26 ` Dan Mick
2012-06-23 6:32 ` Stefan Priebe
1 sibling, 1 reply; 24+ messages in thread
From: Dan Mick @ 2012-06-23 0:26 UTC (permalink / raw)
To: Stefan Priebe; +Cc: ceph-devel
The ceph-osd binary you sent claims to be version 0.47.2-521-g88c762,
which is not quite 0.47.3. You can get the version with <binary> -v, or
(in my case) examining strings in the binary. I'm retrieving that
version to analyze the core dump.
On 06/21/2012 11:43 PM, Stefan Priebe wrote:
> Does anybody have an idea? This is right now a showstopper to me.
>
> Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AG<s.priebe@profihost.ag>:
>
>> Hello list,
>>
>> i'm able to reproducably crash osd daemons.
>>
>> How i can reproduce:
>>
>> Kernel: 3.5.0-rc3
>> Ceph: 0.47.3
>> FS: btrfs
>> Journal: 2GB tmpfs per OSD
>> OSD: 3x servers with 4x Intel SSD OSDs each
>> 10GBE Network
>> rbd_cache_max_age: 2.0
>> rbd_cache_size: 33554432
>>
>> Disk is set to writeback.
>>
>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>
>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes.
>>
>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>
>> Strangely exactly THIS OSD also has the most log entries:
>> 64K ceph-osd.20.log
>> 64K ceph-osd.21.log
>> 1,3M ceph-osd.22.log
>> 64K ceph-osd.23.log
>>
>> But all OSDs are set to debug osd = 20.
>>
>> dmesg shows:
>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>
>> I uploaded the following files:
>> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't crash
>> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
>> üu
>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
>> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>>
>> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-23 0:26 ` Dan Mick
@ 2012-06-23 6:32 ` Stefan Priebe
0 siblings, 0 replies; 24+ messages in thread
From: Stefan Priebe @ 2012-06-23 6:32 UTC (permalink / raw)
To: Dan Mick; +Cc: ceph-devel
Thanks yes it is from the next branch.
Am 23.06.2012 um 02:26 schrieb Dan Mick <dan.mick@inktank.com>:
> The ceph-osd binary you sent claims to be version 0.47.2-521-g88c762, which is not quite 0.47.3. You can get the version with <binary> -v, or (in my case) examining strings in the binary. I'm retrieving that version to analyze the core dump.
>
>
> On 06/21/2012 11:43 PM, Stefan Priebe wrote:
>> Does anybody have an idea? This is right now a showstopper to me.
>>
>> Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AG<s.priebe@profihost.ag>:
>>
>>> Hello list,
>>>
>>> i'm able to reproducably crash osd daemons.
>>>
>>> How i can reproduce:
>>>
>>> Kernel: 3.5.0-rc3
>>> Ceph: 0.47.3
>>> FS: btrfs
>>> Journal: 2GB tmpfs per OSD
>>> OSD: 3x servers with 4x Intel SSD OSDs each
>>> 10GBE Network
>>> rbd_cache_max_age: 2.0
>>> rbd_cache_size: 33554432
>>>
>>> Disk is set to writeback.
>>>
>>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>>
>>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes.
>>>
>>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>>
>>> Strangely exactly THIS OSD also has the most log entries:
>>> 64K ceph-osd.20.log
>>> 64K ceph-osd.21.log
>>> 1,3M ceph-osd.22.log
>>> 64K ceph-osd.23.log
>>>
>>> But all OSDs are set to debug osd = 20.
>>>
>>> dmesg shows:
>>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>>
>>> I uploaded the following files:
>>> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't crash
>>> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
>>> üu
>>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
>>> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>>>
>>> Stefan
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-22 23:59 ` Sam Just
@ 2012-06-23 6:32 ` Stefan Priebe
2012-06-25 16:39 ` Dan Mick
0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe @ 2012-06-23 6:32 UTC (permalink / raw)
To: Sam Just; +Cc: Dan Mick, ceph-devel
Thanks did you find anything?
Am 23.06.2012 um 01:59 schrieb Sam Just <sam.just@inktank.com>:
> I am still looking into the logs.
> -Sam
>
> On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick <dan.mick@inktank.com> wrote:
>> Stefan, I'm looking at your logs and coredump now.
>>
>>
>> On 06/21/2012 11:43 PM, Stefan Priebe wrote:
>>>
>>> Does anybody have an idea? This is right now a showstopper to me.
>>>
>>> Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost
>>> AG<s.priebe@profihost.ag>:
>>>
>>>> Hello list,
>>>>
>>>> i'm able to reproducably crash osd daemons.
>>>>
>>>> How i can reproduce:
>>>>
>>>> Kernel: 3.5.0-rc3
>>>> Ceph: 0.47.3
>>>> FS: btrfs
>>>> Journal: 2GB tmpfs per OSD
>>>> OSD: 3x servers with 4x Intel SSD OSDs each
>>>> 10GBE Network
>>>> rbd_cache_max_age: 2.0
>>>> rbd_cache_size: 33554432
>>>>
>>>> Disk is set to writeback.
>>>>
>>>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>>>
>>>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
>>>> crashes.
>>>>
>>>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>>>
>>>> Strangely exactly THIS OSD also has the most log entries:
>>>> 64K ceph-osd.20.log
>>>> 64K ceph-osd.21.log
>>>> 1,3M ceph-osd.22.log
>>>> 64K ceph-osd.23.log
>>>>
>>>> But all OSDs are set to debug osd = 20.
>>>>
>>>> dmesg shows:
>>>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
>>>> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>>>
>>>> I uploaded the following files:
>>>> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't
>>>> crash
>>>> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
>>>> üu
>>>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
>>>> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>>>>
>>>> Stefan
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-23 6:32 ` Stefan Priebe
@ 2012-06-25 16:39 ` Dan Mick
2012-06-25 17:19 ` Stefan Priebe
0 siblings, 1 reply; 24+ messages in thread
From: Dan Mick @ 2012-06-25 16:39 UTC (permalink / raw)
To: Stefan Priebe; +Cc: Sam Just, ceph-devel
I've yet to make the core match the binary.
On Jun 22, 2012, at 11:32 PM, Stefan Priebe <s.priebe@profihost.ag> wrote:
> Thanks did you find anything?
>
> Am 23.06.2012 um 01:59 schrieb Sam Just <sam.just@inktank.com>:
>
>> I am still looking into the logs.
>> -Sam
>>
>> On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick <dan.mick@inktank.com> wrote:
>>> Stefan, I'm looking at your logs and coredump now.
>>>
>>>
>>> On 06/21/2012 11:43 PM, Stefan Priebe wrote:
>>>>
>>>> Does anybody have an idea? This is right now a showstopper to me.
>>>>
>>>> Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost
>>>> AG<s.priebe@profihost.ag>:
>>>>
>>>>> Hello list,
>>>>>
>>>>> i'm able to reproducably crash osd daemons.
>>>>>
>>>>> How i can reproduce:
>>>>>
>>>>> Kernel: 3.5.0-rc3
>>>>> Ceph: 0.47.3
>>>>> FS: btrfs
>>>>> Journal: 2GB tmpfs per OSD
>>>>> OSD: 3x servers with 4x Intel SSD OSDs each
>>>>> 10GBE Network
>>>>> rbd_cache_max_age: 2.0
>>>>> rbd_cache_size: 33554432
>>>>>
>>>>> Disk is set to writeback.
>>>>>
>>>>> Start a KVM VM via PXE with the disk attached in writeback mode.
>>>>>
>>>>> Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
>>>>> crashes.
>>>>>
>>>>> # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
>>>>> --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
>>>>> --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
>>>>>
>>>>> Strangely exactly THIS OSD also has the most log entries:
>>>>> 64K ceph-osd.20.log
>>>>> 64K ceph-osd.21.log
>>>>> 1,3M ceph-osd.22.log
>>>>> 64K ceph-osd.23.log
>>>>>
>>>>> But all OSDs are set to debug osd = 20.
>>>>>
>>>>> dmesg shows:
>>>>> ceph-osd[5381]: segfault at 3f592c000 ip 00007fa281d8eb23 sp
>>>>> 00007fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
>>>>>
>>>>> I uploaded the following files:
>>>>> priebe_fio_randwrite_ceph-osd.21.log.bz2 => OSD which was OK and didn't
>>>>> crash
>>>>> priebe_fio_randwrite_ceph-osd.22.log.bz2 => Log from the crashed OSD
>>>>> üu
>>>>> priebe_fio_randwrite_core.ssdstor001.27204.bz2 => Core dump
>>>>> priebe_fio_randwrite_ceph-osd.bz2 => osd binary
>>>>>
>>>>> Stefan
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-25 16:39 ` Dan Mick
@ 2012-06-25 17:19 ` Stefan Priebe
2012-06-25 21:01 ` Dan Mick
0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe @ 2012-06-25 17:19 UTC (permalink / raw)
To: Dan Mick; +Cc: Sam Just, ceph-devel
> I've yet to make the core match the binary.
I'm sorry. Is there a way to provide more / better information.
Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-25 17:19 ` Stefan Priebe
@ 2012-06-25 21:01 ` Dan Mick
2012-06-25 21:18 ` Stefan Priebe
0 siblings, 1 reply; 24+ messages in thread
From: Dan Mick @ 2012-06-25 21:01 UTC (permalink / raw)
To: Stefan Priebe; +Cc: Sam Just, ceph-devel
On 06/25/2012 10:19 AM, Stefan Priebe wrote:
>
>> I've yet to make the core match the binary.
>
> I'm sorry. Is there a way to provide more / better information.
>
> Stefan
Well, I was still looking at the possibility that it was my problem, but
here are versions:
- the osd binary you sent is version 0.47.2-521-g88c7629, corresponding
to sha1 5ce8d71fb6d2e3065c8b274e196c26288c7a93d0
- the corefile is from version 0.47.3 (sha1
c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
Using any combination of your binary or my build of the binary from
either version does not result in a matching stack backtrace from gdb.
If you can get a corefile and binary that match and send them again,
perhaps we can get further; I'm not sure why what you sent didn't seem
to match. Are you running the same version of Ceph on each host?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-25 21:01 ` Dan Mick
@ 2012-06-25 21:18 ` Stefan Priebe
2012-06-26 0:11 ` Dan Mick
0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe @ 2012-06-25 21:18 UTC (permalink / raw)
To: Dan Mick; +Cc: Sam Just, ceph-devel
Am 25.06.2012 23:01, schrieb Dan Mick:
> On 06/25/2012 10:19 AM, Stefan Priebe wrote:
>>
>>> I've yet to make the core match the binary.
>>
>> I'm sorry. Is there a way to provide more / better information.
>>
>> Stefan
>
> Well, I was still looking at the possibility that it was my problem, but
> here are versions:
>
> - the osd binary you sent is version 0.47.2-521-g88c7629, corresponding
> to sha1 5ce8d71fb6d2e3065c8b274e196c26288c7a93d0
>
> - the corefile is from version 0.47.3 (sha1
> c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
ouch i'm really sorry. No idea right know how this happened.
Here are two new files:
priebe_core.ssdstor001.gz
priebe_ceph-osd.gz
most probably the same crash.
Thanks, sorry and
Greets
Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-25 21:18 ` Stefan Priebe
@ 2012-06-26 0:11 ` Dan Mick
2012-06-26 5:15 ` Stefan Priebe
0 siblings, 1 reply; 24+ messages in thread
From: Dan Mick @ 2012-06-26 0:11 UTC (permalink / raw)
To: Stefan Priebe; +Cc: Sam Just, ceph-devel
On 06/25/2012 02:18 PM, Stefan Priebe wrote:
> Am 25.06.2012 23:01, schrieb Dan Mick:
>> On 06/25/2012 10:19 AM, Stefan Priebe wrote:
>>>
>>>> I've yet to make the core match the binary.
>>>
>>> I'm sorry. Is there a way to provide more / better information.
>>>
>>> Stefan
>>
>> Well, I was still looking at the possibility that it was my problem, but
>> here are versions:
>>
>> - the osd binary you sent is version 0.47.2-521-g88c7629, corresponding
>> to sha1 5ce8d71fb6d2e3065c8b274e196c26288c7a93d0
>>
>> - the corefile is from version 0.47.3 (sha1
>> c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
>
> ouch i'm really sorry. No idea right know how this happened.
>
> Here are two new files:
> priebe_core.ssdstor001.gz
> priebe_ceph-osd.gz
>
> most probably the same crash.
>
> Thanks, sorry and
> Greets
> Stefan
These show
core: ceph version 0.47.3 (commit:c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
binary: 0.47.3-528-g9fcc3de
which don't match either. How are you collecting the binary and the
coredump file?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-26 0:11 ` Dan Mick
@ 2012-06-26 5:15 ` Stefan Priebe
2012-06-26 5:48 ` Stefan Priebe
0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe @ 2012-06-26 5:15 UTC (permalink / raw)
To: Dan Mick; +Cc: Sam Just, ceph-devel
Strange! I just copied the core dump file and the ceph-osd binary from /us
Am 26.06.2012 um 02:11 schrieb Dan Mick <dan.mick@inktank.com>:
>
>
> On 06/25/2012 02:18 PM, Stefan Priebe wrote:
>> Am 25.06.2012 23:01, schrieb Dan Mick:
>>> On 06/25/2012 10:19 AM, Stefan Priebe wrote:
>>>>
>>>>> I've yet to make the core match the binary.
>>>>
>>>> I'm sorry. Is there a way to provide more / better information.
>>>>
>>>> Stefan
>>>
>>> Well, I was still looking at the possibility that it was my problem, but
>>> here are versions:
>>>
>>> - the osd binary you sent is version 0.47.2-521-g88c7629, corresponding
>>> to sha1 5ce8d71fb6d2e3065c8b274e196c26288c7a93d0
>>>
>>> - the corefile is from version 0.47.3 (sha1
>>> c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
>>
>> ouch i'm really sorry. No idea right know how this happened.
>>
>> Here are two new files:
>> priebe_core.ssdstor001.gz
>> priebe_ceph-osd.gz
>>
>> most probably the same crash.
>>
>> Thanks, sorry and
>> Greets
>> Stefan
>
> These show
>
> core: ceph version 0.47.3 (commit:c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
> binary: 0.47.3-528-g9fcc3de
>
> which don't match either. How are you collecting the binary and the coredump file?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-26 5:15 ` Stefan Priebe
@ 2012-06-26 5:48 ` Stefan Priebe
2012-06-26 16:05 ` Tommi Virtanen
0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe @ 2012-06-26 5:48 UTC (permalink / raw)
To: Stefan Priebe; +Cc: Dan Mick, Sam Just, ceph-devel
Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
can happen. For building I use the provided Debian scripts.
Am 26.06.2012 um 07:15 schrieb Stefan Priebe <s.priebe@profihost.ag>:
> Strange! I just copied the core dump file and the ceph-osd binary from /us
>
> Am 26.06.2012 um 02:11 schrieb Dan Mick <dan.mick@inktank.com>:
>
>>
>>
>> On 06/25/2012 02:18 PM, Stefan Priebe wrote:
>>> Am 25.06.2012 23:01, schrieb Dan Mick:
>>>> On 06/25/2012 10:19 AM, Stefan Priebe wrote:
>>>>>
>>>>>> I've yet to make the core match the binary.
>>>>>
>>>>> I'm sorry. Is there a way to provide more / better information.
>>>>>
>>>>> Stefan
>>>>
>>>> Well, I was still looking at the possibility that it was my problem, but
>>>> here are versions:
>>>>
>>>> - the osd binary you sent is version 0.47.2-521-g88c7629, corresponding
>>>> to sha1 5ce8d71fb6d2e3065c8b274e196c26288c7a93d0
>>>>
>>>> - the corefile is from version 0.47.3 (sha1
>>>> c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
>>>
>>> ouch i'm really sorry. No idea right know how this happened.
>>>
>>> Here are two new files:
>>> priebe_core.ssdstor001.gz
>>> priebe_ceph-osd.gz
>>>
>>> most probably the same crash.
>>>
>>> Thanks, sorry and
>>> Greets
>>> Stefan
>>
>> These show
>>
>> core: ceph version 0.47.3 (commit:c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
>> binary: 0.47.3-528-g9fcc3de
>>
>> which don't match either. How are you collecting the binary and the coredump file?
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-26 5:48 ` Stefan Priebe
@ 2012-06-26 16:05 ` Tommi Virtanen
2012-06-26 16:47 ` Stefan Priebe
0 siblings, 1 reply; 24+ messages in thread
From: Tommi Virtanen @ 2012-06-26 16:05 UTC (permalink / raw)
To: Stefan Priebe; +Cc: Dan Mick, Sam Just, ceph-devel
On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe <s.priebe@profihost.ag> wrote:
> Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
> can happen. For building I use the provided Debian scripts.
Perhaps you upgraded the debs but did not restart the daemons? That
would make the on-disk executable with that name not match the
in-memory one.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-26 16:05 ` Tommi Virtanen
@ 2012-06-26 16:47 ` Stefan Priebe
2012-06-26 18:01 ` Sam Just
0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe @ 2012-06-26 16:47 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Dan Mick, Sam Just, ceph-devel
Am 26.06.2012 18:05, schrieb Tommi Virtanen:
> On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe <s.priebe@profihost.ag> wrote:
>> Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
>> can happen. For building I use the provided Debian scripts.
>
> Perhaps you upgraded the debs but did not restart the daemons? That
> would make the on-disk executable with that name not match the
> in-memory one.
No, i reboot after each upgrade ;-)
Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce
the issue.
Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-26 16:47 ` Stefan Priebe
@ 2012-06-26 18:01 ` Sam Just
2012-06-27 7:22 ` Stefan Priebe - Profihost AG
0 siblings, 1 reply; 24+ messages in thread
From: Sam Just @ 2012-06-26 18:01 UTC (permalink / raw)
To: Stefan Priebe; +Cc: Tommi Virtanen, Dan Mick, ceph-devel
Stefan,
Sorry for the delay, I think I've found the problem. Could you give
wip_ms_handle_reset_race a try?
-Sam
On Tue, Jun 26, 2012 at 9:47 AM, Stefan Priebe <s.priebe@profihost.ag> wrote:
> Am 26.06.2012 18:05, schrieb Tommi Virtanen:
>
>> On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe <s.priebe@profihost.ag>
>> wrote:
>>>
>>> Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
>>> can happen. For building I use the provided Debian scripts.
>>
>>
>> Perhaps you upgraded the debs but did not restart the daemons? That
>> would make the on-disk executable with that name not match the
>> in-memory one.
>
>
> No, i reboot after each upgrade ;-)
>
> Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce the
> issue.
>
> Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-26 18:01 ` Sam Just
@ 2012-06-27 7:22 ` Stefan Priebe - Profihost AG
2012-06-27 15:19 ` Sage Weil
0 siblings, 1 reply; 24+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-27 7:22 UTC (permalink / raw)
To: Sam Just; +Cc: Tommi Virtanen, Dan Mick, ceph-devel
THANKS a lot. This fixes it. I've merged your branch into next and i
wsn't able to trigger the osd crash again. So please include this into 0.48.
Greets
Stefan
Am 26.06.2012 20:01, schrieb Sam Just:
> Stefan,
>
> Sorry for the delay, I think I've found the problem. Could you give
> wip_ms_handle_reset_race a try?
> -Sam
>
> On Tue, Jun 26, 2012 at 9:47 AM, Stefan Priebe <s.priebe@profihost.ag> wrote:
>> Am 26.06.2012 18:05, schrieb Tommi Virtanen:
>>
>>> On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe <s.priebe@profihost.ag>
>>> wrote:
>>>>
>>>> Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
>>>> can happen. For building I use the provided Debian scripts.
>>>
>>>
>>> Perhaps you upgraded the debs but did not restart the daemons? That
>>> would make the on-disk executable with that name not match the
>>> in-memory one.
>>
>>
>> No, i reboot after each upgrade ;-)
>>
>> Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce the
>> issue.
>>
>> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: reproducable osd crash
2012-06-27 7:22 ` Stefan Priebe - Profihost AG
@ 2012-06-27 15:19 ` Sage Weil
0 siblings, 0 replies; 24+ messages in thread
From: Sage Weil @ 2012-06-27 15:19 UTC (permalink / raw)
To: Stefan Priebe - Profihost AG
Cc: Sam Just, Tommi Virtanen, Dan Mick, ceph-devel
On Wed, 27 Jun 2012, Stefan Priebe - Profihost AG wrote:
> THANKS a lot. This fixes it. I've merged your branch into next and i wsn't
> able to trigger the osd crash again. So please include this into 0.48.
Excellent. Thanks for testing! This now in next.
sage
>
> Greets
> Stefan
>
> Am 26.06.2012 20:01, schrieb Sam Just:
> > Stefan,
> >
> > Sorry for the delay, I think I've found the problem. Could you give
> > wip_ms_handle_reset_race a try?
> > -Sam
> >
> > On Tue, Jun 26, 2012 at 9:47 AM, Stefan Priebe <s.priebe@profihost.ag>
> > wrote:
> > > Am 26.06.2012 18:05, schrieb Tommi Virtanen:
> > >
> > > > On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe <s.priebe@profihost.ag>
> > > > wrote:
> > > > >
> > > > > Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how
> > > > > this
> > > > > can happen. For building I use the provided Debian scripts.
> > > >
> > > >
> > > > Perhaps you upgraded the debs but did not restart the daemons? That
> > > > would make the on-disk executable with that name not match the
> > > > in-memory one.
> > >
> > >
> > > No, i reboot after each upgrade ;-)
> > >
> > > Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce
> > > the
> > > issue.
> > >
> > > Stefan
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2012-06-27 15:19 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-21 12:55 reproducable osd crash Stefan Priebe - Profihost AG
2012-06-21 13:07 ` Stefan Priebe - Profihost AG
2012-06-21 13:13 ` Stefan Priebe - Profihost AG
2012-06-21 13:23 ` Stefan Priebe - Profihost AG
2012-06-21 19:57 ` Stefan Priebe
2012-06-22 16:01 ` Stefan Priebe - Profihost AG
2012-06-22 6:43 ` Stefan Priebe
2012-06-22 22:56 ` Dan Mick
2012-06-22 23:59 ` Sam Just
2012-06-23 6:32 ` Stefan Priebe
2012-06-25 16:39 ` Dan Mick
2012-06-25 17:19 ` Stefan Priebe
2012-06-25 21:01 ` Dan Mick
2012-06-25 21:18 ` Stefan Priebe
2012-06-26 0:11 ` Dan Mick
2012-06-26 5:15 ` Stefan Priebe
2012-06-26 5:48 ` Stefan Priebe
2012-06-26 16:05 ` Tommi Virtanen
2012-06-26 16:47 ` Stefan Priebe
2012-06-26 18:01 ` Sam Just
2012-06-27 7:22 ` Stefan Priebe - Profihost AG
2012-06-27 15:19 ` Sage Weil
2012-06-23 0:26 ` Dan Mick
2012-06-23 6:32 ` Stefan Priebe
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.