* osd crash in ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*)
@ 2012-10-11 22:17 Yann Dupont
2012-10-15 19:57 ` Samuel Just
0 siblings, 1 reply; 4+ messages in thread
From: Yann Dupont @ 2012-10-11 22:17 UTC (permalink / raw)
To: ceph-devel
Hello everybody.
I'm currently having problem with 1 of my OSD, crashing with this trace :
ceph version 0.52 (commit:e48859474c4944d4ff201ddc9f5fd400e8898173)
1: /usr/bin/ceph-osd() [0x737879]
2: (()+0xf030) [0x7f43f0af0030]
3:
(ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*,
pg_stat_t*)+0x292) [0x555262]
4: (ReplicatedPG::recover_backfill(int)+0x1c1a) [0x55c93a]
5: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*)+0x26a)
[0x563c1a]
6: (OSD::do_recovery(PG*)+0x39d) [0x5d3c9d]
7: (OSD::RecoveryWQ::_process(PG*)+0xd) [0x6119fd]
8: (ThreadPool::worker()+0x82b) [0x7c176b]
9: (ThreadPool::WorkThread::entry()+0xd) [0x5f609d]
10: (()+0x6b50) [0x7f43f0ae7b50]
11: (clone()+0x6d) [0x7f43ef81b78d]
Restarting gives the same message after some seconds.
I've been watching the bug tracker but I don't see something related.
Some informations : kernel is 3.6.1, with "standard" debian packages
from ceph.com
My ceph cluster was running well and stable on 6 osd since june (3
datacenters, 2 with 2 nodes, 1 with 4 nodes, a replication of 2, and
adjusted weight to try to balance data evenly). Beginned with the
then-up-to-date version, then 0.48, 49,50,51... Data store is on XFS.
I'm currently in the process of growing my ceph from 6 nodes to 12
nodes. 11 nodes are currently in ceph, for a 130 TB total. Declaring new
osd was OK, the data has moved "quite" ok (in fact I had some OSD crash
- not definitive, the osd restart ok-, maybe related to an error in my
new nodes network configuration that I discovered after. More on that
later, I can find the traces, but I'm not sure it's related)
When ceph was finally stable again, with HEALTH_OK, I decided to
reweight the osd (that was tuesday). Operation went quite OK, but near
the end of operation (0,085% left), 1 of my OSD crashed, and won't start
again.
More problematic, with this osd down, I have 1 incomplete PG :
ceph -s
health HEALTH_WARN 86 pgs backfill; 231 pgs degraded; 4 pgs down; 15
pgs incomplete; 4 pgs peering; 134 pgs recovering; 19 pgs stuck
inactive; 455 pgs stuck unclean; recovery 2122878/23181946 degraded
(9.157%); 2321/11590973 unfound (0.020%); 1 near full osd(s)
monmap e1: 3 mons at
{chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0},
election epoch 20, quorum 0,1,2 chichibu,glenesk,karuizawa
osdmap e13184: 11 osds: 10 up, 10 in
pgmap v2399093: 1728 pgs: 165 active, 1270 active+clean, 8
active+recovering+degraded, 41
active+recovering+degraded+remapped+backfill, 4 down+peering, 137
active+degraded, 3 active+clean+scrubbing, 15 incomplete, 40
active+recovering, 45 active+recovering+degraded+backfill; 44119 GB
data, 84824 GB used, 37643 GB / 119 TB avail; 2122878/23181946 degraded
(9.157%); 2321/11590973 unfound (0.020%)
mdsmap e321: 1/1/1 up {0=karuizawa=up:active}, 2 up:standby
how is it possible as I have a replication of 2 ?
Is it a known problem ?
Cheers,
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: osd crash in ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*)
2012-10-11 22:17 osd crash in ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*) Yann Dupont
@ 2012-10-15 19:57 ` Samuel Just
[not found] ` <50856854.6050505@univ-nantes.fr>
0 siblings, 1 reply; 4+ messages in thread
From: Samuel Just @ 2012-10-15 19:57 UTC (permalink / raw)
To: Yann Dupont; +Cc: ceph-devel
Do you have a coredump for the crash? Can you reproduce the crash with:
debug filestore = 20
debug osd = 20
and post the logs?
As far as the incomplete pg goes, can you post the output of
ceph pg <pgid> query
where <pgid> is the pgid of the incomplete pg (e.g. 1.34)?
Thanks
-Sam
On Thu, Oct 11, 2012 at 3:17 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Hello everybody.
>
> I'm currently having problem with 1 of my OSD, crashing with this trace :
>
> ceph version 0.52 (commit:e48859474c4944d4ff201ddc9f5fd400e8898173)
> 1: /usr/bin/ceph-osd() [0x737879]
> 2: (()+0xf030) [0x7f43f0af0030]
> 3:
> (ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*,
> pg_stat_t*)+0x292) [0x555262]
> 4: (ReplicatedPG::recover_backfill(int)+0x1c1a) [0x55c93a]
> 5: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*)+0x26a)
> [0x563c1a]
> 6: (OSD::do_recovery(PG*)+0x39d) [0x5d3c9d]
> 7: (OSD::RecoveryWQ::_process(PG*)+0xd) [0x6119fd]
> 8: (ThreadPool::worker()+0x82b) [0x7c176b]
> 9: (ThreadPool::WorkThread::entry()+0xd) [0x5f609d]
> 10: (()+0x6b50) [0x7f43f0ae7b50]
> 11: (clone()+0x6d) [0x7f43ef81b78d]
>
> Restarting gives the same message after some seconds.
> I've been watching the bug tracker but I don't see something related.
>
> Some informations : kernel is 3.6.1, with "standard" debian packages from
> ceph.com
>
> My ceph cluster was running well and stable on 6 osd since june (3
> datacenters, 2 with 2 nodes, 1 with 4 nodes, a replication of 2, and
> adjusted weight to try to balance data evenly). Beginned with the
> then-up-to-date version, then 0.48, 49,50,51... Data store is on XFS.
>
> I'm currently in the process of growing my ceph from 6 nodes to 12 nodes. 11
> nodes are currently in ceph, for a 130 TB total. Declaring new osd was OK,
> the data has moved "quite" ok (in fact I had some OSD crash - not
> definitive, the osd restart ok-, maybe related to an error in my new nodes
> network configuration that I discovered after. More on that later, I can
> find the traces, but I'm not sure it's related)
>
> When ceph was finally stable again, with HEALTH_OK, I decided to reweight
> the osd (that was tuesday). Operation went quite OK, but near the end of
> operation (0,085% left), 1 of my OSD crashed, and won't start again.
>
> More problematic, with this osd down, I have 1 incomplete PG :
>
> ceph -s
> health HEALTH_WARN 86 pgs backfill; 231 pgs degraded; 4 pgs down; 15 pgs
> incomplete; 4 pgs peering; 134 pgs recovering; 19 pgs stuck inactive; 455
> pgs stuck unclean; recovery 2122878/23181946 degraded (9.157%);
> 2321/11590973 unfound (0.020%); 1 near full osd(s)
> monmap e1: 3 mons at
> {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0},
> election epoch 20, quorum 0,1,2 chichibu,glenesk,karuizawa
> osdmap e13184: 11 osds: 10 up, 10 in
> pgmap v2399093: 1728 pgs: 165 active, 1270 active+clean, 8
> active+recovering+degraded, 41 active+recovering+degraded+remapped+backfill,
> 4 down+peering, 137 active+degraded, 3 active+clean+scrubbing, 15
> incomplete, 40 active+recovering, 45 active+recovering+degraded+backfill;
> 44119 GB data, 84824 GB used, 37643 GB / 119 TB avail; 2122878/23181946
> degraded (9.157%); 2321/11590973 unfound (0.020%)
> mdsmap e321: 1/1/1 up {0=karuizawa=up:active}, 2 up:standby
>
> how is it possible as I have a replication of 2 ?
>
> Is it a known problem ?
>
> Cheers,
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: osd crash in ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*)
[not found] ` <50856854.6050505@univ-nantes.fr>
@ 2012-10-22 17:40 ` Yann Dupont
2012-10-22 23:33 ` Samuel Just
0 siblings, 1 reply; 4+ messages in thread
From: Yann Dupont @ 2012-10-22 17:40 UTC (permalink / raw)
To: Samuel Just, ceph-devel@vger.kernel.org >> ceph-devel
Le 22/10/2012 17:37, Yann Dupont a écrit :
> Le 15/10/2012 21:57, Samuel Just a écrit :
>> debug filestore = 20
>> debug osd = 20
> ok, just had a core
>
> you can grab it here :
> http://filex.univ-nantes.fr/get?k=xojcpgmGoN4pR1rpqf5
>
> now, I'll run with debug options
>
> cheers
>
Ok, I've collected a big log. I'll try to put it online later (I'm at
home right now)
Anyway, a quick gdb on the core show this with bt show this :
#0 0x00007f9aaace6efb in raise () from
/lib/x86_64-linux-gnu/libpthread.so.0
#1 0x0000000000737ca7 in reraise_fatal (signum=11) at
global/signal_handler.cc:58
#2 handle_fatal_signal (signum=11) at global/signal_handler.cc:104
#3 <signal handler called>
#4 std::_Rb_tree<snapid_t, std::pair<snapid_t const,
interval_set<unsigned long> >, std::_Select1st<std::pair<snapid_t const,
interval_set<unsigned long> > >, std::less<snapid_t>,
std::allocator<std::pair<snapid_t const, interval_set<unsigned long> > >
>::_M_begin (this=0x8654000, obc=0xe342000, pgstat=0x7f9a8bf97c10) at
/usr/include/c++/4.4/bits/stl_tree.h:488
#5 std::_Rb_tree<snapid_t, std::pair<snapid_t const,
interval_set<unsigned long> >, std::_Select1st<std::pair<snapid_t const,
interval_set<unsigned long> > >, std::less<snapid_t>,
std::allocator<std::pair<snapid_t const, interval_set<unsigned long> > >
>::find (this=0x8654000, obc=0xe342000, pgstat=0x7f9a8bf97c10) at
/usr/include/c++/4.4/bits/stl_tree.h:1434
#6 std::map<snapid_t, interval_set<unsigned long>, std::less<snapid_t>,
std::allocator<std::pair<snapid_t const, interval_set<unsigned long> > >
>::count (this=0x8654000, obc=0xe342000,
pgstat=0x7f9a8bf97c10) at /usr/include/c++/4.4/bits/stl_map.h:686
#7 ReplicatedPG::add_object_context_to_pg_stat (this=0x8654000,
obc=0xe342000, pgstat=0x7f9a8bf97c10) at osd/ReplicatedPG.cc:4145
#8 0x000000000055c93a in ReplicatedPG::recover_backfill
(this=0x8654000, max=<value optimized out>) at osd/ReplicatedPG.cc:6381
#9 0x0000000000563c1a in ReplicatedPG::start_recovery_ops
(this=0x8654000, max=1, prctx=<value optimized out>) at
osd/ReplicatedPG.cc:5959
#10 0x00000000005d3c9d in OSD::do_recovery (this=0x30c9000,
pg=0x8654000) at osd/OSD.cc:5121
#11 0x00000000006119fd in OSD::RecoveryWQ::_process(PG*) ()
#12 0x00000000007c176b in ThreadPool::worker (this=0x30c9568) at
common/WorkQueue.cc:54
#13 0x00000000005f609d in ThreadPool::WorkThread::entry() ()
#14 0x00007f9aaacdeb50 in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007f9aa9a1278d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#16 0x0000000000000000 in ?? ()
does it helps you ?
Cheers,
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: osd crash in ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*)
2012-10-22 17:40 ` Yann Dupont
@ 2012-10-22 23:33 ` Samuel Just
0 siblings, 0 replies; 4+ messages in thread
From: Samuel Just @ 2012-10-22 23:33 UTC (permalink / raw)
To: Yann Dupont; +Cc: ceph-devel@vger.kernel.org >> ceph-devel
Yeah, I think I've seen that before, but not yet with logs. filestore
and osd logging would help greatly if it's reproducible.
I've put it in as #3386.
-Sam
On Mon, Oct 22, 2012 at 10:40 AM, Yann Dupont
<Yann.Dupont@univ-nantes.fr> wrote:
> Le 22/10/2012 17:37, Yann Dupont a écrit :
>
>> Le 15/10/2012 21:57, Samuel Just a écrit :
>>>
>>> debug filestore = 20
>>> debug osd = 20
>>
>> ok, just had a core
>>
>> you can grab it here :
>> http://filex.univ-nantes.fr/get?k=xojcpgmGoN4pR1rpqf5
>>
>> now, I'll run with debug options
>>
>> cheers
>>
> Ok, I've collected a big log. I'll try to put it online later (I'm at home
> right now)
> Anyway, a quick gdb on the core show this with bt show this :
>
> #0 0x00007f9aaace6efb in raise () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #1 0x0000000000737ca7 in reraise_fatal (signum=11) at
> global/signal_handler.cc:58
> #2 handle_fatal_signal (signum=11) at global/signal_handler.cc:104
> #3 <signal handler called>
> #4 std::_Rb_tree<snapid_t, std::pair<snapid_t const, interval_set<unsigned
> long> >, std::_Select1st<std::pair<snapid_t const, interval_set<unsigned
> long> > >, std::less<snapid_t>, std::allocator<std::pair<snapid_t const,
> interval_set<unsigned long> > > >::_M_begin (this=0x8654000, obc=0xe342000,
> pgstat=0x7f9a8bf97c10) at /usr/include/c++/4.4/bits/stl_tree.h:488
> #5 std::_Rb_tree<snapid_t, std::pair<snapid_t const, interval_set<unsigned
> long> >, std::_Select1st<std::pair<snapid_t const, interval_set<unsigned
> long> > >, std::less<snapid_t>, std::allocator<std::pair<snapid_t const,
> interval_set<unsigned long> > > >::find (this=0x8654000, obc=0xe342000,
> pgstat=0x7f9a8bf97c10) at /usr/include/c++/4.4/bits/stl_tree.h:1434
> #6 std::map<snapid_t, interval_set<unsigned long>, std::less<snapid_t>,
> std::allocator<std::pair<snapid_t const, interval_set<unsigned long> > >
>>::count (this=0x8654000, obc=0xe342000,
> pgstat=0x7f9a8bf97c10) at /usr/include/c++/4.4/bits/stl_map.h:686
> #7 ReplicatedPG::add_object_context_to_pg_stat (this=0x8654000,
> obc=0xe342000, pgstat=0x7f9a8bf97c10) at osd/ReplicatedPG.cc:4145
> #8 0x000000000055c93a in ReplicatedPG::recover_backfill (this=0x8654000,
> max=<value optimized out>) at osd/ReplicatedPG.cc:6381
> #9 0x0000000000563c1a in ReplicatedPG::start_recovery_ops (this=0x8654000,
> max=1, prctx=<value optimized out>) at osd/ReplicatedPG.cc:5959
> #10 0x00000000005d3c9d in OSD::do_recovery (this=0x30c9000, pg=0x8654000) at
> osd/OSD.cc:5121
> #11 0x00000000006119fd in OSD::RecoveryWQ::_process(PG*) ()
> #12 0x00000000007c176b in ThreadPool::worker (this=0x30c9568) at
> common/WorkQueue.cc:54
> #13 0x00000000005f609d in ThreadPool::WorkThread::entry() ()
> #14 0x00007f9aaacdeb50 in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #15 0x00007f9aa9a1278d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #16 0x0000000000000000 in ?? ()
>
> does it helps you ?
>
> Cheers,
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2012-10-22 23:33 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-11 22:17 osd crash in ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*) Yann Dupont
2012-10-15 19:57 ` Samuel Just
[not found] ` <50856854.6050505@univ-nantes.fr>
2012-10-22 17:40 ` Yann Dupont
2012-10-22 23:33 ` Samuel Just
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.