All of lore.kernel.org
 help / color / mirror / Atom feed
* Do not understand some terms about cluster health
@ 2011-12-22  3:47 Eric_YH_Chen
  2011-12-22 20:40 ` Gregory Farnum
  0 siblings, 1 reply; 3+ messages in thread
From: Eric_YH_Chen @ 2011-12-22  3:47 UTC (permalink / raw)
  To: ceph-devel; +Cc: Chris_YT_Huang

Hi, All

   When I type 'ceph health' to get the status of cluster, it will show
some information.

   Would you please to explain the term?

   Ex: HEALTH_WARN 3/54 degraded (5.556%)
 
         What does "degraded" mean ?  Is it a serious error and how to
fix it ? 

   Ex: HEALTH_WARN 264 pgs degraded, 6/60 degraded (10.000%); 3/27
unfound (11.111%)

         What does "unfound" mean?  Could we recover the data?  
         Would it cause the whole data in rbd image corrupted and never
access ?

   When I type 'ceph pg dump', it would show like this.  Would you
please explain what is "hb in" and "hb out" ?

	osdstat  kbused          kbavail          kb              hb in
hb out
	0       17300872        1884175720      1906311168
[6,7,8,9,10,11] [6,7,8,9,10,11]
	1       16661664        1884808728      1906311168
[6,7,8,9,10,11] [6,7,8,9,10,11]
	2       15695664        1886027584      1906311168
[6,7,8,9,10,11] [6,7,8,9,10,11]
	3       16463440        1885005328      1906311168
[6,7,8,9,10,11] [6,7,8,9,10,11]
	4       14101016        1888130760      1906311168
[6,7,8,9,10,11] [6,7,8,9,10,11]
	5       14015804        1888215124      1906311168
[6,7,8,9,10,11] [6,7,8,9,10,11]
	6       19312280        1881660776      1906311168
[0,1,2,3,4,5]   [0,1,2,3,4,5]
	7       14451992        1887521200      1906311168
[0,1,2,3,4,5]   [0,1,2,3,4,5]
	8       16336028        1885393468      1906311168
[0,1,2,3,4,5]   [0,1,2,3,4,5]
	9       16697868        1884773940      1906311168
[0,1,2,3,4,5]   [0,1,2,3,4,5]
	10      13530456        1888695776      1906311168
[0,1,2,3,4,5]   [0,1,2,3,4,5]
	11      13921908        1888307364      1906311168
[0,1,2,3,4,5]   [0,1,2,3,4,5]
	sum    188488992       22632715768     22875734016      


  And from the latest document, I know we can do the cluster snapshot by
" ceph osd cluster_snap <name>"
  Is that means we can rollback the data from the snapshot? Do you have
any related document to show how to operate it?

  Thanks a lot!





^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Do not understand some terms about cluster health
  2011-12-22  3:47 Do not understand some terms about cluster health Eric_YH_Chen
@ 2011-12-22 20:40 ` Gregory Farnum
  2011-12-23 19:46   ` Gregory Farnum
  0 siblings, 1 reply; 3+ messages in thread
From: Gregory Farnum @ 2011-12-22 20:40 UTC (permalink / raw)
  To: Eric_YH_Chen; +Cc: ceph-devel, Chris_YT_Huang

On Wed, Dec 21, 2011 at 7:47 PM,  <Eric_YH_Chen@wistron.com> wrote:
> Hi, All
>
>   When I type 'ceph health' to get the status of cluster, it will show
> some information.
>
>   Would you please to explain the term?
>
>   Ex: HEALTH_WARN 3/54 degraded (5.556%)
>
>         What does "degraded" mean ?  Is it a serious error and how to
> fix it ?
>
>   Ex: HEALTH_WARN 264 pgs degraded, 6/60 degraded (10.000%); 3/27
> unfound (11.111%)
There are two meanings of degraded here. The degraded PGs are those
which don't yet have the number of active OSDs as they should (ie, the
PG wants 3 OSDs to be holding it and only 2 are). The number of
degraded objects is the number of missing replicas of objects. The
difference here is that an OSD can be an active member of a PG without
holding all the objects yet; the general sequence is that you lose an
OSD so a bunch of PGs go degraded, and then the OSDs peer and bring in
a new replica so the PG is no longer degraded but most of the objects
are until they get copied over.
Unfound objects are those which the cluster believes should exist but
can't find anywhere, either because the only copy is on a down OSD or
because there's a bug which caused them to believe in non-existent
objects.
Are you using the RADOS gateway? If you are, that's probably where
your unfound objects came from; there was a long-standing accounting
bug which had a fix merged earlier this week.

>         What does "unfound" mean?  Could we recover the data?
>         Would it cause the whole data in rbd image corrupted and never
> access ?
Nope; unfound objects will only block access to that specific object.
I'll have to look into whether rbd could trigger the same bug that RGW
was or not.

>
>   When I type 'ceph pg dump', it would show like this.  Would you
> please explain what is "hb in" and "hb out" ?
Those are the lists of OSDs which are heartbeating the given OSD, in
and out. The first group is OSDs which the one in question is keeping
track of; the second are OSDs which the one in question should be
reporting to.

>  And from the latest document, I know we can do the cluster snapshot by
> " ceph osd cluster_snap <name>"
>  Is that means we can rollback the data from the snapshot? Do you have
> any related document to show how to operate it?
That's the intention, but it's not a well-tested or complete solution
at this time. You shouldn't use it yet.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Do not understand some terms about cluster health
  2011-12-22 20:40 ` Gregory Farnum
@ 2011-12-23 19:46   ` Gregory Farnum
  0 siblings, 0 replies; 3+ messages in thread
From: Gregory Farnum @ 2011-12-23 19:46 UTC (permalink / raw)
  To: Eric_YH_Chen; +Cc: ceph-devel, Chris_YT_Huang

On Thu, Dec 22, 2011 at 12:40 PM, Gregory Farnum
<gregory.farnum@dreamhost.com> wrote:
> On Wed, Dec 21, 2011 at 7:47 PM,  <Eric_YH_Chen@wistron.com> wrote:
>> Hi, All
>>
>>   When I type 'ceph health' to get the status of cluster, it will show
>> some information.
>>
>>   Would you please to explain the term?
>>
>>   Ex: HEALTH_WARN 3/54 degraded (5.556%)
>>
>>         What does "degraded" mean ?  Is it a serious error and how to
>> fix it ?
>>
>>   Ex: HEALTH_WARN 264 pgs degraded, 6/60 degraded (10.000%); 3/27
>> unfound (11.111%)
> There are two meanings of degraded here. The degraded PGs are those
> which don't yet have the number of active OSDs as they should (ie, the
> PG wants 3 OSDs to be holding it and only 2 are). The number of
> degraded objects is the number of missing replicas of objects. The
> difference here is that an OSD can be an active member of a PG without
> holding all the objects yet; the general sequence is that you lose an
> OSD so a bunch of PGs go degraded, and then the OSDs peer and bring in
> a new replica so the PG is no longer degraded but most of the objects
> are until they get copied over.
> Unfound objects are those which the cluster believes should exist but
> can't find anywhere, either because the only copy is on a down OSD or
> because there's a bug which caused them to believe in non-existent
> objects.
> Are you using the RADOS gateway? If you are, that's probably where
> your unfound objects came from; there was a long-standing accounting
> bug which had a fix merged earlier this week.
>
>>         What does "unfound" mean?  Could we recover the data?
>>         Would it cause the whole data in rbd image corrupted and never
>> access ?
> Nope; unfound objects will only block access to that specific object.
> I'll have to look into whether rbd could trigger the same bug that RGW
> was or not.

And the answer to this appears to be "no". If you've got unfound
objects and you aren't using the Rados Gateway, we should figure out
how it happened! Do you have any down OSDs?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-12-23 19:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-22  3:47 Do not understand some terms about cluster health Eric_YH_Chen
2011-12-22 20:40 ` Gregory Farnum
2011-12-23 19:46   ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.