All of lore.kernel.org
 help / color / mirror / Atom feed
* Problem with inconsistent PG
@ 2012-02-10  7:43 Jens Rehpöhler
  2012-02-10 22:30 ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Rehpöhler @ 2012-02-10  7:43 UTC (permalink / raw)
  To: ceph-devel; +Cc: oliver.francke

Hi Liste,

today i've got another problem.

ceph -w shows up with an inconsistent PG over night:

2012-02-10 08:38:48.701775    pg v441251: 1982 pgs: 1981 active+clean, 1
active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
GB avail
2012-02-10 08:38:49.702789    pg v441252: 1982 pgs: 1981 active+clean, 1
active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
GB avail

I've identified it with "ceph pg dump - | grep inconsistent

109.6    141    0    0    0    463820288    111780    111780   
active+clean+inconsistent    485'7115    480'7301    [3,4]    [3,4]   
485'7061    2012-02-10 08:02:12.043986

Now I've tried to repair it with: ceph pg repair 109.6

2012-02-10 08:35:52.276325 mon <- [pg,repair,109.6]
2012-02-10 08:35:52.276776 mon.1 -> 'instructing pg 109.6 on osd.3 to
repair' (0)

but i only get the following result:

2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728
2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent
objects
2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors

Can someone please explain me what to do in this case and how to recover
the pg ?

Thanks a lot !

Jens

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problem with inconsistent PG
  2012-02-10  7:43 Problem with inconsistent PG Jens Rehpöhler
@ 2012-02-10 22:30 ` Sage Weil
  0 siblings, 0 replies; 11+ messages in thread
From: Sage Weil @ 2012-02-10 22:30 UTC (permalink / raw)
  To: Jens Rehpöhler; +Cc: ceph-devel, oliver.francke

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2192 bytes --]

On Fri, 10 Feb 2012, Jens Rehpöhler wrote:
> Hi Liste,
> 
> today i've got another problem.
> 
> ceph -w shows up with an inconsistent PG over night:
> 
> 2012-02-10 08:38:48.701775    pg v441251: 1982 pgs: 1981 active+clean, 1
> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
> GB avail
> 2012-02-10 08:38:49.702789    pg v441252: 1982 pgs: 1981 active+clean, 1
> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
> GB avail
> 
> I've identified it with "ceph pg dump - | grep inconsistent
> 
> 109.6    141    0    0    0    463820288    111780    111780   
> active+clean+inconsistent    485'7115    480'7301    [3,4]    [3,4]   
> 485'7061    2012-02-10 08:02:12.043986
> 
> Now I've tried to repair it with: ceph pg repair 109.6
> 
> 2012-02-10 08:35:52.276325 mon <- [pg,repair,109.6]
> 2012-02-10 08:35:52.276776 mon.1 -> 'instructing pg 109.6 on osd.3 to
> repair' (0)
> 
> but i only get the following result:
> 
> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
> 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
> 1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728
> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
> 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent
> objects
> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
> 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
> 
> Can someone please explain me what to do in this case and how to recover
> the pg ?

So the "fix" is just to truncate the file to the expected size, 3145728, 
by finding it in the current/ directory.  The name/path will be slightly 
weird; look for 'rb.0.0.0000000000bd'.

The data is still suspect, though.  Did the ceph-osd restart or crash 
recently?  I would do that, repair (it should succeed), and then fsck the 
file system in that rbd image.

We just fixed a bug that was causing transactions to leak across 
checkpoint/snapshot boundaries.  That could be responsible for causing all 
sorts of subtle corruptions, including this one.  It'll be included in 
v0.42 (out next week).

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problem with inconsistent PG
  2012-02-17 17:54             ` Sage Weil
@ 2012-02-17 18:13               ` Oliver Francke
  0 siblings, 0 replies; 11+ messages in thread
From: Oliver Francke @ 2012-02-17 18:13 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Well,

Am 17.02.2012 um 18:54 schrieb Sage Weil:

> On Fri, 17 Feb 2012, Oliver Francke wrote:
>> Well then,
>> 
>> found it via the "ceph osd dump" via the pool-id, thanks. The according customer
>> opened a ticket this morning for not being able to boot his VM after shutdown.
>> So I had to do some testdisk/fsck and tar the content into a new image.
>> 
>> I hope, there are no other "bad blocks" not being visible as "inconsistencies".
>> 
>> As these faulty images were easy detected as the boot-block was affected, how
>> big is the chance, that there are more rb..-fragments being corrupted within a image
>> in reference to what you mentioned below:
>> 
>> "...transactions to leak across checkpoint/snapshot boundaries."
>> 
>> Do we have a chance to detect it? I fear not, cause it will perhaps only be visible while
>> doing a "fsck" inside the VM?!
> 
> It is hard to say.  There is a small chance that it will trigger any time 
> ceph-osd is restarted.  The bug is fixed in the next release (which should 
> be out today), but of course upgrading involves shutting down :(.  
> Alternatively, you can cherry-pick the fixes, 
> 1009d1a016f049e19ad729a0c00a354a3956caf7 and 
> 93d7ef96316f30d3d7caefe07a5a747ce883ca2d.  v0.42 includes some encoding 
> changes that means you can upgrade but you can't downgrade again.  (These 
> encoding changes are being made so that in the future, you _can_ 
> downgrade.)
> 
> Here's what I suggest:
> 
> - don't restart any ceph-osds if you can help it
> - wait for v0.42 to come out, and wait until Monday at least
> - pause read/write traffic to the cluster with
> 
> ceph osd pause
> 
> - wait at least 30 seconds for osds to do a commit without any load.  
>   this makes it extremely unlikely you'd trigger the bug.
> - upgrade to v0.42, or restart with a patched ceph-osd.
> - unpause io with
> 
> ceph osd unpause
> 

that sounds reasonable, cool stuff ;-)

Thnx again,

Oliver.

> sage
> 
> 
> 
>> 
>> Anyway, thanks for your help and best regards,
>> 
>> Oliver.
>> 
>> Am 16.02.2012 um 19:02 schrieb Sage Weil:
>> 
>>> On Thu, 16 Feb 2012, Oliver Francke wrote:
>>>> Hi Sage,
>>>> 
>>>> thnx for the quick response,
>>>> 
>>>> Am 16.02.2012 um 18:17 schrieb Sage Weil:
>>>> 
>>>>> On Thu, 16 Feb 2012, Oliver Francke wrote:
>>>>>> Hi Sage, *,
>>>>>> 
>>>>>> your tip with truncating from below did not solve the problem. Just to recap:
>>>>>> 
>>>>>> we had two inconsistencies, which we could break down to something like:
>>>>>> 
>>>>>> rb.0.0.000000000000__head_DA680EE2
>>>>>> 
>>>>>> according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3
>>>>>> for example, and a stupid "find ?" brings up a couple of them, so the pg number is relevant too -
>>>>>> makes sense - we went into lets say "/data/osd3/current/84.2_head/" and did a hex dump from the file, looked really
>>>>>> like the "head", in means of signs from an installed grub-loader. But a corrupted partition-table.
>>>>>> From other of these files one could do a "fdisk -l <file>" and at least a partition-table could have been
>>>>>> found.
>>>>>> Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is,
>>>>>> from such a file with name and pg, how can we identify the real file being associated with, cause there is another
>>>>>> customer with a potential problem with next reboot ( second inconsistency).
>>>>>> 
>>>>>> We also had some VM's in a big test-phase with similar problems? grub going into rescue-prompt, invalid/corrupted
>>>>>> partition tables, so all in the first "head-file"?
>>>>>> Would be cool to get some more infos? and sched some light into the structures ( myself not really being a good code-reader
>>>>>> anymore ;) ).
>>>>> 
>>>>> 'head' in this case means the object hasn't been COWed (snapshotted and 
>>>>> then overwritten), and 000000000000 means its the first 4MB block of the 
>>>>> rbd image/disk.
>>>>> 
>>>> 
>>>> yes, true,
>>>> 
>>>>> We you able to use the 'rbd info' in the previous email to identify which 
>>>>> image it is?  Is that what you mean by 'identify the real file'?
>>>>> 
>>>> 
>>>> that's the point, from the object I would like to identify the complete image location ala:
>>>> 
>>>> <pool>/<image>
>>>> 
>>>> from there I'd know, which customer's rbd disk-image is affected.
>>> 
>>> For pool, look at the pgid, in this case '109.6'.  109 is the pool id.  
>>> Look at the pool list from 'ceph osd dump' output to see which pool name 
>>> that is.
>>> 
>>> For the image, rb.0.0 is the image prefix.  Look at each rbd image in that 
>>> pool, and check for the image whose prefix matches.  e.g.,
>>> 
>>> for img in `rbd -p poolname list` ; do rbd info $img -p poolname | grep 
>>> -q rb.0.0 && echo found $img ; done
>>> 
>>> BTW, are you creating a pool per customer here?  You need to be a little 
>>> bit careful about creating large numbers of pools; the system isn't really 
>>> designed to be used that way.  You should use a pool if you have a 
>>> distinct data placement requirement (e.g., put these objects on this set 
>>> of ceph-osds).  But because of the way things work internally creating 
>>> hundreds/thousands of them won't be very efficient.
>>> 
>>> sage
>>> 
>>> 
>>>> 
>>>> Thnx for your patience,
>>>> 
>>>> Oliver.
>>>> 
>>>>> I'm not sure I understand exactly what your question is.  I would have 
>>>>> expected modifying the file with fdisk -l to work (if fdisk sees a valid 
>>>>> partition table, it should be able to write it too).
>>>>> 
>>>>> sage
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Thanks in@vance and kind regards,
>>>>>> 
>>>>>> Oliver.
>>>>>> 
>>>>>> Am 13.02.2012 um 18:13 schrieb Sage Weil:
>>>>>> 
>>>>>>> On Sun, 12 Feb 2012, Jens Rehpoehler wrote:
>>>>>>> 
>>>>>>>>>> Hi Liste,
>>>>>>>>>> 
>>>>>>>>>> today i've got another problem.
>>>>>>>>>> 
>>>>>>>>>> ceph -w shows up with an inconsistent PG over night:
>>>>>>>>>> 
>>>>>>>>>> 2012-02-10 08:38:48.701775    pg v441251: 1982 pgs: 1981 active+clean, 1
>>>>>>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
>>>>>>>>>> GB avail
>>>>>>>>>> 2012-02-10 08:38:49.702789    pg v441252: 1982 pgs: 1981 active+clean, 1
>>>>>>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
>>>>>>>>>> GB avail
>>>>>>>>>> 
>>>>>>>>>> I've identified it with "ceph pg dump - | grep inconsistent
>>>>>>>>>> ...
>>>>>>>>>> 
>>>>>>>>>> Now I've tried to repair it with: ceph pg repair 109.6
>>>>>>>>>> 
>>>>>>>>>> 2012-02-10 08:35:52.276325 mon<- [pg,repair,109.6]
>>>>>>>>>> 2012-02-10 08:35:52.276776 mon.1 ->  'instructing pg 109.6 on osd.3 to
>>>>>>>>>> repair' (0)
>>>>>>>>>> 
>>>>>>>>>> but i only get the following result:
>>>>>>>>>> 
>>>>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
>>>>>>>>>> 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
>>>>>>>>>> 1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728
>>>>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
>>>>>>>>>> 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent
>>>>>>>>>> objects
>>>>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
>>>>>>>>>> 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
>>>>>>>>>> 
>>>>>>>>>> Can someone please explain me what to do in this case and how to recover
>>>>>>>>>> the pg ?
>>>>>>>>> 
>>>>>>>>> So the "fix" is just to truncate the file to the expected size, 3145728,
>>>>>>>>> by finding it in the current/ directory.  The name/path will be slightly
>>>>>>>>> weird; look for 'rb.0.0.0000000000bd'.
>>>>>>>>> 
>>>>>>>>> The data is still suspect, though.  Did the ceph-osd restart or crash
>>>>>>>>> recently?  I would do that, repair (it should succeed), and then fsck the
>>>>>>>>> file system in that rbd image.
>>>>>>>>> 
>>>>>>>>> We just fixed a bug that was causing transactions to leak across
>>>>>>>>> checkpoint/snapshot boundaries.  That could be responsible for causing all
>>>>>>>>> sorts of subtle corruptions, including this one.  It'll be included in
>>>>>>>>> v0.42 (out next week).
>>>>>>>>> 
>>>>>>>>> sage
>>>>>>>> 
>>>>>>>> Hi Sarge,
>>>>>>>> 
>>>>>>>> no ... the osd didn't crash. I had to do some hardware maintainance and push
>>>>>>>> it
>>>>>>>> out of distribution with "ceph osd out 3". After a short while i used
>>>>>>>> "/etc/init.d/ceph stop" on that osd.
>>>>>>>> Then, after my work i've started ceph and push it in the distribution with
>>>>>>>> "ceph osd in 3".
>>>>>>> 
>>>>>>> For the bug I'm worried about, stopping the daemon and crashing are 
>>>>>>> equivalent.  In both cases, a transaction may have been only partially 
>>>>>>> included in the checkpoint.
>>>>>>> 
>>>>>>>> Could you please tell me if this is the right way to get an osd out for
>>>>>>>> maintainance ? Is there
>>>>>>>> any other thing i should do to keep data consistent ?
>>>>>>> 
>>>>>>> You followed the right procedure.  There is (hopefully, was!) just a bug.
>>>>>>> 
>>>>>>> sage
>>>>>>> 
>>>>>>> 
>>>>>>>> My structure is ->  3 MDS/MON Server on seperate Hardware Nodes an 3 OSD Nodes
>>>>>>>> with a each a total capacity
>>>>>>>> of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a
>>>>>>>> data store for a kvm virtualisation
>>>>>>>> farm. The farm is accessing the data directly per rbd.
>>>>>>>> 
>>>>>>>> Thank you
>>>>>>>> 
>>>>>>>> Jens
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>> 
>>>>>>>> 
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> 
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> 
>>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problem with inconsistent PG
  2012-02-17 14:00           ` Oliver Francke
@ 2012-02-17 17:54             ` Sage Weil
  2012-02-17 18:13               ` Oliver Francke
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2012-02-17 17:54 UTC (permalink / raw)
  To: Oliver Francke; +Cc: ceph-devel

On Fri, 17 Feb 2012, Oliver Francke wrote:
> Well then,
> 
> found it via the "ceph osd dump" via the pool-id, thanks. The according customer
> opened a ticket this morning for not being able to boot his VM after shutdown.
> So I had to do some testdisk/fsck and tar the content into a new image.
> 
> I hope, there are no other "bad blocks" not being visible as "inconsistencies".
> 
> As these faulty images were easy detected as the boot-block was affected, how
> big is the chance, that there are more rb..-fragments being corrupted within a image
> in reference to what you mentioned below:
> 
> "...transactions to leak across checkpoint/snapshot boundaries."
> 
> Do we have a chance to detect it? I fear not, cause it will perhaps only be visible while
> doing a "fsck" inside the VM?!

It is hard to say.  There is a small chance that it will trigger any time 
ceph-osd is restarted.  The bug is fixed in the next release (which should 
be out today), but of course upgrading involves shutting down :(.  
Alternatively, you can cherry-pick the fixes, 
1009d1a016f049e19ad729a0c00a354a3956caf7 and 
93d7ef96316f30d3d7caefe07a5a747ce883ca2d.  v0.42 includes some encoding 
changes that means you can upgrade but you can't downgrade again.  (These 
encoding changes are being made so that in the future, you _can_ 
downgrade.)

Here's what I suggest:

 - don't restart any ceph-osds if you can help it
 - wait for v0.42 to come out, and wait until Monday at least
 - pause read/write traffic to the cluster with

 ceph osd pause

 - wait at least 30 seconds for osds to do a commit without any load.  
   this makes it extremely unlikely you'd trigger the bug.
 - upgrade to v0.42, or restart with a patched ceph-osd.
 - unpause io with

 ceph osd unpause

sage



> 
> Anyway, thanks for your help and best regards,
> 
> Oliver.
> 
> Am 16.02.2012 um 19:02 schrieb Sage Weil:
> 
> > On Thu, 16 Feb 2012, Oliver Francke wrote:
> >> Hi Sage,
> >> 
> >> thnx for the quick response,
> >> 
> >> Am 16.02.2012 um 18:17 schrieb Sage Weil:
> >> 
> >>> On Thu, 16 Feb 2012, Oliver Francke wrote:
> >>>> Hi Sage, *,
> >>>> 
> >>>> your tip with truncating from below did not solve the problem. Just to recap:
> >>>> 
> >>>> we had two inconsistencies, which we could break down to something like:
> >>>> 
> >>>> rb.0.0.000000000000__head_DA680EE2
> >>>> 
> >>>> according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3
> >>>> for example, and a stupid "find ?" brings up a couple of them, so the pg number is relevant too -
> >>>> makes sense - we went into lets say "/data/osd3/current/84.2_head/" and did a hex dump from the file, looked really
> >>>> like the "head", in means of signs from an installed grub-loader. But a corrupted partition-table.
> >>>> From other of these files one could do a "fdisk -l <file>" and at least a partition-table could have been
> >>>> found.
> >>>> Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is,
> >>>> from such a file with name and pg, how can we identify the real file being associated with, cause there is another
> >>>> customer with a potential problem with next reboot ( second inconsistency).
> >>>> 
> >>>> We also had some VM's in a big test-phase with similar problems? grub going into rescue-prompt, invalid/corrupted
> >>>> partition tables, so all in the first "head-file"?
> >>>> Would be cool to get some more infos? and sched some light into the structures ( myself not really being a good code-reader
> >>>> anymore ;) ).
> >>> 
> >>> 'head' in this case means the object hasn't been COWed (snapshotted and 
> >>> then overwritten), and 000000000000 means its the first 4MB block of the 
> >>> rbd image/disk.
> >>> 
> >> 
> >> yes, true,
> >> 
> >>> We you able to use the 'rbd info' in the previous email to identify which 
> >>> image it is?  Is that what you mean by 'identify the real file'?
> >>> 
> >> 
> >> that's the point, from the object I would like to identify the complete image location ala:
> >> 
> >> <pool>/<image>
> >> 
> >> from there I'd know, which customer's rbd disk-image is affected.
> > 
> > For pool, look at the pgid, in this case '109.6'.  109 is the pool id.  
> > Look at the pool list from 'ceph osd dump' output to see which pool name 
> > that is.
> > 
> > For the image, rb.0.0 is the image prefix.  Look at each rbd image in that 
> > pool, and check for the image whose prefix matches.  e.g.,
> > 
> > for img in `rbd -p poolname list` ; do rbd info $img -p poolname | grep 
> > -q rb.0.0 && echo found $img ; done
> > 
> > BTW, are you creating a pool per customer here?  You need to be a little 
> > bit careful about creating large numbers of pools; the system isn't really 
> > designed to be used that way.  You should use a pool if you have a 
> > distinct data placement requirement (e.g., put these objects on this set 
> > of ceph-osds).  But because of the way things work internally creating 
> > hundreds/thousands of them won't be very efficient.
> > 
> > sage
> > 
> > 
> >> 
> >> Thnx for your patience,
> >> 
> >> Oliver.
> >> 
> >>> I'm not sure I understand exactly what your question is.  I would have 
> >>> expected modifying the file with fdisk -l to work (if fdisk sees a valid 
> >>> partition table, it should be able to write it too).
> >>> 
> >>> sage
> >>> 
> >>> 
> >>>> 
> >>>> Thanks in@vance and kind regards,
> >>>> 
> >>>> Oliver.
> >>>> 
> >>>> Am 13.02.2012 um 18:13 schrieb Sage Weil:
> >>>> 
> >>>>> On Sun, 12 Feb 2012, Jens Rehpoehler wrote:
> >>>>> 
> >>>>>>>> Hi Liste,
> >>>>>>>> 
> >>>>>>>> today i've got another problem.
> >>>>>>>> 
> >>>>>>>> ceph -w shows up with an inconsistent PG over night:
> >>>>>>>> 
> >>>>>>>> 2012-02-10 08:38:48.701775    pg v441251: 1982 pgs: 1981 active+clean, 1
> >>>>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
> >>>>>>>> GB avail
> >>>>>>>> 2012-02-10 08:38:49.702789    pg v441252: 1982 pgs: 1981 active+clean, 1
> >>>>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
> >>>>>>>> GB avail
> >>>>>>>> 
> >>>>>>>> I've identified it with "ceph pg dump - | grep inconsistent
> >>>>>>>> ...
> >>>>>>>> 
> >>>>>>>> Now I've tried to repair it with: ceph pg repair 109.6
> >>>>>>>> 
> >>>>>>>> 2012-02-10 08:35:52.276325 mon<- [pg,repair,109.6]
> >>>>>>>> 2012-02-10 08:35:52.276776 mon.1 ->  'instructing pg 109.6 on osd.3 to
> >>>>>>>> repair' (0)
> >>>>>>>> 
> >>>>>>>> but i only get the following result:
> >>>>>>>> 
> >>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
> >>>>>>>> 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
> >>>>>>>> 1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728
> >>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
> >>>>>>>> 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent
> >>>>>>>> objects
> >>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
> >>>>>>>> 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
> >>>>>>>> 
> >>>>>>>> Can someone please explain me what to do in this case and how to recover
> >>>>>>>> the pg ?
> >>>>>>> 
> >>>>>>> So the "fix" is just to truncate the file to the expected size, 3145728,
> >>>>>>> by finding it in the current/ directory.  The name/path will be slightly
> >>>>>>> weird; look for 'rb.0.0.0000000000bd'.
> >>>>>>> 
> >>>>>>> The data is still suspect, though.  Did the ceph-osd restart or crash
> >>>>>>> recently?  I would do that, repair (it should succeed), and then fsck the
> >>>>>>> file system in that rbd image.
> >>>>>>> 
> >>>>>>> We just fixed a bug that was causing transactions to leak across
> >>>>>>> checkpoint/snapshot boundaries.  That could be responsible for causing all
> >>>>>>> sorts of subtle corruptions, including this one.  It'll be included in
> >>>>>>> v0.42 (out next week).
> >>>>>>> 
> >>>>>>> sage
> >>>>>> 
> >>>>>> Hi Sarge,
> >>>>>> 
> >>>>>> no ... the osd didn't crash. I had to do some hardware maintainance and push
> >>>>>> it
> >>>>>> out of distribution with "ceph osd out 3". After a short while i used
> >>>>>> "/etc/init.d/ceph stop" on that osd.
> >>>>>> Then, after my work i've started ceph and push it in the distribution with
> >>>>>> "ceph osd in 3".
> >>>>> 
> >>>>> For the bug I'm worried about, stopping the daemon and crashing are 
> >>>>> equivalent.  In both cases, a transaction may have been only partially 
> >>>>> included in the checkpoint.
> >>>>> 
> >>>>>> Could you please tell me if this is the right way to get an osd out for
> >>>>>> maintainance ? Is there
> >>>>>> any other thing i should do to keep data consistent ?
> >>>>> 
> >>>>> You followed the right procedure.  There is (hopefully, was!) just a bug.
> >>>>> 
> >>>>> sage
> >>>>> 
> >>>>> 
> >>>>>> My structure is ->  3 MDS/MON Server on seperate Hardware Nodes an 3 OSD Nodes
> >>>>>> with a each a total capacity
> >>>>>> of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a
> >>>>>> data store for a kvm virtualisation
> >>>>>> farm. The farm is accessing the data directly per rbd.
> >>>>>> 
> >>>>>> Thank you
> >>>>>> 
> >>>>>> Jens
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>>> the body of a message to majordomo@vger.kernel.org
> >>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>> 
> >>>>>> 
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>> the body of a message to majordomo@vger.kernel.org
> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>> 
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>> 
> >>>> 
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> >> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problem with inconsistent PG
  2012-02-16 18:02         ` Sage Weil
@ 2012-02-17 14:00           ` Oliver Francke
  2012-02-17 17:54             ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Oliver Francke @ 2012-02-17 14:00 UTC (permalink / raw)
  To: ceph-devel

Well then,

found it via the "ceph osd dump" via the pool-id, thanks. The according customer
opened a ticket this morning for not being able to boot his VM after shutdown.
So I had to do some testdisk/fsck and tar the content into a new image.

I hope, there are no other "bad blocks" not being visible as "inconsistencies".

As these faulty images were easy detected as the boot-block was affected, how
big is the chance, that there are more rb..-fragments being corrupted within a image
in reference to what you mentioned below:

"...transactions to leak across checkpoint/snapshot boundaries."

Do we have a chance to detect it? I fear not, cause it will perhaps only be visible while
doing a "fsck" inside the VM?!

Anyway, thanks for your help and best regards,

Oliver.

Am 16.02.2012 um 19:02 schrieb Sage Weil:

> On Thu, 16 Feb 2012, Oliver Francke wrote:
>> Hi Sage,
>> 
>> thnx for the quick response,
>> 
>> Am 16.02.2012 um 18:17 schrieb Sage Weil:
>> 
>>> On Thu, 16 Feb 2012, Oliver Francke wrote:
>>>> Hi Sage, *,
>>>> 
>>>> your tip with truncating from below did not solve the problem. Just to recap:
>>>> 
>>>> we had two inconsistencies, which we could break down to something like:
>>>> 
>>>> rb.0.0.000000000000__head_DA680EE2
>>>> 
>>>> according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3
>>>> for example, and a stupid "find ?" brings up a couple of them, so the pg number is relevant too -
>>>> makes sense - we went into lets say "/data/osd3/current/84.2_head/" and did a hex dump from the file, looked really
>>>> like the "head", in means of signs from an installed grub-loader. But a corrupted partition-table.
>>>> From other of these files one could do a "fdisk -l <file>" and at least a partition-table could have been
>>>> found.
>>>> Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is,
>>>> from such a file with name and pg, how can we identify the real file being associated with, cause there is another
>>>> customer with a potential problem with next reboot ( second inconsistency).
>>>> 
>>>> We also had some VM's in a big test-phase with similar problems? grub going into rescue-prompt, invalid/corrupted
>>>> partition tables, so all in the first "head-file"?
>>>> Would be cool to get some more infos? and sched some light into the structures ( myself not really being a good code-reader
>>>> anymore ;) ).
>>> 
>>> 'head' in this case means the object hasn't been COWed (snapshotted and 
>>> then overwritten), and 000000000000 means its the first 4MB block of the 
>>> rbd image/disk.
>>> 
>> 
>> yes, true,
>> 
>>> We you able to use the 'rbd info' in the previous email to identify which 
>>> image it is?  Is that what you mean by 'identify the real file'?
>>> 
>> 
>> that's the point, from the object I would like to identify the complete image location ala:
>> 
>> <pool>/<image>
>> 
>> from there I'd know, which customer's rbd disk-image is affected.
> 
> For pool, look at the pgid, in this case '109.6'.  109 is the pool id.  
> Look at the pool list from 'ceph osd dump' output to see which pool name 
> that is.
> 
> For the image, rb.0.0 is the image prefix.  Look at each rbd image in that 
> pool, and check for the image whose prefix matches.  e.g.,
> 
> for img in `rbd -p poolname list` ; do rbd info $img -p poolname | grep 
> -q rb.0.0 && echo found $img ; done
> 
> BTW, are you creating a pool per customer here?  You need to be a little 
> bit careful about creating large numbers of pools; the system isn't really 
> designed to be used that way.  You should use a pool if you have a 
> distinct data placement requirement (e.g., put these objects on this set 
> of ceph-osds).  But because of the way things work internally creating 
> hundreds/thousands of them won't be very efficient.
> 
> sage
> 
> 
>> 
>> Thnx for your patience,
>> 
>> Oliver.
>> 
>>> I'm not sure I understand exactly what your question is.  I would have 
>>> expected modifying the file with fdisk -l to work (if fdisk sees a valid 
>>> partition table, it should be able to write it too).
>>> 
>>> sage
>>> 
>>> 
>>>> 
>>>> Thanks in@vance and kind regards,
>>>> 
>>>> Oliver.
>>>> 
>>>> Am 13.02.2012 um 18:13 schrieb Sage Weil:
>>>> 
>>>>> On Sun, 12 Feb 2012, Jens Rehpoehler wrote:
>>>>> 
>>>>>>>> Hi Liste,
>>>>>>>> 
>>>>>>>> today i've got another problem.
>>>>>>>> 
>>>>>>>> ceph -w shows up with an inconsistent PG over night:
>>>>>>>> 
>>>>>>>> 2012-02-10 08:38:48.701775    pg v441251: 1982 pgs: 1981 active+clean, 1
>>>>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
>>>>>>>> GB avail
>>>>>>>> 2012-02-10 08:38:49.702789    pg v441252: 1982 pgs: 1981 active+clean, 1
>>>>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
>>>>>>>> GB avail
>>>>>>>> 
>>>>>>>> I've identified it with "ceph pg dump - | grep inconsistent
>>>>>>>> ...
>>>>>>>> 
>>>>>>>> Now I've tried to repair it with: ceph pg repair 109.6
>>>>>>>> 
>>>>>>>> 2012-02-10 08:35:52.276325 mon<- [pg,repair,109.6]
>>>>>>>> 2012-02-10 08:35:52.276776 mon.1 ->  'instructing pg 109.6 on osd.3 to
>>>>>>>> repair' (0)
>>>>>>>> 
>>>>>>>> but i only get the following result:
>>>>>>>> 
>>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
>>>>>>>> 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
>>>>>>>> 1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728
>>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
>>>>>>>> 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent
>>>>>>>> objects
>>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
>>>>>>>> 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
>>>>>>>> 
>>>>>>>> Can someone please explain me what to do in this case and how to recover
>>>>>>>> the pg ?
>>>>>>> 
>>>>>>> So the "fix" is just to truncate the file to the expected size, 3145728,
>>>>>>> by finding it in the current/ directory.  The name/path will be slightly
>>>>>>> weird; look for 'rb.0.0.0000000000bd'.
>>>>>>> 
>>>>>>> The data is still suspect, though.  Did the ceph-osd restart or crash
>>>>>>> recently?  I would do that, repair (it should succeed), and then fsck the
>>>>>>> file system in that rbd image.
>>>>>>> 
>>>>>>> We just fixed a bug that was causing transactions to leak across
>>>>>>> checkpoint/snapshot boundaries.  That could be responsible for causing all
>>>>>>> sorts of subtle corruptions, including this one.  It'll be included in
>>>>>>> v0.42 (out next week).
>>>>>>> 
>>>>>>> sage
>>>>>> 
>>>>>> Hi Sarge,
>>>>>> 
>>>>>> no ... the osd didn't crash. I had to do some hardware maintainance and push
>>>>>> it
>>>>>> out of distribution with "ceph osd out 3". After a short while i used
>>>>>> "/etc/init.d/ceph stop" on that osd.
>>>>>> Then, after my work i've started ceph and push it in the distribution with
>>>>>> "ceph osd in 3".
>>>>> 
>>>>> For the bug I'm worried about, stopping the daemon and crashing are 
>>>>> equivalent.  In both cases, a transaction may have been only partially 
>>>>> included in the checkpoint.
>>>>> 
>>>>>> Could you please tell me if this is the right way to get an osd out for
>>>>>> maintainance ? Is there
>>>>>> any other thing i should do to keep data consistent ?
>>>>> 
>>>>> You followed the right procedure.  There is (hopefully, was!) just a bug.
>>>>> 
>>>>> sage
>>>>> 
>>>>> 
>>>>>> My structure is ->  3 MDS/MON Server on seperate Hardware Nodes an 3 OSD Nodes
>>>>>> with a each a total capacity
>>>>>> of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a
>>>>>> data store for a kvm virtualisation
>>>>>> farm. The farm is accessing the data directly per rbd.
>>>>>> 
>>>>>> Thank you
>>>>>> 
>>>>>> Jens
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> 
>>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problem with inconsistent PG
  2012-02-16 17:53       ` Oliver Francke
@ 2012-02-16 18:02         ` Sage Weil
  2012-02-17 14:00           ` Oliver Francke
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2012-02-16 18:02 UTC (permalink / raw)
  To: Oliver Francke; +Cc: Jens Rehpoehler, ceph-devel

On Thu, 16 Feb 2012, Oliver Francke wrote:
> Hi Sage,
> 
> thnx for the quick response,
> 
> Am 16.02.2012 um 18:17 schrieb Sage Weil:
> 
> > On Thu, 16 Feb 2012, Oliver Francke wrote:
> >> Hi Sage, *,
> >> 
> >> your tip with truncating from below did not solve the problem. Just to recap:
> >> 
> >> we had two inconsistencies, which we could break down to something like:
> >> 
> >> rb.0.0.000000000000__head_DA680EE2
> >> 
> >> according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3
> >> for example, and a stupid "find ?" brings up a couple of them, so the pg number is relevant too -
> >> makes sense - we went into lets say "/data/osd3/current/84.2_head/" and did a hex dump from the file, looked really
> >> like the "head", in means of signs from an installed grub-loader. But a corrupted partition-table.
> >> From other of these files one could do a "fdisk -l <file>" and at least a partition-table could have been
> >> found.
> >> Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is,
> >> from such a file with name and pg, how can we identify the real file being associated with, cause there is another
> >> customer with a potential problem with next reboot ( second inconsistency).
> >> 
> >> We also had some VM's in a big test-phase with similar problems? grub going into rescue-prompt, invalid/corrupted
> >> partition tables, so all in the first "head-file"?
> >> Would be cool to get some more infos? and sched some light into the structures ( myself not really being a good code-reader
> >> anymore ;) ).
> > 
> > 'head' in this case means the object hasn't been COWed (snapshotted and 
> > then overwritten), and 000000000000 means its the first 4MB block of the 
> > rbd image/disk.
> > 
> 
> yes, true,
> 
> > We you able to use the 'rbd info' in the previous email to identify which 
> > image it is?  Is that what you mean by 'identify the real file'?
> > 
> 
> that's the point, from the object I would like to identify the complete image location ala:
> 
> <pool>/<image>
> 
> from there I'd know, which customer's rbd disk-image is affected.

For pool, look at the pgid, in this case '109.6'.  109 is the pool id.  
Look at the pool list from 'ceph osd dump' output to see which pool name 
that is.

For the image, rb.0.0 is the image prefix.  Look at each rbd image in that 
pool, and check for the image whose prefix matches.  e.g.,

 for img in `rbd -p poolname list` ; do rbd info $img -p poolname | grep 
-q rb.0.0 && echo found $img ; done

BTW, are you creating a pool per customer here?  You need to be a little 
bit careful about creating large numbers of pools; the system isn't really 
designed to be used that way.  You should use a pool if you have a 
distinct data placement requirement (e.g., put these objects on this set 
of ceph-osds).  But because of the way things work internally creating 
hundreds/thousands of them won't be very efficient.

sage


> 
> Thnx for your patience,
> 
> Oliver.
> 
> > I'm not sure I understand exactly what your question is.  I would have 
> > expected modifying the file with fdisk -l to work (if fdisk sees a valid 
> > partition table, it should be able to write it too).
> > 
> > sage
> > 
> > 
> >> 
> >> Thanks in@vance and kind regards,
> >> 
> >> Oliver.
> >> 
> >> Am 13.02.2012 um 18:13 schrieb Sage Weil:
> >> 
> >>> On Sun, 12 Feb 2012, Jens Rehpoehler wrote:
> >>> 
> >>>>>> Hi Liste,
> >>>>>> 
> >>>>>> today i've got another problem.
> >>>>>> 
> >>>>>> ceph -w shows up with an inconsistent PG over night:
> >>>>>> 
> >>>>>> 2012-02-10 08:38:48.701775    pg v441251: 1982 pgs: 1981 active+clean, 1
> >>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
> >>>>>> GB avail
> >>>>>> 2012-02-10 08:38:49.702789    pg v441252: 1982 pgs: 1981 active+clean, 1
> >>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
> >>>>>> GB avail
> >>>>>> 
> >>>>>> I've identified it with "ceph pg dump - | grep inconsistent
> >>>>>> 
> >>>>>> 109.6    141    0    0    0    463820288    111780    111780
> >>>>>> active+clean+inconsistent    485'7115    480'7301    [3
> >>>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4
> >>>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>]    [3
> >>>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4
> >>>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>]
> >>>>>> 485'7061    2012-02-10 08:02:12.043986
> >>>>>> 
> >>>>>> Now I've tried to repair it with: ceph pg repair 109.6
> >>>>>> 
> >>>>>> 2012-02-10 08:35:52.276325 mon<- [pg,repair,109.6]
> >>>>>> 2012-02-10 08:35:52.276776 mon.1 ->  'instructing pg 109.6 on osd.3 to
> >>>>>> repair' (0)
> >>>>>> 
> >>>>>> but i only get the following result:
> >>>>>> 
> >>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
> >>>>>> 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
> >>>>>> 1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728
> >>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
> >>>>>> 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent
> >>>>>> objects
> >>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
> >>>>>> 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
> >>>>>> 
> >>>>>> Can someone please explain me what to do in this case and how to recover
> >>>>>> the pg ?
> >>>>> 
> >>>>> So the "fix" is just to truncate the file to the expected size, 3145728,
> >>>>> by finding it in the current/ directory.  The name/path will be slightly
> >>>>> weird; look for 'rb.0.0.0000000000bd'.
> >>>>> 
> >>>>> The data is still suspect, though.  Did the ceph-osd restart or crash
> >>>>> recently?  I would do that, repair (it should succeed), and then fsck the
> >>>>> file system in that rbd image.
> >>>>> 
> >>>>> We just fixed a bug that was causing transactions to leak across
> >>>>> checkpoint/snapshot boundaries.  That could be responsible for causing all
> >>>>> sorts of subtle corruptions, including this one.  It'll be included in
> >>>>> v0.42 (out next week).
> >>>>> 
> >>>>> sage
> >>>> 
> >>>> Hi Sarge,
> >>>> 
> >>>> no ... the osd didn't crash. I had to do some hardware maintainance and push
> >>>> it
> >>>> out of distribution with "ceph osd out 3". After a short while i used
> >>>> "/etc/init.d/ceph stop" on that osd.
> >>>> Then, after my work i've started ceph and push it in the distribution with
> >>>> "ceph osd in 3".
> >>> 
> >>> For the bug I'm worried about, stopping the daemon and crashing are 
> >>> equivalent.  In both cases, a transaction may have been only partially 
> >>> included in the checkpoint.
> >>> 
> >>>> Could you please tell me if this is the right way to get an osd out for
> >>>> maintainance ? Is there
> >>>> any other thing i should do to keep data consistent ?
> >>> 
> >>> You followed the right procedure.  There is (hopefully, was!) just a bug.
> >>> 
> >>> sage
> >>> 
> >>> 
> >>>> My structure is ->  3 MDS/MON Server on seperate Hardware Nodes an 3 OSD Nodes
> >>>> with a each a total capacity
> >>>> of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a
> >>>> data store for a kvm virtualisation
> >>>> farm. The farm is accessing the data directly per rbd.
> >>>> 
> >>>> Thank you
> >>>> 
> >>>> Jens
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>> 
> >>>> 
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> >> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problem with inconsistent PG
  2012-02-16 17:17     ` Sage Weil
@ 2012-02-16 17:53       ` Oliver Francke
  2012-02-16 18:02         ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Oliver Francke @ 2012-02-16 17:53 UTC (permalink / raw)
  To: Sage Weil; +Cc: Jens Rehpoehler, ceph-devel

Hi Sage,

thnx for the quick response,

Am 16.02.2012 um 18:17 schrieb Sage Weil:

> On Thu, 16 Feb 2012, Oliver Francke wrote:
>> Hi Sage, *,
>> 
>> your tip with truncating from below did not solve the problem. Just to recap:
>> 
>> we had two inconsistencies, which we could break down to something like:
>> 
>> rb.0.0.000000000000__head_DA680EE2
>> 
>> according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3
>> for example, and a stupid "find ?" brings up a couple of them, so the pg number is relevant too -
>> makes sense - we went into lets say "/data/osd3/current/84.2_head/" and did a hex dump from the file, looked really
>> like the "head", in means of signs from an installed grub-loader. But a corrupted partition-table.
>> From other of these files one could do a "fdisk -l <file>" and at least a partition-table could have been
>> found.
>> Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is,
>> from such a file with name and pg, how can we identify the real file being associated with, cause there is another
>> customer with a potential problem with next reboot ( second inconsistency).
>> 
>> We also had some VM's in a big test-phase with similar problems? grub going into rescue-prompt, invalid/corrupted
>> partition tables, so all in the first "head-file"?
>> Would be cool to get some more infos? and sched some light into the structures ( myself not really being a good code-reader
>> anymore ;) ).
> 
> 'head' in this case means the object hasn't been COWed (snapshotted and 
> then overwritten), and 000000000000 means its the first 4MB block of the 
> rbd image/disk.
> 

yes, true,

> We you able to use the 'rbd info' in the previous email to identify which 
> image it is?  Is that what you mean by 'identify the real file'?
> 

that's the point, from the object I would like to identify the complete image location ala:

<pool>/<image>

from there I'd know, which customer's rbd disk-image is affected.

Thnx for your patience,

Oliver.

> I'm not sure I understand exactly what your question is.  I would have 
> expected modifying the file with fdisk -l to work (if fdisk sees a valid 
> partition table, it should be able to write it too).
> 
> sage
> 
> 
>> 
>> Thanks in@vance and kind regards,
>> 
>> Oliver.
>> 
>> Am 13.02.2012 um 18:13 schrieb Sage Weil:
>> 
>>> On Sun, 12 Feb 2012, Jens Rehpoehler wrote:
>>> 
>>>>>> Hi Liste,
>>>>>> 
>>>>>> today i've got another problem.
>>>>>> 
>>>>>> ceph -w shows up with an inconsistent PG over night:
>>>>>> 
>>>>>> 2012-02-10 08:38:48.701775    pg v441251: 1982 pgs: 1981 active+clean, 1
>>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
>>>>>> GB avail
>>>>>> 2012-02-10 08:38:49.702789    pg v441252: 1982 pgs: 1981 active+clean, 1
>>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
>>>>>> GB avail
>>>>>> 
>>>>>> I've identified it with "ceph pg dump - | grep inconsistent
>>>>>> 
>>>>>> 109.6    141    0    0    0    463820288    111780    111780
>>>>>> active+clean+inconsistent    485'7115    480'7301    [3
>>>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4
>>>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>]    [3
>>>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4
>>>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>]
>>>>>> 485'7061    2012-02-10 08:02:12.043986
>>>>>> 
>>>>>> Now I've tried to repair it with: ceph pg repair 109.6
>>>>>> 
>>>>>> 2012-02-10 08:35:52.276325 mon<- [pg,repair,109.6]
>>>>>> 2012-02-10 08:35:52.276776 mon.1 ->  'instructing pg 109.6 on osd.3 to
>>>>>> repair' (0)
>>>>>> 
>>>>>> but i only get the following result:
>>>>>> 
>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
>>>>>> 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
>>>>>> 1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728
>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
>>>>>> 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent
>>>>>> objects
>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
>>>>>> 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
>>>>>> 
>>>>>> Can someone please explain me what to do in this case and how to recover
>>>>>> the pg ?
>>>>> 
>>>>> So the "fix" is just to truncate the file to the expected size, 3145728,
>>>>> by finding it in the current/ directory.  The name/path will be slightly
>>>>> weird; look for 'rb.0.0.0000000000bd'.
>>>>> 
>>>>> The data is still suspect, though.  Did the ceph-osd restart or crash
>>>>> recently?  I would do that, repair (it should succeed), and then fsck the
>>>>> file system in that rbd image.
>>>>> 
>>>>> We just fixed a bug that was causing transactions to leak across
>>>>> checkpoint/snapshot boundaries.  That could be responsible for causing all
>>>>> sorts of subtle corruptions, including this one.  It'll be included in
>>>>> v0.42 (out next week).
>>>>> 
>>>>> sage
>>>> 
>>>> Hi Sarge,
>>>> 
>>>> no ... the osd didn't crash. I had to do some hardware maintainance and push
>>>> it
>>>> out of distribution with "ceph osd out 3". After a short while i used
>>>> "/etc/init.d/ceph stop" on that osd.
>>>> Then, after my work i've started ceph and push it in the distribution with
>>>> "ceph osd in 3".
>>> 
>>> For the bug I'm worried about, stopping the daemon and crashing are 
>>> equivalent.  In both cases, a transaction may have been only partially 
>>> included in the checkpoint.
>>> 
>>>> Could you please tell me if this is the right way to get an osd out for
>>>> maintainance ? Is there
>>>> any other thing i should do to keep data consistent ?
>>> 
>>> You followed the right procedure.  There is (hopefully, was!) just a bug.
>>> 
>>> sage
>>> 
>>> 
>>>> My structure is ->  3 MDS/MON Server on seperate Hardware Nodes an 3 OSD Nodes
>>>> with a each a total capacity
>>>> of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a
>>>> data store for a kvm virtualisation
>>>> farm. The farm is accessing the data directly per rbd.
>>>> 
>>>> Thank you
>>>> 
>>>> Jens
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problem with inconsistent PG
  2012-02-16 14:42   ` Oliver Francke
@ 2012-02-16 17:17     ` Sage Weil
  2012-02-16 17:53       ` Oliver Francke
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2012-02-16 17:17 UTC (permalink / raw)
  To: Oliver Francke; +Cc: Jens Rehpoehler, ceph-devel

On Thu, 16 Feb 2012, Oliver Francke wrote:
> Hi Sage, *,
> 
> your tip with truncating from below did not solve the problem. Just to recap:
> 
> we had two inconsistencies, which we could break down to something like:
> 
> rb.0.0.000000000000__head_DA680EE2
> 
> according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3
> for example, and a stupid "find ?" brings up a couple of them, so the pg number is relevant too -
> makes sense - we went into lets say "/data/osd3/current/84.2_head/" and did a hex dump from the file, looked really
> like the "head", in means of signs from an installed grub-loader. But a corrupted partition-table.
> From other of these files one could do a "fdisk -l <file>" and at least a partition-table could have been
> found.
> Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is,
> from such a file with name and pg, how can we identify the real file being associated with, cause there is another
> customer with a potential problem with next reboot ( second inconsistency).
>
> We also had some VM's in a big test-phase with similar problems? grub going into rescue-prompt, invalid/corrupted
> partition tables, so all in the first "head-file"?
> Would be cool to get some more infos? and sched some light into the structures ( myself not really being a good code-reader
> anymore ;) ).

'head' in this case means the object hasn't been COWed (snapshotted and 
then overwritten), and 000000000000 means its the first 4MB block of the 
rbd image/disk.

We you able to use the 'rbd info' in the previous email to identify which 
image it is?  Is that what you mean by 'identify the real file'?

I'm not sure I understand exactly what your question is.  I would have 
expected modifying the file with fdisk -l to work (if fdisk sees a valid 
partition table, it should be able to write it too).

sage


> 
> Thanks in@vance and kind regards,
> 
> Oliver.
> 
> Am 13.02.2012 um 18:13 schrieb Sage Weil:
> 
> > On Sun, 12 Feb 2012, Jens Rehpoehler wrote:
> > 
> >>>> Hi Liste,
> >>>> 
> >>>> today i've got another problem.
> >>>> 
> >>>> ceph -w shows up with an inconsistent PG over night:
> >>>> 
> >>>> 2012-02-10 08:38:48.701775    pg v441251: 1982 pgs: 1981 active+clean, 1
> >>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
> >>>> GB avail
> >>>> 2012-02-10 08:38:49.702789    pg v441252: 1982 pgs: 1981 active+clean, 1
> >>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
> >>>> GB avail
> >>>> 
> >>>> I've identified it with "ceph pg dump - | grep inconsistent
> >>>> 
> >>>> 109.6    141    0    0    0    463820288    111780    111780
> >>>> active+clean+inconsistent    485'7115    480'7301    [3
> >>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4
> >>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>]    [3
> >>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4
> >>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>]
> >>>> 485'7061    2012-02-10 08:02:12.043986
> >>>> 
> >>>> Now I've tried to repair it with: ceph pg repair 109.6
> >>>> 
> >>>> 2012-02-10 08:35:52.276325 mon<- [pg,repair,109.6]
> >>>> 2012-02-10 08:35:52.276776 mon.1 ->  'instructing pg 109.6 on osd.3 to
> >>>> repair' (0)
> >>>> 
> >>>> but i only get the following result:
> >>>> 
> >>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
> >>>> 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
> >>>> 1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728
> >>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
> >>>> 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent
> >>>> objects
> >>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
> >>>> 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
> >>>> 
> >>>> Can someone please explain me what to do in this case and how to recover
> >>>> the pg ?
> >>> 
> >>> So the "fix" is just to truncate the file to the expected size, 3145728,
> >>> by finding it in the current/ directory.  The name/path will be slightly
> >>> weird; look for 'rb.0.0.0000000000bd'.
> >>> 
> >>> The data is still suspect, though.  Did the ceph-osd restart or crash
> >>> recently?  I would do that, repair (it should succeed), and then fsck the
> >>> file system in that rbd image.
> >>> 
> >>> We just fixed a bug that was causing transactions to leak across
> >>> checkpoint/snapshot boundaries.  That could be responsible for causing all
> >>> sorts of subtle corruptions, including this one.  It'll be included in
> >>> v0.42 (out next week).
> >>> 
> >>> sage
> >> 
> >> Hi Sarge,
> >> 
> >> no ... the osd didn't crash. I had to do some hardware maintainance and push
> >> it
> >> out of distribution with "ceph osd out 3". After a short while i used
> >> "/etc/init.d/ceph stop" on that osd.
> >> Then, after my work i've started ceph and push it in the distribution with
> >> "ceph osd in 3".
> > 
> > For the bug I'm worried about, stopping the daemon and crashing are 
> > equivalent.  In both cases, a transaction may have been only partially 
> > included in the checkpoint.
> > 
> >> Could you please tell me if this is the right way to get an osd out for
> >> maintainance ? Is there
> >> any other thing i should do to keep data consistent ?
> > 
> > You followed the right procedure.  There is (hopefully, was!) just a bug.
> > 
> > sage
> > 
> > 
> >> My structure is ->  3 MDS/MON Server on seperate Hardware Nodes an 3 OSD Nodes
> >> with a each a total capacity
> >> of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a
> >> data store for a kvm virtualisation
> >> farm. The farm is accessing the data directly per rbd.
> >> 
> >> Thank you
> >> 
> >> Jens
> >> 
> >> 
> >> 
> >> 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> >> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problem with inconsistent PG
  2012-02-13 17:13 ` Sage Weil
@ 2012-02-16 14:42   ` Oliver Francke
  2012-02-16 17:17     ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Oliver Francke @ 2012-02-16 14:42 UTC (permalink / raw)
  To: Sage Weil; +Cc: Jens Rehpoehler, ceph-devel

Hi Sage, *,

your tip with truncating from below did not solve the problem. Just to recap:

we had two inconsistencies, which we could break down to something like:

rb.0.0.000000000000__head_DA680EE2

according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3
for example, and a stupid "find …" brings up a couple of them, so the pg number is relevant too -
makes sense - we went into lets say "/data/osd3/current/84.2_head/" and did a hex dump from the file, looked really
like the "head", in means of signs from an installed grub-loader. But a corrupted partition-table.
From other of these files one could do a "fdisk -l <file>" and at least a partition-table could have been
found.
Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is,
from such a file with name and pg, how can we identify the real file being associated with, cause there is another
customer with a potential problem with next reboot ( second inconsistency).

We also had some VM's in a big test-phase with similar problems… grub going into rescue-prompt, invalid/corrupted
partition tables, so all in the first "head-file"?
Would be cool to get some more infos… and sched some light into the structures ( myself not really being a good code-reader
anymore ;) ).

Thanks in@vance and kind regards,

Oliver.

Am 13.02.2012 um 18:13 schrieb Sage Weil:

> On Sun, 12 Feb 2012, Jens Rehpoehler wrote:
> 
>>>> Hi Liste,
>>>> 
>>>> today i've got another problem.
>>>> 
>>>> ceph -w shows up with an inconsistent PG over night:
>>>> 
>>>> 2012-02-10 08:38:48.701775    pg v441251: 1982 pgs: 1981 active+clean, 1
>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
>>>> GB avail
>>>> 2012-02-10 08:38:49.702789    pg v441252: 1982 pgs: 1981 active+clean, 1
>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
>>>> GB avail
>>>> 
>>>> I've identified it with "ceph pg dump - | grep inconsistent
>>>> 
>>>> 109.6    141    0    0    0    463820288    111780    111780
>>>> active+clean+inconsistent    485'7115    480'7301    [3
>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4
>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>]    [3
>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4
>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>]
>>>> 485'7061    2012-02-10 08:02:12.043986
>>>> 
>>>> Now I've tried to repair it with: ceph pg repair 109.6
>>>> 
>>>> 2012-02-10 08:35:52.276325 mon<- [pg,repair,109.6]
>>>> 2012-02-10 08:35:52.276776 mon.1 ->  'instructing pg 109.6 on osd.3 to
>>>> repair' (0)
>>>> 
>>>> but i only get the following result:
>>>> 
>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
>>>> 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
>>>> 1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728
>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
>>>> 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent
>>>> objects
>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
>>>> 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
>>>> 
>>>> Can someone please explain me what to do in this case and how to recover
>>>> the pg ?
>>> 
>>> So the "fix" is just to truncate the file to the expected size, 3145728,
>>> by finding it in the current/ directory.  The name/path will be slightly
>>> weird; look for 'rb.0.0.0000000000bd'.
>>> 
>>> The data is still suspect, though.  Did the ceph-osd restart or crash
>>> recently?  I would do that, repair (it should succeed), and then fsck the
>>> file system in that rbd image.
>>> 
>>> We just fixed a bug that was causing transactions to leak across
>>> checkpoint/snapshot boundaries.  That could be responsible for causing all
>>> sorts of subtle corruptions, including this one.  It'll be included in
>>> v0.42 (out next week).
>>> 
>>> sage
>> 
>> Hi Sarge,
>> 
>> no ... the osd didn't crash. I had to do some hardware maintainance and push
>> it
>> out of distribution with "ceph osd out 3". After a short while i used
>> "/etc/init.d/ceph stop" on that osd.
>> Then, after my work i've started ceph and push it in the distribution with
>> "ceph osd in 3".
> 
> For the bug I'm worried about, stopping the daemon and crashing are 
> equivalent.  In both cases, a transaction may have been only partially 
> included in the checkpoint.
> 
>> Could you please tell me if this is the right way to get an osd out for
>> maintainance ? Is there
>> any other thing i should do to keep data consistent ?
> 
> You followed the right procedure.  There is (hopefully, was!) just a bug.
> 
> sage
> 
> 
>> My structure is ->  3 MDS/MON Server on seperate Hardware Nodes an 3 OSD Nodes
>> with a each a total capacity
>> of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a
>> data store for a kvm virtualisation
>> farm. The farm is accessing the data directly per rbd.
>> 
>> Thank you
>> 
>> Jens
>> 
>> 
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problem with inconsistent PG
  2012-02-12 17:46 ` Jens Rehpoehler
@ 2012-02-13 17:10   ` Sage Weil
  0 siblings, 0 replies; 11+ messages in thread
From: Sage Weil @ 2012-02-13 17:10 UTC (permalink / raw)
  To: Jens Rehpoehler; +Cc: ceph-devel

On Sun, 12 Feb 2012, Jens Rehpoehler wrote:
> Am 12.02.2012 13:00, schrieb Jens Rehpoehler:
> > > >  Hi Liste,
> > > > 
> > > >  today i've got another problem.
> > > > 
> > > >  ceph -w shows up with an inconsistent PG over night:
> > > > 
> > > >  2012-02-10 08:38:48.701775    pg v441251: 1982 pgs: 1981 active+clean,
> > > > 1
> > > >  active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
> > > >  GB avail
> > > >  2012-02-10 08:38:49.702789    pg v441252: 1982 pgs: 1981 active+clean,
> > > > 1
> > > >  active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
> > > >  GB avail
> > > > 
> > > >  I've identified it with "ceph pg dump - | grep inconsistent
> > > > 
> > > >  109.6    141    0    0    0    463820288    111780    111780
> > > >  active+clean+inconsistent    485'7115    480'7301    [3
> > > > <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4
> > > > <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>]    [3
> > > > <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4
> > > > <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>]
> > > >  485'7061    2012-02-10 08:02:12.043986
> > > > 
> > > >  Now I've tried to repair it with: ceph pg repair 109.6
> > > > 
> > > >  2012-02-10 08:35:52.276325 mon<- [pg,repair,109.6]
> > > >  2012-02-10 08:35:52.276776 mon.1 ->  'instructing pg 109.6 on osd.3 to
> > > >  repair' (0)
> > > > 
> > > >  but i only get the following result:
> > > > 
> > > >  2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
> > > >  10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
> > > >  1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728
> > > >  2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
> > > >  10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1
> > > > inconsistent
> > > >  objects
> > > >  2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
> > > >  10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
> > > > 
> > > >  Can someone please explain me what to do in this case and how to
> > > > recover
> > > >  the pg ?
> > > 
> > > So the "fix" is just to truncate the file to the expected size, 3145728,
> > > by finding it in the current/ directory.  The name/path will be slightly
> > > weird; look for 'rb.0.0.0000000000bd'.
> > > 
> > > The data is still suspect, though.  Did the ceph-osd restart or crash
> > > recently?  I would do that, repair (it should succeed), and then fsck the
> > > file system in that rbd image.
> > > 
> > > We just fixed a bug that was causing transactions to leak across
> > > checkpoint/snapshot boundaries.  That could be responsible for causing all
> > > sorts of subtle corruptions, including this one.  It'll be included in
> > > v0.42 (out next week).
> > > 
> > > sage
> > 
> > Hi Sarge,
> > 
> > no ... the osd didn't crash. I had to do some hardware maintainance and push
> > it
> > out of distribution with "ceph osd out 3". After a short while i used
> > "/etc/init.d/ceph stop" on that osd.
> > Then, after my work i've started ceph and push it in the distribution with
> > "ceph osd in 3".
> > 
> > Could you please tell me if this is the right way to get an osd out for
> > maintainance ? Is there
> > any other thing i should do to keep data consistent ?
> > 
> > My structure is ->  3 MDS/MON Server on seperate Hardware Nodes an 3 OSD
> > Nodes with a each a total capacity
> > of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a
> > data store for a kvm virtualisation
> > farm. The farm is accessing the data directly per rbd.
> > 
> > Thank you
> > 
> > Jens
> > 
> > 
> > 
> > 
> Hi  Sarge,
> 
> just another addition:
> 
> root@fcmsmon0:~# ceph pg dump -|grep inconsi
> 109.6   141     0       0       0       463820288       111780  111780
> active+clean+inconsistent       558'14530       510'14829       [3,4]   [3,4]
> 558'14515       2012-02-12 18:29:07.793725
> 84.2    279     0       0       0       722016776       111780  111780
> active+clean+inconsistent       558'22106       510'22528       [3,4]   [3,4]
> 558'22089       2012-02-12 18:29:37.089054
> 
> The repair output for the new inconsistenz  is:
> 
> 2012-02-12 18:29:23.933162   log 2012-02-12 18:29:20.936261 osd.3
> 10.10.10.8:6800/12718 1868 : [ERR] 84.2 osd.4: soid
> da680ee2/rb.0.0.000000000000/headsize 2666496 != known size 3145728
> 2012-02-12 18:29:23.933162   log 2012-02-12 18:29:20.936274 osd.3
> 10.10.10.8:6800/12718 1869 : [ERR] 84.2 repair 0 missing, 1 inconsistent
> objects
> 2012-02-12 18:29:23.933162   log 2012-02-12 18:29:20.937164 osd.3
> 10.10.10.8:6800/12718 1870 : [ERR] 84.2 repair stat mismatch, got 279/279
> objects, 0/0 clones, 722016776/721537544 bytes.
> 2012-02-12 18:29:23.933162   log 2012-02-12 18:29:20.937206 osd.3
> 10.10.10.8:6800/12718 1871 : [ERR] 84.2 repair 2 errors, 1 fixed
> 
> Please note, that the osd hasn't been down in the last days. The filesystem is
> under heavy load by more than 150 KVM vms.
> 
> Could you also please explain, how i may find the corresponding vm to the
> inconsistenz to do a filesystem check ?

The 'rbd info' shows the object prefix, e.g.

rbd image 'foo':
        size 10000 MB in 2500 objects
        order 22 (4096 KB objects)
        block_name_prefix: rb.0.0
        parent:  (pool -1)

If it's rb.0.0, it's probably the first image you created.  Or you should 
be able to find it with something like

 $ for f in `rbd list`; do rbd info $f | grep -q 'rb.0.0' && echo $f ; done

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problem with inconsistent PG
  2012-02-12 12:00 Jens Rehpoehler
@ 2012-02-12 17:46 ` Jens Rehpoehler
  2012-02-13 17:10   ` Sage Weil
  2012-02-13 17:13 ` Sage Weil
  1 sibling, 1 reply; 11+ messages in thread
From: Jens Rehpoehler @ 2012-02-12 17:46 UTC (permalink / raw)
  To: ceph-devel; +Cc: sage

[-- Attachment #1: Type: text/plain, Size: 4784 bytes --]

Am 12.02.2012 13:00, schrieb Jens Rehpoehler:
>>>  Hi Liste,
>>>
>>>  today i've got another problem.
>>>
>>>  ceph -w shows up with an inconsistent PG over night:
>>>
>>>  2012-02-10 08:38:48.701775    pg v441251: 1982 pgs: 1981 
>>> active+clean, 1
>>>  active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 
>>> 22345
>>>  GB avail
>>>  2012-02-10 08:38:49.702789    pg v441252: 1982 pgs: 1981 
>>> active+clean, 1
>>>  active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 
>>> 22345
>>>  GB avail
>>>
>>>  I've identified it with "ceph pg dump - | grep inconsistent
>>>
>>>  109.6    141    0    0    0    463820288    111780    111780
>>>  active+clean+inconsistent    485'7115    480'7301    [3 
>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4 
>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>]    [3 
>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4 
>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>]
>>>  485'7061    2012-02-10 08:02:12.043986
>>>
>>>  Now I've tried to repair it with: ceph pg repair 109.6
>>>
>>>  2012-02-10 08:35:52.276325 mon<- [pg,repair,109.6]
>>>  2012-02-10 08:35:52.276776 mon.1 ->  'instructing pg 109.6 on osd.3 to
>>>  repair' (0)
>>>
>>>  but i only get the following result:
>>>
>>>  2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
>>>  10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
>>>  1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728
>>>  2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
>>>  10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 
>>> inconsistent
>>>  objects
>>>  2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
>>>  10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
>>>
>>>  Can someone please explain me what to do in this case and how to 
>>> recover
>>>  the pg ?
>>
>> So the "fix" is just to truncate the file to the expected size, 3145728,
>> by finding it in the current/ directory.  The name/path will be slightly
>> weird; look for 'rb.0.0.0000000000bd'.
>>
>> The data is still suspect, though.  Did the ceph-osd restart or crash
>> recently?  I would do that, repair (it should succeed), and then fsck 
>> the
>> file system in that rbd image.
>>
>> We just fixed a bug that was causing transactions to leak across
>> checkpoint/snapshot boundaries.  That could be responsible for 
>> causing all
>> sorts of subtle corruptions, including this one.  It'll be included in
>> v0.42 (out next week).
>>
>> sage
>
> Hi Sarge,
>
> no ... the osd didn't crash. I had to do some hardware maintainance 
> and push it
> out of distribution with "ceph osd out 3". After a short while i used 
> "/etc/init.d/ceph stop" on that osd.
> Then, after my work i've started ceph and push it in the distribution 
> with "ceph osd in 3".
>
> Could you please tell me if this is the right way to get an osd out 
> for maintainance ? Is there
> any other thing i should do to keep data consistent ?
>
> My structure is ->  3 MDS/MON Server on seperate Hardware Nodes an 3 
> OSD Nodes with a each a total capacity
> of 8 TB. Journaling is done on a separate SSD per node. The whole 
> thing is a data store for a kvm virtualisation
> farm. The farm is accessing the data directly per rbd.
>
> Thank you
>
> Jens
>
>
>
>
Hi  Sarge,

just another addition:

root@fcmsmon0:~# ceph pg dump -|grep inconsi
109.6   141     0       0       0       463820288       111780  111780  
active+clean+inconsistent       558'14530       510'14829       [3,4]   
[3,4]   558'14515       2012-02-12 18:29:07.793725
84.2    279     0       0       0       722016776       111780  111780  
active+clean+inconsistent       558'22106       510'22528       [3,4]   
[3,4]   558'22089       2012-02-12 18:29:37.089054

The repair output for the new inconsistenz  is:

2012-02-12 18:29:23.933162   log 2012-02-12 18:29:20.936261 osd.3 
10.10.10.8:6800/12718 1868 : [ERR] 84.2 osd.4: soid 
da680ee2/rb.0.0.000000000000/headsize 2666496 != known size 3145728
2012-02-12 18:29:23.933162   log 2012-02-12 18:29:20.936274 osd.3 
10.10.10.8:6800/12718 1869 : [ERR] 84.2 repair 0 missing, 1 inconsistent 
objects
2012-02-12 18:29:23.933162   log 2012-02-12 18:29:20.937164 osd.3 
10.10.10.8:6800/12718 1870 : [ERR] 84.2 repair stat mismatch, got 
279/279 objects, 0/0 clones, 722016776/721537544 bytes.
2012-02-12 18:29:23.933162   log 2012-02-12 18:29:20.937206 osd.3 
10.10.10.8:6800/12718 1871 : [ERR] 84.2 repair 2 errors, 1 fixed

Please note, that the osd hasn't been down in the last days. The 
filesystem is under heavy load by more than 150 KVM vms.

Could you also please explain, how i may find the corresponding vm to 
the inconsistenz to do a filesystem check ?

Thank you

Jens



[-- Attachment #2: jens_rehpoehler.vcf --]
[-- Type: text/x-vcard, Size: 317 bytes --]

begin:vcard
fn;quoted-printable:Jens Rehp=C3=B6hler
n;quoted-printable:Rehp=C3=B6hler;Jens
org:Filoo GmbH
adr:;;Tilsiter Str. 1;Langenberg;NRW;33449;Deutschland
email;internet:jens.rehpoehler@filoo.de
tel;work:+49-5248-1898412
tel;fax:+49-5248-189819
tel;cell:+49-151-54645798
url:www.filoo.de
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-02-17 18:13 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-10  7:43 Problem with inconsistent PG Jens Rehpöhler
2012-02-10 22:30 ` Sage Weil
2012-02-12 12:00 Jens Rehpoehler
2012-02-12 17:46 ` Jens Rehpoehler
2012-02-13 17:10   ` Sage Weil
2012-02-13 17:13 ` Sage Weil
2012-02-16 14:42   ` Oliver Francke
2012-02-16 17:17     ` Sage Weil
2012-02-16 17:53       ` Oliver Francke
2012-02-16 18:02         ` Sage Weil
2012-02-17 14:00           ` Oliver Francke
2012-02-17 17:54             ` Sage Weil
2012-02-17 18:13               ` Oliver Francke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.