All of lore.kernel.org
 help / color / mirror / Atom feed
* unfound object problem
@ 2016-05-30  3:58 Rui Xie
  2016-06-06 21:14 ` Gregory Farnum
  2016-06-07  6:27 ` Mustafa Muhammad
  0 siblings, 2 replies; 7+ messages in thread
From: Rui Xie @ 2016-05-30  3:58 UTC (permalink / raw)
  To: ceph-devel

Hi

I found an unfound object problem in my test environment (hammer).
I suspect the reason is the wrong update of up_thru.

from osdmap, the up_thru become smaller than before at an epoch.  some
old PGTemp messages with smaller epoch are prepared and executed, and
change the up_thru to smaller epoch.
maybe_went_rw is wrong for that interval.

I think prepare_pgtemp should not change up_thru if it is smaller than
current, and duplicated PGTemp messages not be sent ?

Is this a bug or something wrong for me?

Thanks !

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: unfound object problem
  2016-05-30  3:58 unfound object problem Rui Xie
@ 2016-06-06 21:14 ` Gregory Farnum
  2016-06-06 22:07   ` Samuel Just
  2016-06-07  6:27 ` Mustafa Muhammad
  1 sibling, 1 reply; 7+ messages in thread
From: Gregory Farnum @ 2016-06-06 21:14 UTC (permalink / raw)
  To: Rui Xie, Samuel Just; +Cc: ceph-devel

On Sun, May 29, 2016 at 8:58 PM, Rui Xie <jerry.xr86@gmail.com> wrote:
> Hi
>
> I found an unfound object problem in my test environment (hammer).
> I suspect the reason is the wrong update of up_thru.
>
> from osdmap, the up_thru become smaller than before at an epoch.  some
> old PGTemp messages with smaller epoch are prepared and executed, and
> change the up_thru to smaller epoch.
> maybe_went_rw is wrong for that interval.
>
> I think prepare_pgtemp should not change up_thru if it is smaller than
> current, and duplicated PGTemp messages not be sent ?
>
> Is this a bug or something wrong for me?
>
> Thanks !

I'm not sure if this came out of the same place or not, but Sam was
just talking last week about an issue with pgtemp updates that is at
least close to this bug. That one was resolved in the OSDMonitor, if
those are the up_thru locations you're talking about. :)
-Greg

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: unfound object problem
  2016-06-06 21:14 ` Gregory Farnum
@ 2016-06-06 22:07   ` Samuel Just
  2016-06-07  4:21     ` Rui Xie
  0 siblings, 1 reply; 7+ messages in thread
From: Samuel Just @ 2016-06-06 22:07 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Rui Xie, ceph-devel

I don't quite understand...  Can you explain the sequence of events in
more detail?
-Sam

On Mon, Jun 6, 2016 at 2:14 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Sun, May 29, 2016 at 8:58 PM, Rui Xie <jerry.xr86@gmail.com> wrote:
>> Hi
>>
>> I found an unfound object problem in my test environment (hammer).
>> I suspect the reason is the wrong update of up_thru.
>>
>> from osdmap, the up_thru become smaller than before at an epoch.  some
>> old PGTemp messages with smaller epoch are prepared and executed, and
>> change the up_thru to smaller epoch.
>> maybe_went_rw is wrong for that interval.
>>
>> I think prepare_pgtemp should not change up_thru if it is smaller than
>> current, and duplicated PGTemp messages not be sent ?
>>
>> Is this a bug or something wrong for me?
>>
>> Thanks !
>
> I'm not sure if this came out of the same place or not, but Sam was
> just talking last week about an issue with pgtemp updates that is at
> least close to this bug. That one was resolved in the OSDMonitor, if
> those are the up_thru locations you're talking about. :)
> -Greg

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: unfound object problem
  2016-06-06 22:07   ` Samuel Just
@ 2016-06-07  4:21     ` Rui Xie
  2016-06-07 14:52       ` Samuel Just
  0 siblings, 1 reply; 7+ messages in thread
From: Rui Xie @ 2016-06-07  4:21 UTC (permalink / raw)
  To: Samuel Just; +Cc: Gregory Farnum, ceph-devel

Hi Sam

we do not check pgtemp  map_epoch in preprocess_pgtemp and prepare_pgtemp.
old pgtemp messages with smaller map_epoch are prapared, and update
up_thru to smaller version.

a lot of duplicate pgtemp messages there.

2016-06-07 6:07 GMT+08:00 Samuel Just <sjust@redhat.com>:
> I don't quite understand...  Can you explain the sequence of events in
> more detail?
> -Sam
>
> On Mon, Jun 6, 2016 at 2:14 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
>> On Sun, May 29, 2016 at 8:58 PM, Rui Xie <jerry.xr86@gmail.com> wrote:
>>> Hi
>>>
>>> I found an unfound object problem in my test environment (hammer).
>>> I suspect the reason is the wrong update of up_thru.
>>>
>>> from osdmap, the up_thru become smaller than before at an epoch.  some
>>> old PGTemp messages with smaller epoch are prepared and executed, and
>>> change the up_thru to smaller epoch.
>>> maybe_went_rw is wrong for that interval.
>>>
>>> I think prepare_pgtemp should not change up_thru if it is smaller than
>>> current, and duplicated PGTemp messages not be sent ?
>>>
>>> Is this a bug or something wrong for me?
>>>
>>> Thanks !
>>
>> I'm not sure if this came out of the same place or not, but Sam was
>> just talking last week about an issue with pgtemp updates that is at
>> least close to this bug. That one was resolved in the OSDMonitor, if
>> those are the up_thru locations you're talking about. :)
>> -Greg

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: unfound object problem
  2016-05-30  3:58 unfound object problem Rui Xie
  2016-06-06 21:14 ` Gregory Farnum
@ 2016-06-07  6:27 ` Mustafa Muhammad
  1 sibling, 0 replies; 7+ messages in thread
From: Mustafa Muhammad @ 2016-06-07  6:27 UTC (permalink / raw)
  To: ceph-devel

As Sam said in an earlier message, maybe this is caused by setting
sortbitwise at the end of update, this happened to my cluster, and
unsetting it made everything go back to normal.
See http://tracker.ceph.com/issues/16113

Regards
Mustafa

On Mon, May 30, 2016 at 6:58 AM, Rui Xie <jerry.xr86@gmail.com> wrote:
> Hi
>
> I found an unfound object problem in my test environment (hammer).
> I suspect the reason is the wrong update of up_thru.
>
> from osdmap, the up_thru become smaller than before at an epoch.  some
> old PGTemp messages with smaller epoch are prepared and executed, and
> change the up_thru to smaller epoch.
> maybe_went_rw is wrong for that interval.
>
> I think prepare_pgtemp should not change up_thru if it is smaller than
> current, and duplicated PGTemp messages not be sent ?
>
> Is this a bug or something wrong for me?
>
> Thanks !
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: unfound object problem
  2016-06-07  4:21     ` Rui Xie
@ 2016-06-07 14:52       ` Samuel Just
  2016-06-08  0:18         ` Samuel Just
  0 siblings, 1 reply; 7+ messages in thread
From: Samuel Just @ 2016-06-07 14:52 UTC (permalink / raw)
  To: Rui Xie; +Cc: Gregory Farnum, ceph-devel

That does sound like a bug, I'll try to take a look today.
-Sam

On Mon, Jun 6, 2016 at 9:21 PM, Rui Xie <jerry.xr86@gmail.com> wrote:
> Hi Sam
>
> we do not check pgtemp  map_epoch in preprocess_pgtemp and prepare_pgtemp.
> old pgtemp messages with smaller map_epoch are prapared, and update
> up_thru to smaller version.
>
> a lot of duplicate pgtemp messages there.
>
> 2016-06-07 6:07 GMT+08:00 Samuel Just <sjust@redhat.com>:
>> I don't quite understand...  Can you explain the sequence of events in
>> more detail?
>> -Sam
>>
>> On Mon, Jun 6, 2016 at 2:14 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
>>> On Sun, May 29, 2016 at 8:58 PM, Rui Xie <jerry.xr86@gmail.com> wrote:
>>>> Hi
>>>>
>>>> I found an unfound object problem in my test environment (hammer).
>>>> I suspect the reason is the wrong update of up_thru.
>>>>
>>>> from osdmap, the up_thru become smaller than before at an epoch.  some
>>>> old PGTemp messages with smaller epoch are prepared and executed, and
>>>> change the up_thru to smaller epoch.
>>>> maybe_went_rw is wrong for that interval.
>>>>
>>>> I think prepare_pgtemp should not change up_thru if it is smaller than
>>>> current, and duplicated PGTemp messages not be sent ?
>>>>
>>>> Is this a bug or something wrong for me?
>>>>
>>>> Thanks !
>>>
>>> I'm not sure if this came out of the same place or not, but Sam was
>>> just talking last week about an issue with pgtemp updates that is at
>>> least close to this bug. That one was resolved in the OSDMonitor, if
>>> those are the up_thru locations you're talking about. :)
>>> -Greg

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: unfound object problem
  2016-06-07 14:52       ` Samuel Just
@ 2016-06-08  0:18         ` Samuel Just
  0 siblings, 0 replies; 7+ messages in thread
From: Samuel Just @ 2016-06-08  0:18 UTC (permalink / raw)
  To: Rui Xie; +Cc: Gregory Farnum, ceph-devel

http://tracker.ceph.com/issues/16185

So, it does seem to be a bug, and I've got a fix.  However, it's not
clear to me that it would result in an unfound object.  It seems like
it would result in a pg which should be down being allowed to peer or
an actually unfound object being erroneously considered ok.  I'm not
sure you've diagnosed the original issue correctly.  If you can
reproduce, you should enable logging (debug osd = 20, debug filestore
= 20, debug ms = 1) and go through the logs more carefully.
-Sam

On Tue, Jun 7, 2016 at 7:52 AM, Samuel Just <sjust@redhat.com> wrote:
> That does sound like a bug, I'll try to take a look today.
> -Sam
>
> On Mon, Jun 6, 2016 at 9:21 PM, Rui Xie <jerry.xr86@gmail.com> wrote:
>> Hi Sam
>>
>> we do not check pgtemp  map_epoch in preprocess_pgtemp and prepare_pgtemp.
>> old pgtemp messages with smaller map_epoch are prapared, and update
>> up_thru to smaller version.
>>
>> a lot of duplicate pgtemp messages there.
>>
>> 2016-06-07 6:07 GMT+08:00 Samuel Just <sjust@redhat.com>:
>>> I don't quite understand...  Can you explain the sequence of events in
>>> more detail?
>>> -Sam
>>>
>>> On Mon, Jun 6, 2016 at 2:14 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
>>>> On Sun, May 29, 2016 at 8:58 PM, Rui Xie <jerry.xr86@gmail.com> wrote:
>>>>> Hi
>>>>>
>>>>> I found an unfound object problem in my test environment (hammer).
>>>>> I suspect the reason is the wrong update of up_thru.
>>>>>
>>>>> from osdmap, the up_thru become smaller than before at an epoch.  some
>>>>> old PGTemp messages with smaller epoch are prepared and executed, and
>>>>> change the up_thru to smaller epoch.
>>>>> maybe_went_rw is wrong for that interval.
>>>>>
>>>>> I think prepare_pgtemp should not change up_thru if it is smaller than
>>>>> current, and duplicated PGTemp messages not be sent ?
>>>>>
>>>>> Is this a bug or something wrong for me?
>>>>>
>>>>> Thanks !
>>>>
>>>> I'm not sure if this came out of the same place or not, but Sam was
>>>> just talking last week about an issue with pgtemp updates that is at
>>>> least close to this bug. That one was resolved in the OSDMonitor, if
>>>> those are the up_thru locations you're talking about. :)
>>>> -Greg

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-06-08  0:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-30  3:58 unfound object problem Rui Xie
2016-06-06 21:14 ` Gregory Farnum
2016-06-06 22:07   ` Samuel Just
2016-06-07  4:21     ` Rui Xie
2016-06-07 14:52       ` Samuel Just
2016-06-08  0:18         ` Samuel Just
2016-06-07  6:27 ` Mustafa Muhammad

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.