All of lore.kernel.org
 help / color / mirror / Atom feed
* problems to protect rbd from mutiple simultaneous mapping
@ 2017-03-06 14:08 peng.hse
  2017-03-06 23:47 ` Jason Dillaman
  0 siblings, 1 reply; 4+ messages in thread
From: peng.hse @ 2017-03-06 14:08 UTC (permalink / raw)
  To: Sage Weil, jdurgin, ceph-devel

Hi Sage,

the recommended way to protect rbd from multiple simultaneous mapping is 
just as the follows:

- identify old rbd lock holder
- blacklist old owner
- break the old rbd lock through "rbd lock remove"
- map rbd image on new host

However, i am wondering how do we handle the situation as the below 
timeline sequences:

  1. node1 locks the rbd image, doing the IO request, the IO is 
outstanding in the osds and
      not commit and reply to client yet

  2. node2 takes over the corresponding IO service due to some network 
partition issue,
      add node1 into the blacklist to all osds successfully and resume 
the IO.

3. assuming the step-1 outstanding IO and step-2 IO targeted the same 
area of the fs metadata
     on the rbd devices. step-2 successfully persist the data and reply 
to client.
     then, the following laggy IO from step-1 might override and corrupt 
what we have written in step-2.

so, how do we prevent this kind of corruption happening?


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: problems to protect rbd from mutiple simultaneous mapping
  2017-03-06 14:08 problems to protect rbd from mutiple simultaneous mapping peng.hse
@ 2017-03-06 23:47 ` Jason Dillaman
  2017-03-07  2:16   ` peng.hse
  0 siblings, 1 reply; 4+ messages in thread
From: Jason Dillaman @ 2017-03-06 23:47 UTC (permalink / raw)
  To: peng.hse; +Cc: Sage Weil, Josh Durgin, ceph-devel

On Mon, Mar 6, 2017 at 9:08 AM, peng.hse <peng.hse@xtaotech.com> wrote:
> 3. assuming the step-1 outstanding IO and step-2 IO targeted the same area
> of the fs metadata
>     on the rbd devices. step-2 successfully persist the data and reply to
> client.
>     then, the following laggy IO from step-1 might override and corrupt what
> we have written in step-2.
>
> so, how do we prevent this kind of corruption happening?

... but in step (2) you successfully blacklisted the client on node1
(i.e. it is not allowed to talk to the OSDs). Therefore, node1 cannot
overwrite any data written by node2.

-- 
Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: problems to protect rbd from mutiple simultaneous mapping
  2017-03-06 23:47 ` Jason Dillaman
@ 2017-03-07  2:16   ` peng.hse
  2017-03-07  2:26     ` Jason Dillaman
  0 siblings, 1 reply; 4+ messages in thread
From: peng.hse @ 2017-03-07  2:16 UTC (permalink / raw)
  To: dillaman; +Cc: Sage Weil, Josh Durgin, ceph-devel

what i mean is : the step-1's IO from node1 was received by the OSDs 
before the blacklist barrier,
however, still under progress after the blacklist barrier, which might 
overwrite the data by node2 and
corrupt our data.
How do we avoid this situation?

On 2017年03月07日 07:47, Jason Dillaman wrote:
> On Mon, Mar 6, 2017 at 9:08 AM, peng.hse <peng.hse@xtaotech.com> wrote:
>> 3. assuming the step-1 outstanding IO and step-2 IO targeted the same area
>> of the fs metadata
>>      on the rbd devices. step-2 successfully persist the data and reply to
>> client.
>>      then, the following laggy IO from step-1 might override and corrupt what
>> we have written in step-2.
>>
>> so, how do we prevent this kind of corruption happening?
> ... but in step (2) you successfully blacklisted the client on node1
> (i.e. it is not allowed to talk to the OSDs). Therefore, node1 cannot
> overwrite any data written by node2.
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: problems to protect rbd from mutiple simultaneous mapping
  2017-03-07  2:16   ` peng.hse
@ 2017-03-07  2:26     ` Jason Dillaman
  0 siblings, 0 replies; 4+ messages in thread
From: Jason Dillaman @ 2017-03-07  2:26 UTC (permalink / raw)
  To: peng.hse; +Cc: Sage Weil, Josh Durgin, ceph-devel

Each object is "owned" by a PG and each PG operates on each object
in-order within a transaction. Therefore, if the OSDs had received a
write op from node1 before the blacklist from node2, by definition
they would be completed before node2's write ops for the same object
could be started. If it's the reverse scenario, where the write op
from node1 was in-flight on the network and somehow arrived after
node2's blacklist and write op, I would hope and expect that the OSD
properly handled the blacklist and dropped the op.

On Mon, Mar 6, 2017 at 9:16 PM, peng.hse <peng.hse@xtaotech.com> wrote:
> what i mean is : the step-1's IO from node1 was received by the OSDs before
> the blacklist barrier,
> however, still under progress after the blacklist barrier, which might
> overwrite the data by node2 and
> corrupt our data.
> How do we avoid this situation?
>
>
> On 2017年03月07日 07:47, Jason Dillaman wrote:
>>
>> On Mon, Mar 6, 2017 at 9:08 AM, peng.hse <peng.hse@xtaotech.com> wrote:
>>>
>>> 3. assuming the step-1 outstanding IO and step-2 IO targeted the same
>>> area
>>> of the fs metadata
>>>      on the rbd devices. step-2 successfully persist the data and reply
>>> to
>>> client.
>>>      then, the following laggy IO from step-1 might override and corrupt
>>> what
>>> we have written in step-2.
>>>
>>> so, how do we prevent this kind of corruption happening?
>>
>> ... but in step (2) you successfully blacklisted the client on node1
>> (i.e. it is not allowed to talk to the OSDs). Therefore, node1 cannot
>> overwrite any data written by node2.
>>
>



-- 
Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-03-07  8:52 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-06 14:08 problems to protect rbd from mutiple simultaneous mapping peng.hse
2017-03-06 23:47 ` Jason Dillaman
2017-03-07  2:16   ` peng.hse
2017-03-07  2:26     ` Jason Dillaman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.