All of lore.kernel.org
 help / color / mirror / Atom feed
* wip-proxy-write and (non-idempotent) client ops
@ 2015-01-19 16:50 Sage Weil
  2015-01-20  7:04 ` Wang, Zhiqiang
  0 siblings, 1 reply; 5+ messages in thread
From: Sage Weil @ 2015-01-19 16:50 UTC (permalink / raw)
  To: zhiqiang.wang, sjust, ceph-devel

Consider:

1- primary rx client delete
     proxy delete to base pool
2- primary initiate promote (list-snaps, copy-from)
3- primary rx delete reply
4- primary tx client reply
5- socket failure drops client reply
6- primary rx promote completion (enoent), writes a whiteout
7- client resents delete
8- primary replies with ENOENT

i.e., the problem seems to be that delete is not idempotent and we can't 
tell that the same client op is what triggered the delete.

We could special case delete since that is where this is noticeable, but I 
think the bigger problem is that the op history that is used for dup op 
detection is not preserved across the cache and base tier.  That is, this 
is another variation on this ticket:

	http://tracker.ceph.com/issues/8935

I have this sinking feeling we need to properly address that problem 
before we can do the write proxying...

sage

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: wip-proxy-write and (non-idempotent) client ops
  2015-01-19 16:50 wip-proxy-write and (non-idempotent) client ops Sage Weil
@ 2015-01-20  7:04 ` Wang, Zhiqiang
  2015-01-20 15:06   ` Sage Weil
  0 siblings, 1 reply; 5+ messages in thread
From: Wang, Zhiqiang @ 2015-01-20  7:04 UTC (permalink / raw)
  To: Sage Weil; +Cc: sjust, ceph-devel

Do we have any proposed solutions for this problem? Copy the needed info from base tier to cache tier during promotion? I see it has been there for over 6 months.

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Tuesday, January 20, 2015 12:51 AM
To: Wang, Zhiqiang; sjust@redhat.com; ceph-devel@vger.kernel.org
Subject: wip-proxy-write and (non-idempotent) client ops

Consider:

1- primary rx client delete
     proxy delete to base pool
2- primary initiate promote (list-snaps, copy-from)
3- primary rx delete reply
4- primary tx client reply
5- socket failure drops client reply
6- primary rx promote completion (enoent), writes a whiteout
7- client resents delete
8- primary replies with ENOENT

i.e., the problem seems to be that delete is not idempotent and we can't tell that the same client op is what triggered the delete.

We could special case delete since that is where this is noticeable, but I think the bigger problem is that the op history that is used for dup op detection is not preserved across the cache and base tier.  That is, this is another variation on this ticket:

	http://tracker.ceph.com/issues/8935

I have this sinking feeling we need to properly address that problem before we can do the write proxying...

sage

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: wip-proxy-write and (non-idempotent) client ops
  2015-01-20  7:04 ` Wang, Zhiqiang
@ 2015-01-20 15:06   ` Sage Weil
  2015-01-21  2:16     ` Wang, Zhiqiang
  0 siblings, 1 reply; 5+ messages in thread
From: Sage Weil @ 2015-01-20 15:06 UTC (permalink / raw)
  To: Wang, Zhiqiang; +Cc: sjust, ceph-devel

On Tue, 20 Jan 2015, Wang, Zhiqiang wrote:
> Do we have any proposed solutions for this problem? Copy the needed info 
> from base tier to cache tier during promotion? I see it has been there 
> for over 6 months.

Yeah...

1. keep a list of osd_reqid_t's in each object_info_t and match against 
that for dup ops (i forget if the patch for this already went in?).  
there should probably be a tunable for the max list len and age cutoff.

2. preserve that list on copy-from when a flag is specified so that we 
preserve it for both promote and flush.

sage


> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com] 
> Sent: Tuesday, January 20, 2015 12:51 AM
> To: Wang, Zhiqiang; sjust@redhat.com; ceph-devel@vger.kernel.org
> Subject: wip-proxy-write and (non-idempotent) client ops
> 
> Consider:
> 
> 1- primary rx client delete
>      proxy delete to base pool
> 2- primary initiate promote (list-snaps, copy-from)
> 3- primary rx delete reply
> 4- primary tx client reply
> 5- socket failure drops client reply
> 6- primary rx promote completion (enoent), writes a whiteout
> 7- client resents delete
> 8- primary replies with ENOENT
> 
> i.e., the problem seems to be that delete is not idempotent and we can't tell that the same client op is what triggered the delete.
> 
> We could special case delete since that is where this is noticeable, but I think the bigger problem is that the op history that is used for dup op detection is not preserved across the cache and base tier.  That is, this is another variation on this ticket:
> 
> 	http://tracker.ceph.com/issues/8935
> 
> I have this sinking feeling we need to properly address that problem before we can do the write proxying...
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: wip-proxy-write and (non-idempotent) client ops
  2015-01-20 15:06   ` Sage Weil
@ 2015-01-21  2:16     ` Wang, Zhiqiang
  2015-01-21  2:30       ` Sage Weil
  0 siblings, 1 reply; 5+ messages in thread
From: Wang, Zhiqiang @ 2015-01-21  2:16 UTC (permalink / raw)
  To: Sage Weil; +Cc: sjust, ceph-devel

Is it sufficient to only preserve the list of osd_reqid_t? It's able to match dup ops. But it can't tell if the op is already completed, acked or still undergoing.

However, maybe we could say these ops have completed since they are from the base tier and we just do a RWORDERED promotion. That is, all the ops before initiating the promotion have completed in base tier, and all the ops after initiating the promotion are requeued after the promotion. Sounds right?

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Tuesday, January 20, 2015 11:06 PM
To: Wang, Zhiqiang
Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
Subject: RE: wip-proxy-write and (non-idempotent) client ops

On Tue, 20 Jan 2015, Wang, Zhiqiang wrote:
> Do we have any proposed solutions for this problem? Copy the needed 
> info from base tier to cache tier during promotion? I see it has been 
> there for over 6 months.

Yeah...

1. keep a list of osd_reqid_t's in each object_info_t and match against that for dup ops (i forget if the patch for this already went in?).  
there should probably be a tunable for the max list len and age cutoff.

2. preserve that list on copy-from when a flag is specified so that we preserve it for both promote and flush.

sage


> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Tuesday, January 20, 2015 12:51 AM
> To: Wang, Zhiqiang; sjust@redhat.com; ceph-devel@vger.kernel.org
> Subject: wip-proxy-write and (non-idempotent) client ops
> 
> Consider:
> 
> 1- primary rx client delete
>      proxy delete to base pool
> 2- primary initiate promote (list-snaps, copy-from)
> 3- primary rx delete reply
> 4- primary tx client reply
> 5- socket failure drops client reply
> 6- primary rx promote completion (enoent), writes a whiteout
> 7- client resents delete
> 8- primary replies with ENOENT
> 
> i.e., the problem seems to be that delete is not idempotent and we can't tell that the same client op is what triggered the delete.
> 
> We could special case delete since that is where this is noticeable, but I think the bigger problem is that the op history that is used for dup op detection is not preserved across the cache and base tier.  That is, this is another variation on this ticket:
> 
> 	http://tracker.ceph.com/issues/8935
> 
> I have this sinking feeling we need to properly address that problem before we can do the write proxying...
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: wip-proxy-write and (non-idempotent) client ops
  2015-01-21  2:16     ` Wang, Zhiqiang
@ 2015-01-21  2:30       ` Sage Weil
  0 siblings, 0 replies; 5+ messages in thread
From: Sage Weil @ 2015-01-21  2:30 UTC (permalink / raw)
  To: Wang, Zhiqiang; +Cc: sjust, ceph-devel

On Wed, 21 Jan 2015, Wang, Zhiqiang wrote:
> Is it sufficient to only preserve the list of osd_reqid_t? It's able to 
> match dup ops. But it can't tell if the op is already completed, acked 
> or still undergoing.
> 
> However, maybe we could say these ops have completed since they are from 
> the base tier and we just do a RWORDERED promotion. That is, all the ops 
> before initiating the promotion have completed in base tier, and all the 
> ops after initiating the promotion are requeued after the promotion. 
> Sounds right?

Yeah exactly.  Except I think there shouldn't be any write ops after the 
promotion starts since the cache tier won't do that (it will start 
blocking writes once a promotion is in progress).

Either way, I think it's

1- add vector<osd_reqid_t> to object_info_t, populate it on write, and 
check it for dups when we check the pg log.  make a config tunable and/or 
a pg_pool_t tunable to control how many to keep.

2- add it to the object_copy_data_t so that promote and flush can preserve 
it

?
sage



> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com] 
> Sent: Tuesday, January 20, 2015 11:06 PM
> To: Wang, Zhiqiang
> Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> Subject: RE: wip-proxy-write and (non-idempotent) client ops
> 
> On Tue, 20 Jan 2015, Wang, Zhiqiang wrote:
> > Do we have any proposed solutions for this problem? Copy the needed 
> > info from base tier to cache tier during promotion? I see it has been 
> > there for over 6 months.
> 
> Yeah...
> 
> 1. keep a list of osd_reqid_t's in each object_info_t and match against that for dup ops (i forget if the patch for this already went in?).  
> there should probably be a tunable for the max list len and age cutoff.
> 
> 2. preserve that list on copy-from when a flag is specified so that we preserve it for both promote and flush.
> 
> sage
> 
> 
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Tuesday, January 20, 2015 12:51 AM
> > To: Wang, Zhiqiang; sjust@redhat.com; ceph-devel@vger.kernel.org
> > Subject: wip-proxy-write and (non-idempotent) client ops
> > 
> > Consider:
> > 
> > 1- primary rx client delete
> >      proxy delete to base pool
> > 2- primary initiate promote (list-snaps, copy-from)
> > 3- primary rx delete reply
> > 4- primary tx client reply
> > 5- socket failure drops client reply
> > 6- primary rx promote completion (enoent), writes a whiteout
> > 7- client resents delete
> > 8- primary replies with ENOENT
> > 
> > i.e., the problem seems to be that delete is not idempotent and we can't tell that the same client op is what triggered the delete.
> > 
> > We could special case delete since that is where this is noticeable, but I think the bigger problem is that the op history that is used for dup op detection is not preserved across the cache and base tier.  That is, this is another variation on this ticket:
> > 
> > 	http://tracker.ceph.com/issues/8935
> > 
> > I have this sinking feeling we need to properly address that problem before we can do the write proxying...
> > 
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-01-21  2:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-19 16:50 wip-proxy-write and (non-idempotent) client ops Sage Weil
2015-01-20  7:04 ` Wang, Zhiqiang
2015-01-20 15:06   ` Sage Weil
2015-01-21  2:16     ` Wang, Zhiqiang
2015-01-21  2:30       ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.