All of lore.kernel.org
 help / color / mirror / Atom feed
* rados semantic changes
@ 2015-05-06 17:47 Sage Weil
  2015-05-08 23:18 ` Samuel Just
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2015-05-06 17:47 UTC (permalink / raw)
  To: sjust, ceph-devel

It sounds like we're kicking around two proposed changes:

1) A delete will never -ENOENT; a delete on a non-existent object is a 
success.  This is important for cache tiering, and allowing us to skip 
checking the base tier for certain client requests.

2) Any write will implicitly create the object.  If we want such a 
dependency (so we see ENOENT), the user can put an assert_exists in the op 
vector.

I think what both of these amount to is that the writes have no read-side 
checks and will effectively never fail (except for true failure 
cases).  If the user wants some sort of failure, it will be explicit 
in the form of another read check in the op vector.

Sam, is this what you're thinking?

It's a subtle but real change in the semantics of rados ops, but I 
think now would be the time to make it...

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rados semantic changes
  2015-05-06 17:47 rados semantic changes Sage Weil
@ 2015-05-08 23:18 ` Samuel Just
  2015-05-09  0:16   ` Gregory Farnum
  0 siblings, 1 reply; 11+ messages in thread
From: Samuel Just @ 2015-05-08 23:18 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

So, the problem is sequences of piplined operations like:

(object does not exist)
1. [exists_guard, write]
2. [write]

Currently, if a client sends 1. and then 2. without waiting for 1. to complete, the guarantee is that even in the event of peering or a client crash the only visible orderings are [], [1], and [1, 2].  However, if both operations complete (with 1. returning ENOENT) and are then replayed, we will see a result of [2, 1].  1. will be executed first, but since 2. already completed, exists_guard will not error out, and the write will succeed.  2. will then return immediately with success since pg log will contain an entry indicating that it already happened.  Delete seems to be a special case of this.

For the more general problem, it seems like the objecter cannot pipeline rw ops with any kind of write (including other rw ops).  This means the objecter in the case above would hold 2. back until 1. completes and is not in danger of being re-sent.  We'll need some machinery in the objecter to handle this part.

For this to work, we need the pg log to record that an op marked as w has been processed regardless of the result.  We can do this for a particular op type marked as w by ensuring that it always succeeds, by writing a noop log entry recording the result, or by giving up and marking that op rw.

It seems to me that:
1) delete should always succeed.  We'll have to record a noop log entry to ensure that it is not replayed out of turn.
  - Or we can leave delete as it is and mark it rw.
2) omap and xattr set operations implicitly create the object and therefore always succeed.
3) omap remove operations are marked as rw so they can return ENOENT.
  - Otherwise, an omap remove operation on a non-existing object would have to add a noop entry to the log or implicitly create the object -- the latter would be particularly odd.
  - Or, we could record a noop log entry with the return code.
4) I don't think there are any current object classes which should be marked w, but they should be carefully audited.

On the implementation side, we probably need new versions of the affected librados calls (omap_set2?, blind_delete?) since we are changing the return values and there may be older code which relies on this behavior.

Write ordered reads also appear to be fundamentally broken in the current implementation for the same reason.  It seems like we'd have to handle that at the objecter level by marking the reads rw.

We could also find a way to relax the ordering guarantees even further, but I fear that that would make it excessively difficult to reason about librados operation ordering.

Thoughts?
-Sam

----- Original Message -----
From: "Sage Weil" <sweil@redhat.com>
To: sjust@redhat.com, ceph-devel@vger.kernel.org
Sent: Wednesday, May 6, 2015 10:47:11 AM
Subject: rados semantic changes

It sounds like we're kicking around two proposed changes:

1) A delete will never -ENOENT; a delete on a non-existent object is a 
success.  This is important for cache tiering, and allowing us to skip 
checking the base tier for certain client requests.

2) Any write will implicitly create the object.  If we want such a 
dependency (so we see ENOENT), the user can put an assert_exists in the op 
vector.

I think what both of these amount to is that the writes have no read-side 
checks and will effectively never fail (except for true failure 
cases).  If the user wants some sort of failure, it will be explicit 
in the form of another read check in the op vector.

Sam, is this what you're thinking?

It's a subtle but real change in the semantics of rados ops, but I 
think now would be the time to make it...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rados semantic changes
  2015-05-08 23:18 ` Samuel Just
@ 2015-05-09  0:16   ` Gregory Farnum
  2015-05-09  0:53     ` Samuel Just
  2015-05-11  2:31     ` Sage Weil
  0 siblings, 2 replies; 11+ messages in thread
From: Gregory Farnum @ 2015-05-09  0:16 UTC (permalink / raw)
  To: Samuel Just; +Cc: Sage Weil, ceph-devel

Do we have any tickets or something motivating these? I'm not quite
sure which of these problems are things you noticed, versus things
we've seen in the field, versus stuff that might make our lives easier
in the future.

That said, my votes so far are
On Fri, May 8, 2015 at 4:18 PM, Samuel Just <sjust@redhat.com> wrote:
> So, the problem is sequences of piplined operations like:
>
> (object does not exist)
> 1. [exists_guard, write]
> 2. [write]
>
> Currently, if a client sends 1. and then 2. without waiting for 1. to complete, the guarantee is that even in the event of peering or a client crash the only visible orderings are [], [1], and [1, 2].  However, if both operations complete (with 1. returning ENOENT) and are then replayed, we will see a result of [2, 1].  1. will be executed first, but since 2. already completed, exists_guard will not error out, and the write will succeed.  2. will then return immediately with success since pg log will contain an entry indicating that it already happened.  Delete seems to be a special case of this.
>
> For the more general problem, it seems like the objecter cannot pipeline rw ops with any kind of write (including other rw ops).  This means the objecter in the case above would hold 2. back until 1. completes and is not in danger of being re-sent.  We'll need some machinery in the objecter to handle this part.

Although I think you mean it can't pipeline rw ops on the same object,
that still seems unpleasant — especially for our more annoying
operations that might touch multiple objects.

>
> For this to work, we need the pg log to record that an op marked as w has been processed regardless of the result.  We can do this for a particular op type marked as w by ensuring that it always succeeds, by writing a noop log entry recording the result, or by giving up and marking that op rw.
>
> It seems to me that:
> 1) delete should always succeed.  We'll have to record a noop log entry to ensure that it is not replayed out of turn.

yes

>   - Or we can leave delete as it is and mark it rw.
> 2) omap and xattr set operations implicitly create the object and therefore always succeed.

that's not current behavior? yes

> 3) omap remove operations are marked as rw so they can return ENOENT.
>   - Otherwise, an omap remove operation on a non-existing object would have to add a noop entry to the log or implicitly create the object -- the latter would be particularly odd.
>   - Or, we could record a noop log entry with the return code.

This one — I don't think log entries are very expensive, and I don't
think we want to serialize omap ops. In particular I think serializing
omap rm would be bad for rgw.

> 4) I don't think there are any current object classes which should be marked w, but they should be carefully audited.
>
> On the implementation side, we probably need new versions of the affected librados calls (omap_set2?, blind_delete?) since we are changing the return values and there may be older code which relies on this behavior.

But then they'd also be preserving broken behavior, right? Perhaps
it's time to do something interesting with our versioning.

>
> Write ordered reads also appear to be fundamentally broken in the current implementation for the same reason.  It seems like we'd have to handle that at the objecter level by marking the reads rw.
>
> We could also find a way to relax the ordering guarantees even further, but I fear that that would make it excessively difficult to reason about librados operation ordering.
>
> Thoughts?
> -Sam
>
> ----- Original Message -----
> From: "Sage Weil" <sweil@redhat.com>
> To: sjust@redhat.com, ceph-devel@vger.kernel.org
> Sent: Wednesday, May 6, 2015 10:47:11 AM
> Subject: rados semantic changes
>
> It sounds like we're kicking around two proposed changes:
>
> 1) A delete will never -ENOENT; a delete on a non-existent object is a
> success.  This is important for cache tiering, and allowing us to skip
> checking the base tier for certain client requests.
>
> 2) Any write will implicitly create the object.  If we want such a
> dependency (so we see ENOENT), the user can put an assert_exists in the op
> vector.
>
> I think what both of these amount to is that the writes have no read-side
> checks and will effectively never fail (except for true failure
> cases).  If the user wants some sort of failure, it will be explicit
> in the form of another read check in the op vector.
>
> Sam, is this what you're thinking?
>
> It's a subtle but real change in the semantics of rados ops, but I
> think now would be the time to make it...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rados semantic changes
  2015-05-09  0:16   ` Gregory Farnum
@ 2015-05-09  0:53     ` Samuel Just
  2015-05-09  4:20       ` Samuel Just
  2015-05-11  2:31     ` Sage Weil
  1 sibling, 1 reply; 11+ messages in thread
From: Samuel Just @ 2015-05-09  0:53 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel

Just some things I noticed.  It'll be relatively easy to reproduce in ceph_test_rados.  We haven't seen it only because the three existing rados users and ceph_test_rados don't happen to send problematic sequences.
-Sam

----- Original Message -----
From: "Gregory Farnum" <greg@gregs42.com>
To: "Samuel Just" <sjust@redhat.com>
Cc: "Sage Weil" <sweil@redhat.com>, ceph-devel@vger.kernel.org
Sent: Friday, May 8, 2015 5:16:16 PM
Subject: Re: rados semantic changes

Do we have any tickets or something motivating these? I'm not quite
sure which of these problems are things you noticed, versus things
we've seen in the field, versus stuff that might make our lives easier
in the future.

That said, my votes so far are
On Fri, May 8, 2015 at 4:18 PM, Samuel Just <sjust@redhat.com> wrote:
> So, the problem is sequences of piplined operations like:
>
> (object does not exist)
> 1. [exists_guard, write]
> 2. [write]
>
> Currently, if a client sends 1. and then 2. without waiting for 1. to complete, the guarantee is that even in the event of peering or a client crash the only visible orderings are [], [1], and [1, 2].  However, if both operations complete (with 1. returning ENOENT) and are then replayed, we will see a result of [2, 1].  1. will be executed first, but since 2. already completed, exists_guard will not error out, and the write will succeed.  2. will then return immediately with success since pg log will contain an entry indicating that it already happened.  Delete seems to be a special case of this.
>
> For the more general problem, it seems like the objecter cannot pipeline rw ops with any kind of write (including other rw ops).  This means the objecter in the case above would hold 2. back until 1. completes and is not in danger of being re-sent.  We'll need some machinery in the objecter to handle this part.

Although I think you mean it can't pipeline rw ops on the same object,
that still seems unpleasant — especially for our more annoying
operations that might touch multiple objects.

>
> For this to work, we need the pg log to record that an op marked as w has been processed regardless of the result.  We can do this for a particular op type marked as w by ensuring that it always succeeds, by writing a noop log entry recording the result, or by giving up and marking that op rw.
>
> It seems to me that:
> 1) delete should always succeed.  We'll have to record a noop log entry to ensure that it is not replayed out of turn.

yes

>   - Or we can leave delete as it is and mark it rw.
> 2) omap and xattr set operations implicitly create the object and therefore always succeed.

that's not current behavior? yes

> 3) omap remove operations are marked as rw so they can return ENOENT.
>   - Otherwise, an omap remove operation on a non-existing object would have to add a noop entry to the log or implicitly create the object -- the latter would be particularly odd.
>   - Or, we could record a noop log entry with the return code.

This one — I don't think log entries are very expensive, and I don't
think we want to serialize omap ops. In particular I think serializing
omap rm would be bad for rgw.

> 4) I don't think there are any current object classes which should be marked w, but they should be carefully audited.
>
> On the implementation side, we probably need new versions of the affected librados calls (omap_set2?, blind_delete?) since we are changing the return values and there may be older code which relies on this behavior.

But then they'd also be preserving broken behavior, right? Perhaps
it's time to do something interesting with our versioning.

>
> Write ordered reads also appear to be fundamentally broken in the current implementation for the same reason.  It seems like we'd have to handle that at the objecter level by marking the reads rw.
>
> We could also find a way to relax the ordering guarantees even further, but I fear that that would make it excessively difficult to reason about librados operation ordering.
>
> Thoughts?
> -Sam
>
> ----- Original Message -----
> From: "Sage Weil" <sweil@redhat.com>
> To: sjust@redhat.com, ceph-devel@vger.kernel.org
> Sent: Wednesday, May 6, 2015 10:47:11 AM
> Subject: rados semantic changes
>
> It sounds like we're kicking around two proposed changes:
>
> 1) A delete will never -ENOENT; a delete on a non-existent object is a
> success.  This is important for cache tiering, and allowing us to skip
> checking the base tier for certain client requests.
>
> 2) Any write will implicitly create the object.  If we want such a
> dependency (so we see ENOENT), the user can put an assert_exists in the op
> vector.
>
> I think what both of these amount to is that the writes have no read-side
> checks and will effectively never fail (except for true failure
> cases).  If the user wants some sort of failure, it will be explicit
> in the form of another read check in the op vector.
>
> Sam, is this what you're thinking?
>
> It's a subtle but real change in the semantics of rados ops, but I
> think now would be the time to make it...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rados semantic changes
  2015-05-09  0:53     ` Samuel Just
@ 2015-05-09  4:20       ` Samuel Just
  0 siblings, 0 replies; 11+ messages in thread
From: Samuel Just @ 2015-05-09  4:20 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel

Oops, omap and xattr set do create the object already.
-Sam

----- Original Message -----
From: "Samuel Just" <sjust@redhat.com>
To: "Gregory Farnum" <greg@gregs42.com>
Cc: "Sage Weil" <sweil@redhat.com>, ceph-devel@vger.kernel.org
Sent: Friday, May 8, 2015 5:53:37 PM
Subject: Re: rados semantic changes

Just some things I noticed.  It'll be relatively easy to reproduce in ceph_test_rados.  We haven't seen it only because the three existing rados users and ceph_test_rados don't happen to send problematic sequences.
-Sam

----- Original Message -----
From: "Gregory Farnum" <greg@gregs42.com>
To: "Samuel Just" <sjust@redhat.com>
Cc: "Sage Weil" <sweil@redhat.com>, ceph-devel@vger.kernel.org
Sent: Friday, May 8, 2015 5:16:16 PM
Subject: Re: rados semantic changes

Do we have any tickets or something motivating these? I'm not quite
sure which of these problems are things you noticed, versus things
we've seen in the field, versus stuff that might make our lives easier
in the future.

That said, my votes so far are
On Fri, May 8, 2015 at 4:18 PM, Samuel Just <sjust@redhat.com> wrote:
> So, the problem is sequences of piplined operations like:
>
> (object does not exist)
> 1. [exists_guard, write]
> 2. [write]
>
> Currently, if a client sends 1. and then 2. without waiting for 1. to complete, the guarantee is that even in the event of peering or a client crash the only visible orderings are [], [1], and [1, 2].  However, if both operations complete (with 1. returning ENOENT) and are then replayed, we will see a result of [2, 1].  1. will be executed first, but since 2. already completed, exists_guard will not error out, and the write will succeed.  2. will then return immediately with success since pg log will contain an entry indicating that it already happened.  Delete seems to be a special case of this.
>
> For the more general problem, it seems like the objecter cannot pipeline rw ops with any kind of write (including other rw ops).  This means the objecter in the case above would hold 2. back until 1. completes and is not in danger of being re-sent.  We'll need some machinery in the objecter to handle this part.

Although I think you mean it can't pipeline rw ops on the same object,
that still seems unpleasant — especially for our more annoying
operations that might touch multiple objects.

>
> For this to work, we need the pg log to record that an op marked as w has been processed regardless of the result.  We can do this for a particular op type marked as w by ensuring that it always succeeds, by writing a noop log entry recording the result, or by giving up and marking that op rw.
>
> It seems to me that:
> 1) delete should always succeed.  We'll have to record a noop log entry to ensure that it is not replayed out of turn.

yes

>   - Or we can leave delete as it is and mark it rw.
> 2) omap and xattr set operations implicitly create the object and therefore always succeed.

that's not current behavior? yes

> 3) omap remove operations are marked as rw so they can return ENOENT.
>   - Otherwise, an omap remove operation on a non-existing object would have to add a noop entry to the log or implicitly create the object -- the latter would be particularly odd.
>   - Or, we could record a noop log entry with the return code.

This one — I don't think log entries are very expensive, and I don't
think we want to serialize omap ops. In particular I think serializing
omap rm would be bad for rgw.

> 4) I don't think there are any current object classes which should be marked w, but they should be carefully audited.
>
> On the implementation side, we probably need new versions of the affected librados calls (omap_set2?, blind_delete?) since we are changing the return values and there may be older code which relies on this behavior.

But then they'd also be preserving broken behavior, right? Perhaps
it's time to do something interesting with our versioning.

>
> Write ordered reads also appear to be fundamentally broken in the current implementation for the same reason.  It seems like we'd have to handle that at the objecter level by marking the reads rw.
>
> We could also find a way to relax the ordering guarantees even further, but I fear that that would make it excessively difficult to reason about librados operation ordering.
>
> Thoughts?
> -Sam
>
> ----- Original Message -----
> From: "Sage Weil" <sweil@redhat.com>
> To: sjust@redhat.com, ceph-devel@vger.kernel.org
> Sent: Wednesday, May 6, 2015 10:47:11 AM
> Subject: rados semantic changes
>
> It sounds like we're kicking around two proposed changes:
>
> 1) A delete will never -ENOENT; a delete on a non-existent object is a
> success.  This is important for cache tiering, and allowing us to skip
> checking the base tier for certain client requests.
>
> 2) Any write will implicitly create the object.  If we want such a
> dependency (so we see ENOENT), the user can put an assert_exists in the op
> vector.
>
> I think what both of these amount to is that the writes have no read-side
> checks and will effectively never fail (except for true failure
> cases).  If the user wants some sort of failure, it will be explicit
> in the form of another read check in the op vector.
>
> Sam, is this what you're thinking?
>
> It's a subtle but real change in the semantics of rados ops, but I
> think now would be the time to make it...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rados semantic changes
  2015-05-09  0:16   ` Gregory Farnum
  2015-05-09  0:53     ` Samuel Just
@ 2015-05-11  2:31     ` Sage Weil
  2016-03-09 20:42       ` Sage Weil
  1 sibling, 1 reply; 11+ messages in thread
From: Sage Weil @ 2015-05-11  2:31 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Samuel Just, ceph-devel

On Fri, 8 May 2015, Gregory Farnum wrote:
> Do we have any tickets or something motivating these? I'm not quite
> sure which of these problems are things you noticed, versus things
> we've seen in the field, versus stuff that might make our lives easier
> in the future.
> 
> That said, my votes so far are
> On Fri, May 8, 2015 at 4:18 PM, Samuel Just <sjust@redhat.com> wrote:
> > So, the problem is sequences of piplined operations like:
> >
> > (object does not exist)
> > 1. [exists_guard, write]
> > 2. [write]
> >
> > Currently, if a client sends 1. and then 2. without waiting for 1. to complete, the guarantee is that even in the event of peering or a client crash the only visible orderings are [], [1], and [1, 2].  However, if both operations complete (with 1. returning ENOENT) and are then replayed, we will see a result of [2, 1].  1. will be executed first, but since 2. already completed, exists_guard will not error out, and the write will succeed.  2. will then return immediately with success since pg log will contain an entry indicating that it already happened.  Delete seems to be a special case of this.
> >
> > For the more general problem, it seems like the objecter cannot pipeline rw ops with any kind of write (including other rw ops).  This means the objecter in the case above would hold 2. back until 1. completes and is not in danger of being re-sent.  We'll need some machinery in the objecter to handle this part.
> 
> Although I think you mean it can't pipeline rw ops on the same object,
> that still seems unpleasant ? especially for our more annoying
> operations that might touch multiple objects.
> 
> >
> > For this to work, we need the pg log to record that an op marked as w has been processed regardless of the result.  We can do this for a particular op type marked as w by ensuring that it always succeeds, by writing a noop log entry recording the result, or by giving up and marking that op rw.
> >
> > It seems to me that:
> > 1) delete should always succeed.  We'll have to record a noop log entry to ensure that it is not replayed out of turn.
> 
> yes
> 
> >   - Or we can leave delete as it is and mark it rw.
> > 2) omap and xattr set operations implicitly create the object and therefore always succeed.
> 
> that's not current behavior? yes
> 
> > 3) omap remove operations are marked as rw so they can return ENOENT.
> >   - Otherwise, an omap remove operation on a non-existing object would have to add a noop entry to the log or implicitly create the object -- the latter would be particularly odd.
> >   - Or, we could record a noop log entry with the return code.
> 
> This one ? I don't think log entries are very expensive, and I don't
> think we want to serialize omap ops. In particular I think serializing
> omap rm would be bad for rgw.

Yeah, agree on these.  It'll be pretty easy to log noop items.

We could also put a return value in those noop log entries (or, perhaps, 
any log entry).  If I'm following correctly, that will allow the client to 
pipeline RW ops.  I'm not sure that's the best idea in the general case 
(e.g., if we expect the test to fail frequently), but an op flag 
could indicate whether we want to record/log test failures (allowing 
the client to pipeline) or whether the objecter will plug the request 
stream for the object to keep things correct.

Actually, I'm not really sure we'll want to cram all that into the 
Objecter--it'll mean a hash_map by object that checks that there 
aren't already in-flight rw ops on the given object, etc., which may have 
a significant performance impact, all to guard against a sequence 
very few clients will ever generate.

sage


> 
> > 4) I don't think there are any current object classes which should be marked w, but they should be carefully audited.
> >
> > On the implementation side, we probably need new versions of the affected librados calls (omap_set2?, blind_delete?) since we are changing the return values and there may be older code which relies on this behavior.
> 
> But then they'd also be preserving broken behavior, right? Perhaps
> it's time to do something interesting with our versioning.
> 
> >
> > Write ordered reads also appear to be fundamentally broken in the current implementation for the same reason.  It seems like we'd have to handle that at the objecter level by marking the reads rw.
> >
> > We could also find a way to relax the ordering guarantees even further, but I fear that that would make it excessively difficult to reason about librados operation ordering.
> >
> > Thoughts?
> > -Sam
> >
> > ----- Original Message -----
> > From: "Sage Weil" <sweil@redhat.com>
> > To: sjust@redhat.com, ceph-devel@vger.kernel.org
> > Sent: Wednesday, May 6, 2015 10:47:11 AM
> > Subject: rados semantic changes
> >
> > It sounds like we're kicking around two proposed changes:
> >
> > 1) A delete will never -ENOENT; a delete on a non-existent object is a
> > success.  This is important for cache tiering, and allowing us to skip
> > checking the base tier for certain client requests.
> >
> > 2) Any write will implicitly create the object.  If we want such a
> > dependency (so we see ENOENT), the user can put an assert_exists in the op
> > vector.
> >
> > I think what both of these amount to is that the writes have no read-side
> > checks and will effectively never fail (except for true failure
> > cases).  If the user wants some sort of failure, it will be explicit
> > in the form of another read check in the op vector.
> >
> > Sam, is this what you're thinking?
> >
> > It's a subtle but real change in the semantics of rados ops, but I
> > think now would be the time to make it...
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rados semantic changes
  2015-05-11  2:31     ` Sage Weil
@ 2016-03-09 20:42       ` Sage Weil
  2016-03-09 21:26         ` Gregory Farnum
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2016-03-09 20:42 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Samuel Just, ceph-devel

Resurrecting an old thread.

I think we really want to make these semantic changes to current rados 
ops (like delete) to make life better going forward.  Ideally shortly 
after jewel so that they have plenty of time to bake before K and L.

I'm wondering if the way to make this change visible to users is to 
(finally) rev librados to librados3.  We can take the opportunity to make 
any other pending cleanups to the public API as well...

sage



On Sun, 10 May 2015, Sage Weil wrote:

> On Fri, 8 May 2015, Gregory Farnum wrote:
> > Do we have any tickets or something motivating these? I'm not quite
> > sure which of these problems are things you noticed, versus things
> > we've seen in the field, versus stuff that might make our lives easier
> > in the future.
> > 
> > That said, my votes so far are
> > On Fri, May 8, 2015 at 4:18 PM, Samuel Just <sjust@redhat.com> wrote:
> > > So, the problem is sequences of piplined operations like:
> > >
> > > (object does not exist)
> > > 1. [exists_guard, write]
> > > 2. [write]
> > >
> > > Currently, if a client sends 1. and then 2. without waiting for 1. to complete, the guarantee is that even in the event of peering or a client crash the only visible orderings are [], [1], and [1, 2].  However, if both operations complete (with 1. returning ENOENT) and are then replayed, we will see a result of [2, 1].  1. will be executed first, but since 2. already completed, exists_guard will not error out, and the write will succeed.  2. will then return immediately with success since pg log will contain an entry indicating that it already happened.  Delete seems to be a special case of this.
> > >
> > > For the more general problem, it seems like the objecter cannot pipeline rw ops with any kind of write (including other rw ops).  This means the objecter in the case above would hold 2. back until 1. completes and is not in danger of being re-sent.  We'll need some machinery in the objecter to handle this part.
> > 
> > Although I think you mean it can't pipeline rw ops on the same object,
> > that still seems unpleasant ? especially for our more annoying
> > operations that might touch multiple objects.
> > 
> > >
> > > For this to work, we need the pg log to record that an op marked as w has been processed regardless of the result.  We can do this for a particular op type marked as w by ensuring that it always succeeds, by writing a noop log entry recording the result, or by giving up and marking that op rw.
> > >
> > > It seems to me that:
> > > 1) delete should always succeed.  We'll have to record a noop log entry to ensure that it is not replayed out of turn.
> > 
> > yes
> > 
> > >   - Or we can leave delete as it is and mark it rw.
> > > 2) omap and xattr set operations implicitly create the object and therefore always succeed.
> > 
> > that's not current behavior? yes
> > 
> > > 3) omap remove operations are marked as rw so they can return ENOENT.
> > >   - Otherwise, an omap remove operation on a non-existing object would have to add a noop entry to the log or implicitly create the object -- the latter would be particularly odd.
> > >   - Or, we could record a noop log entry with the return code.
> > 
> > This one ? I don't think log entries are very expensive, and I don't
> > think we want to serialize omap ops. In particular I think serializing
> > omap rm would be bad for rgw.
> 
> Yeah, agree on these.  It'll be pretty easy to log noop items.
> 
> We could also put a return value in those noop log entries (or, perhaps, 
> any log entry).  If I'm following correctly, that will allow the client to 
> pipeline RW ops.  I'm not sure that's the best idea in the general case 
> (e.g., if we expect the test to fail frequently), but an op flag 
> could indicate whether we want to record/log test failures (allowing 
> the client to pipeline) or whether the objecter will plug the request 
> stream for the object to keep things correct.
> 
> Actually, I'm not really sure we'll want to cram all that into the 
> Objecter--it'll mean a hash_map by object that checks that there 
> aren't already in-flight rw ops on the given object, etc., which may have 
> a significant performance impact, all to guard against a sequence 
> very few clients will ever generate.
> 
> sage
> 
> 
> > 
> > > 4) I don't think there are any current object classes which should be marked w, but they should be carefully audited.
> > >
> > > On the implementation side, we probably need new versions of the affected librados calls (omap_set2?, blind_delete?) since we are changing the return values and there may be older code which relies on this behavior.
> > 
> > But then they'd also be preserving broken behavior, right? Perhaps
> > it's time to do something interesting with our versioning.
> > 
> > >
> > > Write ordered reads also appear to be fundamentally broken in the current implementation for the same reason.  It seems like we'd have to handle that at the objecter level by marking the reads rw.
> > >
> > > We could also find a way to relax the ordering guarantees even further, but I fear that that would make it excessively difficult to reason about librados operation ordering.
> > >
> > > Thoughts?
> > > -Sam
> > >
> > > ----- Original Message -----
> > > From: "Sage Weil" <sweil@redhat.com>
> > > To: sjust@redhat.com, ceph-devel@vger.kernel.org
> > > Sent: Wednesday, May 6, 2015 10:47:11 AM
> > > Subject: rados semantic changes
> > >
> > > It sounds like we're kicking around two proposed changes:
> > >
> > > 1) A delete will never -ENOENT; a delete on a non-existent object is a
> > > success.  This is important for cache tiering, and allowing us to skip
> > > checking the base tier for certain client requests.
> > >
> > > 2) Any write will implicitly create the object.  If we want such a
> > > dependency (so we see ENOENT), the user can put an assert_exists in the op
> > > vector.
> > >
> > > I think what both of these amount to is that the writes have no read-side
> > > checks and will effectively never fail (except for true failure
> > > cases).  If the user wants some sort of failure, it will be explicit
> > > in the form of another read check in the op vector.
> > >
> > > Sam, is this what you're thinking?
> > >
> > > It's a subtle but real change in the semantics of rados ops, but I
> > > think now would be the time to make it...
> > >
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rados semantic changes
  2016-03-09 20:42       ` Sage Weil
@ 2016-03-09 21:26         ` Gregory Farnum
  2016-03-09 21:47           ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2016-03-09 21:26 UTC (permalink / raw)
  To: Sage Weil; +Cc: Samuel Just, ceph-devel

On Wed, Mar 9, 2016 at 12:42 PM, Sage Weil <sage@newdream.net> wrote:
> Resurrecting an old thread.
>
> I think we really want to make these semantic changes to current rados
> ops (like delete) to make life better going forward.  Ideally shortly
> after jewel so that they have plenty of time to bake before K and L.
>
> I'm wondering if the way to make this change visible to users is to
> (finally) rev librados to librados3.  We can take the opportunity to make
> any other pending cleanups to the public API as well...

Yep. I presume you're thinking of this because of
http://tracker.ceph.com/issues/14468? It looks like we didn't really
have any good solutions for that pipelining problem though; any new
suggestions?
-Greg

>
> sage
>
>
>
> On Sun, 10 May 2015, Sage Weil wrote:
>
>> On Fri, 8 May 2015, Gregory Farnum wrote:
>> > Do we have any tickets or something motivating these? I'm not quite
>> > sure which of these problems are things you noticed, versus things
>> > we've seen in the field, versus stuff that might make our lives easier
>> > in the future.
>> >
>> > That said, my votes so far are
>> > On Fri, May 8, 2015 at 4:18 PM, Samuel Just <sjust@redhat.com> wrote:
>> > > So, the problem is sequences of piplined operations like:
>> > >
>> > > (object does not exist)
>> > > 1. [exists_guard, write]
>> > > 2. [write]
>> > >
>> > > Currently, if a client sends 1. and then 2. without waiting for 1. to complete, the guarantee is that even in the event of peering or a client crash the only visible orderings are [], [1], and [1, 2].  However, if both operations complete (with 1. returning ENOENT) and are then replayed, we will see a result of [2, 1].  1. will be executed first, but since 2. already completed, exists_guard will not error out, and the write will succeed.  2. will then return immediately with success since pg log will contain an entry indicating that it already happened.  Delete seems to be a special case of this.
>> > >
>> > > For the more general problem, it seems like the objecter cannot pipeline rw ops with any kind of write (including other rw ops).  This means the objecter in the case above would hold 2. back until 1. completes and is not in danger of being re-sent.  We'll need some machinery in the objecter to handle this part.
>> >
>> > Although I think you mean it can't pipeline rw ops on the same object,
>> > that still seems unpleasant ? especially for our more annoying
>> > operations that might touch multiple objects.
>> >
>> > >
>> > > For this to work, we need the pg log to record that an op marked as w has been processed regardless of the result.  We can do this for a particular op type marked as w by ensuring that it always succeeds, by writing a noop log entry recording the result, or by giving up and marking that op rw.
>> > >
>> > > It seems to me that:
>> > > 1) delete should always succeed.  We'll have to record a noop log entry to ensure that it is not replayed out of turn.
>> >
>> > yes
>> >
>> > >   - Or we can leave delete as it is and mark it rw.
>> > > 2) omap and xattr set operations implicitly create the object and therefore always succeed.
>> >
>> > that's not current behavior? yes
>> >
>> > > 3) omap remove operations are marked as rw so they can return ENOENT.
>> > >   - Otherwise, an omap remove operation on a non-existing object would have to add a noop entry to the log or implicitly create the object -- the latter would be particularly odd.
>> > >   - Or, we could record a noop log entry with the return code.
>> >
>> > This one ? I don't think log entries are very expensive, and I don't
>> > think we want to serialize omap ops. In particular I think serializing
>> > omap rm would be bad for rgw.
>>
>> Yeah, agree on these.  It'll be pretty easy to log noop items.
>>
>> We could also put a return value in those noop log entries (or, perhaps,
>> any log entry).  If I'm following correctly, that will allow the client to
>> pipeline RW ops.  I'm not sure that's the best idea in the general case
>> (e.g., if we expect the test to fail frequently), but an op flag
>> could indicate whether we want to record/log test failures (allowing
>> the client to pipeline) or whether the objecter will plug the request
>> stream for the object to keep things correct.
>>
>> Actually, I'm not really sure we'll want to cram all that into the
>> Objecter--it'll mean a hash_map by object that checks that there
>> aren't already in-flight rw ops on the given object, etc., which may have
>> a significant performance impact, all to guard against a sequence
>> very few clients will ever generate.
>>
>> sage
>>
>>
>> >
>> > > 4) I don't think there are any current object classes which should be marked w, but they should be carefully audited.
>> > >
>> > > On the implementation side, we probably need new versions of the affected librados calls (omap_set2?, blind_delete?) since we are changing the return values and there may be older code which relies on this behavior.
>> >
>> > But then they'd also be preserving broken behavior, right? Perhaps
>> > it's time to do something interesting with our versioning.
>> >
>> > >
>> > > Write ordered reads also appear to be fundamentally broken in the current implementation for the same reason.  It seems like we'd have to handle that at the objecter level by marking the reads rw.
>> > >
>> > > We could also find a way to relax the ordering guarantees even further, but I fear that that would make it excessively difficult to reason about librados operation ordering.
>> > >
>> > > Thoughts?
>> > > -Sam
>> > >
>> > > ----- Original Message -----
>> > > From: "Sage Weil" <sweil@redhat.com>
>> > > To: sjust@redhat.com, ceph-devel@vger.kernel.org
>> > > Sent: Wednesday, May 6, 2015 10:47:11 AM
>> > > Subject: rados semantic changes
>> > >
>> > > It sounds like we're kicking around two proposed changes:
>> > >
>> > > 1) A delete will never -ENOENT; a delete on a non-existent object is a
>> > > success.  This is important for cache tiering, and allowing us to skip
>> > > checking the base tier for certain client requests.
>> > >
>> > > 2) Any write will implicitly create the object.  If we want such a
>> > > dependency (so we see ENOENT), the user can put an assert_exists in the op
>> > > vector.
>> > >
>> > > I think what both of these amount to is that the writes have no read-side
>> > > checks and will effectively never fail (except for true failure
>> > > cases).  If the user wants some sort of failure, it will be explicit
>> > > in the form of another read check in the op vector.
>> > >
>> > > Sam, is this what you're thinking?
>> > >
>> > > It's a subtle but real change in the semantics of rados ops, but I
>> > > think now would be the time to make it...
>> > >
>> > > sage
>> > > --
>> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > > the body of a message to majordomo@vger.kernel.org
>> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > >
>> > > --
>> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > > the body of a message to majordomo@vger.kernel.org
>> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rados semantic changes
  2016-03-09 21:26         ` Gregory Farnum
@ 2016-03-09 21:47           ` Sage Weil
  2016-03-09 21:56             ` Gregory Farnum
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2016-03-09 21:47 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Samuel Just, ceph-devel

On Wed, 9 Mar 2016, Gregory Farnum wrote:
> On Wed, Mar 9, 2016 at 12:42 PM, Sage Weil <sage@newdream.net> wrote:
> > Resurrecting an old thread.
> >
> > I think we really want to make these semantic changes to current rados
> > ops (like delete) to make life better going forward.  Ideally shortly
> > after jewel so that they have plenty of time to bake before K and L.
> >
> > I'm wondering if the way to make this change visible to users is to
> > (finally) rev librados to librados3.  We can take the opportunity to make
> > any other pending cleanups to the public API as well...
> 
> Yep. I presume you're thinking of this because of
> http://tracker.ceph.com/issues/14468? It looks like we didn't really
> have any good solutions for that pipelining problem though; any new
> suggestions?

Yeah, I'm still not very happy with either alternative:

1) We persistently record the reqid and return value in the pg log.  This 
turns failed rw ops into a replicated (metadata) write, which sort of 
sucks.  It also means that we probably *wouldn't* store any reply payload, 
which means we lose the ability to have a failure return useful data 
(e.g., info about why it failed).

2) The objecter prevents rw ops from being pipelined.  This means a hash 
table in the objecter so that it transparently blocks subsequent requests 
to the same object.  Or,

3) librados users are expected to avoid pipelining.  We'd document it.  
They'd inevitably get it wrong and have very rare and hard to track down 
failures.

I guess I lean toward #2.  That's a bit different than what we were 
thinking a year ago on this thread...

sage


> -Greg
> 
> >
> > sage
> >
> >
> >
> > On Sun, 10 May 2015, Sage Weil wrote:
> >
> >> On Fri, 8 May 2015, Gregory Farnum wrote:
> >> > Do we have any tickets or something motivating these? I'm not quite
> >> > sure which of these problems are things you noticed, versus things
> >> > we've seen in the field, versus stuff that might make our lives easier
> >> > in the future.
> >> >
> >> > That said, my votes so far are
> >> > On Fri, May 8, 2015 at 4:18 PM, Samuel Just <sjust@redhat.com> wrote:
> >> > > So, the problem is sequences of piplined operations like:
> >> > >
> >> > > (object does not exist)
> >> > > 1. [exists_guard, write]
> >> > > 2. [write]
> >> > >
> >> > > Currently, if a client sends 1. and then 2. without waiting for 1. to complete, the guarantee is that even in the event of peering or a client crash the only visible orderings are [], [1], and [1, 2].  However, if both operations complete (with 1. returning ENOENT) and are then replayed, we will see a result of [2, 1].  1. will be executed first, but since 2. already completed, exists_guard will not error out, and the write will succeed.  2. will then return immediately with success since pg log will contain an entry indicating that it already happened.  Delete seems to be a special case of this.
> >> > >
> >> > > For the more general problem, it seems like the objecter cannot pipeline rw ops with any kind of write (including other rw ops).  This means the objecter in the case above would hold 2. back until 1. completes and is not in danger of being re-sent.  We'll need some machinery in the objecter to handle this part.
> >> >
> >> > Although I think you mean it can't pipeline rw ops on the same object,
> >> > that still seems unpleasant ? especially for our more annoying
> >> > operations that might touch multiple objects.
> >> >
> >> > >
> >> > > For this to work, we need the pg log to record that an op marked as w has been processed regardless of the result.  We can do this for a particular op type marked as w by ensuring that it always succeeds, by writing a noop log entry recording the result, or by giving up and marking that op rw.
> >> > >
> >> > > It seems to me that:
> >> > > 1) delete should always succeed.  We'll have to record a noop log entry to ensure that it is not replayed out of turn.
> >> >
> >> > yes
> >> >
> >> > >   - Or we can leave delete as it is and mark it rw.
> >> > > 2) omap and xattr set operations implicitly create the object and therefore always succeed.
> >> >
> >> > that's not current behavior? yes
> >> >
> >> > > 3) omap remove operations are marked as rw so they can return ENOENT.
> >> > >   - Otherwise, an omap remove operation on a non-existing object would have to add a noop entry to the log or implicitly create the object -- the latter would be particularly odd.
> >> > >   - Or, we could record a noop log entry with the return code.
> >> >
> >> > This one ? I don't think log entries are very expensive, and I don't
> >> > think we want to serialize omap ops. In particular I think serializing
> >> > omap rm would be bad for rgw.
> >>
> >> Yeah, agree on these.  It'll be pretty easy to log noop items.
> >>
> >> We could also put a return value in those noop log entries (or, perhaps,
> >> any log entry).  If I'm following correctly, that will allow the client to
> >> pipeline RW ops.  I'm not sure that's the best idea in the general case
> >> (e.g., if we expect the test to fail frequently), but an op flag
> >> could indicate whether we want to record/log test failures (allowing
> >> the client to pipeline) or whether the objecter will plug the request
> >> stream for the object to keep things correct.
> >>
> >> Actually, I'm not really sure we'll want to cram all that into the
> >> Objecter--it'll mean a hash_map by object that checks that there
> >> aren't already in-flight rw ops on the given object, etc., which may have
> >> a significant performance impact, all to guard against a sequence
> >> very few clients will ever generate.
> >>
> >> sage
> >>
> >>
> >> >
> >> > > 4) I don't think there are any current object classes which should be marked w, but they should be carefully audited.
> >> > >
> >> > > On the implementation side, we probably need new versions of the affected librados calls (omap_set2?, blind_delete?) since we are changing the return values and there may be older code which relies on this behavior.
> >> >
> >> > But then they'd also be preserving broken behavior, right? Perhaps
> >> > it's time to do something interesting with our versioning.
> >> >
> >> > >
> >> > > Write ordered reads also appear to be fundamentally broken in the current implementation for the same reason.  It seems like we'd have to handle that at the objecter level by marking the reads rw.
> >> > >
> >> > > We could also find a way to relax the ordering guarantees even further, but I fear that that would make it excessively difficult to reason about librados operation ordering.
> >> > >
> >> > > Thoughts?
> >> > > -Sam
> >> > >
> >> > > ----- Original Message -----
> >> > > From: "Sage Weil" <sweil@redhat.com>
> >> > > To: sjust@redhat.com, ceph-devel@vger.kernel.org
> >> > > Sent: Wednesday, May 6, 2015 10:47:11 AM
> >> > > Subject: rados semantic changes
> >> > >
> >> > > It sounds like we're kicking around two proposed changes:
> >> > >
> >> > > 1) A delete will never -ENOENT; a delete on a non-existent object is a
> >> > > success.  This is important for cache tiering, and allowing us to skip
> >> > > checking the base tier for certain client requests.
> >> > >
> >> > > 2) Any write will implicitly create the object.  If we want such a
> >> > > dependency (so we see ENOENT), the user can put an assert_exists in the op
> >> > > vector.
> >> > >
> >> > > I think what both of these amount to is that the writes have no read-side
> >> > > checks and will effectively never fail (except for true failure
> >> > > cases).  If the user wants some sort of failure, it will be explicit
> >> > > in the form of another read check in the op vector.
> >> > >
> >> > > Sam, is this what you're thinking?
> >> > >
> >> > > It's a subtle but real change in the semantics of rados ops, but I
> >> > > think now would be the time to make it...
> >> > >
> >> > > sage
> >> > > --
> >> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> > > the body of a message to majordomo@vger.kernel.org
> >> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> > >
> >> > > --
> >> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> > > the body of a message to majordomo@vger.kernel.org
> >> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> >> >
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rados semantic changes
  2016-03-09 21:47           ` Sage Weil
@ 2016-03-09 21:56             ` Gregory Farnum
  2016-03-09 22:00               ` Yehuda Sadeh-Weinraub
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2016-03-09 21:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: Samuel Just, ceph-devel

On Wed, Mar 9, 2016 at 1:47 PM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 9 Mar 2016, Gregory Farnum wrote:
>> On Wed, Mar 9, 2016 at 12:42 PM, Sage Weil <sage@newdream.net> wrote:
>> > Resurrecting an old thread.
>> >
>> > I think we really want to make these semantic changes to current rados
>> > ops (like delete) to make life better going forward.  Ideally shortly
>> > after jewel so that they have plenty of time to bake before K and L.
>> >
>> > I'm wondering if the way to make this change visible to users is to
>> > (finally) rev librados to librados3.  We can take the opportunity to make
>> > any other pending cleanups to the public API as well...
>>
>> Yep. I presume you're thinking of this because of
>> http://tracker.ceph.com/issues/14468? It looks like we didn't really
>> have any good solutions for that pipelining problem though; any new
>> suggestions?
>
> Yeah, I'm still not very happy with either alternative:
>
> 1) We persistently record the reqid and return value in the pg log.  This
> turns failed rw ops into a replicated (metadata) write, which sort of
> sucks.  It also means that we probably *wouldn't* store any reply payload,
> which means we lose the ability to have a failure return useful data
> (e.g., info about why it failed).

This inability to return data on writes has pretty persistently sucked
for us... I wonder if we should be attacking it from that direction
instead. We just don't want pglog entries to get that large and are
worried about being able to reproduce the data on replay, right?
Perhaps we could add some kind of limited-size lookaside thing. Given
that RW ops *are* a write on success (whatever "success" means in the
op's context) I'm not so concerned about turning them into writes even
if they would have been a read. The other option is #2, which as you
note might have some serious performance implications on the client
side. :/
-Greg

>
> 2) The objecter prevents rw ops from being pipelined.  This means a hash
> table in the objecter so that it transparently blocks subsequent requests
> to the same object.  Or,

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rados semantic changes
  2016-03-09 21:56             ` Gregory Farnum
@ 2016-03-09 22:00               ` Yehuda Sadeh-Weinraub
  0 siblings, 0 replies; 11+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2016-03-09 22:00 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, Samuel Just, ceph-devel

On Wed, Mar 9, 2016 at 1:56 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Wed, Mar 9, 2016 at 1:47 PM, Sage Weil <sage@newdream.net> wrote:
>> On Wed, 9 Mar 2016, Gregory Farnum wrote:
>>> On Wed, Mar 9, 2016 at 12:42 PM, Sage Weil <sage@newdream.net> wrote:
>>> > Resurrecting an old thread.
>>> >
>>> > I think we really want to make these semantic changes to current rados
>>> > ops (like delete) to make life better going forward.  Ideally shortly
>>> > after jewel so that they have plenty of time to bake before K and L.
>>> >
>>> > I'm wondering if the way to make this change visible to users is to
>>> > (finally) rev librados to librados3.  We can take the opportunity to make
>>> > any other pending cleanups to the public API as well...
>>>
>>> Yep. I presume you're thinking of this because of
>>> http://tracker.ceph.com/issues/14468? It looks like we didn't really
>>> have any good solutions for that pipelining problem though; any new
>>> suggestions?
>>
>> Yeah, I'm still not very happy with either alternative:
>>
>> 1) We persistently record the reqid and return value in the pg log.  This
>> turns failed rw ops into a replicated (metadata) write, which sort of
>> sucks.  It also means that we probably *wouldn't* store any reply payload,
>> which means we lose the ability to have a failure return useful data
>> (e.g., info about why it failed).
>
> This inability to return data on writes has pretty persistently sucked
> for us... I wonder if we should be attacking it from that direction
> instead. We just don't want pglog entries to get that large and are
> worried about being able to reproduce the data on replay, right?
> Perhaps we could add some kind of limited-size lookaside thing. Given
> that RW ops *are* a write on success (whatever "success" means in the
> op's context) I'm not so concerned about turning them into writes even
> if they would have been a read. The other option is #2, which as you
> note might have some serious performance implications on the client
> side. :/

+1

Both in terms of usefulness for client applications, and in
performance implications. Maybe we can relax the guarantees of the
returned data on writes, so that it doesn't become an issue?

Yehuda

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-03-09 22:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-06 17:47 rados semantic changes Sage Weil
2015-05-08 23:18 ` Samuel Just
2015-05-09  0:16   ` Gregory Farnum
2015-05-09  0:53     ` Samuel Just
2015-05-09  4:20       ` Samuel Just
2015-05-11  2:31     ` Sage Weil
2016-03-09 20:42       ` Sage Weil
2016-03-09 21:26         ` Gregory Farnum
2016-03-09 21:47           ` Sage Weil
2016-03-09 21:56             ` Gregory Farnum
2016-03-09 22:00               ` Yehuda Sadeh-Weinraub

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.