All of lore.kernel.org
 help / color / mirror / Atom feed
* radosgw - stuck ops
@ 2015-08-04  1:53 GuangYang
       [not found] ` <CADRKj5SadChvVbVmVx18SFKfAeXWKiGmLB4Xybkqn3xXxoxnyg@mail.gmail.com>
  0 siblings, 1 reply; 12+ messages in thread
From: GuangYang @ 2015-08-04  1:53 UTC (permalink / raw)
  To: Sadeh-Weinraub Yehuda, ceph-devel

Hi Yehuda,
Recently with our pre-production clusters (with radosgw), we had an outage that all radosgw worker threads got stuck and all clients request resulted in 500 because that there is no worker thread taking care of them.

What we observed from the cluster, is that there was a PG stuck at *peering* state, as a result, all requests hitting that PG would occupy a worker thread infinitely and that gradually stuck all workers.

The reason why the PG stuck at peering is still under investigation, but radosgw side, I am wondering if we can pursue anything to improve such use case (to be more specific, 1 out of 8192 PGs' issue cascading to a service unavailable across the entire cluster):

1. The first approach I can think of is to add timeout at objecter layer for each OP to OSD, I think the complexity comes with WRITE, that is, how do we make sure the integrity if we abort at objecter layer. But for immutable op, I think we certainly can do this, since at an upper layer, we already reply back to client with an error.
2. Do thread pool/working queue sharding  at radosgw, in which case, partial failure would (hopefully) only impact partial of worker threads and only cause a partial outage.

How do you think?

Thanks,
Guang 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: radosgw - stuck ops
       [not found] ` <CADRKj5SadChvVbVmVx18SFKfAeXWKiGmLB4Xybkqn3xXxoxnyg@mail.gmail.com>
@ 2015-08-04 16:42   ` Samuel Just
  2015-08-04 17:15     ` Yehuda Sadeh-Weinraub
  2015-08-04 16:48   ` GuangYang
  2015-08-04 16:55   ` Sage Weil
  2 siblings, 1 reply; 12+ messages in thread
From: Samuel Just @ 2015-08-04 16:42 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub
  Cc: GuangYang, Weil, Sage, Sadeh-Weinraub Yehuda, ceph-devel

What if instead the request had a marker that would cause the OSD to
reply with EAGAIN if the pg is unhealthy?
-Sam

On Tue, Aug 4, 2015 at 8:41 AM, Yehuda Sadeh-Weinraub
<ysadehwe@redhat.com> wrote:
>
>
> On Mon, Aug 3, 2015 at 6:53 PM, GuangYang <yguang11@outlook.com> wrote:
>>
>> Hi Yehuda,
>> Recently with our pre-production clusters (with radosgw), we had an outage
>> that all radosgw worker threads got stuck and all clients request resulted
>> in 500 because that there is no worker thread taking care of them.
>>
>> What we observed from the cluster, is that there was a PG stuck at
>> *peering* state, as a result, all requests hitting that PG would occupy a
>> worker thread infinitely and that gradually stuck all workers.
>>
>> The reason why the PG stuck at peering is still under investigation, but
>> radosgw side, I am wondering if we can pursue anything to improve such use
>> case (to be more specific, 1 out of 8192 PGs' issue cascading to a service
>> unavailable across the entire cluster):
>>
>> 1. The first approach I can think of is to add timeout at objecter layer
>> for each OP to OSD, I think the complexity comes with WRITE, that is, how do
>> we make sure the integrity if we abort at objecter layer. But for immutable
>> op, I think we certainly can do this, since at an upper layer, we already
>> reply back to client with an error.
>> 2. Do thread pool/working queue sharding  at radosgw, in which case,
>> partial failure would (hopefully) only impact partial of worker threads and
>> only cause a partial outage.
>>
>
> The problem with timeouts is that they are racy and can bring the system
> into inconsistent state. For example, an operation takes too long, rgw gets
> a timeout, but the operation actually completes on the osd. So rgw returns
> with an error, removes the tail and does not complete the write, whereas in
> practice the new head was already written and points at the newly removed
> tail. The index would still show as if the old version of the object was
> still there. I'm sure we can come up with some more scenarios that I'm not
> sure we could resolve easily.
> The problem with sharding is that for large enough objects they could end up
> writing to any pg, so I'm not sure how effective that would be.
> One solution that I can think of is to determine before the read/write
> whether the pg we're about to access is healthy (or has been unhealthy for a
> short period of time), and if not to cancel the request before sending the
> operation. This could mitigate the problem you're seeing at the expense of
> availability in some cases. We'd need to have a way to query pg health
> through librados which we don't have right now afaik.
> Sage / Sam, does that make sense, and/or possible?
>
> Yehuda

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: radosgw - stuck ops
       [not found] ` <CADRKj5SadChvVbVmVx18SFKfAeXWKiGmLB4Xybkqn3xXxoxnyg@mail.gmail.com>
  2015-08-04 16:42   ` Samuel Just
@ 2015-08-04 16:48   ` GuangYang
  2015-08-04 17:14     ` Yehuda Sadeh-Weinraub
  2015-08-04 16:55   ` Sage Weil
  2 siblings, 1 reply; 12+ messages in thread
From: GuangYang @ 2015-08-04 16:48 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub, Weil Sage, Just, Samuel
  Cc: Sadeh-Weinraub Yehuda, ceph-devel

Hi Yehuda,
Thanks for the quick response. My comments inline..

Thanks,
Guang
________________________________
> Date: Tue, 4 Aug 2015 08:41:26 -0700 
> Subject: Re: radosgw - stuck ops 
> From: ysadehwe@redhat.com 
> To: yguang11@outlook.com; sweil@redhat.com; sjust@redhat.com 
> CC: yehuda@redhat.com; ceph-devel@vger.kernel.org 
> 
> 
> 
> On Mon, Aug 3, 2015 at 6:53 PM, GuangYang 
> <yguang11@outlook.com<mailto:yguang11@outlook.com>> wrote: 
> Hi Yehuda, 
> Recently with our pre-production clusters (with radosgw), we had an 
> outage that all radosgw worker threads got stuck and all clients 
> request resulted in 500 because that there is no worker thread taking 
> care of them. 
> 
> What we observed from the cluster, is that there was a PG stuck at 
> *peering* state, as a result, all requests hitting that PG would occupy 
> a worker thread infinitely and that gradually stuck all workers. 
> 
> The reason why the PG stuck at peering is still under investigation, 
> but radosgw side, I am wondering if we can pursue anything to improve 
> such use case (to be more specific, 1 out of 8192 PGs' issue cascading 
> to a service unavailable across the entire cluster): 
> 
> 1. The first approach I can think of is to add timeout at objecter 
> layer for each OP to OSD, I think the complexity comes with WRITE, that 
> is, how do we make sure the integrity if we abort at objecter layer. 
> But for immutable op, I think we certainly can do this, since at an 
> upper layer, we already reply back to client with an error. 
> 2. Do thread pool/working queue sharding at radosgw, in which case, 
> partial failure would (hopefully) only impact partial of worker threads 
> and only cause a partial outage. 
> 
> 
> The problem with timeouts is that they are racy and can bring the 
> system into inconsistent state. For example, an operation takes too 
> long, rgw gets a timeout, but the operation actually completes on the 
> osd. So rgw returns with an error, removes the tail and does not 
> complete the write, whereas in practice the new head was already 
> written and points at the newly removed tail. The index would still 
> show as if the old version of the object was still there. I'm sure we 
> can come up with some more scenarios that I'm not sure we could resolve 
> easily. 
Right, that is my concern as well, we will need to come up with a mechanism to
preserve integrity, like for each write, it should be all or nothing, not partial, though
we already reply to client with a 500 error.
But that is the problem we properly need to deal with anyway, for example, in our
cluster, each time we detect this kind of availability issue, we will need to restart
all radosgw daemons to bring it back, which has the possibility to leave some
inconsistent state.
I am thinking it might make sense to start with *immutable* requests, for example,
bucket listing, object get/head, etc. We can timeout as long as we timeout with client.
That should be much easier to implement and solve part of the problem.
> The problem with sharding is that for large enough objects they could 
> end up writing to any pg, so I'm not sure how effective that would be.
Not sure of other use cases with radosgw across the community, but for us, at
time being, 95%tile of the objects are stored with one chunk, so that should be
effective for this kind of work load, but yeah we should consider to support more
general use cases. As a bottom line, that should not make things worse.
> One solution that I can think of is to determine before the read/write 
> whether the pg we're about to access is healthy (or has been unhealthy 
> for a short period of time), and if not to cancel the request before 
> sending the operation. This could mitigate the problem you're seeing at 
> the expense of availability in some cases. We'd need to have a way to 
> query pg health through librados which we don't have right now afaik. 
That sounds good. The only complexity I can think of is for large objects
which has several chunks, we will need to deal with the write issue as well, since each chunk
might assign to different PGs?
> Sage / Sam, does that make sense, and/or possible? 
> 
> Yehuda 
 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: radosgw - stuck ops
       [not found] ` <CADRKj5SadChvVbVmVx18SFKfAeXWKiGmLB4Xybkqn3xXxoxnyg@mail.gmail.com>
  2015-08-04 16:42   ` Samuel Just
  2015-08-04 16:48   ` GuangYang
@ 2015-08-04 16:55   ` Sage Weil
  2015-08-04 16:58     ` Yehuda Sadeh-Weinraub
  2 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2015-08-04 16:55 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub
  Cc: GuangYang, Just, Samuel, Sadeh-Weinraub Yehuda, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3988 bytes --]

On Tue, 4 Aug 2015, Yehuda Sadeh-Weinraub wrote:
> On Mon, Aug 3, 2015 at 6:53 PM, GuangYang <yguang11@outlook.com> wrote:
>       Hi Yehuda,
>       Recently with our pre-production clusters (with radosgw), we had
>       an outage that all radosgw worker threads got stuck and all
>       clients request resulted in 500 because that there is no worker
>       thread taking care of them.
> 
>       What we observed from the cluster, is that there was a PG stuck
>       at *peering* state, as a result, all requests hitting that PG
>       would occupy a worker thread infinitely and that gradually stuck
>       all workers.
> 
>       The reason why the PG stuck at peering is still under
>       investigation, but radosgw side, I am wondering if we can pursue
>       anything to improve such use case (to be more specific, 1 out of
>       8192 PGs' issue cascading to a service unavailable across the
>       entire cluster):
> 
>       1. The first approach I can think of is to add timeout at
>       objecter layer for each OP to OSD, I think the complexity comes
>       with WRITE, that is, how do we make sure the integrity if we
>       abort at objecter layer. But for immutable op, I think we
>       certainly can do this, since at an upper layer, we already reply
>       back to client with an error.
>       2. Do thread pool/working queue sharding  at radosgw, in which
>       case, partial failure would (hopefully) only impact partial of
>       worker threads and only cause a partial outage.
> 
> 
> The problem with timeouts is that they are racy and can bring the system
> into inconsistent state. For example, an operation takes too long, rgw gets
> a timeout, but the operation actually completes on the osd. So rgw returns
> with an error, removes the tail and does not complete the write, whereas in
> practice the new head was already written and points at the newly removed
> tail. The index would still show as if the old version of the object was
> still there. I'm sure we can come up with some more scenarios that I'm not
> sure we could resolve easily.

Yeah, unless the entire request goes boom when we time out.  In 
that case, it'd look like a radosgw failure (and the head wouldn't get 
removed, etc.).

This could trivially be done by just setting the suicide timeouts on the 
rgw work queue, but in practice I think that just means all the requests 
will fail (even ones that were making progress at the time) *or* all of 
them will get retried and the list of 'hung' requests will continue to 
pile up (unless the original clients disconnect and the LB/proxy/wahtever 
stops sending them to rgw?).

> The problem with sharding is that for large enough objects they could end up
> writing to any pg, so I'm not sure how effective that would be.

Yeah.

In practice, though, the hung threads aren't consuming any CPU.. they're 
just blocked.  I wonder if rgw could go into a mode where it counts idle 
vs progressing threads, and expands the work queue so that work still gets 
done.  Then/or once it hits some threshold it realizes there's a backend 
hang, it drains requests making progress, and does an orderly restart.

Ideally, we'd have a way to 'kill' a single request process, in which case 
we could time it out and return the appropriate HTTP code to the 
front-end, but in lieu of that... :/

> One solution that I can think of is to determine before the read/write
> whether the pg we're about to access is healthy (or has been unhealthy for a
> short period of time), and if not to cancel the request before sending the
> operation. This could mitigate the problem you're seeing at the expense of
> availability in some cases. We'd need to have a way to query pg health
> through librados which we don't have right now afaik.
> Sage / Sam, does that make sense, and/or possible?

This seems mostly impossible because we don't know ahead of time which 
PG(s) a request is going to touch (it'll generally be a lot of them)?

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: radosgw - stuck ops
  2015-08-04 16:55   ` Sage Weil
@ 2015-08-04 16:58     ` Yehuda Sadeh-Weinraub
  2015-08-04 17:03       ` Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2015-08-04 16:58 UTC (permalink / raw)
  To: Sage Weil; +Cc: GuangYang, Just, Samuel, Sadeh-Weinraub Yehuda, ceph-devel

On Tue, Aug 4, 2015 at 9:55 AM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 4 Aug 2015, Yehuda Sadeh-Weinraub wrote:
>> On Mon, Aug 3, 2015 at 6:53 PM, GuangYang <yguang11@outlook.com> wrote:
>>       Hi Yehuda,
>>       Recently with our pre-production clusters (with radosgw), we had
>>       an outage that all radosgw worker threads got stuck and all
>>       clients request resulted in 500 because that there is no worker
>>       thread taking care of them.
>>
>>       What we observed from the cluster, is that there was a PG stuck
>>       at *peering* state, as a result, all requests hitting that PG
>>       would occupy a worker thread infinitely and that gradually stuck
>>       all workers.
>>
>>       The reason why the PG stuck at peering is still under
>>       investigation, but radosgw side, I am wondering if we can pursue
>>       anything to improve such use case (to be more specific, 1 out of
>>       8192 PGs' issue cascading to a service unavailable across the
>>       entire cluster):
>>
>>       1. The first approach I can think of is to add timeout at
>>       objecter layer for each OP to OSD, I think the complexity comes
>>       with WRITE, that is, how do we make sure the integrity if we
>>       abort at objecter layer. But for immutable op, I think we
>>       certainly can do this, since at an upper layer, we already reply
>>       back to client with an error.
>>       2. Do thread pool/working queue sharding  at radosgw, in which
>>       case, partial failure would (hopefully) only impact partial of
>>       worker threads and only cause a partial outage.
>>
>>
>> The problem with timeouts is that they are racy and can bring the system
>> into inconsistent state. For example, an operation takes too long, rgw gets
>> a timeout, but the operation actually completes on the osd. So rgw returns
>> with an error, removes the tail and does not complete the write, whereas in
>> practice the new head was already written and points at the newly removed
>> tail. The index would still show as if the old version of the object was
>> still there. I'm sure we can come up with some more scenarios that I'm not
>> sure we could resolve easily.
>
> Yeah, unless the entire request goes boom when we time out.  In
> that case, it'd look like a radosgw failure (and the head wouldn't get
> removed, etc.).
>
> This could trivially be done by just setting the suicide timeouts on the
> rgw work queue, but in practice I think that just means all the requests
> will fail (even ones that were making progress at the time) *or* all of
> them will get retried and the list of 'hung' requests will continue to
> pile up (unless the original clients disconnect and the LB/proxy/wahtever
> stops sending them to rgw?).
>
>> The problem with sharding is that for large enough objects they could end up
>> writing to any pg, so I'm not sure how effective that would be.
>
> Yeah.
>
> In practice, though, the hung threads aren't consuming any CPU.. they're
> just blocked.  I wonder if rgw could go into a mode where it counts idle
> vs progressing threads, and expands the work queue so that work still gets
> done.  Then/or once it hits some threshold it realizes there's a backend
> hang, it drains requests making progress, and does an orderly restart.
>
> Ideally, we'd have a way to 'kill' a single request process, in which case
> we could time it out and return the appropriate HTTP code to the
> front-end, but in lieu of that... :/
>
>> One solution that I can think of is to determine before the read/write
>> whether the pg we're about to access is healthy (or has been unhealthy for a
>> short period of time), and if not to cancel the request before sending the
>> operation. This could mitigate the problem you're seeing at the expense of
>> availability in some cases. We'd need to have a way to query pg health
>> through librados which we don't have right now afaik.
>> Sage / Sam, does that make sense, and/or possible?
>
> This seems mostly impossible because we don't know ahead of time which
> PG(s) a request is going to touch (it'll generally be a lot of them)?
>

Barring pgls() and such, each rados request that radosgw produces will
only touch a single pg, right?

Yehuda

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: radosgw - stuck ops
  2015-08-04 16:58     ` Yehuda Sadeh-Weinraub
@ 2015-08-04 17:03       ` Sage Weil
  2015-08-04 17:14         ` Yehuda Sadeh-Weinraub
  0 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2015-08-04 17:03 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub
  Cc: GuangYang, Just, Samuel, Sadeh-Weinraub Yehuda, ceph-devel

On Tue, 4 Aug 2015, Yehuda Sadeh-Weinraub wrote:
> On Tue, Aug 4, 2015 at 9:55 AM, Sage Weil <sweil@redhat.com> wrote:
> >> One solution that I can think of is to determine before the read/write
> >> whether the pg we're about to access is healthy (or has been unhealthy for a
> >> short period of time), and if not to cancel the request before sending the
> >> operation. This could mitigate the problem you're seeing at the expense of
> >> availability in some cases. We'd need to have a way to query pg health
> >> through librados which we don't have right now afaik.
> >> Sage / Sam, does that make sense, and/or possible?
> >
> > This seems mostly impossible because we don't know ahead of time which
> > PG(s) a request is going to touch (it'll generally be a lot of them)?
> >
> 
> Barring pgls() and such, each rados request that radosgw produces will
> only touch a single pg, right?

Oh, yeah.  I thought you meant before each RGW request.  If it's at the 
rados level then yeah, you could avoid stuck pgs, although I think a 
better approach would be to make the OSD reply with -EAGAIN in that case 
so that you know the op didn't happen.  There would still be cases (though 
more rare) where you weren't sure if the op happened or not (e.g., when 
you send to osd A, it goes down, you resend to osd B, and then you get 
EAGAIN/timeout).

What would you do when you get that failure/timeout, though?  Is it 
practical to abort the rgw request handling completely?

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: radosgw - stuck ops
  2015-08-04 17:03       ` Sage Weil
@ 2015-08-04 17:14         ` Yehuda Sadeh-Weinraub
  2015-08-04 22:23           ` GuangYang
  0 siblings, 1 reply; 12+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2015-08-04 17:14 UTC (permalink / raw)
  To: Sage Weil; +Cc: GuangYang, Just, Samuel, Sadeh-Weinraub Yehuda, ceph-devel

On Tue, Aug 4, 2015 at 10:03 AM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 4 Aug 2015, Yehuda Sadeh-Weinraub wrote:
>> On Tue, Aug 4, 2015 at 9:55 AM, Sage Weil <sweil@redhat.com> wrote:
>> >> One solution that I can think of is to determine before the read/write
>> >> whether the pg we're about to access is healthy (or has been unhealthy for a
>> >> short period of time), and if not to cancel the request before sending the
>> >> operation. This could mitigate the problem you're seeing at the expense of
>> >> availability in some cases. We'd need to have a way to query pg health
>> >> through librados which we don't have right now afaik.
>> >> Sage / Sam, does that make sense, and/or possible?
>> >
>> > This seems mostly impossible because we don't know ahead of time which
>> > PG(s) a request is going to touch (it'll generally be a lot of them)?
>> >
>>
>> Barring pgls() and such, each rados request that radosgw produces will
>> only touch a single pg, right?
>
> Oh, yeah.  I thought you meant before each RGW request.  If it's at the
> rados level then yeah, you could avoid stuck pgs, although I think a
> better approach would be to make the OSD reply with -EAGAIN in that case
> so that you know the op didn't happen.  There would still be cases (though
> more rare) where you weren't sure if the op happened or not (e.g., when
> you send to osd A, it goes down, you resend to osd B, and then you get
> EAGAIN/timeout).

If done on the client side then we should only make it apply to the
first request sent. Is it actually a problem if the osd triggered the
error?

>
> What would you do when you get that failure/timeout, though?  Is it
> practical to abort the rgw request handling completely?
>

It should be like any error that happens through the transaction
(e.g., client disconnection).

Yehuda

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: radosgw - stuck ops
  2015-08-04 16:48   ` GuangYang
@ 2015-08-04 17:14     ` Yehuda Sadeh-Weinraub
  0 siblings, 0 replies; 12+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2015-08-04 17:14 UTC (permalink / raw)
  To: GuangYang; +Cc: Weil Sage, Just, Samuel, Sadeh-Weinraub Yehuda, ceph-devel

On Tue, Aug 4, 2015 at 9:48 AM, GuangYang <yguang11@outlook.com> wrote:
> Hi Yehuda,
> Thanks for the quick response. My comments inline..
>
> Thanks,
> Guang
> ________________________________
>> Date: Tue, 4 Aug 2015 08:41:26 -0700
>> Subject: Re: radosgw - stuck ops
>> From: ysadehwe@redhat.com
>> To: yguang11@outlook.com; sweil@redhat.com; sjust@redhat.com
>> CC: yehuda@redhat.com; ceph-devel@vger.kernel.org
>>
>>
>>
>> On Mon, Aug 3, 2015 at 6:53 PM, GuangYang
>> <yguang11@outlook.com<mailto:yguang11@outlook.com>> wrote:
>> Hi Yehuda,
>> Recently with our pre-production clusters (with radosgw), we had an
>> outage that all radosgw worker threads got stuck and all clients
>> request resulted in 500 because that there is no worker thread taking
>> care of them.
>>
>> What we observed from the cluster, is that there was a PG stuck at
>> *peering* state, as a result, all requests hitting that PG would occupy
>> a worker thread infinitely and that gradually stuck all workers.
>>
>> The reason why the PG stuck at peering is still under investigation,
>> but radosgw side, I am wondering if we can pursue anything to improve
>> such use case (to be more specific, 1 out of 8192 PGs' issue cascading
>> to a service unavailable across the entire cluster):
>>
>> 1. The first approach I can think of is to add timeout at objecter
>> layer for each OP to OSD, I think the complexity comes with WRITE, that
>> is, how do we make sure the integrity if we abort at objecter layer.
>> But for immutable op, I think we certainly can do this, since at an
>> upper layer, we already reply back to client with an error.
>> 2. Do thread pool/working queue sharding at radosgw, in which case,
>> partial failure would (hopefully) only impact partial of worker threads
>> and only cause a partial outage.
>>
>>
>> The problem with timeouts is that they are racy and can bring the
>> system into inconsistent state. For example, an operation takes too
>> long, rgw gets a timeout, but the operation actually completes on the
>> osd. So rgw returns with an error, removes the tail and does not
>> complete the write, whereas in practice the new head was already
>> written and points at the newly removed tail. The index would still
>> show as if the old version of the object was still there. I'm sure we
>> can come up with some more scenarios that I'm not sure we could resolve
>> easily.
> Right, that is my concern as well, we will need to come up with a mechanism to
> preserve integrity, like for each write, it should be all or nothing, not partial, though
> we already reply to client with a 500 error.
> But that is the problem we properly need to deal with anyway, for example, in our
> cluster, each time we detect this kind of availability issue, we will need to restart
> all radosgw daemons to bring it back, which has the possibility to leave some
> inconsistent state.

It's a different kind of inconsistency, one that we're built to recover from.

> I am thinking it might make sense to start with *immutable* requests, for example,
> bucket listing, object get/head, etc. We can timeout as long as we timeout with client.
> That should be much easier to implement and solve part of the problem.
>> The problem with sharding is that for large enough objects they could
>> end up writing to any pg, so I'm not sure how effective that would be.
> Not sure of other use cases with radosgw across the community, but for us, at
> time being, 95%tile of the objects are stored with one chunk, so that should be
> effective for this kind of work load, but yeah we should consider to support more
> general use cases. As a bottom line, that should not make things worse.

Yeah, the idea is to get the general case working.

>> One solution that I can think of is to determine before the read/write
>> whether the pg we're about to access is healthy (or has been unhealthy
>> for a short period of time), and if not to cancel the request before
>> sending the operation. This could mitigate the problem you're seeing at
>> the expense of availability in some cases. We'd need to have a way to
>> query pg health through librados which we don't have right now afaik.
> That sounds good. The only complexity I can think of is for large objects
> which has several chunks, we will need to deal with the write issue as well, since each chunk
> might assign to different PGs?

For larger objects, in theory we can get it to retry a write using
different prefix. Not sure how easy it would be to implement, and
won't work with reads obviously.

Yehuda

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: radosgw - stuck ops
  2015-08-04 16:42   ` Samuel Just
@ 2015-08-04 17:15     ` Yehuda Sadeh-Weinraub
  0 siblings, 0 replies; 12+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2015-08-04 17:15 UTC (permalink / raw)
  To: Samuel Just; +Cc: GuangYang, Weil, Sage, Sadeh-Weinraub Yehuda, ceph-devel

On Tue, Aug 4, 2015 at 9:42 AM, Samuel Just <sjust@redhat.com> wrote:
> What if instead the request had a marker that would cause the OSD to
> reply with EAGAIN if the pg is unhealthy?
> -Sam

That sounds like a good option. Not crazy about the specific error
code, not sure we're not abusing it already.

Yehuda

>
> On Tue, Aug 4, 2015 at 8:41 AM, Yehuda Sadeh-Weinraub
> <ysadehwe@redhat.com> wrote:
>>
>>
>> On Mon, Aug 3, 2015 at 6:53 PM, GuangYang <yguang11@outlook.com> wrote:
>>>
>>> Hi Yehuda,
>>> Recently with our pre-production clusters (with radosgw), we had an outage
>>> that all radosgw worker threads got stuck and all clients request resulted
>>> in 500 because that there is no worker thread taking care of them.
>>>
>>> What we observed from the cluster, is that there was a PG stuck at
>>> *peering* state, as a result, all requests hitting that PG would occupy a
>>> worker thread infinitely and that gradually stuck all workers.
>>>
>>> The reason why the PG stuck at peering is still under investigation, but
>>> radosgw side, I am wondering if we can pursue anything to improve such use
>>> case (to be more specific, 1 out of 8192 PGs' issue cascading to a service
>>> unavailable across the entire cluster):
>>>
>>> 1. The first approach I can think of is to add timeout at objecter layer
>>> for each OP to OSD, I think the complexity comes with WRITE, that is, how do
>>> we make sure the integrity if we abort at objecter layer. But for immutable
>>> op, I think we certainly can do this, since at an upper layer, we already
>>> reply back to client with an error.
>>> 2. Do thread pool/working queue sharding  at radosgw, in which case,
>>> partial failure would (hopefully) only impact partial of worker threads and
>>> only cause a partial outage.
>>>
>>
>> The problem with timeouts is that they are racy and can bring the system
>> into inconsistent state. For example, an operation takes too long, rgw gets
>> a timeout, but the operation actually completes on the osd. So rgw returns
>> with an error, removes the tail and does not complete the write, whereas in
>> practice the new head was already written and points at the newly removed
>> tail. The index would still show as if the old version of the object was
>> still there. I'm sure we can come up with some more scenarios that I'm not
>> sure we could resolve easily.
>> The problem with sharding is that for large enough objects they could end up
>> writing to any pg, so I'm not sure how effective that would be.
>> One solution that I can think of is to determine before the read/write
>> whether the pg we're about to access is healthy (or has been unhealthy for a
>> short period of time), and if not to cancel the request before sending the
>> operation. This could mitigate the problem you're seeing at the expense of
>> availability in some cases. We'd need to have a way to query pg health
>> through librados which we don't have right now afaik.
>> Sage / Sam, does that make sense, and/or possible?
>>
>> Yehuda

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: radosgw - stuck ops
  2015-08-04 17:14         ` Yehuda Sadeh-Weinraub
@ 2015-08-04 22:23           ` GuangYang
  2015-08-05 14:44             ` Yehuda Sadeh-Weinraub
  0 siblings, 1 reply; 12+ messages in thread
From: GuangYang @ 2015-08-04 22:23 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub, Weil Sage
  Cc: Just, Samuel, Sadeh-Weinraub Yehuda, ceph-devel

Thanks for Sage, Yehuda and Sam's quick reply.

Given the discussion so far, could I summarize into the following bullet points:

1> The first step we would like to pursue is to implement the following mechanism to avoid infinite waiting at radosgw side:
      1.1. radosgw - send OP with a *fast_fail* flag
      1.2. OSD - reply with -EAGAIN if the PG is *inactive* and the *fast_fail* flag is set
      1.3. radosgw - upon receiving -EAGAIN, retry till a timeout interval is reached (properly with some back-off?), and if it eventually fails, convert -EAGAIN to some other error code and passes to upper layer.

2> In terms of management of radosgw's worker threads, I think we either pursue Sage's proposal (which could linearly increase the time it takes to stuck all worker threads depending how many threads we expand), or simply try sharding work queue (which we already has some basic building block)?

Can I start working on patch for <1> and then <2> as a lower priority?

Thanks,
Guang
----------------------------------------
> Date: Tue, 4 Aug 2015 10:14:06 -0700
> Subject: Re: radosgw - stuck ops
> From: ysadehwe@redhat.com
> To: sweil@redhat.com
> CC: yguang11@outlook.com; sjust@redhat.com; yehuda@redhat.com; ceph-devel@vger.kernel.org
>
> On Tue, Aug 4, 2015 at 10:03 AM, Sage Weil <sweil@redhat.com> wrote:
>> On Tue, 4 Aug 2015, Yehuda Sadeh-Weinraub wrote:
>>> On Tue, Aug 4, 2015 at 9:55 AM, Sage Weil <sweil@redhat.com> wrote:
>>>>> One solution that I can think of is to determine before the read/write
>>>>> whether the pg we're about to access is healthy (or has been unhealthy for a
>>>>> short period of time), and if not to cancel the request before sending the
>>>>> operation. This could mitigate the problem you're seeing at the expense of
>>>>> availability in some cases. We'd need to have a way to query pg health
>>>>> through librados which we don't have right now afaik.
>>>>> Sage / Sam, does that make sense, and/or possible?
>>>>
>>>> This seems mostly impossible because we don't know ahead of time which
>>>> PG(s) a request is going to touch (it'll generally be a lot of them)?
>>>>
>>>
>>> Barring pgls() and such, each rados request that radosgw produces will
>>> only touch a single pg, right?
>>
>> Oh, yeah. I thought you meant before each RGW request. If it's at the
>> rados level then yeah, you could avoid stuck pgs, although I think a
>> better approach would be to make the OSD reply with -EAGAIN in that case
>> so that you know the op didn't happen. There would still be cases (though
>> more rare) where you weren't sure if the op happened or not (e.g., when
>> you send to osd A, it goes down, you resend to osd B, and then you get
>> EAGAIN/timeout).
>
> If done on the client side then we should only make it apply to the
> first request sent. Is it actually a problem if the osd triggered the
> error?
>
>>
>> What would you do when you get that failure/timeout, though? Is it
>> practical to abort the rgw request handling completely?
>>
>
> It should be like any error that happens through the transaction
> (e.g., client disconnection).
>
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: radosgw - stuck ops
  2015-08-04 22:23           ` GuangYang
@ 2015-08-05 14:44             ` Yehuda Sadeh-Weinraub
  2015-08-11  0:55               ` GuangYang
  0 siblings, 1 reply; 12+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2015-08-05 14:44 UTC (permalink / raw)
  To: GuangYang; +Cc: Weil Sage, Just, Samuel, Sadeh-Weinraub Yehuda, ceph-devel

On Tue, Aug 4, 2015 at 3:23 PM, GuangYang <yguang11@outlook.com> wrote:
> Thanks for Sage, Yehuda and Sam's quick reply.
>
> Given the discussion so far, could I summarize into the following bullet points:
>
> 1> The first step we would like to pursue is to implement the following mechanism to avoid infinite waiting at radosgw side:
>       1.1. radosgw - send OP with a *fast_fail* flag
>       1.2. OSD - reply with -EAGAIN if the PG is *inactive* and the *fast_fail* flag is set
>       1.3. radosgw - upon receiving -EAGAIN, retry till a timeout interval is reached (properly with some back-off?), and if it eventually fails, convert -EAGAIN to some other error code and passes to upper layer.

I'm not crazy about the 'fast_fail' name, maybe we can come up with a
better describing term. Also, not 100% sure the EAGAIN is the error we
want to see. Maybe the flag on the request could specify what would be
the error code to return in this case?
I think it's a good plan to start with, we can adjust things later.

>
> 2> In terms of management of radosgw's worker threads, I think we either pursue Sage's proposal (which could linearly increase the time it takes to stuck all worker threads depending how many threads we expand), or simply try sharding work queue (which we already has some basic building block)?

The problem that I see with that proposal (missed it earlier, only
seeing it now), is that when the threads actually wake up the system
could become unusable. In any case, it's probably a lower priority at
this point, we could rethink this area again later.

Yehuda

>
> Can I start working on patch for <1> and then <2> as a lower priority?
>
> Thanks,
> Guang
> ----------------------------------------
>> Date: Tue, 4 Aug 2015 10:14:06 -0700
>> Subject: Re: radosgw - stuck ops
>> From: ysadehwe@redhat.com
>> To: sweil@redhat.com
>> CC: yguang11@outlook.com; sjust@redhat.com; yehuda@redhat.com; ceph-devel@vger.kernel.org
>>
>> On Tue, Aug 4, 2015 at 10:03 AM, Sage Weil <sweil@redhat.com> wrote:
>>> On Tue, 4 Aug 2015, Yehuda Sadeh-Weinraub wrote:
>>>> On Tue, Aug 4, 2015 at 9:55 AM, Sage Weil <sweil@redhat.com> wrote:
>>>>>> One solution that I can think of is to determine before the read/write
>>>>>> whether the pg we're about to access is healthy (or has been unhealthy for a
>>>>>> short period of time), and if not to cancel the request before sending the
>>>>>> operation. This could mitigate the problem you're seeing at the expense of
>>>>>> availability in some cases. We'd need to have a way to query pg health
>>>>>> through librados which we don't have right now afaik.
>>>>>> Sage / Sam, does that make sense, and/or possible?
>>>>>
>>>>> This seems mostly impossible because we don't know ahead of time which
>>>>> PG(s) a request is going to touch (it'll generally be a lot of them)?
>>>>>
>>>>
>>>> Barring pgls() and such, each rados request that radosgw produces will
>>>> only touch a single pg, right?
>>>
>>> Oh, yeah. I thought you meant before each RGW request. If it's at the
>>> rados level then yeah, you could avoid stuck pgs, although I think a
>>> better approach would be to make the OSD reply with -EAGAIN in that case
>>> so that you know the op didn't happen. There would still be cases (though
>>> more rare) where you weren't sure if the op happened or not (e.g., when
>>> you send to osd A, it goes down, you resend to osd B, and then you get
>>> EAGAIN/timeout).
>>
>> If done on the client side then we should only make it apply to the
>> first request sent. Is it actually a problem if the osd triggered the
>> error?
>>
>>>
>>> What would you do when you get that failure/timeout, though? Is it
>>> practical to abort the rgw request handling completely?
>>>
>>
>> It should be like any error that happens through the transaction
>> (e.g., client disconnection).
>>
>> Yehuda
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: radosgw - stuck ops
  2015-08-05 14:44             ` Yehuda Sadeh-Weinraub
@ 2015-08-11  0:55               ` GuangYang
  0 siblings, 0 replies; 12+ messages in thread
From: GuangYang @ 2015-08-11  0:55 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub; +Cc: ceph-devel

Hi Yehuda,
On top of the changes for [1], I would propose another change, which exposes the number of *stuck threads* via admin socket, so that we can build something outside of ceph to check if all worker threads are stuck, and if yes, restart the service.

We can also assertion out if all workers are stuck, as an inside ceph solution... (as being more conservative to use 'hit_suicide_timeout' to assert out if there is one thread stuck).

What do you think?

[1] https://github.com/ceph/ceph/pull/5501
Thanks,
Guang


----------------------------------------
> Date: Wed, 5 Aug 2015 07:44:34 -0700
> Subject: Re: radosgw - stuck ops
> From: ysadehwe@redhat.com
> To: yguang11@outlook.com
> CC: sweil@redhat.com; sjust@redhat.com; yehuda@redhat.com; ceph-devel@vger.kernel.org
>
> On Tue, Aug 4, 2015 at 3:23 PM, GuangYang <yguang11@outlook.com> wrote:
>> Thanks for Sage, Yehuda and Sam's quick reply.
>>
>> Given the discussion so far, could I summarize into the following bullet points:
>>
>> 1> The first step we would like to pursue is to implement the following mechanism to avoid infinite waiting at radosgw side:
>> 1.1. radosgw - send OP with a *fast_fail* flag
>> 1.2. OSD - reply with -EAGAIN if the PG is *inactive* and the *fast_fail* flag is set
>> 1.3. radosgw - upon receiving -EAGAIN, retry till a timeout interval is reached (properly with some back-off?), and if it eventually fails, convert -EAGAIN to some other error code and passes to upper layer.
>
> I'm not crazy about the 'fast_fail' name, maybe we can come up with a
> better describing term. Also, not 100% sure the EAGAIN is the error we
> want to see. Maybe the flag on the request could specify what would be
> the error code to return in this case?
> I think it's a good plan to start with, we can adjust things later.
>
>>
>> 2> In terms of management of radosgw's worker threads, I think we either pursue Sage's proposal (which could linearly increase the time it takes to stuck all worker threads depending how many threads we expand), or simply try sharding work queue (which we already has some basic building block)?
>
> The problem that I see with that proposal (missed it earlier, only
> seeing it now), is that when the threads actually wake up the system
> could become unusable. In any case, it's probably a lower priority at
> this point, we could rethink this area again later.
>
> Yehuda
>
>>
>> Can I start working on patch for <1> and then <2> as a lower priority?
>>
>> Thanks,
>> Guang
>> ----------------------------------------
>>> Date: Tue, 4 Aug 2015 10:14:06 -0700
>>> Subject: Re: radosgw - stuck ops
>>> From: ysadehwe@redhat.com
>>> To: sweil@redhat.com
>>> CC: yguang11@outlook.com; sjust@redhat.com; yehuda@redhat.com; ceph-devel@vger.kernel.org
>>>
>>> On Tue, Aug 4, 2015 at 10:03 AM, Sage Weil <sweil@redhat.com> wrote:
>>>> On Tue, 4 Aug 2015, Yehuda Sadeh-Weinraub wrote:
>>>>> On Tue, Aug 4, 2015 at 9:55 AM, Sage Weil <sweil@redhat.com> wrote:
>>>>>>> One solution that I can think of is to determine before the read/write
>>>>>>> whether the pg we're about to access is healthy (or has been unhealthy for a
>>>>>>> short period of time), and if not to cancel the request before sending the
>>>>>>> operation. This could mitigate the problem you're seeing at the expense of
>>>>>>> availability in some cases. We'd need to have a way to query pg health
>>>>>>> through librados which we don't have right now afaik.
>>>>>>> Sage / Sam, does that make sense, and/or possible?
>>>>>>
>>>>>> This seems mostly impossible because we don't know ahead of time which
>>>>>> PG(s) a request is going to touch (it'll generally be a lot of them)?
>>>>>>
>>>>>
>>>>> Barring pgls() and such, each rados request that radosgw produces will
>>>>> only touch a single pg, right?
>>>>
>>>> Oh, yeah. I thought you meant before each RGW request. If it's at the
>>>> rados level then yeah, you could avoid stuck pgs, although I think a
>>>> better approach would be to make the OSD reply with -EAGAIN in that case
>>>> so that you know the op didn't happen. There would still be cases (though
>>>> more rare) where you weren't sure if the op happened or not (e.g., when
>>>> you send to osd A, it goes down, you resend to osd B, and then you get
>>>> EAGAIN/timeout).
>>>
>>> If done on the client side then we should only make it apply to the
>>> first request sent. Is it actually a problem if the osd triggered the
>>> error?
>>>
>>>>
>>>> What would you do when you get that failure/timeout, though? Is it
>>>> practical to abort the rgw request handling completely?
>>>>
>>>
>>> It should be like any error that happens through the transaction
>>> (e.g., client disconnection).
>>>
>>> Yehuda
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-08-11  0:55 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-04  1:53 radosgw - stuck ops GuangYang
     [not found] ` <CADRKj5SadChvVbVmVx18SFKfAeXWKiGmLB4Xybkqn3xXxoxnyg@mail.gmail.com>
2015-08-04 16:42   ` Samuel Just
2015-08-04 17:15     ` Yehuda Sadeh-Weinraub
2015-08-04 16:48   ` GuangYang
2015-08-04 17:14     ` Yehuda Sadeh-Weinraub
2015-08-04 16:55   ` Sage Weil
2015-08-04 16:58     ` Yehuda Sadeh-Weinraub
2015-08-04 17:03       ` Sage Weil
2015-08-04 17:14         ` Yehuda Sadeh-Weinraub
2015-08-04 22:23           ` GuangYang
2015-08-05 14:44             ` Yehuda Sadeh-Weinraub
2015-08-11  0:55               ` GuangYang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.