All of lore.kernel.org
 help / color / mirror / Atom feed
* dump_historic_ops, slow requests
@ 2015-10-12 21:22 Deneau, Tom
  2015-10-14  4:25 ` Gregory Farnum
  0 siblings, 1 reply; 2+ messages in thread
From: Deneau, Tom @ 2015-10-12 21:22 UTC (permalink / raw)
  To: ceph-devel

I have a small ceph cluster (3 nodes, 5 osds each, journals all just partitions
on the spinner disks) and I have noticed that when I hit it with a bunch of
rados bench clients all doing writes of large (40M objects) with --no-cleanup,
the rados bench commands seem to finish OK but I often get health warnings like
    HEALTH_WARN 4 requests are blocked > 32 sec;
                2 osds have slow requests 3 ops are blocked > 32.768 sec on osd.9
                1 ops are blocked > 32.768 sec on osd.10
                2 osds have slow requests
After a couple of minutes, health goes to HEALTH_OK.

But if I go to the node containing osd.10 for example and do dump_historic_ops
I do get lots of around 20-sec durations but nothing over 32 sec.

The 20-sec or so ops are always  "ack+ondisk+write+known_if_redirected"
with type_data = "commit sent: apply or cleanup"
and the following are typical event timings

                               initiated: 14:06:58.205937
                              reached_pg: 14:07:01.823288, gap=  3617.351
                                 started: 14:07:01.823359, gap=     0.071
               waiting for subops from 3: 14:07:01.855259, gap=    31.900
         commit_queued_for_journal_write: 14:07:03.132697, gap=  1277.438
          write_thread_in_journal_buffer: 14:07:03.143356, gap=    10.659
             journaled_completion_queued: 14:07:04.175863, gap=  1032.507
                               op_commit: 14:07:04.585040, gap=   409.177
                              op_applied: 14:07:04.589751, gap=     4.711
                sub_op_commit_rec from 3: 14:07:14.682925, gap= 10093.174
                             commit_sent: 14:07:14.683081, gap=     0.156
                                    done: 14:07:14.683119, gap=     0.038

Should I expect to see a historic op with duration greater than 32 sec?

-- Tom Deneau


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: dump_historic_ops, slow requests
  2015-10-12 21:22 dump_historic_ops, slow requests Deneau, Tom
@ 2015-10-14  4:25 ` Gregory Farnum
  0 siblings, 0 replies; 2+ messages in thread
From: Gregory Farnum @ 2015-10-14  4:25 UTC (permalink / raw)
  To: Deneau, Tom; +Cc: ceph-devel

On Mon, Oct 12, 2015 at 2:22 PM, Deneau, Tom <tom.deneau@amd.com> wrote:
> I have a small ceph cluster (3 nodes, 5 osds each, journals all just partitions
> on the spinner disks) and I have noticed that when I hit it with a bunch of
> rados bench clients all doing writes of large (40M objects) with --no-cleanup,
> the rados bench commands seem to finish OK but I often get health warnings like
>     HEALTH_WARN 4 requests are blocked > 32 sec;
>                 2 osds have slow requests 3 ops are blocked > 32.768 sec on osd.9
>                 1 ops are blocked > 32.768 sec on osd.10
>                 2 osds have slow requests
> After a couple of minutes, health goes to HEALTH_OK.
>
> But if I go to the node containing osd.10 for example and do dump_historic_ops
> I do get lots of around 20-sec durations but nothing over 32 sec.
>
> The 20-sec or so ops are always  "ack+ondisk+write+known_if_redirected"
> with type_data = "commit sent: apply or cleanup"
> and the following are typical event timings
>
>                                initiated: 14:06:58.205937
>                               reached_pg: 14:07:01.823288, gap=  3617.351
>                                  started: 14:07:01.823359, gap=     0.071
>                waiting for subops from 3: 14:07:01.855259, gap=    31.900
>          commit_queued_for_journal_write: 14:07:03.132697, gap=  1277.438
>           write_thread_in_journal_buffer: 14:07:03.143356, gap=    10.659
>              journaled_completion_queued: 14:07:04.175863, gap=  1032.507
>                                op_commit: 14:07:04.585040, gap=   409.177
>                               op_applied: 14:07:04.589751, gap=     4.711
>                 sub_op_commit_rec from 3: 14:07:14.682925, gap= 10093.174
>                              commit_sent: 14:07:14.683081, gap=     0.156
>                                     done: 14:07:14.683119, gap=     0.038
>
> Should I expect to see a historic op with duration greater than 32 sec?

IIRC, the historic ops generally consist of up to the N longest
operations in the last Y time period, on a rolling basis. Something
like 20 ops from the last 6 minutes. So if you get a report of a
32-second blocked op and immediately go to the OSD, it should have
that op in either in-progress or historic ops.

One possibility is that if you've got peering happening, it's got ops
blocked on that but they get thrown out because the OSD turns out not
to be responsible for them.
-Greg

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2015-10-14  4:25 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-12 21:22 dump_historic_ops, slow requests Deneau, Tom
2015-10-14  4:25 ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.