All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Jim Schutt" <jaschut@sandia.gov>
To: Sage Weil <sage@newdream.net>
Cc: Gregory Farnum <gregory.farnum@dreamhost.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: cosd multi-second stalls cause "wrongly marked me down"
Date: Fri, 8 Apr 2011 10:23:23 -0600	[thread overview]
Message-ID: <4D9F367B.1070904@sandia.gov> (raw)
In-Reply-To: <Pine.LNX.4.64.1102161649560.11150@cobra.newdream.net>

Hi Sage,

Sage Weil wrote:
> On Wed, 16 Feb 2011, Jim Schutt wrote:
>> On Wed, 2011-02-16 at 14:40 -0700, Gregory Farnum wrote:
>>> On Wednesday, February 16, 2011 at 1:25 PM, Jim Schutt wrote:
>>>> Hi,
>>>>
>>>> I've been testing v0.24.3 w/ 64 clients against
>>>> 1 mon, 1 mds, 96 osds. Under heavy write load I
>>>> see:
>>>>  [WRN] map e7 wrongly marked me down or wrong addr
>>>>
>>>> I was able to sort through the logs and discover that when
>>>> this happens I have large gaps (10 seconds or more) in osd
>>>> heatbeat processing. In those heartbeat gaps I've discovered
>>>> long periods (5-15 seconds) where an osd logs nothing, even
>>>> though I am running with debug osd/filestore/journal = 20.
>>>>
>>>> Is this a known issue?
>>> You're running on btrfs? 
>> Yep.
> 
> Are the cosd log files on the same btrfs volume as the btrfs data, or 
> elsewhere?  The heartbeat thread takes some pains to avoid any locks that 
> may be contented and do avoid any disk io, so in theory a btrfs stall 
> shouldn't affect anything.  We may have missed something.. do you have a 
> log showing this in action?

In the end, after all the various things I've tried, I think
that the root cause of this is relatively simple: I don't
have enough CPU cycles available on my servers to do the
amount of OSD processing required to service my client
load, given the number of OSDs per server I'm running.

With too much work and not enough cycles to do it, the
one real-time component of Ceph, heartbeat processing,
eventually must miss its deadline (no heartbeat "observed"
in osd_heartbeat_grace seconds), since it requires work
done by components (messengers, memory allocation system)
that don't provide real-time guarantees.

All of my experiences on this make perfect sense when
viewed from this perspective.

For example, when working with tcmalloc, I learned I
could compile it with CXXFLAGS=-DTCMALLOC_LARGE_PAGES,
which causes tcmalloc to allow objects up to 256k in
its thread caches, rather than the default 32k.  So I
used that in combination with a 256k stripe width, on
the theory that deallocating messages would mostly only
interact with the thread cache, but it didn't help.

When looking at thread stacks generated by my
Mutex::LockOrAbort trick with a 5 sec wait to acquire
the pipe_lock, I often saw threads waiting on the
DoutLocker mutex.  Since lots of Ceph debugging output
happens with other locks being held, debugging might
thus slow things down out of proportion to the processing
required to generate the log messages.  Yet, when I
configured no debugging, I saw no improvement; it might
be that things got a little worse.  This now makes sense
to me in light of my above hypothesis about not enough
available CPU cycles - there's still too much work to
do, even with no cycles spent on debugging output.

What I didn't see very often in my thread stacks were
stack frames from tcmalloc.  This doesn't make sense
to me if the memory allocation subsystem is the root
cause of my problem, but makes perfect sense if there's
not enough CPU cycles: not so much time is spent
deallocating memory, so it is caught in the act less
often by LockOrAbort.

What finally seemed to help at avoiding missed heartbeats
in my configuration was the following combination:
turning off debugging, running with these throttling paramters:
         osd client message size cap = 14000000
         client oc size =              14000000
         client oc max dirty =         35000000
         filestore queue max bytes =   35000000
         journal queue max bytes =     35000000
         ms dispatch throttle bytes =  14000000
         objector inflight op bytes =  35000000
and using a 512k stripe width.

Evidently keeping a relatively small amount of data in
flight, in smaller chunks, allowed heartbeat processing to
hit its mark more often.  But, it only delayed things, it
didn't solve the problem.  This makes sense to me if the
root cause is that I don't have enough CPU cycles available
per OSD, because I didn't change the offered load.

So, in the short term I guess I need to run fewer cosd
instances per server.

If my analysis above is correct, do you think anything
can be gained by running the heartbeat and heartbeat
dispatcher threads as SCHED_RR threads?  Since tick() runs
heartbeat_check(), that would also need to be SCHED_RR,
or the heartbeats could arrive on time, but not checked
until it was too late.

Anyway, please let me know what you think of the above.

Thanks -- Jim

> 
> sage
> 
> 


  parent reply	other threads:[~2011-04-08 16:23 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-16 21:25 cosd multi-second stalls cause "wrongly marked me down" Jim Schutt
2011-02-16 21:37 ` Wido den Hollander
2011-02-16 21:51   ` Jim Schutt
2011-02-16 21:40 ` Gregory Farnum
2011-02-16 21:50   ` Jim Schutt
2011-02-17  0:50     ` Sage Weil
2011-02-17  0:54       ` Sage Weil
2011-02-17 15:46         ` Jim Schutt
2011-02-17 16:11           ` Sage Weil
2011-02-17 23:31             ` Jim Schutt
2011-02-18  7:13               ` Sage Weil
2011-02-18 17:04                 ` Jim Schutt
2011-02-18 17:15                 ` Gregory Farnum
2011-02-18 18:41                 ` Jim Schutt
2011-02-18 19:07                 ` Colin McCabe
2011-02-18 20:48                   ` Jim Schutt
2011-02-18 20:58                     ` Sage Weil
2011-02-18 21:09                       ` Jim Schutt
2011-03-09 16:02               ` Jim Schutt
2011-03-09 17:07                 ` Gregory Farnum
2011-03-09 18:36                   ` Jim Schutt
2011-03-09 19:37                     ` Gregory Farnum
2011-03-10 23:09                       ` Jim Schutt
2011-03-10 23:21                         ` Sage Weil
2011-03-10 23:32                           ` Jim Schutt
2011-03-10 23:40                             ` Sage Weil
2011-03-11 14:51                               ` Jim Schutt
2011-03-11 18:26                               ` Jim Schutt
2011-03-11 18:37                                 ` Jim Schutt
2011-03-11 18:37                                 ` Sage Weil
2011-03-11 18:51                                   ` Jim Schutt
2011-03-11 19:09                                     ` Gregory Farnum
2011-03-11 19:13                                       ` Yehuda Sadeh Weinraub
2011-03-11 19:17                                         ` Yehuda Sadeh Weinraub
2011-03-11 19:16                                       ` Jim Schutt
2011-03-11 21:13                                   ` Jim Schutt
2011-03-11 21:37                                     ` Sage Weil
2011-03-11 22:21                                       ` Jim Schutt
2011-03-11 22:26                                         ` Jim Schutt
2011-03-11 22:45                                           ` Sage Weil
2011-03-11 23:29                                             ` Jim Schutt
2011-03-30 21:26                                       ` Jim Schutt
2011-03-30 21:55                                         ` Sage Weil
2011-03-31 14:16                                           ` Jim Schutt
2011-03-31 16:25                                             ` Sage Weil
2011-03-31 17:00                                               ` Jim Schutt
2011-03-31 17:10                                                 ` Jim Schutt
2011-03-31 17:24                                                   ` Sage Weil
2011-03-31 18:08                                                     ` Jim Schutt
2011-03-31 18:41                                                       ` Sage Weil
2011-04-01 22:38                                                         ` Jim Schutt
2011-02-23 17:52             ` Jim Schutt
2011-02-23 18:12               ` Gregory Farnum
2011-02-23 18:54                 ` Sage Weil
2011-02-23 19:12                   ` Gregory Farnum
2011-02-23 19:23                 ` Jim Schutt
2011-02-23 20:27                   ` Gregory Farnum
2011-03-02  0:53                   ` Sage Weil
2011-03-02 15:21                     ` Jim Schutt
2011-03-02 17:10                       ` Sage Weil
2011-03-02 20:54                         ` Jim Schutt
2011-03-02 21:45                           ` Sage Weil
2011-03-02 21:59                             ` Jim Schutt
2011-03-02 22:57                               ` Jim Schutt
2011-03-02 23:20                                 ` Gregory Farnum
2011-03-02 23:25                                   ` Jim Schutt
2011-03-02 23:33                                     ` Gregory Farnum
2011-03-03  2:26                                 ` Colin McCabe
2011-03-03 20:03                                   ` Jim Schutt
2011-03-03 20:47                                     ` Jim Schutt
2011-03-03 20:55                                       ` Yehuda Sadeh Weinraub
2011-03-03 21:45                                         ` Jim Schutt
2011-03-03 22:22                                           ` Sage Weil
2011-03-03 22:34                                             ` Jim Schutt
2011-03-03 21:53                                         ` Colin McCabe
2011-03-03 23:06                                           ` Jim Schutt
2011-03-03 23:30                                             ` Colin McCabe
2011-03-03 23:37                                               ` Jim Schutt
2011-03-03  5:03                                 ` Sage Weil
2011-03-03 16:35                                   ` Jim Schutt
2011-03-03 17:28                                   ` Jim Schutt
2011-03-03 18:04                                     ` Sage Weil
2011-03-03 18:42                                       ` Jim Schutt
2011-03-03 18:51                                         ` Sage Weil
2011-03-03 19:39                                           ` Jim Schutt
2011-04-08 16:23       ` Jim Schutt [this message]
2011-04-08 20:50         ` Sage Weil
2011-04-08 22:11           ` Jim Schutt
2011-04-08 23:10             ` Colin McCabe
2011-04-11 14:41               ` Jim Schutt
2011-04-11 16:25                 ` Sage Weil
2011-04-11 20:14             ` Jim Schutt
2011-04-11 21:18             ` Jim Schutt
2011-04-11 23:23               ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D9F367B.1070904@sandia.gov \
    --to=jaschut@sandia.gov \
    --cc=ceph-devel@vger.kernel.org \
    --cc=gregory.farnum@dreamhost.com \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.