All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mark Hills <Mark.Hills@framestore.com>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] Hangs with cgroup memory controller
Date: Wed, 27 Jul 2011 18:33:10 +0100 (BST)	[thread overview]
Message-ID: <alpine.LFD.2.01.1107271820040.6411@sys880.ldn.framestore.com> (raw)
In-Reply-To: <BE324D9D-3D7B-4D9E-8F3D-316F259871FC@whamcloud.com>

On Wed, 27 Jul 2011, Andreas Dilger wrote:

> Two ideas come to mind. On is that the reason you are having difficulty 
> to reproduce the problem is that it only happens after some fault 
> condition. Possibly you need the client to do recovery to an OST and 
> resend a bulk RPC, or resend due to a checksum error?

Is there an easy way to trigger some error cases like this?

> It might also be due to application IO types (e.g. mmap, direct IO, 
> pwrite, splice, etc).

Yes, of course. Although I didn't gather any statistics, there wasn't a 
clear standout application which was more affected than others.

> Possibly you can correlate reproducer cases with Lustre errors on the 
> console?

Back when I tried this last year on the production system, I wasn't able 
to see corresponding errors. But I don't have any of this data around any 
more.

I'd need to do some tests on the production system to capture one case.

> Lustre also has memory debugging that can be enabled, but without a 
> reasonably concise reproducer it would be difficult to log/analyze so 
> much data for hours of runtime.

If I am able to capture a case, is there a way to, for example, dump a 
list of Lustre pages still held by the client? And correlate these with 
the files in question?

What I am thinking is that I could stop the running processes and attempt 
to drain all the pages, and this could hopefully leave a small number of 
'bad' ones -- with the files in question I could at least help to identify 
the I/O type.

Thanks for your reply

-- 
Mark

  reply	other threads:[~2011-07-27 17:33 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-27 16:21 [Lustre-devel] Hangs with cgroup memory controller Mark Hills
2011-07-27 17:11 ` Andreas Dilger
2011-07-27 17:33   ` Mark Hills [this message]
2011-07-27 18:57   ` Mark Hills
2011-07-27 19:16     ` Andreas Dilger
2011-07-28 13:53       ` Mark Hills
2011-07-28 17:10         ` Andreas Dilger
2011-07-29 14:39           ` Mark Hills
2011-08-04 17:24             ` [Lustre-devel] Bad page state after unlink (was Re: Hangs with cgroup memory controller) Mark Hills
2011-07-29  7:15     ` [Lustre-devel] Hangs with cgroup memory controller Robin Humble
2011-07-29 16:42       ` Mark Hills

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.01.1107271820040.6411@sys880.ldn.framestore.com \
    --to=mark.hills@framestore.com \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.