linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matthias Dahl <ml_linux-kernel@binary-island.eu>
To: Michal Hocko <mhocko@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	linux-raid@vger.kernel.org, linux-mm@kvack.org,
	dm-devel@redhat.com, linux-kernel@vger.kernel.org,
	Mike Snitzer <snitzer@redhat.com>
Subject: Re: Page Allocation Failures/OOM with dm-crypt on software RAID10 (Intel Rapid Storage) with check/repair/sync
Date: Fri, 15 Jul 2016 09:11:02 +0200	[thread overview]
Message-ID: <005574d77d3f5dbc2643044a1e2468dc@mail.ud19.udmedia.de> (raw)
In-Reply-To: <9074e82f-bf52-011e-8bd7-5731d2b0dcaa@I-love.SAKURA.ne.jp>

Hello...

I am rather persistent (stubborn?) when it comes to tracking down bugs,
if somehow possible... and it seems it paid off... somewhat. ;-)

So I did quite a lot more further tests and came up with something very
interesting: As long as the RAID is in sync (as-in: sync_action=idle),
I can not for the life of me trigger this issue -- the used memory
still explodes to most of the RAM but it oscillates back and forth.

I did very stupid things to stress the machine while dd was running as
usual on the dm-crypt device. I opened a second dd instance with the
same parameters on the dm-crypt device. I wrote a simple program that
allocated random amounts of memory (up to 10 GiB), memset them and after
a random amount of time released it again -- in a continuous loop. I
put heavy network stress on the machine... whatever I could think of.

No matter what, the issue did not trigger. And I repeated said tests
quite a few times over extended time periods (usually an hour or so).
Everything worked beautifully with nice speeds and no noticeable system
slow-downs/lag.

As soon as I issued a "check" to sync_action of the RAID device, it was
just a matter of a second until the OOM killer kicked in and all hell
broke loose again. And basically all of my tests where done while the
RAID device was syncing -- due to a very unfortunate series of events.

I tried to repeat that same test with an external (USB3) connected disk
with a Linux s/w RAID10 over two partitions... but unfortunately that
behaves rather differently. I assume it is because it is connected
through USB and not SATA. While doing those tests on my RAID10 with the
4 internal SATA3 disks, you can see w/ free that the "used memory" does
explode to most of the RAM and then oscillates back and forth. With the
same test on the external disk through, that does not happen at all. The
used memory stays pretty much constant and only the buffers vary... but
most of the memory is still free in that case.

I hope my persistence on the matter is not annoying and finally leads us
somewhere where the real issue hides.

Any suggestions, opinions and ideas are greatly appreciated as I have
pretty much exhausted mine at this time.

Last but not least: I switched my testing to a OpenSuSE Tumbleweed Live
system (x86_64 w/ kernel 4.6.3) as Rawhide w/ 4.7.0rcX behaves rather
strangely and unstable at times.

Thanks,
Matthias

-- 
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu
  services: custom software [desktop, mobile, web], server administration

  reply	other threads:[~2016-07-15  7:11 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-12  8:27 Page Allocation Failures/OOM with dm-crypt on software RAID10 (Intel Rapid Storage) Matthias Dahl
2016-07-12  9:50 ` Michal Hocko
2016-07-12 11:28   ` Matthias Dahl
2016-07-12 11:49     ` Michal Hocko
2016-07-12 11:59       ` Michal Hocko
2016-07-12 12:42       ` Matthias Dahl
2016-07-12 14:07         ` Michal Hocko
2016-07-12 14:56           ` Matthias Dahl
2016-07-13 11:21             ` Michal Hocko
2016-07-13 12:18               ` Michal Hocko
2016-07-13 13:18                 ` Matthias Dahl
2016-07-13 13:47                   ` Michal Hocko
2016-07-13 15:32                     ` Matthias Dahl
2016-07-13 16:24                       ` [dm-devel] " Ondrej Kozina
2016-07-13 18:24                         ` Matthias Dahl
2016-07-14 11:18                     ` Tetsuo Handa
2016-07-15  7:11                       ` Matthias Dahl [this message]
2016-07-18  7:24                         ` Page Allocation Failures/OOM with dm-crypt on software RAID10 (Intel Rapid Storage) with check/repair/sync Matthias Dahl

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=005574d77d3f5dbc2643044a1e2468dc@mail.ud19.udmedia.de \
    --to=ml_linux-kernel@binary-island.eu \
    --cc=dm-devel@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=snitzer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).