From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753249AbcGLL7U (ORCPT ); Tue, 12 Jul 2016 07:59:20 -0400 Received: from mail-wm0-f49.google.com ([74.125.82.49]:37581 "EHLO mail-wm0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750943AbcGLL7S (ORCPT ); Tue, 12 Jul 2016 07:59:18 -0400 Date: Tue, 12 Jul 2016 13:59:15 +0200 From: Michal Hocko To: Matthias Dahl Cc: linux-raid@vger.kernel.org, linux-mm@kvack.org, dm-devel@redhat.com, linux-kernel@vger.kernel.org Subject: Re: Page Allocation Failures/OOM with dm-crypt on software RAID10 (Intel Rapid Storage) Message-ID: <20160712115915.GG14586@dhcp22.suse.cz> References: <02580b0a303da26b669b4a9892624b13@mail.ud19.udmedia.de> <20160712095013.GA14591@dhcp22.suse.cz> <20160712114920.GF14586@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160712114920.GF14586@dhcp22.suse.cz> User-Agent: Mutt/1.6.0 (2016-04-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 12-07-16 13:49:20, Michal Hocko wrote: > On Tue 12-07-16 13:28:12, Matthias Dahl wrote: > > Hello Michal... > > > > On 2016-07-12 11:50, Michal Hocko wrote: > > > > > This smells like file pages are stuck in the writeback somewhere and the > > > anon memory is not reclaimable because you do not have any swap device. > > > > Not having a swap device shouldn't be a problem -- and in this case, it > > would cause even more trouble as in disk i/o. > > > > What could cause the file pages to get stuck or stopped from being written > > to the disk? And more importantly, what is so unique/special about the > > Intel Rapid Storage that it happens (seemingly) exclusively with that > > and not the the normal Linux s/w raid support? > > I am not a storage expert (not even mention dm-crypt). But what those > counters say is that the IO completion doesn't trigger so the > PageWriteback flag is still set. Such a page is not reclaimable > obviously. So I would check the IO delivery path and focus on the > potential dm-crypt involvement if you suspect this is a contributing > factor. > > > Also, if the pages are not written to disk, shouldn't something error > > out or slow dd down? > > Writers are normally throttled when we the dirty limit. You seem to have > dirty_ratio set to 20% which is quite a lot considering how much memory > you have. And just to clarify. dirty_ratio refers to dirtyable memory which is free_pages+file_lru pages. In your case you you have only 9% of the total memory size dirty/writeback but that is 90% of dirtyable memory. This is quite possible if somebody consumes free_pages racing with the writer. Writer will get throttled but the concurrent memory consumer will not normally. So you can end up in this situation. > If you get back to the memory info from the OOM killer report: > [18907.592209] active_anon:110314 inactive_anon:295 isolated_anon:0 > active_file:27534 inactive_file:819673 isolated_file:160 > unevictable:13001 dirty:167859 writeback:651864 unstable:0 > slab_reclaimable:177477 slab_unreclaimable:1817501 > mapped:934 shmem:588 pagetables:7109 bounce:0 > free:49928 free_pcp:45 free_cma:0 > > The dirty+writeback is ~9%. What is more interesting, though, LRU > pages are negligible to the memory size (~11%). Note the numer of > unreclaimable slab pages (~20%). Who is consuming those objects? > Where is the rest 70% of memory hiding? -- Michal Hocko SUSE Labs