From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757671Ab1ECAvY (ORCPT ); Mon, 2 May 2011 20:51:24 -0400 Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:35862 "EHLO ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751863Ab1ECAvW (ORCPT ); Mon, 2 May 2011 20:51:22 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Av0EAONQv015LBza/2dsb2JhbACEUKE+eLQckF4OgRyDVYEBBJ0t Date: Tue, 3 May 2011 10:51:14 +1000 From: Dave Chinner To: Christian Kujau Cc: Markus Trippelsdorf , LKML , xfs@oss.sgi.com, minchan.kim@gmail.com Subject: Re: 2.6.39-rc4+: oom-killer busy killing tasks Message-ID: <20110503005114.GE2978@dastard> References: <20110427102824.GI12436@dastard> <20110428233751.GR12436@dastard> <20110429201701.GA13166@x4.trippels.de> <20110501080149.GD13542@dastard> <20110502121958.GA2978@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 02, 2011 at 12:59:50PM -0700, Christian Kujau wrote: > On Mon, 2 May 2011 at 22:19, Dave Chinner wrote: > > Yes. Try 2 orders of magnitude as a start. i.e change it to 10000... > > I've run the -12 test with vfs_cache_pressure=200 and now the -13 test > with vfs_cache_pressure=10000. The OOM killer still kicks in, but the > machine seems to be more usable afterwards and does not get totally stuck: > > http://nerdbynature.de/bits/2.6.39-rc4/oom/ > - messages-12.txt.gz & slabinfo-12.txt.bz2 > * oom-debug.sh invoked oom-killer at 01:27:11 > * sysrq-w works until 01:27:08, but got killed by oom > > - messages-13.txt.gz & slabinfo-13.txt.bz2 > * find invoked oom-killer at 08:44:07 > * sysrq-w works until 08:45:48 (listing jbd2/hda6-8), then > my debug script got killed So before the OOM killer kicks in, kswapd is stuck in congestion_wait(), and after a number of oom-kills over a 5s period it is still in congestion_wait(). 7s later it is still in congestion_wait() and the oom-killer starts up again, with kswapd still being in congestion_wait() when the oom-killer stops again 3s later. Ok, so kswapd being stuck in congestion wait means it can only be in balance_pgdat() and it thinks that it is getting into trouble. Looking at the OOM output: active_anon:7992 inactive_anon:8714 isolated_anon:0 active_file:5995 inactive_file:73780 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 free:35263 slab_reclaimable:182652 slab_unreclaimable:3224 mapped:6929 shmem:199 pagetables:396 bounce:0 DMA free:3436kB min:3532kB low:4412kB high:5296kB active_anon:0kB inactive_anon:0kB active_file:236kB inactive_file:248kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:780288kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:0kB slab_reclaimable:730608kB slab_unreclaimable:12896kB kernel_stack:1032kB pagetables:1584kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:680 all_unreclaimable? yes lowmem_reserve[]: 0 0 508 508 HighMem free:137616kB min:508kB low:1096kB high:1684kB active_anon:31968kB inactive_anon:34856kB active_file:23744kB inactive_file:294872kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:520192kB mlocked:0kB dirty:0kB writeback:0kB mapped:27708kB shmem:796kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 There are no isolated pages, so that means we aren't in the congestion_wait() call related to having too many isolated pages. We see that the ZONE_DMA is all_unreclaimable and had 680 pages scanned. ZONE_HIGHMEM had _zero_ pages scanned, which means it must be over the high water marks for free memory and so no attempt is made to reclaim from this zone. That means lru_pages is set to zone_reclaimable_pages(ZONE_DMA), which at this point in time would be: active_anon:0kB inactive_anon:0kB active_file:236kB inactive_file:248kB about 484k or 121 pages. To get all_unreclaimable set, the shrink_slab() call must have returned zero to indicate it didn't free anything. So the first pass through would have passed that to shrink_slab, and asumming they are all mapped pages we'd end up with nr_scanned = 242. For the xfs inode cache with 600,000 reclaimable inodes, this would have resulted in: max_pass = 600000 delta = 4 * 242 / 2 = 484 delta = 484 * 600,000 = 290,400,000 delta = 290,400,000 / 121 + 1 ٍ= 2,380,327 shrinker->nr += delta if (shrinker->nr > max_pass * 2) shrinker->nr = max_pass * 2; = 1,200,000 So, the shrinker->nr should be well above zero, even in the worst case. The question is now: how on earth is it returning zero? Two cases: if the shrinker returns -1, or because the cache is growing: nr_before = (*shrinker->shrink)(shrinker, 0, gfp_mask); shrink_ret = (*shrinker->shrink)(shrinker, this_scan, gfp_mask); if (shrink_ret == -1) break; if (shrink_ret < nr_before) ret += nr_before - shrink_ret; So, first case will happen for XFS when: if (!(gfp_mask & __GFP_FS)) return -1; In most of the OOM-killer invocations, the stack trace is: out_of_memory+0x27c/0x360 __alloc_pages_nodemask+0x6f8/0x708 new_slab+0x1fc/0x234 T.915+0x1f8/0x388 kmem_cache_alloc+0x11c/0x124 kmem_zone_alloc+0xa4/0x114 xfs_inode_alloc+0x40/0x13c xfs_iget+0x2a8/0x620 xfs_lookup+0xf8/0x114 xfs_vn_lookup+0x5c/0xb0 d_alloc_and_lookup+0x54/0x90 do_lookup+0x248/0x2bc path_lookupat+0xfc/0x8f4 do_path_lookup+0x34/0xac user_path_at+0x64/0xb4 vfs_fstatat+0x58/0xbc sys_fstatat64+0x24/0x50 ret_from_syscall+0x0/0x38 So we are not preventing reclaim via the gfp_mask. That leaves the other case, where the number of reclaimable inodes is growing faster than the shrinker is freeing them. I can't really see how that is possible with a single CPU machine without prempt enabled and, apparently, no dirty inodes. Inode reclaim should not block (shrinker or background), so there's something else going on here. Can you run an event trace of all the XFS events during a find for me? Don't do it over the entire subset of the filesystem - only 100,000 inodes is sufficient (i.e. kill the find once the xfs inode cache slab reaches 100k inodes. While still running the event trace, can you then drop the caches (echo 3 > /proc/sys/vm/drop_caches) and check that the xfs inode cache is emptied? If it isn't emptied, drop caches again to see if that empties it. If you coul dthen post the event trace, I might be able to see what is going strange with the shrinker and/or reclaim. Cheers, Dave. -- Dave Chinner david@fromorbit.com