From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757459Ab0JTF5W (ORCPT ); Wed, 20 Oct 2010 01:57:22 -0400 Received: from mga01.intel.com ([192.55.52.88]:38218 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753258Ab0JTF5U (ORCPT ); Wed, 20 Oct 2010 01:57:20 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.57,354,1283756400"; d="scan'208";a="848983607" Date: Wed, 20 Oct 2010 13:57:17 +0800 From: Wu Fengguang To: Torsten Kaiser Cc: Neil Brown , Rik van Riel , Andrew Morton , KOSAKI Motohiro , KAMEZAWA Hiroyuki , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "Li, Shaohua" Subject: Re: Deadlock possibly caused by too_many_isolated. Message-ID: <20101020055717.GA12752@localhost> References: <20100915091118.3dbdc961@notabene> <4C90139A.1080809@redhat.com> <20100915122334.3fa7b35f@notabene> <20100915082843.GA17252@localhost> <20100915184434.18e2d933@notabene> <20101018151459.2b443221@notabene> <20101019101151.57c6dd56@notabene> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote: > On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser > wrote: > > On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown wrote: > >> Yes, thanks for the report. > >> This is a real bug exactly as you describe. > >> > >> This is how I think I will fix it, though it needs a bit of review and > >> testing before I can be certain. > >> Also I need to check raid10 etc to see if they can suffer too. > >> > >> If you can test it I would really appreciate it. > > > > I did test it, but while it seemed to fix the deadlock, the system > > still got unusable. > > The still running "vmstat 1" showed that the swapout was still > > progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds. > > > > I also tried to additionally add Wu's patch: > > --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800 > > +++ linux-next/mm/vmscan.c      2010-10-19 00:13:04.000000000 +0800 > > @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone > >               isolated = zone_page_state(zone, NR_ISOLATED_ANON); > >       } > > > > +       /* > > +        * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that > > +        * they won't get blocked by normal ones and form circular deadlock. > > +        */ > > +       if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) > > +               inactive >>= 3; > > + > >       return isolated > inactive; > > > > Either it did help somewhat, or I was more lucky on my second try, but > > this time I needed ~5 tries instead of only 2 to get the system mostly > > stuck again. On the testrun with Wu's patch the writeout pattern was > > more stable, a burst of ~80kb each 20 seconds. But I would suspect > > that the size of the burst is rather random. > > > > I do have a complete SysRq+T dump from the first run, I can send that > > to anyone how wants it. > > (It's 190k so I don't want not spam it to the list) > > Is this call trace from the SysRq+T violation the rule to only > allocate one bio from bio_alloc() until its submitted? > > [ 549.700038] Call Trace: > [ 549.700038] [] schedule_timeout+0x144/0x200 > [ 549.700038] [] ? process_timeout+0x0/0x10 > [ 549.700038] [] io_schedule_timeout+0x42/0x60 > [ 549.700038] [] mempool_alloc+0x163/0x1b0 > [ 549.700038] [] ? autoremove_wake_function+0x0/0x40 > [ 549.700038] [] bio_alloc_bioset+0x39/0xf0 > [ 549.700038] [] bio_clone+0x1d/0x50 > [ 549.700038] [] make_request+0x23d/0x850 > [ 549.700038] [] ? mempool_alloc_slab+0x10/0x20 > [ 549.700038] [] ? process_timeout+0x0/0x10 > [ 549.700038] [] md_make_request+0xc3/0x220 > [ 549.700038] [] ? mempool_alloc+0xd9/0x1b0 > [ 549.700038] [] generic_make_request+0x1b3/0x370 > [ 549.700038] [] ? bio_alloc_bioset+0x56/0xf0 > [ 549.700038] [] submit_bio+0x5a/0xd0 > [ 549.700038] [] ? unlock_page+0x25/0x30 > [ 549.700038] [] swap_writepage+0x7e/0xc0 > [ 549.700038] [] shmem_writepage+0x1c9/0x240 > [ 549.700038] [] pageout+0x11b/0x270 > [ 549.700038] [] shrink_page_list+0x258/0x4d0 > [ 549.700038] [] shrink_inactive_list+0x187/0x310 > [ 549.700038] [] ? __wake_up_common+0x51/0x80 > [ 549.700038] [] ? cpumask_next_and+0x22/0x40 > [ 549.700038] [] shrink_zone+0x3e0/0x470 > [ 549.700038] [] try_to_free_pages+0x157/0x410 > [ 549.700038] [] __alloc_pages_nodemask+0x412/0x760 > [ 549.700038] [] alloc_pages_current+0x76/0xe0 > [ 549.700038] [] new_slab+0x1fd/0x2a0 > [ 549.700038] [] ? process_timeout+0x0/0x10 > [ 549.700038] [] __slab_alloc+0x111/0x540 > [ 549.700038] [] ? prepare_creds+0x21/0xb0 > [ 549.700038] [] kmem_cache_alloc+0x9b/0xa0 > [ 549.700038] [] prepare_creds+0x21/0xb0 > [ 549.700038] [] sys_setresgid+0x29/0x120 > [ 549.700038] [] system_call_fastpath+0x16/0x1b > [ 549.700038] ffff88011e125ea8 0000000000000046 ffff88011e125e08 > ffffffff81073c59 > [ 549.700038] 0000000000012780 ffff88011ea905b0 ffff88011ea90808 > ffff88011e125fd8 > [ 549.700038] ffff88011ea90810 ffff88011e124010 0000000000012780 > ffff88011e125fd8 > > swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one > bio. That bio is the submitted, but the submit path seems to get into > make_request from raid1.c and that allocates a second bio from > bio_alloc() via bio_clone(). > > I am seeing this pattern (swap_writepage calling > md_make_request/make_request and then getting stuck in mempool_alloc) > more than 5 times in the SysRq+T output... I bet the root cause is the failure of pool->alloc(__GFP_NORETRY) inside mempool_alloc(), which can be fixed by this patch. Thanks, Fengguang --- concurrent direct page reclaim problem __GFP_NORETRY page allocations may fail when there are many concurrent page allocating tasks, but not necessary in real short of memory. The root cause is, tasks will first run direct page reclaim to free some pages from the LRU lists and put them to the per-cpu page lists and the buddy system, and then try to get a free page from there. However the free pages reclaimed by this task may be consumed by other tasks when the direct reclaim task is able to get the free page for itself. Let's retry it a bit harder. --- linux-next.orig/mm/page_alloc.c 2010-10-20 13:44:50.000000000 +0800 +++ linux-next/mm/page_alloc.c 2010-10-20 13:50:54.000000000 +0800 @@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig unsigned long pages_reclaimed) { /* Do not loop if specifically requested */ - if (gfp_mask & __GFP_NORETRY) + if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12))) return 0; /* From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id CA6665F0048 for ; Wed, 20 Oct 2010 01:58:26 -0400 (EDT) Date: Wed, 20 Oct 2010 13:57:17 +0800 From: Wu Fengguang Subject: Re: Deadlock possibly caused by too_many_isolated. Message-ID: <20101020055717.GA12752@localhost> References: <20100915091118.3dbdc961@notabene> <4C90139A.1080809@redhat.com> <20100915122334.3fa7b35f@notabene> <20100915082843.GA17252@localhost> <20100915184434.18e2d933@notabene> <20101018151459.2b443221@notabene> <20101019101151.57c6dd56@notabene> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org To: Torsten Kaiser Cc: Neil Brown , Rik van Riel , Andrew Morton , KOSAKI Motohiro , KAMEZAWA Hiroyuki , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "Li, Shaohua" List-ID: On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote: > On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser > wrote: > > On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown wrote: > >> Yes, thanks for the report. > >> This is a real bug exactly as you describe. > >> > >> This is how I think I will fix it, though it needs a bit of review and > >> testing before I can be certain. > >> Also I need to check raid10 etc to see if they can suffer too. > >> > >> If you can test it I would really appreciate it. > > > > I did test it, but while it seemed to fix the deadlock, the system > > still got unusable. > > The still running "vmstat 1" showed that the swapout was still > > progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds. > > > > I also tried to additionally add Wu's patch: > > --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800 > > +++ linux-next/mm/vmscan.c A A A 2010-10-19 00:13:04.000000000 +0800 > > @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone > > A A A A A A A isolated = zone_page_state(zone, NR_ISOLATED_ANON); > > A A A } > > > > + A A A /* > > + A A A A * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that > > + A A A A * they won't get blocked by normal ones and form circular deadlock. > > + A A A A */ > > + A A A if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) > > + A A A A A A A inactive >>= 3; > > + > > A A A return isolated > inactive; > > > > Either it did help somewhat, or I was more lucky on my second try, but > > this time I needed ~5 tries instead of only 2 to get the system mostly > > stuck again. On the testrun with Wu's patch the writeout pattern was > > more stable, a burst of ~80kb each 20 seconds. But I would suspect > > that the size of the burst is rather random. > > > > I do have a complete SysRq+T dump from the first run, I can send that > > to anyone how wants it. > > (It's 190k so I don't want not spam it to the list) > > Is this call trace from the SysRq+T violation the rule to only > allocate one bio from bio_alloc() until its submitted? > > [ 549.700038] Call Trace: > [ 549.700038] [] schedule_timeout+0x144/0x200 > [ 549.700038] [] ? process_timeout+0x0/0x10 > [ 549.700038] [] io_schedule_timeout+0x42/0x60 > [ 549.700038] [] mempool_alloc+0x163/0x1b0 > [ 549.700038] [] ? autoremove_wake_function+0x0/0x40 > [ 549.700038] [] bio_alloc_bioset+0x39/0xf0 > [ 549.700038] [] bio_clone+0x1d/0x50 > [ 549.700038] [] make_request+0x23d/0x850 > [ 549.700038] [] ? mempool_alloc_slab+0x10/0x20 > [ 549.700038] [] ? process_timeout+0x0/0x10 > [ 549.700038] [] md_make_request+0xc3/0x220 > [ 549.700038] [] ? mempool_alloc+0xd9/0x1b0 > [ 549.700038] [] generic_make_request+0x1b3/0x370 > [ 549.700038] [] ? bio_alloc_bioset+0x56/0xf0 > [ 549.700038] [] submit_bio+0x5a/0xd0 > [ 549.700038] [] ? unlock_page+0x25/0x30 > [ 549.700038] [] swap_writepage+0x7e/0xc0 > [ 549.700038] [] shmem_writepage+0x1c9/0x240 > [ 549.700038] [] pageout+0x11b/0x270 > [ 549.700038] [] shrink_page_list+0x258/0x4d0 > [ 549.700038] [] shrink_inactive_list+0x187/0x310 > [ 549.700038] [] ? __wake_up_common+0x51/0x80 > [ 549.700038] [] ? cpumask_next_and+0x22/0x40 > [ 549.700038] [] shrink_zone+0x3e0/0x470 > [ 549.700038] [] try_to_free_pages+0x157/0x410 > [ 549.700038] [] __alloc_pages_nodemask+0x412/0x760 > [ 549.700038] [] alloc_pages_current+0x76/0xe0 > [ 549.700038] [] new_slab+0x1fd/0x2a0 > [ 549.700038] [] ? process_timeout+0x0/0x10 > [ 549.700038] [] __slab_alloc+0x111/0x540 > [ 549.700038] [] ? prepare_creds+0x21/0xb0 > [ 549.700038] [] kmem_cache_alloc+0x9b/0xa0 > [ 549.700038] [] prepare_creds+0x21/0xb0 > [ 549.700038] [] sys_setresgid+0x29/0x120 > [ 549.700038] [] system_call_fastpath+0x16/0x1b > [ 549.700038] ffff88011e125ea8 0000000000000046 ffff88011e125e08 > ffffffff81073c59 > [ 549.700038] 0000000000012780 ffff88011ea905b0 ffff88011ea90808 > ffff88011e125fd8 > [ 549.700038] ffff88011ea90810 ffff88011e124010 0000000000012780 > ffff88011e125fd8 > > swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one > bio. That bio is the submitted, but the submit path seems to get into > make_request from raid1.c and that allocates a second bio from > bio_alloc() via bio_clone(). > > I am seeing this pattern (swap_writepage calling > md_make_request/make_request and then getting stuck in mempool_alloc) > more than 5 times in the SysRq+T output... I bet the root cause is the failure of pool->alloc(__GFP_NORETRY) inside mempool_alloc(), which can be fixed by this patch. Thanks, Fengguang --- concurrent direct page reclaim problem __GFP_NORETRY page allocations may fail when there are many concurrent page allocating tasks, but not necessary in real short of memory. The root cause is, tasks will first run direct page reclaim to free some pages from the LRU lists and put them to the per-cpu page lists and the buddy system, and then try to get a free page from there. However the free pages reclaimed by this task may be consumed by other tasks when the direct reclaim task is able to get the free page for itself. Let's retry it a bit harder. --- linux-next.orig/mm/page_alloc.c 2010-10-20 13:44:50.000000000 +0800 +++ linux-next/mm/page_alloc.c 2010-10-20 13:50:54.000000000 +0800 @@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig unsigned long pages_reclaimed) { /* Do not loop if specifically requested */ - if (gfp_mask & __GFP_NORETRY) + if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12))) return 0; /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org