From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933919Ab0JSKGY (ORCPT ); Tue, 19 Oct 2010 06:06:24 -0400 Received: from mail-yw0-f46.google.com ([209.85.213.46]:64112 "EHLO mail-yw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758154Ab0JSKGX convert rfc822-to-8bit (ORCPT ); Tue, 19 Oct 2010 06:06:23 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=lodFdhTzG0qddRQmcsb/GBk467YWU1in4g9bVgcM/K7CRCB9hByj7DrEJI1GMbunkd db483T/MByhgB/uHC5nCJ0gCEBWyvk0W0qrq1SpuA/AXpt7owUpdxkaRu+sveYThB4S8 7V3l7Qd8AJk22rJKy9IRcUJDS1A0xhvFwe6Oo= MIME-Version: 1.0 In-Reply-To: References: <20100915091118.3dbdc961@notabene> <4C90139A.1080809@redhat.com> <20100915122334.3fa7b35f@notabene> <20100915082843.GA17252@localhost> <20100915184434.18e2d933@notabene> <20101018151459.2b443221@notabene> <20101019101151.57c6dd56@notabene> Date: Tue, 19 Oct 2010 12:06:21 +0200 Message-ID: Subject: Re: Deadlock possibly caused by too_many_isolated. From: Torsten Kaiser To: Neil Brown Cc: Wu Fengguang , Rik van Riel , Andrew Morton , KOSAKI Motohiro , KAMEZAWA Hiroyuki , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Li Shaohua Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser wrote: > On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown wrote: >> Yes, thanks for the report. >> This is a real bug exactly as you describe. >> >> This is how I think I will fix it, though it needs a bit of review and >> testing before I can be certain. >> Also I need to check raid10 etc to see if they can suffer too. >> >> If you can test it I would really appreciate it. > > I did test it, but while it seemed to fix the deadlock, the system > still got unusable. > The still running "vmstat 1" showed that the swapout was still > progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds. > > I also tried to additionally add Wu's patch: > --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800 > +++ linux-next/mm/vmscan.c      2010-10-19 00:13:04.000000000 +0800 > @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone >               isolated = zone_page_state(zone, NR_ISOLATED_ANON); >       } > > +       /* > +        * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that > +        * they won't get blocked by normal ones and form circular deadlock. > +        */ > +       if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) > +               inactive >>= 3; > + >       return isolated > inactive; > > Either it did help somewhat, or I was more lucky on my second try, but > this time I needed ~5 tries instead of only 2 to get the system mostly > stuck again. On the testrun with Wu's patch the writeout pattern was > more stable, a burst of ~80kb each 20 seconds. But I would suspect > that the size of the burst is rather random. > > I do have a complete SysRq+T dump from the first run, I can send that > to anyone how wants it. > (It's 190k so I don't want not spam it to the list) Is this call trace from the SysRq+T violation the rule to only allocate one bio from bio_alloc() until its submitted? [ 549.700038] Call Trace: [ 549.700038] [] schedule_timeout+0x144/0x200 [ 549.700038] [] ? process_timeout+0x0/0x10 [ 549.700038] [] io_schedule_timeout+0x42/0x60 [ 549.700038] [] mempool_alloc+0x163/0x1b0 [ 549.700038] [] ? autoremove_wake_function+0x0/0x40 [ 549.700038] [] bio_alloc_bioset+0x39/0xf0 [ 549.700038] [] bio_clone+0x1d/0x50 [ 549.700038] [] make_request+0x23d/0x850 [ 549.700038] [] ? mempool_alloc_slab+0x10/0x20 [ 549.700038] [] ? process_timeout+0x0/0x10 [ 549.700038] [] md_make_request+0xc3/0x220 [ 549.700038] [] ? mempool_alloc+0xd9/0x1b0 [ 549.700038] [] generic_make_request+0x1b3/0x370 [ 549.700038] [] ? bio_alloc_bioset+0x56/0xf0 [ 549.700038] [] submit_bio+0x5a/0xd0 [ 549.700038] [] ? unlock_page+0x25/0x30 [ 549.700038] [] swap_writepage+0x7e/0xc0 [ 549.700038] [] shmem_writepage+0x1c9/0x240 [ 549.700038] [] pageout+0x11b/0x270 [ 549.700038] [] shrink_page_list+0x258/0x4d0 [ 549.700038] [] shrink_inactive_list+0x187/0x310 [ 549.700038] [] ? __wake_up_common+0x51/0x80 [ 549.700038] [] ? cpumask_next_and+0x22/0x40 [ 549.700038] [] shrink_zone+0x3e0/0x470 [ 549.700038] [] try_to_free_pages+0x157/0x410 [ 549.700038] [] __alloc_pages_nodemask+0x412/0x760 [ 549.700038] [] alloc_pages_current+0x76/0xe0 [ 549.700038] [] new_slab+0x1fd/0x2a0 [ 549.700038] [] ? process_timeout+0x0/0x10 [ 549.700038] [] __slab_alloc+0x111/0x540 [ 549.700038] [] ? prepare_creds+0x21/0xb0 [ 549.700038] [] kmem_cache_alloc+0x9b/0xa0 [ 549.700038] [] prepare_creds+0x21/0xb0 [ 549.700038] [] sys_setresgid+0x29/0x120 [ 549.700038] [] system_call_fastpath+0x16/0x1b [ 549.700038] ffff88011e125ea8 0000000000000046 ffff88011e125e08 ffffffff81073c59 [ 549.700038] 0000000000012780 ffff88011ea905b0 ffff88011ea90808 ffff88011e125fd8 [ 549.700038] ffff88011ea90810 ffff88011e124010 0000000000012780 ffff88011e125fd8 swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one bio. That bio is the submitted, but the submit path seems to get into make_request from raid1.c and that allocates a second bio from bio_alloc() via bio_clone(). I am seeing this pattern (swap_writepage calling md_make_request/make_request and then getting stuck in mempool_alloc) more than 5 times in the SysRq+T output... Torsten From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 8C50F6B004A for ; Tue, 19 Oct 2010 06:06:25 -0400 (EDT) Received: by gwj21 with SMTP id 21so1151317gwj.14 for ; Tue, 19 Oct 2010 03:06:23 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20100915091118.3dbdc961@notabene> <4C90139A.1080809@redhat.com> <20100915122334.3fa7b35f@notabene> <20100915082843.GA17252@localhost> <20100915184434.18e2d933@notabene> <20101018151459.2b443221@notabene> <20101019101151.57c6dd56@notabene> Date: Tue, 19 Oct 2010 12:06:21 +0200 Message-ID: Subject: Re: Deadlock possibly caused by too_many_isolated. From: Torsten Kaiser Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: Neil Brown Cc: Wu Fengguang , Rik van Riel , Andrew Morton , KOSAKI Motohiro , KAMEZAWA Hiroyuki , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Li Shaohua List-ID: On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser wrote: > On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown wrote: >> Yes, thanks for the report. >> This is a real bug exactly as you describe. >> >> This is how I think I will fix it, though it needs a bit of review and >> testing before I can be certain. >> Also I need to check raid10 etc to see if they can suffer too. >> >> If you can test it I would really appreciate it. > > I did test it, but while it seemed to fix the deadlock, the system > still got unusable. > The still running "vmstat 1" showed that the swapout was still > progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds. > > I also tried to additionally add Wu's patch: > --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800 > +++ linux-next/mm/vmscan.c =A0 =A0 =A02010-10-19 00:13:04.000000000 +0800 > @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone > =A0 =A0 =A0 =A0 =A0 =A0 =A0 isolated =3D zone_page_state(zone, NR_ISOLATE= D_ANON); > =A0 =A0 =A0 } > > + =A0 =A0 =A0 /* > + =A0 =A0 =A0 =A0* GFP_NOIO/GFP_NOFS callers are allowed to isolate more = pages, so that > + =A0 =A0 =A0 =A0* they won't get blocked by normal ones and form circula= r deadlock. > + =A0 =A0 =A0 =A0*/ > + =A0 =A0 =A0 if ((sc->gfp_mask & GFP_IOFS) =3D=3D GFP_IOFS) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 inactive >>=3D 3; > + > =A0 =A0 =A0 return isolated > inactive; > > Either it did help somewhat, or I was more lucky on my second try, but > this time I needed ~5 tries instead of only 2 to get the system mostly > stuck again. On the testrun with Wu's patch the writeout pattern was > more stable, a burst of ~80kb each 20 seconds. But I would suspect > that the size of the burst is rather random. > > I do have a complete SysRq+T dump from the first run, I can send that > to anyone how wants it. > (It's 190k so I don't want not spam it to the list) Is this call trace from the SysRq+T violation the rule to only allocate one bio from bio_alloc() until its submitted? [ 549.700038] Call Trace: [ 549.700038] [] schedule_timeout+0x144/0x200 [ 549.700038] [] ? process_timeout+0x0/0x10 [ 549.700038] [] io_schedule_timeout+0x42/0x60 [ 549.700038] [] mempool_alloc+0x163/0x1b0 [ 549.700038] [] ? autoremove_wake_function+0x0/0x40 [ 549.700038] [] bio_alloc_bioset+0x39/0xf0 [ 549.700038] [] bio_clone+0x1d/0x50 [ 549.700038] [] make_request+0x23d/0x850 [ 549.700038] [] ? mempool_alloc_slab+0x10/0x20 [ 549.700038] [] ? process_timeout+0x0/0x10 [ 549.700038] [] md_make_request+0xc3/0x220 [ 549.700038] [] ? mempool_alloc+0xd9/0x1b0 [ 549.700038] [] generic_make_request+0x1b3/0x370 [ 549.700038] [] ? bio_alloc_bioset+0x56/0xf0 [ 549.700038] [] submit_bio+0x5a/0xd0 [ 549.700038] [] ? unlock_page+0x25/0x30 [ 549.700038] [] swap_writepage+0x7e/0xc0 [ 549.700038] [] shmem_writepage+0x1c9/0x240 [ 549.700038] [] pageout+0x11b/0x270 [ 549.700038] [] shrink_page_list+0x258/0x4d0 [ 549.700038] [] shrink_inactive_list+0x187/0x310 [ 549.700038] [] ? __wake_up_common+0x51/0x80 [ 549.700038] [] ? cpumask_next_and+0x22/0x40 [ 549.700038] [] shrink_zone+0x3e0/0x470 [ 549.700038] [] try_to_free_pages+0x157/0x410 [ 549.700038] [] __alloc_pages_nodemask+0x412/0x760 [ 549.700038] [] alloc_pages_current+0x76/0xe0 [ 549.700038] [] new_slab+0x1fd/0x2a0 [ 549.700038] [] ? process_timeout+0x0/0x10 [ 549.700038] [] __slab_alloc+0x111/0x540 [ 549.700038] [] ? prepare_creds+0x21/0xb0 [ 549.700038] [] kmem_cache_alloc+0x9b/0xa0 [ 549.700038] [] prepare_creds+0x21/0xb0 [ 549.700038] [] sys_setresgid+0x29/0x120 [ 549.700038] [] system_call_fastpath+0x16/0x1b [ 549.700038] ffff88011e125ea8 0000000000000046 ffff88011e125e08 ffffffff81073c59 [ 549.700038] 0000000000012780 ffff88011ea905b0 ffff88011ea90808 ffff88011e125fd8 [ 549.700038] ffff88011ea90810 ffff88011e124010 0000000000012780 ffff88011e125fd8 swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one bio. That bio is the submitted, but the submit path seems to get into make_request from raid1.c and that allocates a second bio from bio_alloc() via bio_clone(). I am seeing this pattern (swap_writepage calling md_make_request/make_request and then getting stuck in mempool_alloc) more than 5 times in the SysRq+T output... Torsten -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org