From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933543Ab0JSIni (ORCPT ); Tue, 19 Oct 2010 04:43:38 -0400 Received: from mail-qy0-f181.google.com ([209.85.216.181]:46041 "EHLO mail-qy0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754126Ab0JSIne convert rfc822-to-8bit (ORCPT ); Tue, 19 Oct 2010 04:43:34 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=l5U/sclHzsKq5/w+gLNRqTijuq7nSQVAVG64MpoGO1VpRuOGrnof8tIXTk1kL/ZrDy 38o7LrsqEOEpvsSu4NogmjldjzAxI6HzccPlaMK4kD+tERXMdbcb8lOCLIuZPxursV8b j0+fkw0124FEosZwj7M5FBKklxrFBqcBrjiXM= MIME-Version: 1.0 In-Reply-To: <20101019101151.57c6dd56@notabene> References: <20100915091118.3dbdc961@notabene> <4C90139A.1080809@redhat.com> <20100915122334.3fa7b35f@notabene> <20100915082843.GA17252@localhost> <20100915184434.18e2d933@notabene> <20101018151459.2b443221@notabene> <20101019101151.57c6dd56@notabene> Date: Tue, 19 Oct 2010 10:43:31 +0200 Message-ID: Subject: Re: Deadlock possibly caused by too_many_isolated. From: Torsten Kaiser To: Neil Brown Cc: Wu Fengguang , Rik van Riel , Andrew Morton , KOSAKI Motohiro , KAMEZAWA Hiroyuki , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Li Shaohua Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown wrote: > On Mon, 18 Oct 2010 12:58:17 +0200 > Torsten Kaiser wrote: > >> On Mon, Oct 18, 2010 at 6:14 AM, Neil Brown wrote: >> > Testing shows that this patch seems to work. >> > The test load (essentially kernbench) doesn't deadlock any more, though it >> > does get bogged down thrashing in swap so it doesn't make a lot more >> > progress :-)  I guess that is to be expected. >> >> I just noticed this thread, as your mail from today pushed it up. >> >> In your original mail you wrote: " I recently had a customer (running >> 2.6.32) report a deadlock during very intensive IO with lots of >> processes. " and " Some threads that are blocked there, hold some IO >> lock (probably in the filesystem) and are trying to allocate memory >> inside the block device (md/raid1 to be precise) which is allocating >> with GFP_NOIO and has a mempool to fall back on." >> >> I recently had the same problem (intense IO due to swapstorm created >> by 20 gcc processes hung my system) and after initially blaming the >> workqueue changes in 2.6.36 Tejun Heo determined that my problem was >> not the workqueues getting locked up, but that it was cause by an >> exhausted mempool: >> http://marc.info/?l=linux-kernel&m=128655737012549&w=2 >> >> Instrumenting mm/mempool.c and retrying my workload showed that >> fs_bio_set from fs/bio.c looked like the mempool to blame and the code >> in drivers/md/raid1.c to be the misuser: >> http://marc.info/?l=linux-kernel&m=128671179817823&w=2 >> >> I was even able to reproduce this hang with only using a normal RAID1 >> md device as swapspace and then using dd to fill a tmpfs until >> swapping was needed: >> http://marc.info/?l=linux-raid&m=128699402805191&w=2 >> >> Looking back in the history of raid1.c and bio.c I found the following >> interesting parts: >> >>  * the change to allocate more then one bio via bio_clone() is from >> 2005, but it looks like it was OK back then, because at that point the >> fs_bio_set was allocation 256 entries >>  * in 2007 the size of the mempool was changed from 256 to only 2 >> entries (5972511b77809cb7c9ccdb79b825c54921c5c546 "A single unit is >> enough, lets scale it down to 2 just to be on the safe side.") >>  * only in 2009 the comment "To make this work, callers must never >> allocate more than 1 bio at the time from this pool. Callers that need >> to allocate more than 1 bio must always submit the previously allocate >> bio for IO before attempting to allocate a new one. Failure to do so >> can cause livelocks under memory pressure." was added to bio_alloc() >> that is the base from my reasoning that raid1.c is broken. (And such a >> comment was not added to bio_clone() although both calls use the same >> mempool) >> >> So could please look someone into raid1.c to confirm or deny that >> using multiple bio_clone() (one per drive) before submitting them >> together could also cause such deadlocks? >> >> Thank for looking >> >> Torsten > > Yes, thanks for the report. > This is a real bug exactly as you describe. > > This is how I think I will fix it, though it needs a bit of review and > testing before I can be certain. > Also I need to check raid10 etc to see if they can suffer too. > > If you can test it I would really appreciate it. I did test it, but while it seemed to fix the deadlock, the system still got unusable. The still running "vmstat 1" showed that the swapout was still progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds. I also tried to additionally add Wu's patch: --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800 +++ linux-next/mm/vmscan.c 2010-10-19 00:13:04.000000000 +0800 @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone isolated = zone_page_state(zone, NR_ISOLATED_ANON); } + /* + * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that + * they won't get blocked by normal ones and form circular deadlock. + */ + if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) + inactive >>= 3; + return isolated > inactive; Either it did help somewhat, or I was more lucky on my second try, but this time I needed ~5 tries instead of only 2 to get the system mostly stuck again. On the testrun with Wu's patch the writeout pattern was more stable, a burst of ~80kb each 20 seconds. But I would suspect that the size of the burst is rather random. I do have a complete SysRq+T dump from the first run, I can send that to anyone how wants it. (It's 190k so I don't want not spam it to the list) Torsten > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index d44a50f..8122dde 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -784,7 +784,6 @@ static int make_request(mddev_t *mddev, struct bio * bio) >        int i, targets = 0, disks; >        struct bitmap *bitmap; >        unsigned long flags; > -       struct bio_list bl; >        struct page **behind_pages = NULL; >        const int rw = bio_data_dir(bio); >        const unsigned long do_sync = (bio->bi_rw & REQ_SYNC); > @@ -892,13 +891,6 @@ static int make_request(mddev_t *mddev, struct bio * bio) >         * bios[x] to bio >         */ >        disks = conf->raid_disks; > -#if 0 > -       { static int first=1; > -       if (first) printk("First Write sector %llu disks %d\n", > -                         (unsigned long long)r1_bio->sector, disks); > -       first = 0; > -       } > -#endif >  retry_write: >        blocked_rdev = NULL; >        rcu_read_lock(); > @@ -956,14 +948,15 @@ static int make_request(mddev_t *mddev, struct bio * bio) >            (behind_pages = alloc_behind_pages(bio)) != NULL) >                set_bit(R1BIO_BehindIO, &r1_bio->state); > > -       atomic_set(&r1_bio->remaining, 0); > +       atomic_set(&r1_bio->remaining, targets); >        atomic_set(&r1_bio->behind_remaining, 0); > >        do_barriers = bio->bi_rw & REQ_HARDBARRIER; >        if (do_barriers) >                set_bit(R1BIO_Barrier, &r1_bio->state); > > -       bio_list_init(&bl); > +       bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors, > +                               test_bit(R1BIO_BehindIO, &r1_bio->state)); >        for (i = 0; i < disks; i++) { >                struct bio *mbio; >                if (!r1_bio->bios[i]) > @@ -995,30 +988,18 @@ static int make_request(mddev_t *mddev, struct bio * bio) >                                atomic_inc(&r1_bio->behind_remaining); >                } > > -               atomic_inc(&r1_bio->remaining); > - > -               bio_list_add(&bl, mbio); > +               spin_lock_irqsave(&conf->device_lock, flags); > +               bio_list_add(&conf->pending_bio_list, mbio); > +               blk_plug_device(mddev->queue); > +               spin_unlock_irqrestore(&conf->device_lock, flags); >        } >        kfree(behind_pages); /* the behind pages are attached to the bios now */ > > -       bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors, > -                               test_bit(R1BIO_BehindIO, &r1_bio->state)); > -       spin_lock_irqsave(&conf->device_lock, flags); > -       bio_list_merge(&conf->pending_bio_list, &bl); > -       bio_list_init(&bl); > - > -       blk_plug_device(mddev->queue); > -       spin_unlock_irqrestore(&conf->device_lock, flags); > - >        /* In case raid1d snuck into freeze_array */ >        wake_up(&conf->wait_barrier); > >        if (do_sync) >                md_wakeup_thread(mddev->thread); > -#if 0 > -       while ((bio = bio_list_pop(&bl)) != NULL) > -               generic_make_request(bio); > -#endif > >        return 0; >  } > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at  http://vger.kernel.org/majordomo-info.html > Please read the FAQ at  http://www.tux.org/lkml/ > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id E91245F0047 for ; Tue, 19 Oct 2010 04:43:33 -0400 (EDT) Received: by qyk34 with SMTP id 34so541716qyk.14 for ; Tue, 19 Oct 2010 01:43:32 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20101019101151.57c6dd56@notabene> References: <20100915091118.3dbdc961@notabene> <4C90139A.1080809@redhat.com> <20100915122334.3fa7b35f@notabene> <20100915082843.GA17252@localhost> <20100915184434.18e2d933@notabene> <20101018151459.2b443221@notabene> <20101019101151.57c6dd56@notabene> Date: Tue, 19 Oct 2010 10:43:31 +0200 Message-ID: Subject: Re: Deadlock possibly caused by too_many_isolated. From: Torsten Kaiser Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: Neil Brown Cc: Wu Fengguang , Rik van Riel , Andrew Morton , KOSAKI Motohiro , KAMEZAWA Hiroyuki , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Li Shaohua List-ID: On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown wrote: > On Mon, 18 Oct 2010 12:58:17 +0200 > Torsten Kaiser wrote: > >> On Mon, Oct 18, 2010 at 6:14 AM, Neil Brown wrote: >> > Testing shows that this patch seems to work. >> > The test load (essentially kernbench) doesn't deadlock any more, thoug= h it >> > does get bogged down thrashing in swap so it doesn't make a lot more >> > progress :-) =A0I guess that is to be expected. >> >> I just noticed this thread, as your mail from today pushed it up. >> >> In your original mail you wrote: " I recently had a customer (running >> 2.6.32) report a deadlock during very intensive IO with lots of >> processes. " and " Some threads that are blocked there, hold some IO >> lock (probably in the filesystem) and are trying to allocate memory >> inside the block device (md/raid1 to be precise) which is allocating >> with GFP_NOIO and has a mempool to fall back on." >> >> I recently had the same problem (intense IO due to swapstorm created >> by 20 gcc processes hung my system) and after initially blaming the >> workqueue changes in 2.6.36 Tejun Heo determined that my problem was >> not the workqueues getting locked up, but that it was cause by an >> exhausted mempool: >> http://marc.info/?l=3Dlinux-kernel&m=3D128655737012549&w=3D2 >> >> Instrumenting mm/mempool.c and retrying my workload showed that >> fs_bio_set from fs/bio.c looked like the mempool to blame and the code >> in drivers/md/raid1.c to be the misuser: >> http://marc.info/?l=3Dlinux-kernel&m=3D128671179817823&w=3D2 >> >> I was even able to reproduce this hang with only using a normal RAID1 >> md device as swapspace and then using dd to fill a tmpfs until >> swapping was needed: >> http://marc.info/?l=3Dlinux-raid&m=3D128699402805191&w=3D2 >> >> Looking back in the history of raid1.c and bio.c I found the following >> interesting parts: >> >> =A0* the change to allocate more then one bio via bio_clone() is from >> 2005, but it looks like it was OK back then, because at that point the >> fs_bio_set was allocation 256 entries >> =A0* in 2007 the size of the mempool was changed from 256 to only 2 >> entries (5972511b77809cb7c9ccdb79b825c54921c5c546 "A single unit is >> enough, lets scale it down to 2 just to be on the safe side.") >> =A0* only in 2009 the comment "To make this work, callers must never >> allocate more than 1 bio at the time from this pool. Callers that need >> to allocate more than 1 bio must always submit the previously allocate >> bio for IO before attempting to allocate a new one. Failure to do so >> can cause livelocks under memory pressure." was added to bio_alloc() >> that is the base from my reasoning that raid1.c is broken. (And such a >> comment was not added to bio_clone() although both calls use the same >> mempool) >> >> So could please look someone into raid1.c to confirm or deny that >> using multiple bio_clone() (one per drive) before submitting them >> together could also cause such deadlocks? >> >> Thank for looking >> >> Torsten > > Yes, thanks for the report. > This is a real bug exactly as you describe. > > This is how I think I will fix it, though it needs a bit of review and > testing before I can be certain. > Also I need to check raid10 etc to see if they can suffer too. > > If you can test it I would really appreciate it. I did test it, but while it seemed to fix the deadlock, the system still got unusable. The still running "vmstat 1" showed that the swapout was still progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds. I also tried to additionally add Wu's patch: --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800 +++ linux-next/mm/vmscan.c 2010-10-19 00:13:04.000000000 +0800 @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone isolated =3D zone_page_state(zone, NR_ISOLATED_ANON); } + /* + * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so = that + * they won't get blocked by normal ones and form circular deadlock=