From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without Date: Wed, 11 Oct 2017 08:20:56 +1100 Message-ID: <87vajmwvgn.fsf@notabene.neil.brown.name> References: <150518076229.32691.13542756562323866921.stgit@noble> <87o9qe9p3j.fsf@notabene.neil.brown.name> <446747392.10694917.1505364915884.JavaMail.zimbra@redhat.com> <871sn9alrh.fsf@notabene.neil.brown.name> <393232447.10845976.1505375841983.JavaMail.zimbra@redhat.com> <87vaju18dc.fsf@notabene.neil.brown.name> <874lrc28x8.fsf@notabene.neil.brown.name> <1345780738.18087591.1507512089744.JavaMail.zimbra@redhat.com> <87a810zznc.fsf@notabene.neil.brown.name> <441ae9fe-fd73-2aac-8bb1-c64da28cda27@redhat.com> <871smczx4j.fsf@notabene.neil.brown.name> Mime-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: Xiao Ni Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Tue, Oct 10 2017, Xiao Ni wrote: > On 10/09/2017 01:52 PM, NeilBrown wrote: >> On Mon, Oct 09 2017, Xiao Ni wrote: >> >>> On 10/09/2017 12:57 PM, NeilBrown wrote: >>>> It would if you had applied >>>> [PATCH 3/4] md: use mddev_suspend/resume instead of ->quiesce() >>>> >>>> Did you apply all 4 patches? >>> Sorry, it's my mistake. I insmod the wrong module. I'll apply the four >>> patches >>> and do test again. >>>> Thanks. I looks suspend_lo_store() is calling raid5_quiesce() directly >>>> as you say - so a patch is missing. >>> Yes, thanks for pointing about this. > > Hi Neil > > I applied the four patches and one patch "md: fix deadlock error in=20 > recent patch." > There is a new stuck. It's stuck at suspend_hi_store this time. I add=20 > the calltrace > as an attachment. > > I added some printk to print some information. > > [12695.993329] mddev suspend : 1 > [12695.996270] mddev ro : 0 > [12695.998790] mddev insync : 0 > [12696.001641] mddev active io: 1 You didn't tell me where (in the code) you printed this information. That makes it hard to interpret. If mddev->active_io is 1, then some thread must be in this range of code atomic_inc(&mddev->active_io); rcu_read_unlock(); if (!mddev->pers->make_request(mddev, bio)) { atomic_dec(&mddev->active_io); wake_up(&mddev->sb_wait); goto check_suspended; } if (atomic_dec_and_test(&mddev->active_io) && mddev->suspended) wake_up(&mddev->sb_wait); If that thread is blocked (which appears to be the case) it must be in =2D>make_request() because nothing else there blocks. None of the threads you showed are in that code. But you didn't report all the threads - only those which hard printed warnings. echo t > /proc/sysrq-trigger will produce the stack traces of *all* threads. That would be more useful. > > Can it be: > diff --git a/drivers/md/md.c b/drivers/md/md.c > index b6b7a28..55e9280 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -7777,7 +7777,7 @@ void md_check_recovery(struct mddev *mddev) > if (mddev->ro && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery)) > return; > if ( ! ( > - (mddev->flags & ~ (1< + (mddev->flags & (mddev->external =3D=3D 1 && ~=20 > (1<