All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Barret Rhoden <brho@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-nvdimm <linux-nvdimm@lists.01.org>
Subject: Re: [PATCH] dax: Fix deadlock in dax_lock_mapping_entry()
Date: Sat, 6 Oct 2018 11:04:49 -0700	[thread overview]
Message-ID: <CAPcyv4hHLp=n5bKQ7O4zx4T5MNa1VhOQoVtw18jO_rwsp0mGdA@mail.gmail.com> (raw)
In-Reply-To: <20181005095415.GC9686@quack2.suse.cz>

On Fri, Oct 5, 2018 at 2:56 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 04-10-18 21:28:14, Dan Williams wrote:
> > On Thu, Oct 4, 2018 at 9:01 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Thu, Oct 4, 2018 at 7:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Thu, Oct 04, 2018 at 06:57:52PM -0700, Dan Williams wrote:
> > > > > On Thu, Oct 4, 2018 at 9:27 AM Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > On Thu 27-09-18 11:22:22, Dan Williams wrote:
> > > > > > > On Thu, Sep 27, 2018 at 6:41 AM Jan Kara <jack@suse.cz> wrote:
> > > > > > > >
> > > > > > > > On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > > > > > > > > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > > > > > > > > When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
> > > > > > > > > > fail to unlock mapping->i_pages spinlock and thus immediately deadlock
> > > > > > > > > > against itself when retrying to grab the entry lock again. Fix the
> > > > > > > > > > problem by unlocking mapping->i_pages before retrying.
> > > > > > > > >
> > > > > > > > > It seems weird that xfstests doesn't provoke this ...
> > > > > > > >
> > > > > > > > The function currently gets called only from mm/memory-failure.c. And yes,
> > > > > > > > we are lacking DAX hwpoison error tests in fstests...
> > > > > > >
> > > > > > > I have an item on my backlog to port the ndctl unit test that does
> > > > > > > memory_failure() injection vs ext4 over to fstests. That said I've
> > > > > > > been investigating a deadlock on ext4 caused by this test. When I saw
> > > > > > > this patch I hoped it was root cause, but the test is still failing
> > > > > > > for me. Vishal is able to pass the test on his system, so the failure
> > > > > > > mode is timing dependent. I'm running this patch on top of -rc5 and
> > > > > > > still seeing the following deadlock.
> > > > > >
> > > > > > I went through the code but I don't see where the problem could be. How can
> > > > > > I run that test? Is KVM enough or do I need hardware with AEP dimms?
> > > > >
> > > > > KVM is enough... however, I have found a hack that makes the test pass:
> > > > >
> > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > index 52517f28e6f4..d7f035b1846e 100644
> > > > > --- a/mm/filemap.c
> > > > > +++ b/mm/filemap.c
> > > > > @@ -1668,6 +1668,9 @@ unsigned find_get_entries(struct address_space *mapping,
> > > > >                         goto repeat;
> > > > >                 }
> > > > >  export:
> > > > > +               if (iter.index < start)
> > > > > +                       continue;
> > > > > +
> > > > >                 indices[ret] = iter.index;
> > > > >                 entries[ret] = page;
> > > > >                 if (++ret == nr_entries)
> > > > >
> > > > > Is this a radix bug? I would never expect:
> > > > >
> > > > >     radix_tree_for_each_slot(slot, &mapping->i_pages, &iter, start)
> > > > >
> > > > > ...to return entries with an index < start. Without that change above
> > > > > we loop forever because dax_layout_busy_page() can't make forward
> > > > > progress. I'll dig into the radix code tomorrow, but maybe someone
> > > > > else will be me to it overnight.
> > > >
> > > > If 'start' is within a 2MB entry, iter.index can absolutely be less
> > > > than start.  I forget exactly what the radix tree code does, but I think
> > > > it returns index set to the canonical/base index of the entry.
> > >
> > > Ok, that makes sense. Then the bug is in dax_layout_busy_page() which
> > > needs to increment 'index' by the entry size. This might also explain
> > > why not every run sees it because you may get lucky and have a 4K
> > > entry.
> >
> > Hmm, no 2MB entry here.
> >
> > We go through the first find_get_entries and export:
> >
> >     export start: 0x0 index: 0x0 page: 0x822000a
> >     export start: 0x0 index: 0x200 page: 0xcc3801a
> >
> > Then dax_layout_busy_page sets 'start' to 0x201, and find_get_entries returns:
> >
> >     export start: 0x201 index: 0x200 page: 0xcc3801a
> >
> > ...forevermore.
>
> Are you sure there's not 2MB entry starting at index 0x200? Because if
> there was, we'd get infinite loop exactly as you describe in
> dax_layout_busy_page()

My debug code was buggy, these are 2MB entries, I'll send out a fix.

> AFAICT. And it seems to me lot of other places
> iterating over entries are borked in a similar way as they all assume that
> doing +1 to the current index is guaranteeing them forward progress. Now
> actual breakage resulting from this is limited as only DAX uses multiorder
> entries and thus not many of these iterators actually ever get called for
> radix tree with multiorder entries (e.g. tmpfs still inserts every 4k subpage
> of THP into the radix tree and iteration functions usually handles
> tail subpages in a special way). But this would really deserve larger
> cleanup.

Yeah, it's a subtle detail waiting to trip up new multi-order-radix
users. The shift reported in the iterator is 6 in this case.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Barret Rhoden <brho@google.com>
Subject: Re: [PATCH] dax: Fix deadlock in dax_lock_mapping_entry()
Date: Sat, 6 Oct 2018 11:04:49 -0700	[thread overview]
Message-ID: <CAPcyv4hHLp=n5bKQ7O4zx4T5MNa1VhOQoVtw18jO_rwsp0mGdA@mail.gmail.com> (raw)
In-Reply-To: <20181005095415.GC9686@quack2.suse.cz>

On Fri, Oct 5, 2018 at 2:56 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 04-10-18 21:28:14, Dan Williams wrote:
> > On Thu, Oct 4, 2018 at 9:01 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Thu, Oct 4, 2018 at 7:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Thu, Oct 04, 2018 at 06:57:52PM -0700, Dan Williams wrote:
> > > > > On Thu, Oct 4, 2018 at 9:27 AM Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > On Thu 27-09-18 11:22:22, Dan Williams wrote:
> > > > > > > On Thu, Sep 27, 2018 at 6:41 AM Jan Kara <jack@suse.cz> wrote:
> > > > > > > >
> > > > > > > > On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > > > > > > > > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > > > > > > > > When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
> > > > > > > > > > fail to unlock mapping->i_pages spinlock and thus immediately deadlock
> > > > > > > > > > against itself when retrying to grab the entry lock again. Fix the
> > > > > > > > > > problem by unlocking mapping->i_pages before retrying.
> > > > > > > > >
> > > > > > > > > It seems weird that xfstests doesn't provoke this ...
> > > > > > > >
> > > > > > > > The function currently gets called only from mm/memory-failure.c. And yes,
> > > > > > > > we are lacking DAX hwpoison error tests in fstests...
> > > > > > >
> > > > > > > I have an item on my backlog to port the ndctl unit test that does
> > > > > > > memory_failure() injection vs ext4 over to fstests. That said I've
> > > > > > > been investigating a deadlock on ext4 caused by this test. When I saw
> > > > > > > this patch I hoped it was root cause, but the test is still failing
> > > > > > > for me. Vishal is able to pass the test on his system, so the failure
> > > > > > > mode is timing dependent. I'm running this patch on top of -rc5 and
> > > > > > > still seeing the following deadlock.
> > > > > >
> > > > > > I went through the code but I don't see where the problem could be. How can
> > > > > > I run that test? Is KVM enough or do I need hardware with AEP dimms?
> > > > >
> > > > > KVM is enough... however, I have found a hack that makes the test pass:
> > > > >
> > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > index 52517f28e6f4..d7f035b1846e 100644
> > > > > --- a/mm/filemap.c
> > > > > +++ b/mm/filemap.c
> > > > > @@ -1668,6 +1668,9 @@ unsigned find_get_entries(struct address_space *mapping,
> > > > >                         goto repeat;
> > > > >                 }
> > > > >  export:
> > > > > +               if (iter.index < start)
> > > > > +                       continue;
> > > > > +
> > > > >                 indices[ret] = iter.index;
> > > > >                 entries[ret] = page;
> > > > >                 if (++ret == nr_entries)
> > > > >
> > > > > Is this a radix bug? I would never expect:
> > > > >
> > > > >     radix_tree_for_each_slot(slot, &mapping->i_pages, &iter, start)
> > > > >
> > > > > ...to return entries with an index < start. Without that change above
> > > > > we loop forever because dax_layout_busy_page() can't make forward
> > > > > progress. I'll dig into the radix code tomorrow, but maybe someone
> > > > > else will be me to it overnight.
> > > >
> > > > If 'start' is within a 2MB entry, iter.index can absolutely be less
> > > > than start.  I forget exactly what the radix tree code does, but I think
> > > > it returns index set to the canonical/base index of the entry.
> > >
> > > Ok, that makes sense. Then the bug is in dax_layout_busy_page() which
> > > needs to increment 'index' by the entry size. This might also explain
> > > why not every run sees it because you may get lucky and have a 4K
> > > entry.
> >
> > Hmm, no 2MB entry here.
> >
> > We go through the first find_get_entries and export:
> >
> >     export start: 0x0 index: 0x0 page: 0x822000a
> >     export start: 0x0 index: 0x200 page: 0xcc3801a
> >
> > Then dax_layout_busy_page sets 'start' to 0x201, and find_get_entries returns:
> >
> >     export start: 0x201 index: 0x200 page: 0xcc3801a
> >
> > ...forevermore.
>
> Are you sure there's not 2MB entry starting at index 0x200? Because if
> there was, we'd get infinite loop exactly as you describe in
> dax_layout_busy_page()

My debug code was buggy, these are 2MB entries, I'll send out a fix.

> AFAICT. And it seems to me lot of other places
> iterating over entries are borked in a similar way as they all assume that
> doing +1 to the current index is guaranteeing them forward progress. Now
> actual breakage resulting from this is limited as only DAX uses multiorder
> entries and thus not many of these iterators actually ever get called for
> radix tree with multiorder entries (e.g. tmpfs still inserts every 4k subpage
> of THP into the radix tree and iteration functions usually handles
> tail subpages in a special way). But this would really deserve larger
> cleanup.

Yeah, it's a subtle detail waiting to trip up new multi-order-radix
users. The shift reported in the iterator is 6 in this case.

  reply	other threads:[~2018-10-06 18:05 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-27 11:23 [PATCH] dax: Fix deadlock in dax_lock_mapping_entry() Jan Kara
2018-09-27 11:23 ` Jan Kara
2018-09-27 13:28 ` Matthew Wilcox
2018-09-27 13:28   ` Matthew Wilcox
2018-09-27 13:41   ` Jan Kara
2018-09-27 13:41     ` Jan Kara
2018-09-27 18:22     ` Dan Williams
2018-09-27 18:22       ` Dan Williams
2018-10-04 16:27       ` Jan Kara
2018-10-04 16:27         ` Jan Kara
2018-10-05  1:57         ` Dan Williams
2018-10-05  1:57           ` Dan Williams
2018-10-05  2:52           ` Matthew Wilcox
2018-10-05  2:52             ` Matthew Wilcox
2018-10-05  4:01             ` Dan Williams
2018-10-05  4:01               ` Dan Williams
2018-10-05  4:28               ` Dan Williams
2018-10-05  4:28                 ` Dan Williams
2018-10-05  9:54                 ` Jan Kara
2018-10-05  9:54                   ` Jan Kara
2018-10-06 18:04                   ` Dan Williams [this message]
2018-10-06 18:04                     ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPcyv4hHLp=n5bKQ7O4zx4T5MNa1VhOQoVtw18jO_rwsp0mGdA@mail.gmail.com' \
    --to=dan.j.williams@intel.com \
    --cc=brho@google.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.