All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Seema Pandit <seema.pandit@intel.com>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	stable <stable@vger.kernel.org>,
	Robert Barror <robert.barror@intel.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jan Kara <jack@suse.cz>
Subject: Re: [PATCH] filesystem-dax: Disable PMD support
Date: Sun, 30 Jun 2019 08:23:24 -0700	[thread overview]
Message-ID: <20190630152324.GA15900@bombadil.infradead.org> (raw)
In-Reply-To: <CAA9_cmcb-Prn6CnOx-mJfb9CRdf0uG9u4M1Vq1B1rKVemCD-Vw@mail.gmail.com>

On Sun, Jun 30, 2019 at 01:01:04AM -0700, Dan Williams wrote:
> On Sun, Jun 30, 2019 at 12:27 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Sat, Jun 29, 2019 at 9:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
> > > > On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
> > > > > > > This bug feels like we failed to unlock, or unlocked the wrong entry
> > > > > > > and this hunk in the bisected commit looks suspect to me. Why do we
> > > > > > > still need to drop the lock now that the radix_tree_preload() calls
> > > > > > > are gone?
> > > > > >
> > > > > > Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I
> > > > > > wonder why we don't restart the lookup like the old implementation.
> > > > >
> > > > > We have the entry locked:
> > > > >
> > > > >                 /*
> > > > >                  * Make sure 'entry' remains valid while we drop
> > > > >                  * the i_pages lock.
> > > > >                  */
> > > > >                 dax_lock_entry(xas, entry);
> > > > >
> > > > >                 /*
> > > > >                  * Besides huge zero pages the only other thing that gets
> > > > >                  * downgraded are empty entries which don't need to be
> > > > >                  * unmapped.
> > > > >                  */
> > > > >                 if (dax_is_zero_entry(entry)) {
> > > > >                         xas_unlock_irq(xas);
> > > > >                         unmap_mapping_pages(mapping,
> > > > >                                         xas->xa_index & ~PG_PMD_COLOUR,
> > > > >                                         PG_PMD_NR, false);
> > > > >                         xas_reset(xas);
> > > > >                         xas_lock_irq(xas);
> > > > >                 }
> > > > >
> > > > > If something can remove a locked entry, then that would seem like the
> > > > > real bug.  Might be worth inserting a lookup there to make sure that it
> > > > > hasn't happened, I suppose?
> > > >
> > > > Nope, added a check, we do in fact get the same locked entry back
> > > > after dropping the lock.
> > > >
> > > > The deadlock revolves around the mmap_sem. One thread holds it for
> > > > read and then gets stuck indefinitely in get_unlocked_entry(). Once
> > > > that happens another rocksdb thread tries to mmap and gets stuck
> > > > trying to take the mmap_sem for write. Then all new readers, including
> > > > ps and top that try to access a remote vma, then get queued behind
> > > > that write.
> > > >
> > > > It could also be the case that we're missing a wake up.
> > >
> > > OK, I have a Theory.
> > >
> > > get_unlocked_entry() doesn't check the size of the entry being waited for.
> > > So dax_iomap_pmd_fault() can end up sleeping waiting for a PTE entry,
> > > which is (a) foolish, because we know it's going to fall back, and (b)
> > > can lead to a missed wakeup because it's going to sleep waiting for
> > > the PMD entry to come unlocked.  Which it won't, unless there's a happy
> > > accident that happens to map to the same hash bucket.
> > >
> > > Let's see if I can steal some time this weekend to whip up a patch.
> >
> > Theory seems to have some evidence... I instrumented fs/dax.c to track
> > outstanding 'lock' entries and 'wait' events. At the time of the hang
> > we see no locks held and the waiter is waiting on a pmd entry:
> >
> > [ 4001.354334] fs/dax locked entries: 0
> > [ 4001.358425] fs/dax wait entries: 1
> > [ 4001.362227] db_bench/2445 index: 0x0 shift: 6
> > [ 4001.367099]  grab_mapping_entry+0x17a/0x260
> > [ 4001.371773]  dax_iomap_pmd_fault.isra.43+0x168/0x7a0
> > [ 4001.377316]  ext4_dax_huge_fault+0x16f/0x1f0
> > [ 4001.382086]  __handle_mm_fault+0x411/0x1390
> > [ 4001.386756]  handle_mm_fault+0x172/0x360
> 
> In fact, this naive fix is holding up so far:
> 
> @@ -215,7 +216,7 @@ static wait_queue_head_t
> *dax_entry_waitqueue(struct xa_state *xas,
>          * queue to the start of that PMD.  This ensures that all offsets in
>          * the range covered by the PMD map to the same bit lock.
>          */
> -       if (dax_is_pmd_entry(entry))
> +       //if (dax_is_pmd_entry(entry))
>                 index &= ~PG_PMD_COLOUR;
>         key->xa = xas->xa;
>         key->entry_start = index;

Hah, that's a great naive fix!  Thanks for trying that out.

I think my theory was slightly mistaken, but your fix has the effect of
fixing the actual problem too.

The xas->xa_index for a PMD is going to be PMD-aligned (ie a multiple of
512), but xas_find_conflict() does _not_ adjust xa_index (... which I
really should have mentioned in the documentation).  So we go to sleep
on the PMD-aligned index instead of the index of the PTE.  Your patch
fixes this by using the PMD-aligned index for PTEs too.

I'm trying to come up with a clean fix for this.  Clearly we
shouldn't wait for a PTE entry if we're looking for a PMD entry.
But what should get_unlocked_entry() return if it detects that case?
We could have it return an error code encoded as an internal entry,
like grab_mapping_entry() does.  Or we could have it return the _locked_
PTE entry, and have callers interpret that.

At least get_unlocked_entry() is static, but it's got quite a few callers.
Trying to discern which ones might ask for a PMD entry is a bit tricky.
So this seems like a large patch which might have bugs.

Thoughts?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Matthew Wilcox <willy@infradead.org>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Seema Pandit <seema.pandit@intel.com>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	stable <stable@vger.kernel.org>,
	Robert Barror <robert.barror@intel.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jan Kara <jack@suse.cz>
Subject: Re: [PATCH] filesystem-dax: Disable PMD support
Date: Sun, 30 Jun 2019 08:23:24 -0700	[thread overview]
Message-ID: <20190630152324.GA15900@bombadil.infradead.org> (raw)
In-Reply-To: <CAA9_cmcb-Prn6CnOx-mJfb9CRdf0uG9u4M1Vq1B1rKVemCD-Vw@mail.gmail.com>

On Sun, Jun 30, 2019 at 01:01:04AM -0700, Dan Williams wrote:
> On Sun, Jun 30, 2019 at 12:27 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Sat, Jun 29, 2019 at 9:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
> > > > On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
> > > > > > > This bug feels like we failed to unlock, or unlocked the wrong entry
> > > > > > > and this hunk in the bisected commit looks suspect to me. Why do we
> > > > > > > still need to drop the lock now that the radix_tree_preload() calls
> > > > > > > are gone?
> > > > > >
> > > > > > Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I
> > > > > > wonder why we don't restart the lookup like the old implementation.
> > > > >
> > > > > We have the entry locked:
> > > > >
> > > > >                 /*
> > > > >                  * Make sure 'entry' remains valid while we drop
> > > > >                  * the i_pages lock.
> > > > >                  */
> > > > >                 dax_lock_entry(xas, entry);
> > > > >
> > > > >                 /*
> > > > >                  * Besides huge zero pages the only other thing that gets
> > > > >                  * downgraded are empty entries which don't need to be
> > > > >                  * unmapped.
> > > > >                  */
> > > > >                 if (dax_is_zero_entry(entry)) {
> > > > >                         xas_unlock_irq(xas);
> > > > >                         unmap_mapping_pages(mapping,
> > > > >                                         xas->xa_index & ~PG_PMD_COLOUR,
> > > > >                                         PG_PMD_NR, false);
> > > > >                         xas_reset(xas);
> > > > >                         xas_lock_irq(xas);
> > > > >                 }
> > > > >
> > > > > If something can remove a locked entry, then that would seem like the
> > > > > real bug.  Might be worth inserting a lookup there to make sure that it
> > > > > hasn't happened, I suppose?
> > > >
> > > > Nope, added a check, we do in fact get the same locked entry back
> > > > after dropping the lock.
> > > >
> > > > The deadlock revolves around the mmap_sem. One thread holds it for
> > > > read and then gets stuck indefinitely in get_unlocked_entry(). Once
> > > > that happens another rocksdb thread tries to mmap and gets stuck
> > > > trying to take the mmap_sem for write. Then all new readers, including
> > > > ps and top that try to access a remote vma, then get queued behind
> > > > that write.
> > > >
> > > > It could also be the case that we're missing a wake up.
> > >
> > > OK, I have a Theory.
> > >
> > > get_unlocked_entry() doesn't check the size of the entry being waited for.
> > > So dax_iomap_pmd_fault() can end up sleeping waiting for a PTE entry,
> > > which is (a) foolish, because we know it's going to fall back, and (b)
> > > can lead to a missed wakeup because it's going to sleep waiting for
> > > the PMD entry to come unlocked.  Which it won't, unless there's a happy
> > > accident that happens to map to the same hash bucket.
> > >
> > > Let's see if I can steal some time this weekend to whip up a patch.
> >
> > Theory seems to have some evidence... I instrumented fs/dax.c to track
> > outstanding 'lock' entries and 'wait' events. At the time of the hang
> > we see no locks held and the waiter is waiting on a pmd entry:
> >
> > [ 4001.354334] fs/dax locked entries: 0
> > [ 4001.358425] fs/dax wait entries: 1
> > [ 4001.362227] db_bench/2445 index: 0x0 shift: 6
> > [ 4001.367099]  grab_mapping_entry+0x17a/0x260
> > [ 4001.371773]  dax_iomap_pmd_fault.isra.43+0x168/0x7a0
> > [ 4001.377316]  ext4_dax_huge_fault+0x16f/0x1f0
> > [ 4001.382086]  __handle_mm_fault+0x411/0x1390
> > [ 4001.386756]  handle_mm_fault+0x172/0x360
> 
> In fact, this naive fix is holding up so far:
> 
> @@ -215,7 +216,7 @@ static wait_queue_head_t
> *dax_entry_waitqueue(struct xa_state *xas,
>          * queue to the start of that PMD.  This ensures that all offsets in
>          * the range covered by the PMD map to the same bit lock.
>          */
> -       if (dax_is_pmd_entry(entry))
> +       //if (dax_is_pmd_entry(entry))
>                 index &= ~PG_PMD_COLOUR;
>         key->xa = xas->xa;
>         key->entry_start = index;

Hah, that's a great naive fix!  Thanks for trying that out.

I think my theory was slightly mistaken, but your fix has the effect of
fixing the actual problem too.

The xas->xa_index for a PMD is going to be PMD-aligned (ie a multiple of
512), but xas_find_conflict() does _not_ adjust xa_index (... which I
really should have mentioned in the documentation).  So we go to sleep
on the PMD-aligned index instead of the index of the PTE.  Your patch
fixes this by using the PMD-aligned index for PTEs too.

I'm trying to come up with a clean fix for this.  Clearly we
shouldn't wait for a PTE entry if we're looking for a PMD entry.
But what should get_unlocked_entry() return if it detects that case?
We could have it return an error code encoded as an internal entry,
like grab_mapping_entry() does.  Or we could have it return the _locked_
PTE entry, and have callers interpret that.

At least get_unlocked_entry() is static, but it's got quite a few callers.
Trying to discern which ones might ask for a PMD entry is a bit tricky.
So this seems like a large patch which might have bugs.

Thoughts?

  reply	other threads:[~2019-06-30 15:23 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-27  0:15 [PATCH] filesystem-dax: Disable PMD support Dan Williams
2019-06-27  0:15 ` Dan Williams
2019-06-27 12:34 ` Matthew Wilcox
2019-06-27 12:34   ` Matthew Wilcox
2019-06-27 16:06   ` Dan Williams
2019-06-27 16:06     ` Dan Williams
2019-06-27 18:29     ` Dan Williams
2019-06-27 18:29       ` Dan Williams
2019-06-27 18:58       ` Dan Williams
2019-06-27 18:58         ` Dan Williams
2019-06-27 19:09         ` Dan Williams
2019-06-27 19:09           ` Dan Williams
2019-06-27 19:59           ` Matthew Wilcox
2019-06-27 19:59             ` Matthew Wilcox
2019-06-28  2:39             ` Dan Williams
2019-06-28  2:39               ` Dan Williams
2019-06-28 16:37               ` Matthew Wilcox
2019-06-28 16:37                 ` Matthew Wilcox
2019-06-28 16:39                 ` Dan Williams
2019-06-28 16:39                   ` Dan Williams
2019-06-28 16:54                   ` Matthew Wilcox
2019-06-28 16:54                     ` Matthew Wilcox
2019-06-29 16:03               ` Matthew Wilcox
2019-06-29 16:03                 ` Matthew Wilcox
2019-06-30  7:27                 ` Dan Williams
2019-06-30  7:27                   ` Dan Williams
2019-06-30  8:01                   ` Dan Williams
2019-06-30  8:01                     ` Dan Williams
2019-06-30 15:23                     ` Matthew Wilcox [this message]
2019-06-30 15:23                       ` Matthew Wilcox
2019-06-30 21:37                       ` Dan Williams
2019-06-30 21:37                         ` Dan Williams
2019-07-02  3:34                         ` Matthew Wilcox
2019-07-02  3:34                           ` Matthew Wilcox
2019-07-02 15:37                           ` Dan Williams
2019-07-02 15:37                             ` Dan Williams
2019-07-03  0:22                             ` Boaz Harrosh
2019-07-03  0:22                               ` Boaz Harrosh
2019-07-03  0:42                               ` Dan Williams
2019-07-03  0:42                                 ` Dan Williams
2019-07-03  1:39                                 ` Boaz Harrosh
2019-07-03  1:39                                   ` Boaz Harrosh
2019-07-01 12:11                       ` Jan Kara
2019-07-01 12:11                         ` Jan Kara
2019-07-03 15:47                         ` Matthew Wilcox
2019-07-03 15:47                           ` Matthew Wilcox
2019-07-04 16:40                           ` Jan Kara
2019-07-04 16:40                             ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190630152324.GA15900@bombadil.infradead.org \
    --to=willy@infradead.org \
    --cc=dan.j.williams@intel.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=robert.barror@intel.com \
    --cc=seema.pandit@intel.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.