All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andreas Gruenbacher <agruenba@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	cluster-devel <cluster-devel@redhat.com>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>, Jan Kara <jack@suse.cz>
Subject: Re: [PATCH] [RFC] Trigger retry from fault vm operation
Date: Tue, 11 May 2021 16:59:59 +0200	[thread overview]
Message-ID: <CAHc6FU5LMhLfQO6wj8z0RD1Q3jv0reToP7=LSj5B-e50WYGnkA@mail.gmail.com> (raw)
In-Reply-To: <YJqQdKmBHz6oEqD1@casper.infradead.org>

On Tue, May 11, 2021 at 4:34 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Tue, May 11, 2021 at 04:01:13PM +0200, Andreas Gruenbacher wrote:
> > we have a locking problem in gfs2 that I don't have a proper solution for, so
> > I'm looking for suggestions.
> >
> > What's happening is that a page fault triggers during a read or write
> > operation, while we're holding a glock (the cluster-wide gfs2 inode
> > lock), and the page fault requires another glock.  We can recognize and
> > handle the case when both glocks are the same, but when the page fault requires
> > another glock, there is a chance that taking that other glock would deadlock.
>
> So we're looking at something like one file on a gfs2 filesystem being
> mmaped() and then doing read() or write() to another gfs2 file with the
> mmaped address being the passed to read()/write()?

Yes, those kinds of scenarios. Here's an example that Jan Kara came up with:

Two independent processes P1, P2. Two files F1, F2, and two mappings M1, M2
where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO to F1
with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They can race
like:

P1                                      P2
read()                                  read()
  gfs2_file_read_iter()                   gfs2_file_read_iter()
    gfs2_file_direct_read()                 gfs2_file_direct_read()
      locks glock of F1                       locks glock of F2
      iomap_dio_rw()                          iomap_dio_rw()
        bio_iov_iter_get_pages()                bio_iov_iter_get_pages()
          <fault in M2>                           <fault in M1>
            gfs2_fault()                            gfs2_fault()
              tries to grab glock of F2               tries to grab glock of F1

With cluster-wide locks, we can obviously end up with distributed
deadlock scenarios as well, of course.

> Have you looked at iov_iter_fault_in_readable() as a solution to
> your locking order?  That way, you bring the mmaped page in first
> (see generic_perform_write()).

Yes. The problem there is that we need to hold the inode glock from
->iomap_begin to ->iomap_end; that's what guarantees that the mapping
returned by ->iomap_begin remains valid.

> > When we realize that we may not be able to take the other glock in gfs2_fault,
> > we need to communicate that to the read or write operation, which will then
> > drop and re-acquire the "outer" glock and retry.  However, there doesn't seem
> > to be a good way to do that; we can only indicate that a page fault should fail
> > by returning VM_FAULT_SIGBUS or similar; that will then be mapped to -EFAULT.
> > We'd need something like VM_FAULT_RESTART that can be mapped to -EBUSY so that
> > we can tell the retry case apart from genuine -EFAULT errors.
>
> We do have VM_FAULT_RETRY ... does that retry at the wrong level?

There's also VM_FAULT_NOPAGE, but that only triggers a retry at the VM
level and doesn't propagate out far enough.

My impression is that VM_FAULT_RETRY is similar to VM_FAULT_NOPAGE
except that it allows the lock dropping optimization implemented in
maybe_unlock_mmap_for_io(). That error code can also only be used when
FAULT_FLAG_ALLOW_RETRY is set it seems. Correct me if I'm getting this
wrong.

Thanks,
Andreas


WARNING: multiple messages have this Message-ID (diff)
From: Andreas Gruenbacher <agruenba@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [PATCH] [RFC] Trigger retry from fault vm operation
Date: Tue, 11 May 2021 16:59:59 +0200	[thread overview]
Message-ID: <CAHc6FU5LMhLfQO6wj8z0RD1Q3jv0reToP7=LSj5B-e50WYGnkA@mail.gmail.com> (raw)
In-Reply-To: <YJqQdKmBHz6oEqD1@casper.infradead.org>

On Tue, May 11, 2021 at 4:34 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Tue, May 11, 2021 at 04:01:13PM +0200, Andreas Gruenbacher wrote:
> > we have a locking problem in gfs2 that I don't have a proper solution for, so
> > I'm looking for suggestions.
> >
> > What's happening is that a page fault triggers during a read or write
> > operation, while we're holding a glock (the cluster-wide gfs2 inode
> > lock), and the page fault requires another glock.  We can recognize and
> > handle the case when both glocks are the same, but when the page fault requires
> > another glock, there is a chance that taking that other glock would deadlock.
>
> So we're looking at something like one file on a gfs2 filesystem being
> mmaped() and then doing read() or write() to another gfs2 file with the
> mmaped address being the passed to read()/write()?

Yes, those kinds of scenarios. Here's an example that Jan Kara came up with:

Two independent processes P1, P2. Two files F1, F2, and two mappings M1, M2
where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO to F1
with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They can race
like:

P1                                      P2
read()                                  read()
  gfs2_file_read_iter()                   gfs2_file_read_iter()
    gfs2_file_direct_read()                 gfs2_file_direct_read()
      locks glock of F1                       locks glock of F2
      iomap_dio_rw()                          iomap_dio_rw()
        bio_iov_iter_get_pages()                bio_iov_iter_get_pages()
          <fault in M2>                           <fault in M1>
            gfs2_fault()                            gfs2_fault()
              tries to grab glock of F2               tries to grab glock of F1

With cluster-wide locks, we can obviously end up with distributed
deadlock scenarios as well, of course.

> Have you looked at iov_iter_fault_in_readable() as a solution to
> your locking order?  That way, you bring the mmaped page in first
> (see generic_perform_write()).

Yes. The problem there is that we need to hold the inode glock from
->iomap_begin to ->iomap_end; that's what guarantees that the mapping
returned by ->iomap_begin remains valid.

> > When we realize that we may not be able to take the other glock in gfs2_fault,
> > we need to communicate that to the read or write operation, which will then
> > drop and re-acquire the "outer" glock and retry.  However, there doesn't seem
> > to be a good way to do that; we can only indicate that a page fault should fail
> > by returning VM_FAULT_SIGBUS or similar; that will then be mapped to -EFAULT.
> > We'd need something like VM_FAULT_RESTART that can be mapped to -EBUSY so that
> > we can tell the retry case apart from genuine -EFAULT errors.
>
> We do have VM_FAULT_RETRY ... does that retry at the wrong level?

There's also VM_FAULT_NOPAGE, but that only triggers a retry at the VM
level and doesn't propagate out far enough.

My impression is that VM_FAULT_RETRY is similar to VM_FAULT_NOPAGE
except that it allows the lock dropping optimization implemented in
maybe_unlock_mmap_for_io(). That error code can also only be used when
FAULT_FLAG_ALLOW_RETRY is set it seems. Correct me if I'm getting this
wrong.

Thanks,
Andreas



  reply	other threads:[~2021-05-11 15:00 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-11 14:01 [PATCH] [RFC] Trigger retry from fault vm operation Andreas Gruenbacher
2021-05-11 14:01 ` [Cluster-devel] " Andreas Gruenbacher
2021-05-11 14:11 ` Matthew Wilcox
2021-05-11 14:11   ` [Cluster-devel] " Matthew Wilcox
2021-05-11 14:59   ` Andreas Gruenbacher [this message]
2021-05-11 14:59     ` Andreas Gruenbacher
2021-05-11 14:59     ` Andreas Gruenbacher

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHc6FU5LMhLfQO6wj8z0RD1Q3jv0reToP7=LSj5B-e50WYGnkA@mail.gmail.com' \
    --to=agruenba@redhat.com \
    --cc=cluster-devel@redhat.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.