linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] gfs2: Fix mmap + page fault deadlocks
@ 2021-07-01 20:42 Andreas Gruenbacher
  2021-07-01 21:41 ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Andreas Gruenbacher @ 2021-07-01 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, Alexander Viro, cluster-devel, linux-kernel,
	Jan Kara, Matthew Wilcox

Hi Linus,

here's another attempt at fixing the mmap + page fault deadlocks we're
seeing on gfs2.  Still not ideal because get_user_pages_fast ignores the
current->pagefault_disabled flag, but at least this fixes the
easy-to-exploit problems.  Would you mind having a look?

For getting get_user_pages_fast changed to fix this properly, I'd need
help from the memory management folks.  With that, we could get rid of
the pagefault_disabled() checks in gfs2_page_mkwrite and gfs2_fault.

Thanks,
Andreas

--

In the .read_iter and .write_iter file operations, we're accessing
user-space memory while holding the inode's glock.  There's a possibility
that the memory is mapped to the same file, in which case we'd recurse on
the same glock.  More complex scenarios can involve multiple glocks,
processes, and cluster nodes.

This patch avoids these kinds of problems by disabling page faults during
read and write operations.  If a page fault occurs, we either end up with a
partial read or write, or with -EFAULT if nothing could be read or written.
In that case, we drop the outer glock, fault in the pages manually, and
repeat the operation.

The caveat is that direct I/O doesn't trigger page faults.  Instead, it
faults in user pages manually via bio_iov_iter_get_pages ->
iov_iter_get_pages -> get_user_pages_fast, which ignores the
current->pagefault_disabled flag.  The consequence is that we can end up in
a random .fault or .page_mkwrite vm operation with page faults disabled,
which can lead to the same kind of deadlock (directly if we end up in
gfs2_fault or gfs2_page_mkwrite; indirectly if we happen to end up in a
handler that ultimately depends on the glock we're holding).

I believe it would make sense to change get_user_pages_fast to check for
the pagefault_disabled flag.  Meanwhile, we can at least check if page
faults are disabled in gfs2_fault and gfs2_page_mkwrite to prevent gfs2
from directly faulting in pages when it shouldn't.

This case of recursive locking was originally reported by Jan Kara.
Disabling page faults was suggested by Linus, with help from Al Viro and
Matthew Wilcox.

Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 55ec1cadc9e6..5b1af227d0b9 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -433,6 +433,14 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	loff_t size;
 	int err;
 
+	if (pagefault_disabled()) {
+		/*
+		 * Direct I/O; pagefault_disabled flag ignored by
+		 * iov_iter_get_pages -> get_user_pages_fast.
+		 */
+		return VM_FAULT_SIGBUS;
+	}
+
 	sb_start_pagefault(inode->i_sb);
 
 	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
@@ -561,6 +569,14 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)
 	vm_fault_t ret;
 	int err;
 
+	if (pagefault_disabled()) {
+		/*
+		 * Direct I/O; pagefault_disabled flag ignored by
+		 * iov_iter_get_pages -> get_user_pages_fast.
+		 */
+		return VM_FAULT_SIGBUS;
+	}
+
 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
 	err = gfs2_glock_nq(&gh);
 	if (err) {
@@ -776,25 +792,59 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
 	return ret ? ret : ret1;
 }
 
+static bool gfs2_fault_in_pages(struct iov_iter *i, struct page ***pages,
+				unsigned int *nrpages)
+{
+	size_t start;
+	ssize_t size;
+
+	if (*pages)
+		return false;
+	size = iov_iter_get_pages_alloc(i, pages, 256 * PAGE_SIZE, &start);
+	if (size < 0)
+		return false;
+	*nrpages = DIV_ROUND_UP(start + size, PAGE_SIZE);
+	return true;
+}
+
+static void gfs2_release_pages(struct page **pages, unsigned int nrpages)
+{
+	unsigned int i;
+
+	for (i = 0; i < nrpages; i++)
+		put_page(pages[i]);
+	kvfree(pages);
+}
+
 static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 				     struct gfs2_holder *gh)
 {
 	struct file *file = iocb->ki_filp;
 	struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
 	size_t count = iov_iter_count(to);
+	struct page **pages = NULL;
+	unsigned int nrpages;
 	ssize_t ret;
 
 	if (!count)
 		return 0; /* skip atime */
 
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
 
+	pagefault_disable();
 	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
+	pagefault_enable();
 	gfs2_glock_dq(gh);
+	if (unlikely(ret == -EFAULT) &&
+	    gfs2_fault_in_pages(to, &pages, &nrpages))
+		goto retry;
 out_uninit:
+	if (unlikely(pages))
+		gfs2_release_pages(pages, nrpages);
 	gfs2_holder_uninit(gh);
 	return ret;
 }
@@ -807,6 +857,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	struct gfs2_inode *ip = GFS2_I(inode);
 	size_t len = iov_iter_count(from);
 	loff_t offset = iocb->ki_pos;
+	struct page **pages = NULL;
+	unsigned int nrpages;
 	ssize_t ret;
 
 	/*
@@ -818,6 +870,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	 * VFS does.
 	 */
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
@@ -826,18 +879,27 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	if (offset + len > i_size_read(&ip->i_inode))
 		goto out;
 
+	pagefault_disable();
 	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
+	pagefault_enable();
 	if (ret == -ENOTBLK)
 		ret = 0;
 out:
 	gfs2_glock_dq(gh);
+	if (unlikely(ret == -EFAULT) &&
+	    gfs2_fault_in_pages(from, &pages, &nrpages))
+		goto retry;
 out_uninit:
+	if (unlikely(pages))
+		gfs2_release_pages(pages, nrpages);
 	gfs2_holder_uninit(gh);
 	return ret;
 }
 
 static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
+	struct page **pages = NULL;
+	unsigned int nrpages;
 	struct gfs2_inode *ip;
 	struct gfs2_holder gh;
 	size_t written = 0;
@@ -864,14 +926,22 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	}
 	ip = GFS2_I(iocb->ki_filp->f_mapping->host);
 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
+retry:
 	ret = gfs2_glock_nq(&gh);
 	if (ret)
 		goto out_uninit;
+	pagefault_disable();
 	ret = generic_file_read_iter(iocb, to);
+	pagefault_enable();
 	if (ret > 0)
 		written += ret;
 	gfs2_glock_dq(&gh);
+	if (unlikely(ret == -EFAULT) &&
+	    gfs2_fault_in_pages(to, &pages, &nrpages))
+		goto retry;
 out_uninit:
+	if (unlikely(pages))
+		gfs2_release_pages(pages, nrpages);
 	gfs2_holder_uninit(&gh);
 	return written ? written : ret;
 }
@@ -880,11 +950,22 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
+	struct page **pages = NULL;
+	unsigned int nrpages;
 	ssize_t ret;
 
+retry:
 	current->backing_dev_info = inode_to_bdi(inode);
+	pagefault_disable();
 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	pagefault_enable();
 	current->backing_dev_info = NULL;
+	if (unlikely(ret == -EFAULT) &&
+	    gfs2_fault_in_pages(from, &pages, &nrpages))
+		goto retry;
+	if (unlikely(pages))
+		gfs2_release_pages(pages, nrpages);
+
 	return ret;
 }
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] gfs2: Fix mmap + page fault deadlocks
  2021-07-01 20:42 [PATCH] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
@ 2021-07-01 21:41 ` Linus Torvalds
  2021-07-01 21:52   ` Linus Torvalds
  2021-07-02  0:20   ` Andreas Gruenbacher
  0 siblings, 2 replies; 7+ messages in thread
From: Linus Torvalds @ 2021-07-01 21:41 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Alexander Viro, cluster-devel, Linux Kernel Mailing List,
	Jan Kara, Matthew Wilcox

On Thu, Jul 1, 2021 at 1:43 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> here's another attempt at fixing the mmap + page fault deadlocks we're
> seeing on gfs2.  Still not ideal because get_user_pages_fast ignores the
> current->pagefault_disabled flag

Of course get_user_pages_fast() ignores the pagefault_disabled flag,
because it doesn't do any page faults.

If you don't want to fall back to the "maybe do IO" case, you should
use the FOLL_FAST_ONLY flag - or get_user_pages_fast_only(), which
does that itself.

> For getting get_user_pages_fast changed to fix this properly, I'd need
> help from the memory management folks.

I really don't think you need anything at all from the mm people,
because we already support that whole "fast only" case.

Also, I have to say that I think the direct-IO code is fundamentally
mis-designed. Why it is doing the page lookup _during_ the IO is a
complete mystery to me. Why wasn't that done ahead of time before the
filesystem took the locks it needed?

So what the direct-IO code _should_ do is to turn an ITER_IOVEC into a
ITER_KVEC by doing the page lookup ahead of time, and none of these
issues should even exist, and then the whole pagefault_disabled and/or
FOLL_FAST_ONLY would be a complete non-issue.

Is there any reason why that isn't what it does (other than historical baggage)?

               Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] gfs2: Fix mmap + page fault deadlocks
  2021-07-01 21:41 ` Linus Torvalds
@ 2021-07-01 21:52   ` Linus Torvalds
  2021-07-02  0:20   ` Andreas Gruenbacher
  1 sibling, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2021-07-01 21:52 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Alexander Viro, cluster-devel, Linux Kernel Mailing List,
	Jan Kara, Matthew Wilcox

On Thu, Jul 1, 2021 at 2:41 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So what the direct-IO code _should_ do is to turn an ITER_IOVEC into a
> ITER_KVEC by doing the page lookup ahead of time

Actually, an ITER_BVEC, not ITER_KVEC. It wants a page array, not a
kernel pointer array.

But I hope people understood what I meant..

             Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] gfs2: Fix mmap + page fault deadlocks
  2021-07-01 21:41 ` Linus Torvalds
  2021-07-01 21:52   ` Linus Torvalds
@ 2021-07-02  0:20   ` Andreas Gruenbacher
  2021-07-02  0:30     ` Linus Torvalds
  1 sibling, 1 reply; 7+ messages in thread
From: Andreas Gruenbacher @ 2021-07-02  0:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, cluster-devel, Linux Kernel Mailing List,
	Jan Kara, Matthew Wilcox

On Thu, Jul 1, 2021 at 11:41 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jul 1, 2021 at 1:43 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > here's another attempt at fixing the mmap + page fault deadlocks we're
> > seeing on gfs2.  Still not ideal because get_user_pages_fast ignores the
> > current->pagefault_disabled flag
>
> Of course get_user_pages_fast() ignores the pagefault_disabled flag,
> because it doesn't do any page faults.
>
> If you don't want to fall back to the "maybe do IO" case, you should
> use the FOLL_FAST_ONLY flag - or get_user_pages_fast_only(), which
> does that itself.
>
> > For getting get_user_pages_fast changed to fix this properly, I'd need
> > help from the memory management folks.
>
> I really don't think you need anything at all from the mm people,
> because we already support that whole "fast only" case.

Yes, fair enough.

> Also, I have to say that I think the direct-IO code is fundamentally
> mis-designed. Why it is doing the page lookup _during_ the IO is a
> complete mystery to me. Why wasn't that done ahead of time before the
> filesystem took the locks it needed?

That would be inconvenient for reads, when the number of bytes read is
much smaller than the buffer size and we won't need to page in the
entire buffer.

> So what the direct-IO code _should_ do is to turn an ITER_IOVEC into a
> ITER_KVEC by doing the page lookup ahead of time, and none of these
> issues should even exist, and then the whole pagefault_disabled and/or
> FOLL_FAST_ONLY would be a complete non-issue.
>
> Is there any reason why that isn't what it does (other than historical baggage)?

It turns out that there's an even deeper issue with keeping references
to user-space pages. Those references will essentially pin the glock
of the associated inode to the node. Moving a glock off a node
requires truncating the inode's page cache, but the page references
would prevent that. So we'd only end up with different kinds of
potential deadlocks.

If we could get iomap_dio_rw to use "fast only" mode when requested,
we could fault in the pages without keeping references, try the IO,
and repeat when necessary.

Thanks a lot,
Adreas


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] gfs2: Fix mmap + page fault deadlocks
  2021-07-02  0:20   ` Andreas Gruenbacher
@ 2021-07-02  0:30     ` Linus Torvalds
  2021-07-02  0:44       ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2021-07-02  0:30 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Alexander Viro, cluster-devel, Linux Kernel Mailing List,
	Jan Kara, Matthew Wilcox

On Thu, Jul 1, 2021 at 5:20 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> On Thu, Jul 1, 2021 at 11:41 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > Also, I have to say that I think the direct-IO code is fundamentally
> > mis-designed. Why it is doing the page lookup _during_ the IO is a
> > complete mystery to me. Why wasn't that done ahead of time before the
> > filesystem took the locks it needed?
>
> That would be inconvenient for reads, when the number of bytes read is
> much smaller than the buffer size and we won't need to page in the
> entire buffer.

What?

A file read will READ THE WHOLE BUFFER.

We're not talking pipes or ttys here. If you ask for X bytes, you'll
get X bytes.

Of course, if you ask for more data than the file has, that's another
thing, but who really does that with direct-IO? And if they do, why
should we care about their silly behavior?

Face it, right now direct-IO is *BUGGY* because of this, and you can
deadlock filesystems with it.

So tell me again how it's "inconvenient" to fix this bug, and fix the
bad direct-IO design?

              Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] gfs2: Fix mmap + page fault deadlocks
  2021-07-02  0:30     ` Linus Torvalds
@ 2021-07-02  0:44       ` Linus Torvalds
  2021-07-02  1:14         ` Andreas Grünbacher
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2021-07-02  0:44 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Alexander Viro, cluster-devel, Linux Kernel Mailing List,
	Jan Kara, Matthew Wilcox

On Thu, Jul 1, 2021 at 5:30 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> Of course, if you ask for more data than the file has, that's another
> thing, but who really does that with direct-IO? And if they do, why
> should we care about their silly behavior?

Now, if the issue is that people do IO for bigger areas than you have
memory for, then I think that's a chunking issue. I don't think the
ITER_IOVEC necessarily needs to be converted to an ITER_BVEC in one
single go. That would indeed be painful if somebody tries to do some
huge direct-IO when they just don't have the memory for it.

But the fact is, direct-IO has been an incredible pain-point for
decades, because it's (a) unusual, (b) buggy and (c) has very little
overall design and common code.

The three issues are likely intimately tied together.

The iomap code at least has tried to make for much more common code,
but I really think some direct-IO people should seriously reconsider
how they are doing things when there are fundamental deadlocks in the
design.

And I do think that a ITER_IOVEC -> ITER_BVEC conversion function
might be a really really good idea to solve this problem. There's even
a very natural chunking algorithm: try to do as much as possible with
get_user_pages_fast() - so that if you already *have* the memory, you
can do the full IO (or at least a big part of it).

And if get_user_pages_fast() only gives you a small area - or nothing
at all - you chunk it down aggressively, and realize that "oh, doing
direct-IO when user space is paged out might not be the most optimal
case".

              Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] gfs2: Fix mmap + page fault deadlocks
  2021-07-02  0:44       ` Linus Torvalds
@ 2021-07-02  1:14         ` Andreas Grünbacher
  0 siblings, 0 replies; 7+ messages in thread
From: Andreas Grünbacher @ 2021-07-02  1:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, Alexander Viro, cluster-devel,
	Linux Kernel Mailing List, Jan Kara, Matthew Wilcox

Am Fr., 2. Juli 2021 um 02:48 Uhr schrieb Linus Torvalds
<torvalds@linux-foundation.org>:
> On Thu, Jul 1, 2021 at 5:30 PM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > Of course, if you ask for more data than the file has, that's another
> > thing, but who really does that with direct-IO? And if they do, why
> > should we care about their silly behavior?
>
> Now, if the issue is that people do IO for bigger areas than you have
> memory for, then I think that's a chunking issue. I don't think the
> ITER_IOVEC necessarily needs to be converted to an ITER_BVEC in one
> single go. That would indeed be painful if somebody tries to do some
> huge direct-IO when they just don't have the memory for it.
>
> But the fact is, direct-IO has been an incredible pain-point for
> decades, because it's (a) unusual, (b) buggy and (c) has very little
> overall design and common code.
>
> The three issues are likely intimately tied together.
>
> The iomap code at least has tried to make for much more common code,
> but I really think some direct-IO people should seriously reconsider
> how they are doing things when there are fundamental deadlocks in the
> design.
>
> And I do think that a ITER_IOVEC -> ITER_BVEC conversion function
> might be a really really good idea to solve this problem.

I've tried to explain above how keeping the user-space pages
referenced will just lead to different kinds of deadlocks. That is the
problem with this approach.

> There's even
> a very natural chunking algorithm: try to do as much as possible with
> get_user_pages_fast() - so that if you already *have* the memory, you
> can do the full IO (or at least a big part of it).
>
> And if get_user_pages_fast() only gives you a small area - or nothing
> at all - you chunk it down aggressively, and realize that "oh, doing
> direct-IO when user space is paged out might not be the most optimal
> case".
>
>               Linus

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-07-02  1:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-01 20:42 [PATCH] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
2021-07-01 21:41 ` Linus Torvalds
2021-07-01 21:52   ` Linus Torvalds
2021-07-02  0:20   ` Andreas Gruenbacher
2021-07-02  0:30     ` Linus Torvalds
2021-07-02  0:44       ` Linus Torvalds
2021-07-02  1:14         ` Andreas Grünbacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).