All of lore.kernel.org
 help / color / mirror / Atom feed
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Miklos Szeredi <miklos@szeredi.hu>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Linux NFS list <linux-nfs@vger.kernel.org>,
	ceph-devel@vger.kernel.org, lustre-devel@lists.lustre.org,
	v9fs-developer@lists.sourceforge.net,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, Chris Wilson <chris@chris-wilson.co.uk>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Jeff Layton <jlayton@redhat.com>
Subject: Re: [PATCH v3 0/2] iov_iter: allow iov_iter_get_pages_alloc to allocate more pages per call
Date: Mon, 6 Feb 2017 03:05:32 +0000	[thread overview]
Message-ID: <20170206030532.GF13195@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20170205220445.GE13195@ZenIV.linux.org.uk>

On Sun, Feb 05, 2017 at 10:04:45PM +0000, Al Viro wrote:

> Sure, you need to hit a fairly narrow window, especially if you are to
> cause damage in A, but AFAICS it's not impossible.  Consider e.g. the
> situation when you lose CPU on preempt on the way to memcpy(); in that
> case server might come back when A has incremented its stack footprint
> again.  Or A might end up taking a hardware interrupt and handling it
> on the normal kernel stack, etc.
> 
> Looks like *any* scenario where fuse_conn_abort() manages to run during
> that memcpy() has potential for that kind of trouble; any SMP box appears
> to be vulnerable, along with preempt UP...
> 
> Am I missing something that prevents that kind of problem?

For that matter, it doesn't have to be on-stack - e.g. fuse_get_link()
has kmalloc'ed buffer for destination, kfree'd upon failure.  Have the
damn thing lose the timeslice in fuse_copy_do() and you might very well
end up spraying user-supplied data over whatever ends up picking your
kfreed buffer.  That one could be reasonably dealt with if we switched
to page_alloc() and stuffed it into the ->pages[] instead...

Some observations regarding the arguments:
	* stack footprint is atrocious.  Consider e.g. fuse_mknod() - you
get 16 bytes of fuse_mknod_in + 120 bytes of struct fuse_args + 128 bytes
of fuse_entry_out.  All on stack, and that's on top of whatever the
callchain already has eaten, which might include e.g. nfsd stuff or
ecryptfs, etc.  Or fuse_get_parent(), for that matter, with 128 bytes of
fuse_entry_out + 120 bytes of fuse_args, both on stack.  This one is
guaranteed to have a nasty call chain - fuse_get_parent() <- reconnect_one()
<- reconnect_path() <- exportfs_decode_fh() (itself with a 256-byte array of
char on stack) <- nfsd_set_fh_dentry() <- fh_verify() <- a bunch of call
chains in nfsd.
	* "out" args (i.e. reply) are probably best dealt with by having
coallocated with request itself - some already are and the sizes tend
to be fixed and not too large (->get_link() is an exception, and it's
probably better handled as mentioned above).
	* "in" args (request) are in some cases easily dealt with by
coallocating with request, but there's a large class of situations where
we are passing dentry->d_name.name and then there's fuse_symlink().
The last one is ugly - potentially up to a page worth of data, coming
straight from method caller; usually it's a part of getname() result,
but e.g. ecryptfs might have it kmalloc'ed, nfsd - picked from sunrpc
request payload, etc.

	AFAICS, your argument applies to the requests that have
some page(s) locked until the request completion (unlock_page() either
by ->end() callback or in the originator of request).  If so, I would
rather mark those as "call request_end() early"; they seem to have
the non-page parts of args hosted in req->misc, so for them it's not
a problem.

	So how about this:

* explicit FR_END_IMMEDIATELY on read/write-related requests
* no FR_LOCKED flipping in lock_request()/unlock_request()
* modifying the call of end_requests() in fuse_abort_conn() so that it
would skip request_end() for everything that isn't marked FR_END_IMMEDIATELY
* make fuse_copy_pages() grab page references around the actual
fuse_copy_page() - grab req->waitq.lock, check FR_ABORTED, grab a page
reference in case it's not, drop req->waitq.lock and bugger off if FR_ABORTED
was set.  Adjust fuse_try_move_page() accordingly.

Do you see any problems with that approach for minimal fix?  If all requests
in need of FR_END_IMMEDIATELY turn out to have non-page part of args already
embedded into req->misc, it looks like this ought to suffice.  I probably
could post something along those lines tomorrow, if you see any serious
problems with that - please yell...

WARNING: multiple messages have this Message-ID (diff)
From: Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
To: Miklos Szeredi <miklos-sUDqSbJrdHQHWmgEVkV9KA@public.gmane.org>
Cc: Linux NFS list
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>,
	Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Chris Wilson
	<chris-Y6uKTt2uX1cEflXRtASbqLVCufUGDwFn@public.gmane.org>,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	v9fs-developer-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linus Torvalds
	<torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	"Kirill A. Shutemov"
	<kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>,
	lustre-devel-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
Subject: Re: [PATCH v3 0/2] iov_iter: allow iov_iter_get_pages_alloc to allocate more pages per call
Date: Mon, 6 Feb 2017 03:05:32 +0000	[thread overview]
Message-ID: <20170206030532.GF13195@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20170205220445.GE13195-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>

On Sun, Feb 05, 2017 at 10:04:45PM +0000, Al Viro wrote:

> Sure, you need to hit a fairly narrow window, especially if you are to
> cause damage in A, but AFAICS it's not impossible.  Consider e.g. the
> situation when you lose CPU on preempt on the way to memcpy(); in that
> case server might come back when A has incremented its stack footprint
> again.  Or A might end up taking a hardware interrupt and handling it
> on the normal kernel stack, etc.
> 
> Looks like *any* scenario where fuse_conn_abort() manages to run during
> that memcpy() has potential for that kind of trouble; any SMP box appears
> to be vulnerable, along with preempt UP...
> 
> Am I missing something that prevents that kind of problem?

For that matter, it doesn't have to be on-stack - e.g. fuse_get_link()
has kmalloc'ed buffer for destination, kfree'd upon failure.  Have the
damn thing lose the timeslice in fuse_copy_do() and you might very well
end up spraying user-supplied data over whatever ends up picking your
kfreed buffer.  That one could be reasonably dealt with if we switched
to page_alloc() and stuffed it into the ->pages[] instead...

Some observations regarding the arguments:
	* stack footprint is atrocious.  Consider e.g. fuse_mknod() - you
get 16 bytes of fuse_mknod_in + 120 bytes of struct fuse_args + 128 bytes
of fuse_entry_out.  All on stack, and that's on top of whatever the
callchain already has eaten, which might include e.g. nfsd stuff or
ecryptfs, etc.  Or fuse_get_parent(), for that matter, with 128 bytes of
fuse_entry_out + 120 bytes of fuse_args, both on stack.  This one is
guaranteed to have a nasty call chain - fuse_get_parent() <- reconnect_one()
<- reconnect_path() <- exportfs_decode_fh() (itself with a 256-byte array of
char on stack) <- nfsd_set_fh_dentry() <- fh_verify() <- a bunch of call
chains in nfsd.
	* "out" args (i.e. reply) are probably best dealt with by having
coallocated with request itself - some already are and the sizes tend
to be fixed and not too large (->get_link() is an exception, and it's
probably better handled as mentioned above).
	* "in" args (request) are in some cases easily dealt with by
coallocating with request, but there's a large class of situations where
we are passing dentry->d_name.name and then there's fuse_symlink().
The last one is ugly - potentially up to a page worth of data, coming
straight from method caller; usually it's a part of getname() result,
but e.g. ecryptfs might have it kmalloc'ed, nfsd - picked from sunrpc
request payload, etc.

	AFAICS, your argument applies to the requests that have
some page(s) locked until the request completion (unlock_page() either
by ->end() callback or in the originator of request).  If so, I would
rather mark those as "call request_end() early"; they seem to have
the non-page parts of args hosted in req->misc, so for them it's not
a problem.

	So how about this:

* explicit FR_END_IMMEDIATELY on read/write-related requests
* no FR_LOCKED flipping in lock_request()/unlock_request()
* modifying the call of end_requests() in fuse_abort_conn() so that it
would skip request_end() for everything that isn't marked FR_END_IMMEDIATELY
* make fuse_copy_pages() grab page references around the actual
fuse_copy_page() - grab req->waitq.lock, check FR_ABORTED, grab a page
reference in case it's not, drop req->waitq.lock and bugger off if FR_ABORTED
was set.  Adjust fuse_try_move_page() accordingly.

Do you see any problems with that approach for minimal fix?  If all requests
in need of FR_END_IMMEDIATELY turn out to have non-page part of args already
embedded into req->misc, it looks like this ought to suffice.  I probably
could post something along those lines tomorrow, if you see any serious
problems with that - please yell...

WARNING: multiple messages have this Message-ID (diff)
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Miklos Szeredi <miklos-sUDqSbJrdHQHWmgEVkV9KA@public.gmane.org>
Cc: Linux NFS list
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>,
	Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Chris Wilson
	<chris-Y6uKTt2uX1cEflXRtASbqLVCufUGDwFn@public.gmane.org>,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	v9fs-developer-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linus Torvalds
	<torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	"Kirill A. Shutemov"
	<kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>,
	lustre-devel-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
Subject: [lustre-devel] [PATCH v3 0/2] iov_iter: allow iov_iter_get_pages_alloc to allocate more pages per call
Date: Mon, 6 Feb 2017 03:05:32 +0000	[thread overview]
Message-ID: <20170206030532.GF13195@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20170205220445.GE13195@ZenIV.linux.org.uk>

On Sun, Feb 05, 2017 at 10:04:45PM +0000, Al Viro wrote:

> Sure, you need to hit a fairly narrow window, especially if you are to
> cause damage in A, but AFAICS it's not impossible.  Consider e.g. the
> situation when you lose CPU on preempt on the way to memcpy(); in that
> case server might come back when A has incremented its stack footprint
> again.  Or A might end up taking a hardware interrupt and handling it
> on the normal kernel stack, etc.
> 
> Looks like *any* scenario where fuse_conn_abort() manages to run during
> that memcpy() has potential for that kind of trouble; any SMP box appears
> to be vulnerable, along with preempt UP...
> 
> Am I missing something that prevents that kind of problem?

For that matter, it doesn't have to be on-stack - e.g. fuse_get_link()
has kmalloc'ed buffer for destination, kfree'd upon failure.  Have the
damn thing lose the timeslice in fuse_copy_do() and you might very well
end up spraying user-supplied data over whatever ends up picking your
kfreed buffer.  That one could be reasonably dealt with if we switched
to page_alloc() and stuffed it into the ->pages[] instead...

Some observations regarding the arguments:
	* stack footprint is atrocious.  Consider e.g. fuse_mknod() - you
get 16 bytes of fuse_mknod_in + 120 bytes of struct fuse_args + 128 bytes
of fuse_entry_out.  All on stack, and that's on top of whatever the
callchain already has eaten, which might include e.g. nfsd stuff or
ecryptfs, etc.  Or fuse_get_parent(), for that matter, with 128 bytes of
fuse_entry_out + 120 bytes of fuse_args, both on stack.  This one is
guaranteed to have a nasty call chain - fuse_get_parent() <- reconnect_one()
<- reconnect_path() <- exportfs_decode_fh() (itself with a 256-byte array of
char on stack) <- nfsd_set_fh_dentry() <- fh_verify() <- a bunch of call
chains in nfsd.
	* "out" args (i.e. reply) are probably best dealt with by having
coallocated with request itself - some already are and the sizes tend
to be fixed and not too large (->get_link() is an exception, and it's
probably better handled as mentioned above).
	* "in" args (request) are in some cases easily dealt with by
coallocating with request, but there's a large class of situations where
we are passing dentry->d_name.name and then there's fuse_symlink().
The last one is ugly - potentially up to a page worth of data, coming
straight from method caller; usually it's a part of getname() result,
but e.g. ecryptfs might have it kmalloc'ed, nfsd - picked from sunrpc
request payload, etc.

	AFAICS, your argument applies to the requests that have
some page(s) locked until the request completion (unlock_page() either
by ->end() callback or in the originator of request).  If so, I would
rather mark those as "call request_end() early"; they seem to have
the non-page parts of args hosted in req->misc, so for them it's not
a problem.

	So how about this:

* explicit FR_END_IMMEDIATELY on read/write-related requests
* no FR_LOCKED flipping in lock_request()/unlock_request()
* modifying the call of end_requests() in fuse_abort_conn() so that it
would skip request_end() for everything that isn't marked FR_END_IMMEDIATELY
* make fuse_copy_pages() grab page references around the actual
fuse_copy_page() - grab req->waitq.lock, check FR_ABORTED, grab a page
reference in case it's not, drop req->waitq.lock and bugger off if FR_ABORTED
was set.  Adjust fuse_try_move_page() accordingly.

Do you see any problems with that approach for minimal fix?  If all requests
in need of FR_END_IMMEDIATELY turn out to have non-page part of args already
embedded into req->misc, it looks like this ought to suffice.  I probably
could post something along those lines tomorrow, if you see any serious
problems with that - please yell...

  reply	other threads:[~2017-02-06  3:05 UTC|newest]

Thread overview: 101+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-24 21:23 [PATCH] iov_iter: allow iov_iter_get_pages_alloc to allocate more pages per call Jeff Layton
2017-01-25 13:32 ` [PATCH v3 0/2] " Jeff Layton
2017-01-25 13:32   ` [PATCH v3 1/2] " Jeff Layton
2017-01-26 12:35     ` Jeff Layton
2017-01-27 13:24       ` [PATCH v4 0/2] " Jeff Layton
2017-01-27 13:24         ` [PATCH v4 1/2] " Jeff Layton
2017-01-27 13:24         ` [PATCH v4 2/2] ceph: switch DIO code to use iov_iter_get_pages_alloc Jeff Layton
2017-01-30 15:40           ` Jeff Layton
2017-01-30 15:40             ` Jeff Layton
2017-01-25 13:32   ` [PATCH v3 " Jeff Layton
2017-02-02  9:51   ` [PATCH v3 0/2] iov_iter: allow iov_iter_get_pages_alloc to allocate more pages per call Al Viro
2017-02-02  9:51     ` [lustre-devel] " Al Viro
2017-02-02  9:51     ` Al Viro
2017-02-02 10:56     ` Christoph Hellwig
2017-02-02 10:56       ` [lustre-devel] " Christoph Hellwig
2017-02-02 10:56       ` Christoph Hellwig
2017-02-02 11:16       ` Al Viro
2017-02-02 11:16         ` [lustre-devel] " Al Viro
2017-02-02 11:16         ` Al Viro
2017-02-02 13:00         ` Jeff Layton
2017-02-02 13:00           ` Jeff Layton
2017-02-03  7:29           ` Al Viro
2017-02-03  7:29             ` [lustre-devel] " Al Viro
2017-02-03  7:29             ` Al Viro
2017-02-03 18:29             ` Linus Torvalds
2017-02-03 18:29               ` [lustre-devel] " Linus Torvalds
2017-02-03 18:29               ` Linus Torvalds
2017-02-03 19:08               ` Al Viro
2017-02-03 19:08                 ` [lustre-devel] " Al Viro
2017-02-03 19:08                 ` Al Viro
2017-02-03 19:28                 ` Linus Torvalds
2017-02-03 19:28                   ` [lustre-devel] " Linus Torvalds
2017-02-03 19:28                   ` Linus Torvalds
2017-02-13  9:56                   ` Steve Capper
2017-02-13 21:40                     ` Linus Torvalds
2017-02-13 21:40                       ` [lustre-devel] " Linus Torvalds
2017-02-13 21:40                       ` Linus Torvalds
2017-02-03  7:49           ` Christoph Hellwig
2017-02-03  7:49             ` [lustre-devel] " Christoph Hellwig
2017-02-03  7:49             ` Christoph Hellwig
2017-02-03  8:54             ` Al Viro
2017-02-03  8:54               ` [lustre-devel] " Al Viro
2017-02-03  8:54               ` Al Viro
2017-02-03 11:09               ` Christoph Hellwig
2017-02-03 11:09                 ` [lustre-devel] " Christoph Hellwig
2017-02-03 11:09                 ` Christoph Hellwig
2017-02-02 14:48     ` Jan Kara
2017-02-02 14:48       ` [lustre-devel] " Jan Kara
2017-02-02 14:48       ` Jan Kara
2017-02-02 18:28       ` Al Viro
2017-02-02 18:28         ` [lustre-devel] " Al Viro
2017-02-02 18:28         ` Al Viro
2017-02-03 14:47         ` Jan Kara
2017-02-03 14:47           ` [lustre-devel] " Jan Kara
2017-02-03 14:47           ` Jan Kara
2017-02-04  3:08     ` Al Viro
2017-02-04  3:08       ` [lustre-devel] " Al Viro
2017-02-04  3:08       ` Al Viro
2017-02-04 19:26       ` Al Viro
2017-02-04 19:26         ` [lustre-devel] " Al Viro
2017-02-04 19:26         ` Al Viro
2017-02-04 22:12         ` Miklos Szeredi
2017-02-04 22:12           ` Miklos Szeredi
2017-02-04 22:11       ` Miklos Szeredi
2017-02-04 22:11         ` Miklos Szeredi
2017-02-05  1:51         ` Al Viro
2017-02-05  1:51           ` [lustre-devel] " Al Viro
2017-02-05  1:51           ` Al Viro
2017-02-05 20:15           ` Miklos Szeredi
2017-02-05 20:15             ` Miklos Szeredi
2017-02-05 21:01             ` Al Viro
2017-02-05 21:01               ` [lustre-devel] " Al Viro
2017-02-05 21:01               ` Al Viro
2017-02-05 21:19               ` Miklos Szeredi
2017-02-05 21:19                 ` Miklos Szeredi
2017-02-05 22:04                 ` Al Viro
2017-02-05 22:04                   ` [lustre-devel] " Al Viro
2017-02-05 22:04                   ` Al Viro
2017-02-05 22:04                   ` Al Viro
2017-02-06  3:05                   ` Al Viro [this message]
2017-02-06  3:05                     ` [lustre-devel] " Al Viro
2017-02-06  3:05                     ` Al Viro
2017-02-06  9:08                     ` Miklos Szeredi
2017-02-06  9:57                       ` Al Viro
2017-02-06  9:57                         ` [lustre-devel] " Al Viro
2017-02-06  9:57                         ` Al Viro
2017-02-06 14:18                         ` Miklos Szeredi
2017-02-07  7:19                           ` Al Viro
2017-02-07  7:19                             ` [lustre-devel] " Al Viro
2017-02-07  7:19                             ` Al Viro
2017-02-07 11:35                             ` Miklos Szeredi
2017-02-07 11:35                               ` Miklos Szeredi
2017-02-08  5:54                               ` Al Viro
2017-02-08  5:54                                 ` [lustre-devel] " Al Viro
2017-02-08  5:54                                 ` Al Viro
2017-02-08  9:53                                 ` Miklos Szeredi
2017-02-06  8:37                   ` Miklos Szeredi
2017-02-05 20:56           ` Al Viro
2017-02-05 20:56             ` [lustre-devel] " Al Viro
2017-02-05 20:56             ` Al Viro
2017-02-16 13:10     ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170206030532.GF13195@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=ceph-devel@vger.kernel.org \
    --cc=chris@chris-wilson.co.uk \
    --cc=jack@suse.cz \
    --cc=jlayton@redhat.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=lustre-devel@lists.lustre.org \
    --cc=miklos@szeredi.hu \
    --cc=torvalds@linux-foundation.org \
    --cc=v9fs-developer@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.