linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] userfaultfd: support control over mm of remote PIDs
@ 2021-09-26 17:06 Nadav Amit
  2021-09-27  9:29 ` David Hildenbrand
  2021-10-13  2:18 ` Peter Xu
  0 siblings, 2 replies; 8+ messages in thread
From: Nadav Amit @ 2021-09-26 17:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli,
	Mike Rapoport, Peter Xu

From: Nadav Amit <namit@vmware.com>

Non-cooperative mode is useful but only for forked processes.
Userfaultfd can be useful to monitor, debug and manage memory of remote
processes.

To support this mode, add a new flag, UFFD_REMOTE_PID, and an optional
second argument to the userfaultfd syscall. When the flag is set, the
second argument is assumed to be the PID of the process that is to be
monitored. Otherwise the flag is ignored.

The syscall enforces that the caller has CAP_SYS_PTRACE to prevent
misuse of this feature.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Nadav Amit <namit@vmware.com>

---

I know that I have an RFC regarding the use of iouring with userfaultfd.
I do intend to follow this RFC as well, but it requires some more work.
---
 fs/userfaultfd.c | 71 ++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 59 insertions(+), 12 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 003f0d31743e..cf44e1e13a03 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -2053,10 +2053,39 @@ static void init_once_userfaultfd_ctx(void *mem)
 	seqcount_spinlock_init(&ctx->refile_seq, &ctx->fault_pending_wqh.lock);
 }
 
-SYSCALL_DEFINE1(userfaultfd, int, flags)
+static int userfaultfd_get_remote_mm(struct userfaultfd_ctx *ctx, int pidfd)
 {
-	struct userfaultfd_ctx *ctx;
-	int fd;
+	struct task_struct *task;
+	struct pid *pid;
+	struct fd f;
+	int ret;
+
+	f = fdget(pidfd);
+	if (!f.file)
+		return -EBADF;
+
+	pid = pidfd_pid(f.file);
+
+	task = get_pid_task(pid, PIDTYPE_PID);
+	ret = -ESRCH;
+	if (!task)
+		goto err_out;
+
+	ctx->mm = task->mm;
+	mmgrab(ctx->mm);
+	put_task_struct(task);
+	ret = 0;
+out:
+	return ret;
+err_out:
+	fdput(f);
+	goto out;
+}
+
+SYSCALL_DEFINE2(userfaultfd, int, flags, int, pidfd)
+{
+	struct userfaultfd_ctx *ctx = NULL;
+	int ret;
 
 	if (!sysctl_unprivileged_userfaultfd &&
 	    (flags & UFFD_USER_MODE_ONLY) == 0 &&
@@ -2067,14 +2096,19 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 		return -EPERM;
 	}
 
+	if ((flags & UFFD_REMOTE_PID) && !capable(CAP_SYS_PTRACE))
+		return -EPERM;
+
 	BUG_ON(!current->mm);
 
 	/* Check the UFFD_* constants for consistency.  */
+	BUILD_BUG_ON(UFFD_REMOTE_PID & UFFD_SHARED_FCNTL_FLAGS);
 	BUILD_BUG_ON(UFFD_USER_MODE_ONLY & UFFD_SHARED_FCNTL_FLAGS);
 	BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
 	BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
 
-	if (flags & ~(UFFD_SHARED_FCNTL_FLAGS | UFFD_USER_MODE_ONLY))
+	if (flags & ~(UFFD_SHARED_FCNTL_FLAGS | UFFD_USER_MODE_ONLY |
+		      UFFD_REMOTE_PID))
 		return -EINVAL;
 
 	ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
@@ -2086,17 +2120,30 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 	ctx->features = 0;
 	ctx->released = false;
 	atomic_set(&ctx->mmap_changing, 0);
-	ctx->mm = current->mm;
-	/* prevent the mm struct to be freed */
-	mmgrab(ctx->mm);
+	ctx->mm = NULL;
+
+	if (flags & UFFD_REMOTE_PID) {
+		/* the remote mm is grabbed by the following call */
+		ret = userfaultfd_get_remote_mm(ctx, pidfd);
+		if (ret)
+			goto err_out;
+	} else {
+		ctx->mm = current->mm;
+		/* prevent the mm struct to be freed */
+		mmgrab(ctx->mm);
+	}
 
-	fd = anon_inode_getfd_secure("[userfaultfd]", &userfaultfd_fops, ctx,
+	ret = anon_inode_getfd_secure("[userfaultfd]", &userfaultfd_fops, ctx,
 			O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS), NULL);
-	if (fd < 0) {
+	if (ret < 0)
+		goto err_out;
+out:
+	return ret;
+err_out:
+	if (ctx->mm)
 		mmdrop(ctx->mm);
-		kmem_cache_free(userfaultfd_ctx_cachep, ctx);
-	}
-	return fd;
+	kmem_cache_free(userfaultfd_ctx_cachep, ctx);
+	goto out;
 }
 
 static int __init userfaultfd_init(void)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] userfaultfd: support control over mm of remote PIDs
  2021-09-26 17:06 [RFC PATCH] userfaultfd: support control over mm of remote PIDs Nadav Amit
@ 2021-09-27  9:29 ` David Hildenbrand
  2021-09-27 10:19   ` Nadav Amit
  2021-10-13  2:18 ` Peter Xu
  1 sibling, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2021-09-27  9:29 UTC (permalink / raw)
  To: Nadav Amit, Andrew Morton
  Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli,
	Mike Rapoport, Peter Xu

On 26.09.21 19:06, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Non-cooperative mode is useful but only for forked processes.
> Userfaultfd can be useful to monitor, debug and manage memory of remote
> processes.
> 
> To support this mode, add a new flag, UFFD_REMOTE_PID, and an optional
> second argument to the userfaultfd syscall. When the flag is set, the
> second argument is assumed to be the PID of the process that is to be
> monitored. Otherwise the flag is ignored.
> 
> The syscall enforces that the caller has CAP_SYS_PTRACE to prevent
> misuse of this feature.

What supposed to happen if the target process intents to use uffd itself?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] userfaultfd: support control over mm of remote PIDs
  2021-09-27  9:29 ` David Hildenbrand
@ 2021-09-27 10:19   ` Nadav Amit
  2021-09-27 17:06     ` David Hildenbrand
  0 siblings, 1 reply; 8+ messages in thread
From: Nadav Amit @ 2021-09-27 10:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	Mike Rapoport, Peter Xu



> On Sep 27, 2021, at 2:29 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 26.09.21 19:06, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> Non-cooperative mode is useful but only for forked processes.
>> Userfaultfd can be useful to monitor, debug and manage memory of remote
>> processes.
>> To support this mode, add a new flag, UFFD_REMOTE_PID, and an optional
>> second argument to the userfaultfd syscall. When the flag is set, the
>> second argument is assumed to be the PID of the process that is to be
>> monitored. Otherwise the flag is ignored.
>> The syscall enforces that the caller has CAP_SYS_PTRACE to prevent
>> misuse of this feature.
> 
> What supposed to happen if the target process intents to use uffd itself?

Thanks for the quick response.

First, sorry that I mistakenly dropped the changes to userfaultfd.h
that define UFFD_REMOTE_PID.

As for your question: there are standard ways to deal with such cases,
similarly to when a debugged program wants to use PTRACE. One way is
to block the userfaultfd syscall, using seccomp. Another way is to do
chaining using ptrace (although using ptrace for anything is
challenging).

It is also possible to add tailor something specific to userfaultfd,
but I think seccomp is a good enough solution. I am open to suggestions.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] userfaultfd: support control over mm of remote PIDs
  2021-09-27 10:19   ` Nadav Amit
@ 2021-09-27 17:06     ` David Hildenbrand
  2021-09-27 20:08       ` Nadav Amit
  0 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2021-09-27 17:06 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	Mike Rapoport, Peter Xu

On 27.09.21 12:19, Nadav Amit wrote:
> 
> 
>> On Sep 27, 2021, at 2:29 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 26.09.21 19:06, Nadav Amit wrote:
>>> From: Nadav Amit <namit@vmware.com>
>>> Non-cooperative mode is useful but only for forked processes.
>>> Userfaultfd can be useful to monitor, debug and manage memory of remote
>>> processes.
>>> To support this mode, add a new flag, UFFD_REMOTE_PID, and an optional
>>> second argument to the userfaultfd syscall. When the flag is set, the
>>> second argument is assumed to be the PID of the process that is to be
>>> monitored. Otherwise the flag is ignored.
>>> The syscall enforces that the caller has CAP_SYS_PTRACE to prevent
>>> misuse of this feature.
>>
>> What supposed to happen if the target process intents to use uffd itself?
> 
> Thanks for the quick response.
> 
> First, sorry that I mistakenly dropped the changes to userfaultfd.h
> that define UFFD_REMOTE_PID.

Didn't even notice it :)

> 
> As for your question: there are standard ways to deal with such cases,
> similarly to when a debugged program wants to use PTRACE. One way is
> to block the userfaultfd syscall, using seccomp. Another way is to do
> chaining using ptrace (although using ptrace for anything is
> challenging).
> 
> It is also possible to add tailor something specific to userfaultfd,
> but I think seccomp is a good enough solution. I am open to suggestions.

If we have something already in place to handle PTRACE, we'd better 
reuse what's already there. Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] userfaultfd: support control over mm of remote PIDs
  2021-09-27 17:06     ` David Hildenbrand
@ 2021-09-27 20:08       ` Nadav Amit
  2021-09-27 20:11         ` David Hildenbrand
  0 siblings, 1 reply; 8+ messages in thread
From: Nadav Amit @ 2021-09-27 20:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Linux-MM, Linux Kernel Mailing List,
	Andrea Arcangeli, Mike Rapoport, Peter Xu



> On Sep 27, 2021, at 10:06 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 27.09.21 12:19, Nadav Amit wrote:
>>> On Sep 27, 2021, at 2:29 AM, David Hildenbrand <david@redhat.com> wrote:
>>> 
>>> On 26.09.21 19:06, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>> Non-cooperative mode is useful but only for forked processes.
>>>> Userfaultfd can be useful to monitor, debug and manage memory of remote
>>>> processes.
>>>> To support this mode, add a new flag, UFFD_REMOTE_PID, and an optional
>>>> second argument to the userfaultfd syscall. When the flag is set, the
>>>> second argument is assumed to be the PID of the process that is to be
>>>> monitored. Otherwise the flag is ignored.
>>>> The syscall enforces that the caller has CAP_SYS_PTRACE to prevent
>>>> misuse of this feature.
>>> 
>>> What supposed to happen if the target process intents to use uffd itself?
>> Thanks for the quick response.
>> First, sorry that I mistakenly dropped the changes to userfaultfd.h
>> that define UFFD_REMOTE_PID.
> 
> Didn't even notice it :)
> 
>> As for your question: there are standard ways to deal with such cases,
>> similarly to when a debugged program wants to use PTRACE. One way is
>> to block the userfaultfd syscall, using seccomp. Another way is to do
>> chaining using ptrace (although using ptrace for anything is
>> challenging).
>> It is also possible to add tailor something specific to userfaultfd,
>> but I think seccomp is a good enough solution. I am open to suggestions.
> 
> If we have something already in place to handle PTRACE, we'd better reuse what's already there. Thanks!

Just to ensure we are on the same page: I meant that this is usually
left for the user application to handle. The 2 basic solutions are to
not expose userfaultfd to the monitored process (easy using seccomp)
or to chain the two monitors (hard using ptrace).

Since ptrace is hard, in theory we can have facilities to “hijack”
a context and “inject” uffd event to another monitor. I just think
it is a total overkill at this stage.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] userfaultfd: support control over mm of remote PIDs
  2021-09-27 20:08       ` Nadav Amit
@ 2021-09-27 20:11         ` David Hildenbrand
  0 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2021-09-27 20:11 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Linux-MM, Linux Kernel Mailing List,
	Andrea Arcangeli, Mike Rapoport, Peter Xu

On 27.09.21 22:08, Nadav Amit wrote:
> 
> 
>> On Sep 27, 2021, at 10:06 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 27.09.21 12:19, Nadav Amit wrote:
>>>> On Sep 27, 2021, at 2:29 AM, David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 26.09.21 19:06, Nadav Amit wrote:
>>>>> From: Nadav Amit <namit@vmware.com>
>>>>> Non-cooperative mode is useful but only for forked processes.
>>>>> Userfaultfd can be useful to monitor, debug and manage memory of remote
>>>>> processes.
>>>>> To support this mode, add a new flag, UFFD_REMOTE_PID, and an optional
>>>>> second argument to the userfaultfd syscall. When the flag is set, the
>>>>> second argument is assumed to be the PID of the process that is to be
>>>>> monitored. Otherwise the flag is ignored.
>>>>> The syscall enforces that the caller has CAP_SYS_PTRACE to prevent
>>>>> misuse of this feature.
>>>>
>>>> What supposed to happen if the target process intents to use uffd itself?
>>> Thanks for the quick response.
>>> First, sorry that I mistakenly dropped the changes to userfaultfd.h
>>> that define UFFD_REMOTE_PID.
>>
>> Didn't even notice it :)
>>
>>> As for your question: there are standard ways to deal with such cases,
>>> similarly to when a debugged program wants to use PTRACE. One way is
>>> to block the userfaultfd syscall, using seccomp. Another way is to do
>>> chaining using ptrace (although using ptrace for anything is
>>> challenging).
>>> It is also possible to add tailor something specific to userfaultfd,
>>> but I think seccomp is a good enough solution. I am open to suggestions.
>>
>> If we have something already in place to handle PTRACE, we'd better reuse what's already there. Thanks!
> 
> Just to ensure we are on the same page: I meant that this is usually
> left for the user application to handle. The 2 basic solutions are to
> not expose userfaultfd to the monitored process (easy using seccomp)
> or to chain the two monitors (hard using ptrace).

Yes, and I agree that the first approach then makes sense. Chaining 
might be way to complicated to support.

As long as the kernel will continue working when a second one tries to 
register (which I think is the case), that should be good enough.

> 
> Since ptrace is hard, in theory we can have facilities to “hijack”
> a context and “inject” uffd event to another monitor. I just think
> it is a total overkill at this stage.

Agreed


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] userfaultfd: support control over mm of remote PIDs
  2021-09-26 17:06 [RFC PATCH] userfaultfd: support control over mm of remote PIDs Nadav Amit
  2021-09-27  9:29 ` David Hildenbrand
@ 2021-10-13  2:18 ` Peter Xu
  2021-10-13 16:02   ` Nadav Amit
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Xu @ 2021-10-13  2:18 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, linux-mm, linux-kernel, Nadav Amit,
	Andrea Arcangeli, Mike Rapoport

On Sun, Sep 26, 2021 at 10:06:37AM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Non-cooperative mode is useful but only for forked processes.
> Userfaultfd can be useful to monitor, debug and manage memory of remote
> processes.
> 
> To support this mode, add a new flag, UFFD_REMOTE_PID, and an optional
> second argument to the userfaultfd syscall. When the flag is set, the
> second argument is assumed to be the PID of the process that is to be
> monitored. Otherwise the flag is ignored.
> 
> The syscall enforces that the caller has CAP_SYS_PTRACE to prevent
> misuse of this feature.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Cc: Peter Xu <peterx@redhat.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>

I think this patch from one pov looks just likes the other patch of the
process_madvise on DONTNEED - the new interface definitely opens new way to do
things, however IMHO it would be great to discuss some detailed scenario that
we can do with it better than the existing facilities.

The thing is uffd already provides some mechanism for doing things like
customized swapping, so that's not something new IMHO that this patch brings
(neither is what the DONTNEED patch brings), just like when I raised in the
other thread about umap.

So IMHO it'll be great if there can be some elaboration on how the "remote"
capability could help us do things better (e.g., use cases that we may not
solve with linking against another uffd-supported library, or we can't do with
register uffd then fork()).

(I skipped the security side of things, as I replied in the other thread that I
 think I buy in your point on depending on PTRACE capability and also the
 examples you gave on ptrace() and process_vm_writev() are persuasive to me,
 but no expert on that..)

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] userfaultfd: support control over mm of remote PIDs
  2021-10-13  2:18 ` Peter Xu
@ 2021-10-13 16:02   ` Nadav Amit
  0 siblings, 0 replies; 8+ messages in thread
From: Nadav Amit @ 2021-10-13 16:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli, Mike Rapoport



> On Oct 12, 2021, at 7:18 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Sun, Sep 26, 2021 at 10:06:37AM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Non-cooperative mode is useful but only for forked processes.
>> Userfaultfd can be useful to monitor, debug and manage memory of remote
>> processes.
>> 
>> To support this mode, add a new flag, UFFD_REMOTE_PID, and an optional
>> second argument to the userfaultfd syscall. When the flag is set, the
>> second argument is assumed to be the PID of the process that is to be
>> monitored. Otherwise the flag is ignored.
>> 
>> The syscall enforces that the caller has CAP_SYS_PTRACE to prevent
>> misuse of this feature.
>> 
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
> 
> I think this patch from one pov looks just likes the other patch of the
> process_madvise on DONTNEED - the new interface definitely opens new way to do
> things, however IMHO it would be great to discuss some detailed scenario that
> we can do with it better than the existing facilities.
> 
> The thing is uffd already provides some mechanism for doing things like
> customized swapping, so that's not something new IMHO that this patch brings
> (neither is what the DONTNEED patch brings), just like when I raised in the
> other thread about umap.
> 
> So IMHO it'll be great if there can be some elaboration on how the "remote"
> capability could help us do things better (e.g., use cases that we may not
> solve with linking against another uffd-supported library, or we can't do with
> register uffd then fork()).
> 
> (I skipped the security side of things, as I replied in the other thread that I
> think I buy in your point on depending on PTRACE capability and also the
> examples you gave on ptrace() and process_vm_writev() are persuasive to me,
> but no expert on that..)

Fair enough. Let me get back to you once I can provide more data.

For now, I just ask you to have this patch in the back of your mind if any
other change to userfaultfd syscall is proposed to prevent a potential
conflict.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-10-13 16:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-26 17:06 [RFC PATCH] userfaultfd: support control over mm of remote PIDs Nadav Amit
2021-09-27  9:29 ` David Hildenbrand
2021-09-27 10:19   ` Nadav Amit
2021-09-27 17:06     ` David Hildenbrand
2021-09-27 20:08       ` Nadav Amit
2021-09-27 20:11         ` David Hildenbrand
2021-10-13  2:18 ` Peter Xu
2021-10-13 16:02   ` Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).