From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755107AbbJSVmZ (ORCPT ); Mon, 19 Oct 2015 17:42:25 -0400 Received: from mx1.redhat.com ([209.132.183.28]:42897 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755086AbbJSVmU (ORCPT ); Mon, 19 Oct 2015 17:42:20 -0400 Date: Mon, 19 Oct 2015 23:42:16 +0200 From: Andrea Arcangeli To: Patrick Donnelly Cc: Andrew Morton , open list , linux-mm@kvack.org, qemu-devel@nongnu.org, kvm@vger.kernel.org, Pavel Emelyanov , Sanidhya Kashyap , zhang.zhanghailiang@huawei.com, Linus Torvalds , "Kirill A. Shutemov" , Andres Lagar-Cavilla , Dave Hansen , Paolo Bonzini , Rik van Riel , Mel Gorman , Andy Lutomirski , Hugh Dickins , Peter Feiner , "Dr. David Alan Gilbert" , Johannes Weiner , "Huangpeng (Peter)" Subject: Re: [PATCH 0/7] userfault21 update Message-ID: <20151019214216.GU19147@redhat.com> References: <1434388931-24487-1-git-send-email-aarcange@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Patrick, On Mon, Oct 12, 2015 at 11:04:11AM -0400, Patrick Donnelly wrote: > Hello Andrea, > > On Mon, Jun 15, 2015 at 1:22 PM, Andrea Arcangeli wrote: > > This is an incremental update to the userfaultfd code in -mm. > > Sorry I'm late to this party. I'm curious how a ptrace monitor might > use a userfaultfd to handle faults in all of its tracees. Is this > possible without having each (newly forked) tracee "cooperate" by > creating a userfaultfd and passing that to the tracer? To make the non cooperative usage work, userfaulfd also needs more features to track fork() and mremap() syscalls and such, as the monitor needs to be aware about modifications to the address space of each "mm" is managing and of new forked "mm" as well. So fork() won't need to call userfaultfd once we add those features, but it still doesn't need to know about the "pid". The uffd_msg already has padding to add the features you need for that. Pavel invented and developed those features for the non cooperative usage to implement postcopy live migration of containers. He posted some patchset on the lists too, but it probably needs to be rebased on upstream. The ptrace monitor thread can also fault into the userfault area if it wants to (but only if it's not the userfault manager thread as well). I didn't expect the ptrace monitor to want to be a userfault manager too though. On a side note, the signals the ptrace monitor sends to the tracee (SIGCONT|STOP included) will only be executed by the tracee without waiting for userfault resolution from the userfault manager, if the tracees userfault wasn't triggered in kernel context (and in a non cooperative usage that's not an assumption you can make). If the tracee hits an userfault while running in kernel context, the userfault manager must resolve the userfault before any signal (except SIGKILL of course) can be executed by the tracee. Only SIGKILL is instantly executed by all tracees no matter if it was an userfault in kernel or user context. That may be another reason for not wanting the ptrace monitor and the userfault manager in the same thread (they can still be running in two different threads of the same external process). > Have you considered using one userfaultfd for an entire tree of > processes (signaled through a flag)? Would not a process id included > in the include/uapi/linux/userfaultfd.h:struct uffd_msg be sufficient > to disambiguate faults? I got a private email asking a corollary question about having the faulting IP address in the uffd_msg recently, which I answered and I take opportunity to quote it as well below, as it's somewhat connected with your "pid" question and this adds more context. === At times it's the kernel accessing the page (copy-user get user pages) like if the buffer is a parameter to the write or read syscalls, just to make an example. The IP address triggering the fault isn't necessarily a userland address. Furthermore not even the pid is known, so you don't know which process accessed it. userfaultfd only notifies userland that a certain page is requested and must be mapped ASAP. You don't know why or who touched it. === Now about adding the "pid": the association between "pid" and "mm" isn't so strict in the kernel. You can tell which "pid" shares the same "mm" but if you look from userland, you can't always tell which "mm"(/process) the pid belongs to. At times async io threads or vhost-net threads can impersonate the "mm" and in effect become part of the process and you'd get those random "pid" of kernel threads. It could also be a ptrace that triggers an userfault, with a "pid" that isn't part of the application and the manager must still work seamlessly no matter who or which "pid" triggered the userfault. So overall dealing the "pid"s sounds like not very clean as the same kernel thread "pid" can impersonate multiple "mm" and you wouldn't get the information of which "mm" the "address" belongs to. When userfaultfd() is called, it literally binds to the "mm" the process is running on and it's pid agnostic. Then when a kernel thread impersonating the "mm" faults into the "mm" with get_user_pages or copy_user or when a ptrace faults into the "mm", the userafult manager won't even see the difference. Thanks, Andrea From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrea Arcangeli Subject: Re: [PATCH 0/7] userfault21 update Date: Mon, 19 Oct 2015 23:42:16 +0200 Message-ID: <20151019214216.GU19147@redhat.com> References: <1434388931-24487-1-git-send-email-aarcange@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Andrew Morton , open list , linux-mm@kvack.org, qemu-devel@nongnu.org, kvm@vger.kernel.org, Pavel Emelyanov , Sanidhya Kashyap , zhang.zhanghailiang@huawei.com, Linus Torvalds , "Kirill A. Shutemov" , Andres Lagar-Cavilla , Dave Hansen , Paolo Bonzini , Rik van Riel , Mel Gorman , Andy Lutomirski , Hugh Dickins , Peter Feiner , "Dr. David Alan Gilbert" , Johannes Weiner , "Huangpeng (Peter)" Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: kvm.vger.kernel.org Hello Patrick, On Mon, Oct 12, 2015 at 11:04:11AM -0400, Patrick Donnelly wrote: > Hello Andrea, > > On Mon, Jun 15, 2015 at 1:22 PM, Andrea Arcangeli wrote: > > This is an incremental update to the userfaultfd code in -mm. > > Sorry I'm late to this party. I'm curious how a ptrace monitor might > use a userfaultfd to handle faults in all of its tracees. Is this > possible without having each (newly forked) tracee "cooperate" by > creating a userfaultfd and passing that to the tracer? To make the non cooperative usage work, userfaulfd also needs more features to track fork() and mremap() syscalls and such, as the monitor needs to be aware about modifications to the address space of each "mm" is managing and of new forked "mm" as well. So fork() won't need to call userfaultfd once we add those features, but it still doesn't need to know about the "pid". The uffd_msg already has padding to add the features you need for that. Pavel invented and developed those features for the non cooperative usage to implement postcopy live migration of containers. He posted some patchset on the lists too, but it probably needs to be rebased on upstream. The ptrace monitor thread can also fault into the userfault area if it wants to (but only if it's not the userfault manager thread as well). I didn't expect the ptrace monitor to want to be a userfault manager too though. On a side note, the signals the ptrace monitor sends to the tracee (SIGCONT|STOP included) will only be executed by the tracee without waiting for userfault resolution from the userfault manager, if the tracees userfault wasn't triggered in kernel context (and in a non cooperative usage that's not an assumption you can make). If the tracee hits an userfault while running in kernel context, the userfault manager must resolve the userfault before any signal (except SIGKILL of course) can be executed by the tracee. Only SIGKILL is instantly executed by all tracees no matter if it was an userfault in kernel or user context. That may be another reason for not wanting the ptrace monitor and the userfault manager in the same thread (they can still be running in two different threads of the same external process). > Have you considered using one userfaultfd for an entire tree of > processes (signaled through a flag)? Would not a process id included > in the include/uapi/linux/userfaultfd.h:struct uffd_msg be sufficient > to disambiguate faults? I got a private email asking a corollary question about having the faulting IP address in the uffd_msg recently, which I answered and I take opportunity to quote it as well below, as it's somewhat connected with your "pid" question and this adds more context. === At times it's the kernel accessing the page (copy-user get user pages) like if the buffer is a parameter to the write or read syscalls, just to make an example. The IP address triggering the fault isn't necessarily a userland address. Furthermore not even the pid is known, so you don't know which process accessed it. userfaultfd only notifies userland that a certain page is requested and must be mapped ASAP. You don't know why or who touched it. === Now about adding the "pid": the association between "pid" and "mm" isn't so strict in the kernel. You can tell which "pid" shares the same "mm" but if you look from userland, you can't always tell which "mm"(/process) the pid belongs to. At times async io threads or vhost-net threads can impersonate the "mm" and in effect become part of the process and you'd get those random "pid" of kernel threads. It could also be a ptrace that triggers an userfault, with a "pid" that isn't part of the application and the manager must still work seamlessly no matter who or which "pid" triggered the userfault. So overall dealing the "pid"s sounds like not very clean as the same kernel thread "pid" can impersonate multiple "mm" and you wouldn't get the information of which "mm" the "address" belongs to. When userfaultfd() is called, it literally binds to the "mm" the process is running on and it's pid agnostic. Then when a kernel thread impersonating the "mm" faults into the "mm" with get_user_pages or copy_user or when a ptrace faults into the "mm", the userafult manager won't even see the difference. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f52.google.com (mail-qg0-f52.google.com [209.85.192.52]) by kanga.kvack.org (Postfix) with ESMTP id E97DD82F8A for ; Mon, 19 Oct 2015 17:42:21 -0400 (EDT) Received: by qgeo38 with SMTP id o38so126967610qge.0 for ; Mon, 19 Oct 2015 14:42:21 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id g67si32089551qgf.96.2015.10.19.14.42.21 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 19 Oct 2015 14:42:21 -0700 (PDT) Date: Mon, 19 Oct 2015 23:42:16 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 0/7] userfault21 update Message-ID: <20151019214216.GU19147@redhat.com> References: <1434388931-24487-1-git-send-email-aarcange@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Patrick Donnelly Cc: Andrew Morton , open list , linux-mm@kvack.org, qemu-devel@nongnu.org, kvm@vger.kernel.org, Pavel Emelyanov , Sanidhya Kashyap , zhang.zhanghailiang@huawei.com, Linus Torvalds , "Kirill A. Shutemov" , Andres Lagar-Cavilla , Dave Hansen , Paolo Bonzini , Rik van Riel , Mel Gorman , Andy Lutomirski , Hugh Dickins , Peter Feiner , "Dr. David Alan Gilbert" , Johannes Weiner , "Huangpeng (Peter)" Hello Patrick, On Mon, Oct 12, 2015 at 11:04:11AM -0400, Patrick Donnelly wrote: > Hello Andrea, > > On Mon, Jun 15, 2015 at 1:22 PM, Andrea Arcangeli wrote: > > This is an incremental update to the userfaultfd code in -mm. > > Sorry I'm late to this party. I'm curious how a ptrace monitor might > use a userfaultfd to handle faults in all of its tracees. Is this > possible without having each (newly forked) tracee "cooperate" by > creating a userfaultfd and passing that to the tracer? To make the non cooperative usage work, userfaulfd also needs more features to track fork() and mremap() syscalls and such, as the monitor needs to be aware about modifications to the address space of each "mm" is managing and of new forked "mm" as well. So fork() won't need to call userfaultfd once we add those features, but it still doesn't need to know about the "pid". The uffd_msg already has padding to add the features you need for that. Pavel invented and developed those features for the non cooperative usage to implement postcopy live migration of containers. He posted some patchset on the lists too, but it probably needs to be rebased on upstream. The ptrace monitor thread can also fault into the userfault area if it wants to (but only if it's not the userfault manager thread as well). I didn't expect the ptrace monitor to want to be a userfault manager too though. On a side note, the signals the ptrace monitor sends to the tracee (SIGCONT|STOP included) will only be executed by the tracee without waiting for userfault resolution from the userfault manager, if the tracees userfault wasn't triggered in kernel context (and in a non cooperative usage that's not an assumption you can make). If the tracee hits an userfault while running in kernel context, the userfault manager must resolve the userfault before any signal (except SIGKILL of course) can be executed by the tracee. Only SIGKILL is instantly executed by all tracees no matter if it was an userfault in kernel or user context. That may be another reason for not wanting the ptrace monitor and the userfault manager in the same thread (they can still be running in two different threads of the same external process). > Have you considered using one userfaultfd for an entire tree of > processes (signaled through a flag)? Would not a process id included > in the include/uapi/linux/userfaultfd.h:struct uffd_msg be sufficient > to disambiguate faults? I got a private email asking a corollary question about having the faulting IP address in the uffd_msg recently, which I answered and I take opportunity to quote it as well below, as it's somewhat connected with your "pid" question and this adds more context. === At times it's the kernel accessing the page (copy-user get user pages) like if the buffer is a parameter to the write or read syscalls, just to make an example. The IP address triggering the fault isn't necessarily a userland address. Furthermore not even the pid is known, so you don't know which process accessed it. userfaultfd only notifies userland that a certain page is requested and must be mapped ASAP. You don't know why or who touched it. === Now about adding the "pid": the association between "pid" and "mm" isn't so strict in the kernel. You can tell which "pid" shares the same "mm" but if you look from userland, you can't always tell which "mm"(/process) the pid belongs to. At times async io threads or vhost-net threads can impersonate the "mm" and in effect become part of the process and you'd get those random "pid" of kernel threads. It could also be a ptrace that triggers an userfault, with a "pid" that isn't part of the application and the manager must still work seamlessly no matter who or which "pid" triggered the userfault. So overall dealing the "pid"s sounds like not very clean as the same kernel thread "pid" can impersonate multiple "mm" and you wouldn't get the information of which "mm" the "address" belongs to. When userfaultfd() is called, it literally binds to the "mm" the process is running on and it's pid agnostic. Then when a kernel thread impersonating the "mm" faults into the "mm" with get_user_pages or copy_user or when a ptrace faults into the "mm", the userafult manager won't even see the difference. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:51420) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZoICG-0002v2-Um for qemu-devel@nongnu.org; Mon, 19 Oct 2015 17:42:26 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZoICD-0000RQ-LC for qemu-devel@nongnu.org; Mon, 19 Oct 2015 17:42:24 -0400 Received: from mx1.redhat.com ([209.132.183.28]:53785) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZoICD-0000QJ-Dl for qemu-devel@nongnu.org; Mon, 19 Oct 2015 17:42:21 -0400 Date: Mon, 19 Oct 2015 23:42:16 +0200 From: Andrea Arcangeli Message-ID: <20151019214216.GU19147@redhat.com> References: <1434388931-24487-1-git-send-email-aarcange@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [Qemu-devel] [PATCH 0/7] userfault21 update List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Patrick Donnelly Cc: "Huangpeng (Peter)" , zhang.zhanghailiang@huawei.com, kvm@vger.kernel.org, Pavel Emelyanov , Hugh Dickins , Johannes Weiner , Dave Hansen , open list , qemu-devel@nongnu.org, linux-mm@kvack.org, Andres Lagar-Cavilla , Mel Gorman , Paolo Bonzini , "Kirill A. Shutemov" , Andrew Morton , Sanidhya Kashyap , Linus Torvalds , Andy Lutomirski , "Dr. David Alan Gilbert" , Peter Feiner Hello Patrick, On Mon, Oct 12, 2015 at 11:04:11AM -0400, Patrick Donnelly wrote: > Hello Andrea, > > On Mon, Jun 15, 2015 at 1:22 PM, Andrea Arcangeli wrote: > > This is an incremental update to the userfaultfd code in -mm. > > Sorry I'm late to this party. I'm curious how a ptrace monitor might > use a userfaultfd to handle faults in all of its tracees. Is this > possible without having each (newly forked) tracee "cooperate" by > creating a userfaultfd and passing that to the tracer? To make the non cooperative usage work, userfaulfd also needs more features to track fork() and mremap() syscalls and such, as the monitor needs to be aware about modifications to the address space of each "mm" is managing and of new forked "mm" as well. So fork() won't need to call userfaultfd once we add those features, but it still doesn't need to know about the "pid". The uffd_msg already has padding to add the features you need for that. Pavel invented and developed those features for the non cooperative usage to implement postcopy live migration of containers. He posted some patchset on the lists too, but it probably needs to be rebased on upstream. The ptrace monitor thread can also fault into the userfault area if it wants to (but only if it's not the userfault manager thread as well). I didn't expect the ptrace monitor to want to be a userfault manager too though. On a side note, the signals the ptrace monitor sends to the tracee (SIGCONT|STOP included) will only be executed by the tracee without waiting for userfault resolution from the userfault manager, if the tracees userfault wasn't triggered in kernel context (and in a non cooperative usage that's not an assumption you can make). If the tracee hits an userfault while running in kernel context, the userfault manager must resolve the userfault before any signal (except SIGKILL of course) can be executed by the tracee. Only SIGKILL is instantly executed by all tracees no matter if it was an userfault in kernel or user context. That may be another reason for not wanting the ptrace monitor and the userfault manager in the same thread (they can still be running in two different threads of the same external process). > Have you considered using one userfaultfd for an entire tree of > processes (signaled through a flag)? Would not a process id included > in the include/uapi/linux/userfaultfd.h:struct uffd_msg be sufficient > to disambiguate faults? I got a private email asking a corollary question about having the faulting IP address in the uffd_msg recently, which I answered and I take opportunity to quote it as well below, as it's somewhat connected with your "pid" question and this adds more context. === At times it's the kernel accessing the page (copy-user get user pages) like if the buffer is a parameter to the write or read syscalls, just to make an example. The IP address triggering the fault isn't necessarily a userland address. Furthermore not even the pid is known, so you don't know which process accessed it. userfaultfd only notifies userland that a certain page is requested and must be mapped ASAP. You don't know why or who touched it. === Now about adding the "pid": the association between "pid" and "mm" isn't so strict in the kernel. You can tell which "pid" shares the same "mm" but if you look from userland, you can't always tell which "mm"(/process) the pid belongs to. At times async io threads or vhost-net threads can impersonate the "mm" and in effect become part of the process and you'd get those random "pid" of kernel threads. It could also be a ptrace that triggers an userfault, with a "pid" that isn't part of the application and the manager must still work seamlessly no matter who or which "pid" triggered the userfault. So overall dealing the "pid"s sounds like not very clean as the same kernel thread "pid" can impersonate multiple "mm" and you wouldn't get the information of which "mm" the "address" belongs to. When userfaultfd() is called, it literally binds to the "mm" the process is running on and it's pid agnostic. Then when a kernel thread impersonating the "mm" faults into the "mm" with get_user_pages or copy_user or when a ptrace faults into the "mm", the userafult manager won't even see the difference. Thanks, Andrea