All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Blake <eblake@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>,
	qemu-devel@nongnu.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-api@vger.kernel.org,
	Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>, Dave Hansen <dave@sr71.net>,
	Jan Kara <jack@suse.cz>, Neil Brown <neilb@suse.de>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	Andrew Jones <drjones@redhat.com>,
	Sanidhya Kashyap <sanidhya.gatech@gmail.com>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	Taras Glek <tglek@mozilla.com>,
	zhang.zhanghailiang@huawei.com,
	Pavel Emelyanov <xemul@parallels.com>,
	Hugh Dickins <hughd@google.com>, Mel Gorman <mgorman@suse.de>,
	Sasha Levin <sasha.levin@oracle.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Huangpeng (Peter)" <peter.huangpeng@huawei.com>,
	Andres Lagar-Cavilla <andreslc@google.com>,
	Christopher Covington <cov@codeaurora.org>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Keith Packard <keithp@keithp.com>,
	Wenchao Xia <wenchaoqemu@gmail.com>,
	Juan Quintela <quintela@redhat.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Minchan Kim <minchan@kernel.org>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Mike Hommey <mh@glandium.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Feiner <pfeiner@google.com>
Subject: Re: [Qemu-devel] [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt
Date: Fri, 06 Mar 2015 08:39:30 -0700	[thread overview]
Message-ID: <54F9CA32.3050407@redhat.com> (raw)
In-Reply-To: <1425575884-2574-3-git-send-email-aarcange@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 6342 bytes --]

On 03/05/2015 10:17 AM, Andrea Arcangeli wrote:
> Add documentation.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  Documentation/vm/userfaultfd.txt | 97 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 97 insertions(+)
>  create mode 100644 Documentation/vm/userfaultfd.txt

Just a grammar review (no analysis of technical correctness)

> 
> diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
> new file mode 100644
> index 0000000..2ec296c
> --- /dev/null
> +++ b/Documentation/vm/userfaultfd.txt
> @@ -0,0 +1,97 @@
> += Userfaultfd =
> +
> +== Objective ==
> +
> +Userfaults allow to implement on demand paging from userland and more

s/to implement/the implementation of/
and maybe: s/on demand/on-demand/

> +generally they allow userland to take control various memory page
> +faults, something otherwise only the kernel code could do.
> +
> +For example userfaults allows a proper and more optimal implementation
> +of the PROT_NONE+SIGSEGV trick.
> +
> +== Design ==
> +
> +Userfaults are delivered and resolved through the userfaultfd syscall.
> +
> +The userfaultfd (aside from registering and unregistering virtual
> +memory ranges) provides for two primary functionalities:

s/provides for/provides/

> +
> +1) read/POLLIN protocol to notify an userland thread of the faults

s/an userland/a userland/ (remember, 'a unicorn gets an umbrella' - if
the 'u' is pronounced 'you' the correct article is 'a')

> +   happening
> +
> +2) various UFFDIO_* ioctls that can mangle over the virtual memory
> +   regions registered in the userfaultfd that allows userland to
> +   efficiently resolve the userfaults it receives via 1) or to mangle
> +   the virtual memory in the background

maybe: s/mangle/manage/2

> +
> +The real advantage of userfaults if compared to regular virtual memory
> +management of mremap/mprotect is that the userfaults in all their
> +operations never involve heavyweight structures like vmas (in fact the
> +userfaultfd runtime load never takes the mmap_sem for writing).
> +
> +Vmas are not suitable for page(or hugepage)-granular fault tracking

s/page(or hugepage)-granular/page- (or hugepage-) granular/

> +when dealing with virtual address spaces that could span
> +Terabytes. Too many vmas would be needed for that.
> +
> +The userfaultfd once opened by invoking the syscall, can also be
> +passed using unix domain sockets to a manager process, so the same
> +manager process could handle the userfaults of a multitude of
> +different process without them being aware about what is going on

s/process/processes/

> +(well of course unless they later try to use the userfaultfd themself

s/themself/themselves/

> +on the same region the manager is already tracking, which is a corner
> +case that would currently return -EBUSY).
> +
> +== API ==
> +
> +When first opened the userfaultfd must be enabled invoking the
> +UFFDIO_API ioctl specifying an uffdio_api.api value set to UFFD_API

s/an uffdio/a uffdio/

> +which will specify the read/POLLIN protocol userland intends to speak
> +on the UFFD. The UFFDIO_API ioctl if successful (i.e. if the requested
> +uffdio_api.api is spoken also by the running kernel), will return into
> +uffdio_api.bits and uffdio_api.ioctls two 64bit bitmasks of
> +respectively the activated feature bits below PAGE_SHIFT in the
> +userfault addresses returned by read(2) and the generic ioctl
> +available.
> +
> +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
> +be invoked (if present in the returned uffdio_api.ioctls bitmask) to
> +register a memory range in the userfaultfd by setting the
> +uffdio_register structure accordingly. The uffdio_register.mode
> +bitmask will specify to the kernel which kind of faults to track for
> +the range (UFFDIO_REGISTER_MODE_MISSING would track missing
> +pages). The UFFDIO_REGISTER ioctl will return the
> +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
> +userfaults on the range reigstered. Not all ioctls will necessarily be

s/reigstered/registered/

> +supported for all memory types depending on the underlying virtual
> +memory backend (anonymous memory vs tmpfs vs real filebacked
> +mappings).
> +
> +Userland can use the uffdio_register.ioctls to mangle the virtual

maybe s/mangle/manage/

> +address space in the background (to add or potentially also remove
> +memory from the userfaultfd registered range). This means an userfault

s/an/a/

> +could be triggering just before userland maps in the background the
> +user-faulted page. To avoid POLLIN resulting in an unexpected blocking
> +read (if the UFFD is not opened in nonblocking mode in the first
> +place), we don't allow the background thread to wake userfaults that
> +haven't been read by userland yet. If we would do that likely the
> +UFFDIO_WAKE ioctl could be dropped. This may change in the future
> +(with a UFFD_API protocol bumb combined with the removal of the

s/bumb/bump/

> +UFFDIO_WAKE ioctl) if it'll be demonstrated that it's a valid
> +optimization and worthy to force userland to use the UFFD always in
> +nonblocking mode if combined with POLLIN.
> +
> +userfaultfd is also a generic enough feature, that it allows KVM to
> +implement postcopy live migration (one form of memory externalization
> +consisting of a virtual machine running with part or all of its memory
> +residing on a different node in the cloud) without having to modify a
> +single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT
> +and all other GUP features works just fine in combination with
> +userfaults (userfaults trigger async page faults in the guest
> +scheduler so those guest processes that aren't waiting for userfaults
> +can keep running in the guest vcpus).
> +
> +The primary ioctl to resolve userfaults is UFFDIO_COPY. That
> +atomically copies a page into the userfault registered range and wakes
> +up the blocked userfaults (unless uffdio_copy.mode &
> +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
> +UFFDIO_COPY.
> 
> 
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

WARNING: multiple messages have this Message-ID (diff)
From: Eric Blake <eblake@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>,
	qemu-devel@nongnu.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-api@vger.kernel.org,
	Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>, Dave Hansen <dave@sr71.net>,
	Jan Kara <jack@suse.cz>, Neil Brown <neilb@suse.de>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	Andrew Jones <drjones@redhat.com>,
	Sanidhya Kashyap <sanidhya.gatech@gmail.com>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	Taras Glek <tglek@mozilla.com>,
	zhang.zhanghailiang@huawei.com,
	Pavel Emelyanov <xemul@parallels.com>,
	Hugh Dickins <hughd@google.com>, Mel Gorman <mgorman@suse.de>,
	Sasha Levin <sasha.levin@oracle.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Huangpeng (Peter)" <peter.huangpeng@huawei.com>,
	Andres Lagar-Cavilla <andreslc@google.com>,
	Christopher Covington <cov@codeaurora.org>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Paolo Bonzini <pbonzini@redh>
Subject: Re: [Qemu-devel] [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt
Date: Fri, 06 Mar 2015 08:39:30 -0700	[thread overview]
Message-ID: <54F9CA32.3050407@redhat.com> (raw)
In-Reply-To: <1425575884-2574-3-git-send-email-aarcange@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 6342 bytes --]

On 03/05/2015 10:17 AM, Andrea Arcangeli wrote:
> Add documentation.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  Documentation/vm/userfaultfd.txt | 97 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 97 insertions(+)
>  create mode 100644 Documentation/vm/userfaultfd.txt

Just a grammar review (no analysis of technical correctness)

> 
> diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
> new file mode 100644
> index 0000000..2ec296c
> --- /dev/null
> +++ b/Documentation/vm/userfaultfd.txt
> @@ -0,0 +1,97 @@
> += Userfaultfd =
> +
> +== Objective ==
> +
> +Userfaults allow to implement on demand paging from userland and more

s/to implement/the implementation of/
and maybe: s/on demand/on-demand/

> +generally they allow userland to take control various memory page
> +faults, something otherwise only the kernel code could do.
> +
> +For example userfaults allows a proper and more optimal implementation
> +of the PROT_NONE+SIGSEGV trick.
> +
> +== Design ==
> +
> +Userfaults are delivered and resolved through the userfaultfd syscall.
> +
> +The userfaultfd (aside from registering and unregistering virtual
> +memory ranges) provides for two primary functionalities:

s/provides for/provides/

> +
> +1) read/POLLIN protocol to notify an userland thread of the faults

s/an userland/a userland/ (remember, 'a unicorn gets an umbrella' - if
the 'u' is pronounced 'you' the correct article is 'a')

> +   happening
> +
> +2) various UFFDIO_* ioctls that can mangle over the virtual memory
> +   regions registered in the userfaultfd that allows userland to
> +   efficiently resolve the userfaults it receives via 1) or to mangle
> +   the virtual memory in the background

maybe: s/mangle/manage/2

> +
> +The real advantage of userfaults if compared to regular virtual memory
> +management of mremap/mprotect is that the userfaults in all their
> +operations never involve heavyweight structures like vmas (in fact the
> +userfaultfd runtime load never takes the mmap_sem for writing).
> +
> +Vmas are not suitable for page(or hugepage)-granular fault tracking

s/page(or hugepage)-granular/page- (or hugepage-) granular/

> +when dealing with virtual address spaces that could span
> +Terabytes. Too many vmas would be needed for that.
> +
> +The userfaultfd once opened by invoking the syscall, can also be
> +passed using unix domain sockets to a manager process, so the same
> +manager process could handle the userfaults of a multitude of
> +different process without them being aware about what is going on

s/process/processes/

> +(well of course unless they later try to use the userfaultfd themself

s/themself/themselves/

> +on the same region the manager is already tracking, which is a corner
> +case that would currently return -EBUSY).
> +
> +== API ==
> +
> +When first opened the userfaultfd must be enabled invoking the
> +UFFDIO_API ioctl specifying an uffdio_api.api value set to UFFD_API

s/an uffdio/a uffdio/

> +which will specify the read/POLLIN protocol userland intends to speak
> +on the UFFD. The UFFDIO_API ioctl if successful (i.e. if the requested
> +uffdio_api.api is spoken also by the running kernel), will return into
> +uffdio_api.bits and uffdio_api.ioctls two 64bit bitmasks of
> +respectively the activated feature bits below PAGE_SHIFT in the
> +userfault addresses returned by read(2) and the generic ioctl
> +available.
> +
> +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
> +be invoked (if present in the returned uffdio_api.ioctls bitmask) to
> +register a memory range in the userfaultfd by setting the
> +uffdio_register structure accordingly. The uffdio_register.mode
> +bitmask will specify to the kernel which kind of faults to track for
> +the range (UFFDIO_REGISTER_MODE_MISSING would track missing
> +pages). The UFFDIO_REGISTER ioctl will return the
> +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
> +userfaults on the range reigstered. Not all ioctls will necessarily be

s/reigstered/registered/

> +supported for all memory types depending on the underlying virtual
> +memory backend (anonymous memory vs tmpfs vs real filebacked
> +mappings).
> +
> +Userland can use the uffdio_register.ioctls to mangle the virtual

maybe s/mangle/manage/

> +address space in the background (to add or potentially also remove
> +memory from the userfaultfd registered range). This means an userfault

s/an/a/

> +could be triggering just before userland maps in the background the
> +user-faulted page. To avoid POLLIN resulting in an unexpected blocking
> +read (if the UFFD is not opened in nonblocking mode in the first
> +place), we don't allow the background thread to wake userfaults that
> +haven't been read by userland yet. If we would do that likely the
> +UFFDIO_WAKE ioctl could be dropped. This may change in the future
> +(with a UFFD_API protocol bumb combined with the removal of the

s/bumb/bump/

> +UFFDIO_WAKE ioctl) if it'll be demonstrated that it's a valid
> +optimization and worthy to force userland to use the UFFD always in
> +nonblocking mode if combined with POLLIN.
> +
> +userfaultfd is also a generic enough feature, that it allows KVM to
> +implement postcopy live migration (one form of memory externalization
> +consisting of a virtual machine running with part or all of its memory
> +residing on a different node in the cloud) without having to modify a
> +single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT
> +and all other GUP features works just fine in combination with
> +userfaults (userfaults trigger async page faults in the guest
> +scheduler so those guest processes that aren't waiting for userfaults
> +can keep running in the guest vcpus).
> +
> +The primary ioctl to resolve userfaults is UFFDIO_COPY. That
> +atomically copies a page into the userfault registered range and wakes
> +up the blocked userfaults (unless uffdio_copy.mode &
> +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
> +UFFDIO_COPY.
> 
> 
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

WARNING: multiple messages have this Message-ID (diff)
From: Eric Blake <eblake@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>,
	qemu-devel@nongnu.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-api@vger.kernel.org,
	Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>, Dave Hansen <dave@sr71.net>,
	Jan Kara <jack@suse.cz>, Neil Brown <neilb@suse.de>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	Andrew Jones <drjones@redhat.com>,
	Sanidhya Kashyap <sanidhya.gatech@gmail.com>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	Taras Glek <tglek@mozilla.com>,
	zhang.zhanghailiang@huawei.com,
	Pavel Emelyanov <xemul@parallels.com>,
	Hugh Dickins <hughd@google.com>, Mel Gorman <mgorman@suse.de>,
	Sasha Levin <sasha.levin@oracle.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Huangpeng (Peter)" <peter.huangpeng@huawei.com>,
	Andres Lagar-Cavilla <andreslc@google.com>,
	Christopher Covington <cov@codeaurora.org>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Paolo Bonzini <pbonzini@redh
Subject: Re: [Qemu-devel] [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt
Date: Fri, 06 Mar 2015 08:39:30 -0700	[thread overview]
Message-ID: <54F9CA32.3050407@redhat.com> (raw)
In-Reply-To: <1425575884-2574-3-git-send-email-aarcange@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 6342 bytes --]

On 03/05/2015 10:17 AM, Andrea Arcangeli wrote:
> Add documentation.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  Documentation/vm/userfaultfd.txt | 97 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 97 insertions(+)
>  create mode 100644 Documentation/vm/userfaultfd.txt

Just a grammar review (no analysis of technical correctness)

> 
> diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
> new file mode 100644
> index 0000000..2ec296c
> --- /dev/null
> +++ b/Documentation/vm/userfaultfd.txt
> @@ -0,0 +1,97 @@
> += Userfaultfd =
> +
> +== Objective ==
> +
> +Userfaults allow to implement on demand paging from userland and more

s/to implement/the implementation of/
and maybe: s/on demand/on-demand/

> +generally they allow userland to take control various memory page
> +faults, something otherwise only the kernel code could do.
> +
> +For example userfaults allows a proper and more optimal implementation
> +of the PROT_NONE+SIGSEGV trick.
> +
> +== Design ==
> +
> +Userfaults are delivered and resolved through the userfaultfd syscall.
> +
> +The userfaultfd (aside from registering and unregistering virtual
> +memory ranges) provides for two primary functionalities:

s/provides for/provides/

> +
> +1) read/POLLIN protocol to notify an userland thread of the faults

s/an userland/a userland/ (remember, 'a unicorn gets an umbrella' - if
the 'u' is pronounced 'you' the correct article is 'a')

> +   happening
> +
> +2) various UFFDIO_* ioctls that can mangle over the virtual memory
> +   regions registered in the userfaultfd that allows userland to
> +   efficiently resolve the userfaults it receives via 1) or to mangle
> +   the virtual memory in the background

maybe: s/mangle/manage/2

> +
> +The real advantage of userfaults if compared to regular virtual memory
> +management of mremap/mprotect is that the userfaults in all their
> +operations never involve heavyweight structures like vmas (in fact the
> +userfaultfd runtime load never takes the mmap_sem for writing).
> +
> +Vmas are not suitable for page(or hugepage)-granular fault tracking

s/page(or hugepage)-granular/page- (or hugepage-) granular/

> +when dealing with virtual address spaces that could span
> +Terabytes. Too many vmas would be needed for that.
> +
> +The userfaultfd once opened by invoking the syscall, can also be
> +passed using unix domain sockets to a manager process, so the same
> +manager process could handle the userfaults of a multitude of
> +different process without them being aware about what is going on

s/process/processes/

> +(well of course unless they later try to use the userfaultfd themself

s/themself/themselves/

> +on the same region the manager is already tracking, which is a corner
> +case that would currently return -EBUSY).
> +
> +== API ==
> +
> +When first opened the userfaultfd must be enabled invoking the
> +UFFDIO_API ioctl specifying an uffdio_api.api value set to UFFD_API

s/an uffdio/a uffdio/

> +which will specify the read/POLLIN protocol userland intends to speak
> +on the UFFD. The UFFDIO_API ioctl if successful (i.e. if the requested
> +uffdio_api.api is spoken also by the running kernel), will return into
> +uffdio_api.bits and uffdio_api.ioctls two 64bit bitmasks of
> +respectively the activated feature bits below PAGE_SHIFT in the
> +userfault addresses returned by read(2) and the generic ioctl
> +available.
> +
> +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
> +be invoked (if present in the returned uffdio_api.ioctls bitmask) to
> +register a memory range in the userfaultfd by setting the
> +uffdio_register structure accordingly. The uffdio_register.mode
> +bitmask will specify to the kernel which kind of faults to track for
> +the range (UFFDIO_REGISTER_MODE_MISSING would track missing
> +pages). The UFFDIO_REGISTER ioctl will return the
> +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
> +userfaults on the range reigstered. Not all ioctls will necessarily be

s/reigstered/registered/

> +supported for all memory types depending on the underlying virtual
> +memory backend (anonymous memory vs tmpfs vs real filebacked
> +mappings).
> +
> +Userland can use the uffdio_register.ioctls to mangle the virtual

maybe s/mangle/manage/

> +address space in the background (to add or potentially also remove
> +memory from the userfaultfd registered range). This means an userfault

s/an/a/

> +could be triggering just before userland maps in the background the
> +user-faulted page. To avoid POLLIN resulting in an unexpected blocking
> +read (if the UFFD is not opened in nonblocking mode in the first
> +place), we don't allow the background thread to wake userfaults that
> +haven't been read by userland yet. If we would do that likely the
> +UFFDIO_WAKE ioctl could be dropped. This may change in the future
> +(with a UFFD_API protocol bumb combined with the removal of the

s/bumb/bump/

> +UFFDIO_WAKE ioctl) if it'll be demonstrated that it's a valid
> +optimization and worthy to force userland to use the UFFD always in
> +nonblocking mode if combined with POLLIN.
> +
> +userfaultfd is also a generic enough feature, that it allows KVM to
> +implement postcopy live migration (one form of memory externalization
> +consisting of a virtual machine running with part or all of its memory
> +residing on a different node in the cloud) without having to modify a
> +single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT
> +and all other GUP features works just fine in combination with
> +userfaults (userfaults trigger async page faults in the guest
> +scheduler so those guest processes that aren't waiting for userfaults
> +can keep running in the guest vcpus).
> +
> +The primary ioctl to resolve userfaults is UFFDIO_COPY. That
> +atomically copies a page into the userfault registered range and wakes
> +up the blocked userfaults (unless uffdio_copy.mode &
> +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
> +UFFDIO_COPY.
> 
> 
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

WARNING: multiple messages have this Message-ID (diff)
From: Eric Blake <eblake@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>,
	qemu-devel@nongnu.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-api@vger.kernel.org,
	Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>, Dave Hansen <dave@sr71.net>,
	Jan Kara <jack@suse.cz>, Neil Brown <neilb@suse.de>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	Sanidhya Kashyap <sanidhya.gatech@gmail.com>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	Taras Glek <tglek@mozilla.com>,
	zhang.zhanghailiang@huawei.com,
	Pavel Emelyanov <xemul@parallels.com>,
	Hugh Dickins <hughd@google.com>, Mel Gorman <mgorman@suse.de>,
	Sasha Levin <sasha.levin@oracle.com>,
	Andrew Jones <drjones@redhat.com>,
	"Huangpeng (Peter)" <peter.huangpeng@huawei.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Andres Lagar-Cavilla <andreslc@google.com>,
	Christopher Covington <cov@codeaurora.org>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Mike Hommey <mh@glandium.org>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Keith Packard <keithp@keithp.com>,
	Wenchao Xia <wenchaoqemu@gmail.com>,
	Juan Quintela <quintela@redhat.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Minchan Kim <minchan@kernel.org>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Feiner <pfeiner@google.com>
Subject: Re: [Qemu-devel] [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt
Date: Fri, 06 Mar 2015 08:39:30 -0700	[thread overview]
Message-ID: <54F9CA32.3050407@redhat.com> (raw)
In-Reply-To: <1425575884-2574-3-git-send-email-aarcange@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 6342 bytes --]

On 03/05/2015 10:17 AM, Andrea Arcangeli wrote:
> Add documentation.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  Documentation/vm/userfaultfd.txt | 97 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 97 insertions(+)
>  create mode 100644 Documentation/vm/userfaultfd.txt

Just a grammar review (no analysis of technical correctness)

> 
> diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
> new file mode 100644
> index 0000000..2ec296c
> --- /dev/null
> +++ b/Documentation/vm/userfaultfd.txt
> @@ -0,0 +1,97 @@
> += Userfaultfd =
> +
> +== Objective ==
> +
> +Userfaults allow to implement on demand paging from userland and more

s/to implement/the implementation of/
and maybe: s/on demand/on-demand/

> +generally they allow userland to take control various memory page
> +faults, something otherwise only the kernel code could do.
> +
> +For example userfaults allows a proper and more optimal implementation
> +of the PROT_NONE+SIGSEGV trick.
> +
> +== Design ==
> +
> +Userfaults are delivered and resolved through the userfaultfd syscall.
> +
> +The userfaultfd (aside from registering and unregistering virtual
> +memory ranges) provides for two primary functionalities:

s/provides for/provides/

> +
> +1) read/POLLIN protocol to notify an userland thread of the faults

s/an userland/a userland/ (remember, 'a unicorn gets an umbrella' - if
the 'u' is pronounced 'you' the correct article is 'a')

> +   happening
> +
> +2) various UFFDIO_* ioctls that can mangle over the virtual memory
> +   regions registered in the userfaultfd that allows userland to
> +   efficiently resolve the userfaults it receives via 1) or to mangle
> +   the virtual memory in the background

maybe: s/mangle/manage/2

> +
> +The real advantage of userfaults if compared to regular virtual memory
> +management of mremap/mprotect is that the userfaults in all their
> +operations never involve heavyweight structures like vmas (in fact the
> +userfaultfd runtime load never takes the mmap_sem for writing).
> +
> +Vmas are not suitable for page(or hugepage)-granular fault tracking

s/page(or hugepage)-granular/page- (or hugepage-) granular/

> +when dealing with virtual address spaces that could span
> +Terabytes. Too many vmas would be needed for that.
> +
> +The userfaultfd once opened by invoking the syscall, can also be
> +passed using unix domain sockets to a manager process, so the same
> +manager process could handle the userfaults of a multitude of
> +different process without them being aware about what is going on

s/process/processes/

> +(well of course unless they later try to use the userfaultfd themself

s/themself/themselves/

> +on the same region the manager is already tracking, which is a corner
> +case that would currently return -EBUSY).
> +
> +== API ==
> +
> +When first opened the userfaultfd must be enabled invoking the
> +UFFDIO_API ioctl specifying an uffdio_api.api value set to UFFD_API

s/an uffdio/a uffdio/

> +which will specify the read/POLLIN protocol userland intends to speak
> +on the UFFD. The UFFDIO_API ioctl if successful (i.e. if the requested
> +uffdio_api.api is spoken also by the running kernel), will return into
> +uffdio_api.bits and uffdio_api.ioctls two 64bit bitmasks of
> +respectively the activated feature bits below PAGE_SHIFT in the
> +userfault addresses returned by read(2) and the generic ioctl
> +available.
> +
> +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
> +be invoked (if present in the returned uffdio_api.ioctls bitmask) to
> +register a memory range in the userfaultfd by setting the
> +uffdio_register structure accordingly. The uffdio_register.mode
> +bitmask will specify to the kernel which kind of faults to track for
> +the range (UFFDIO_REGISTER_MODE_MISSING would track missing
> +pages). The UFFDIO_REGISTER ioctl will return the
> +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
> +userfaults on the range reigstered. Not all ioctls will necessarily be

s/reigstered/registered/

> +supported for all memory types depending on the underlying virtual
> +memory backend (anonymous memory vs tmpfs vs real filebacked
> +mappings).
> +
> +Userland can use the uffdio_register.ioctls to mangle the virtual

maybe s/mangle/manage/

> +address space in the background (to add or potentially also remove
> +memory from the userfaultfd registered range). This means an userfault

s/an/a/

> +could be triggering just before userland maps in the background the
> +user-faulted page. To avoid POLLIN resulting in an unexpected blocking
> +read (if the UFFD is not opened in nonblocking mode in the first
> +place), we don't allow the background thread to wake userfaults that
> +haven't been read by userland yet. If we would do that likely the
> +UFFDIO_WAKE ioctl could be dropped. This may change in the future
> +(with a UFFD_API protocol bumb combined with the removal of the

s/bumb/bump/

> +UFFDIO_WAKE ioctl) if it'll be demonstrated that it's a valid
> +optimization and worthy to force userland to use the UFFD always in
> +nonblocking mode if combined with POLLIN.
> +
> +userfaultfd is also a generic enough feature, that it allows KVM to
> +implement postcopy live migration (one form of memory externalization
> +consisting of a virtual machine running with part or all of its memory
> +residing on a different node in the cloud) without having to modify a
> +single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT
> +and all other GUP features works just fine in combination with
> +userfaults (userfaults trigger async page faults in the guest
> +scheduler so those guest processes that aren't waiting for userfaults
> +can keep running in the guest vcpus).
> +
> +The primary ioctl to resolve userfaults is UFFDIO_COPY. That
> +atomically copies a page into the userfault registered range and wakes
> +up the blocked userfaults (unless uffdio_copy.mode &
> +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
> +UFFDIO_COPY.
> 
> 
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

  reply	other threads:[~2015-03-06 15:40 UTC|newest]

Thread overview: 158+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
2015-03-05 17:17 ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17 ` Andrea Arcangeli
2015-03-05 17:17 ` Andrea Arcangeli
2015-03-05 17:17 ` Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 01/21] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-06 15:39   ` Eric Blake [this message]
2015-03-06 15:39     ` [Qemu-devel] " Eric Blake
2015-03-06 15:39     ` Eric Blake
2015-03-06 15:39     ` Eric Blake
2015-03-05 17:17 ` [PATCH 03/21] userfaultfd: uAPI Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 04/21] userfaultfd: linux/userfaultfd_k.h Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:48   ` Pavel Emelyanov
2015-03-05 17:48     ` [Qemu-devel] " Pavel Emelyanov
2015-03-05 17:48     ` Pavel Emelyanov
2015-03-05 17:48     ` Pavel Emelyanov
2015-03-05 17:48     ` Pavel Emelyanov
2015-03-05 17:17 ` [PATCH 06/21] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 07/21] userfaultfd: call handle_userfault() for userfaultfd_missing() faults Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 08/21] userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 09/21] userfaultfd: prevent khugepaged to merge if userfaultfd is armed Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:57   ` Pavel Emelyanov
2015-03-05 17:57     ` [Qemu-devel] " Pavel Emelyanov
2015-03-05 17:57     ` Pavel Emelyanov
2015-03-05 17:57     ` Pavel Emelyanov
2015-03-05 17:57     ` Pavel Emelyanov
2015-03-06 10:48   ` Michael Kerrisk (man-pages)
2015-03-06 10:48     ` [Qemu-devel] " Michael Kerrisk (man-pages)
2015-03-06 10:48     ` Michael Kerrisk (man-pages)
2015-03-06 10:48     ` Michael Kerrisk (man-pages)
2015-03-16 10:45   ` Thomas Martitz
2015-03-05 17:17 ` [PATCH 11/21] userfaultfd: buildsystem activation Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 12/21] userfaultfd: activate syscall Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 13/21] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 14/21] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 18:07   ` Pavel Emelyanov
2015-03-05 18:07     ` [Qemu-devel] " Pavel Emelyanov
2015-03-05 18:07     ` Pavel Emelyanov
2015-03-05 18:07     ` Pavel Emelyanov
2015-03-05 18:07     ` Pavel Emelyanov
2015-03-05 17:17 ` [PATCH 15/21] userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 16/21] userfaultfd: remap_pages: rmap preparation Andrea Arcangeli
2015-03-05 17:17   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:17   ` Andrea Arcangeli
2015-03-05 17:18 ` [PATCH 17/21] userfaultfd: remap_pages: swp_entry_swapcount() preparation Andrea Arcangeli
2015-03-05 17:18   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 17:18 ` [PATCH 18/21] userfaultfd: UFFDIO_REMAP uABI Andrea Arcangeli
2015-03-05 17:18   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 17:18 ` [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation Andrea Arcangeli
2015-03-05 17:18   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 17:39   ` Linus Torvalds
2015-03-05 17:39     ` [Qemu-devel] " Linus Torvalds
2015-03-05 17:39     ` Linus Torvalds
2015-03-05 17:39     ` Linus Torvalds
2015-03-05 17:39     ` Linus Torvalds
2015-03-05 18:51     ` Andrea Arcangeli
2015-03-05 18:51       ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 18:51       ` Andrea Arcangeli
2015-03-05 18:51       ` Andrea Arcangeli
2015-03-05 19:32       ` Linus Torvalds
2015-03-05 19:32         ` [Qemu-devel] " Linus Torvalds
2015-03-05 19:32         ` Linus Torvalds
2015-03-05 19:32         ` Linus Torvalds
2015-03-05 19:32         ` Linus Torvalds
2015-03-05 18:01   ` Pavel Emelyanov
2015-03-05 18:01     ` [Qemu-devel] " Pavel Emelyanov
2015-03-05 18:01     ` Pavel Emelyanov
2015-03-05 18:01     ` Pavel Emelyanov
2015-03-05 18:01     ` Pavel Emelyanov
2015-03-05 17:18 ` [PATCH 20/21] userfaultfd: UFFDIO_REMAP Andrea Arcangeli
2015-03-05 17:18   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 17:18 ` [PATCH 21/21] userfaultfd: add userfaultfd_wp mm helpers Andrea Arcangeli
2015-03-05 17:18   ` [Qemu-devel] " Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 17:18   ` Andrea Arcangeli
2015-03-05 18:15 ` [PATCH 00/21] RFC: userfaultfd v3 Pavel Emelyanov
2015-03-05 18:15   ` [Qemu-devel] " Pavel Emelyanov
2015-03-05 18:15   ` Pavel Emelyanov
2015-03-05 18:15   ` Pavel Emelyanov
2015-03-05 18:15   ` Pavel Emelyanov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54F9CA32.3050407@redhat.com \
    --to=eblake@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andreslc@google.com \
    --cc=anthony@codemonkey.ws \
    --cc=cov@codeaurora.org \
    --cc=dave@sr71.net \
    --cc=dgilbert@redhat.com \
    --cc=dmitry.adamushko@gmail.com \
    --cc=drjones@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=keithp@keithp.com \
    --cc=kernel-team@android.com \
    --cc=kirill@shutemov.name \
    --cc=kosaki.motohiro@gmail.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@amacapital.net \
    --cc=mgorman@suse.de \
    --cc=mh@glandium.org \
    --cc=minchan@kernel.org \
    --cc=neilb@suse.de \
    --cc=pbonzini@redhat.com \
    --cc=peter.huangpeng@huawei.com \
    --cc=pfeiner@google.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=rlove@google.com \
    --cc=sanidhya.gatech@gmail.com \
    --cc=sasha.levin@oracle.com \
    --cc=stefanha@gmail.com \
    --cc=tglek@mozilla.com \
    --cc=torvalds@linux-foundation.org \
    --cc=walken@google.com \
    --cc=wenchaoqemu@gmail.com \
    --cc=xemul@parallels.com \
    --cc=zhang.zhanghailiang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.