All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michał Mirosław" <emmir@google.com>
To: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Peter Xu <peterx@redhat.com>,
	David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrei Vagin <avagin@gmail.com>,
	Danylo Mocherniuk <mdanylo@google.com>,
	Paul Gofman <pgofman@codeweavers.com>,
	Cyrill Gorcunov <gorcunov@gmail.com>,
	Mike Rapoport <rppt@kernel.org>, Nadav Amit <namit@vmware.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Shuah Khan <shuah@kernel.org>,
	Christian Brauner <brauner@kernel.org>,
	Yang Shi <shy828301@gmail.com>, Vlastimil Babka <vbabka@suse.cz>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Yun Zhou <yun.zhou@windriver.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Alex Sierra <alex.sierra@amd.com>,
	Matthew Wilcox <willy@infradead.org>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	"Gustavo A . R . Silva" <gustavoars@kernel.org>,
	Dan Williams <dan.j.williams@intel.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
	Greg KH <gregkh@linuxfoundation.org>,
	kernel@collabora.com
Subject: Re: [PATCH v28 5/6] mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
Date: Thu, 10 Aug 2023 21:26:10 +0200	[thread overview]
Message-ID: <CABb0KFGftHi1t3Pt8V3XvsG=+-hvfQMMteW9VXEPrRmfUdZZWA@mail.gmail.com> (raw)
In-Reply-To: <20230809061603.1969154-6-usama.anjum@collabora.com>

On Wed, 9 Aug 2023 at 08:16, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> Add some explanation and method to use write-protection and written-to
> on memory range.
[...]
> --- a/Documentation/admin-guide/mm/pagemap.rst
> +++ b/Documentation/admin-guide/mm/pagemap.rst
> @@ -227,3 +227,67 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
>  always 12 at most architectures). Since Linux 3.11 their meaning changes
>  after first clear of soft-dirty bits. Since Linux 4.2 they are used for
>  flags unconditionally.
> +
> +Pagemap Scan IOCTL
> +==================
> +
> +The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get or optionally
> +clear the info about page table entries. The following operations are supported
> +in this IOCTL:
> +- Get the information if the pages have Async Write-Protection enabled
> +  (``PAGE_IS_WPALLOWED``), have been written to (``PAGE_IS_WRITTEN``), file mapped
> +  (``PAGE_IS_FILE``), present (``PAGE_IS_PRESENT``), swapped (``PAGE_IS_SWAPPED``)
> +  or page has pfn zero (``PAGE_IS_PFNZERO``).

A recent addition -- PAGE_IS_HUGE -- is missing.

BTW, it could be easier to understand if the page categories were
separated from the operation description and listed so that each has
its own line and maybe a longer description where needed.

> +- Find pages which have been written to and/or write protect
> +  (atomic ``PM_SCAN_WP_MATCHING + PM_SCAN_CHECK_WPASYNC``) the pages atomically.
> +  The (``PM_SCAN_WP_MATCHING``) is used to WP the matched pages. The
> +  (``PM_SCAN_CHECK_WPASYNC``) aborts the operation if non-Async-Write-Protected
> +  pages are found.

The operation the IOCTL does now is: "scan the process page tables and
report memory ranges matching provided criteria '.
Flags extend the operation: "PM_SCAN_WP_MATCHING write protects the
memory reported" (it does it atomically, but this is just an
optimization, isn't it? A process could gather the ranges, WP them,
and then copy.)
"PM_SCAN_CHECK_WPASYNC" aborts the scan early if a non-WP-able
matching page is found.

> +The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
> + 1. The size of the ``struct pm_scan_arg`` must be specified in the ``size``
> +    field. This field will be helpful in recognizing the structure if extensions
> +    are done later.
> + 2. The flags can be specified in the ``flags`` field. The ``PM_SCAN_WP_MATCHING``
> +    and ``PM_SCAN_CHECK_WPASYNC`` are the only added flags at this time. The get
> +    operation is optionally performed depending upon if the output buffer is
> +    provided or not.
> + 3. The range is specified through ``start`` and ``end``.
> + 4. The output buffer of ``struct page_region`` array and size is specified in
> +    ``vec`` and ``vec_len``.
> + 5. The optional maximum requested pages are specified in the ``max_pages``.
> + 6. The masks are specified in ``category_mask``, ``category_anyof_mask``,
> +    ``category_inverted`` and ``return_mask``.
> +    1.  To find if ``PAGE_IS_WRITTEN`` flag is set for pages which have
> +        ``PAGE_IS_FILE`` set and ``PAGE_IS_SWAPPED`` unset, ``category_mask``
> +        is set to ``PAGE_IS_FILE | PAGE_IS_SWAPPED``, ``category_inverted`` is
> +        set to ``PAGE_IS_SWAPPED`` and ``return_mask`` is set to ``PAGE_IS_WRITTEN``.
> +        The output buffer in ``vec`` and length must be specified in ``vec_len``.
> +    2. To find pages which have either ``PAGE_IS_FILE`` or ``PAGE_IS_SWAPPED``
> +       set, ``category_anyof_mask`` is set to  ``PAGE_IS_FILE | PAGE_IS_SWAPPED``.
> +    3. To find written pages and engage write protect, ``PAGE_IS_WRITTEN`` is
> +       specified in ``category_mask`` and ``return_mask``. In addition to
> +       specifying the output buffer in ``vec`` and length in ``vec_len``, the
> +       ``PM_SCAN_WP_MATCHING`` is specified in ``flags`` to perform write protect
> +       on the range as well.

Could this be rewritten as examples? E.g.:

Finding dirty file-backed pages:

struct pm_scan_arg arg = {
 .size = sizeof(arg),
  .flags = 0,
 ...
   .category_mask = ...,
   .return_mask = ...
};
ssize_t n = ioctl(..., &arg);

Find dirty pages and write protect them in the same call:

arg = { ... };
do {
  ... ioctl(...)
} while(...);

(The code snippets heavily commented.)

> +The ``PAGE_IS_WRITTEN`` flag can be considered as the better and correct

"as a better-performing alternative"

> +alternative of soft-dirty flag. It doesn't get affected by housekeeping chores
> +(VMA merging) of the kernel and hence the user can find the true soft-dirty pages
> +only.

This is still an optimization, e.g. in THP case there might be too
many pages reported?

> + This IOCTL adds the atomic way to find which pages have been written and
> +write protect those pages again. This kind of operation is needed to efficiently
> +find out which pages have changed in the memory.

This repeats the description of PM_SCAN_WP_MATCHING -- I suggest
removing this part.

> +To get information about which pages have been written to or optionally write
> +protect the pages, following must be performed first in order:

"PAGE_IS_WRITTEN" category is used with uffd write protect-enabled
ranges to implement memory dirty tracking in userspace:

> + 1. The userfaultfd file descriptor is created with ``userfaultfd`` syscall.
> + 2. The ``UFFD_FEATURE_WP_UNPOPULATED`` and ``UFFD_FEATURE_WP_ASYNC`` features
> +    are set by ``UFFDIO_API`` IOCTL.
> + 3. The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode
> +    through ``UFFDIO_REGISTER`` IOCTL.
> + 4. Then any part of the registered memory or the whole memory region must
> +    be write protected using ``PAGEMAP_SCAN`` IOCTL with flag ``PM_SCAN_OP_WP``
> +    or the ``UFFDIO_WRITEPROTECT`` IOCTL can be used. Both of these perform the
> +    same operation. The former is better in terms of performance.

I guess that the UFFD performance could be fixed? But this part refers
to the old PM_SCAN_OP_WP, so an updated example is needed.

> + 5. Now the ``PAGEMAP_SCAN`` IOCTL can be used to either just find pages which
> +    have been written to and/or optionally write protect the pages as well.

"find the pages written to since they were last write protected", but
this sounds contradicting: we look for pages that were WP but written
anyway. (IOW: marking write-protected is an implementation detail -
the ioctl is to find pages that changed since they were last marked.)
Maybe we should call the operation "marking CLEAN" or alike?

Best Regards
Michał Mirosław

  reply	other threads:[~2023-08-10 19:26 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-09  6:15 [PATCH v28 0/6] Implement IOCTL to get and optionally clear info about PTEs Muhammad Usama Anjum
2023-08-09  6:15 ` [PATCH v28 1/6] userfaultfd: UFFD_FEATURE_WP_ASYNC Muhammad Usama Anjum
2023-08-09  6:15 ` [PATCH v28 2/6] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs Muhammad Usama Anjum
2023-08-10 17:32   ` Michał Mirosław
2023-08-11 12:02     ` Muhammad Usama Anjum
2023-08-11 13:32       ` Michał Mirosław
2023-08-11 15:30         ` Muhammad Usama Anjum
2023-08-12 14:22           ` Michał Mirosław
2023-08-10 19:07   ` Andrei Vagin
2023-08-11 15:19     ` Muhammad Usama Anjum
2023-08-09  6:16 ` [PATCH v28 3/6] fs/proc/task_mmu: Add fast paths to get/clear PAGE_IS_WRITTEN flag Muhammad Usama Anjum
2023-08-09  6:16 ` [PATCH v28 4/6] tools headers UAPI: Update linux/fs.h with the kernel sources Muhammad Usama Anjum
2023-08-09  6:16 ` [PATCH v28 5/6] mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL Muhammad Usama Anjum
2023-08-10 19:26   ` Michał Mirosław [this message]
2023-08-11 16:52     ` Muhammad Usama Anjum
2023-08-09  6:16 ` [PATCH v28 6/6] selftests: mm: add pagemap ioctl tests Muhammad Usama Anjum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABb0KFGftHi1t3Pt8V3XvsG=+-hvfQMMteW9VXEPrRmfUdZZWA@mail.gmail.com' \
    --to=emmir@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.sierra@amd.com \
    --cc=avagin@gmail.com \
    --cc=axelrasmussen@google.com \
    --cc=brauner@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=gorcunov@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=gustavoars@kernel.org \
    --cc=kernel@collabora.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mdanylo@google.com \
    --cc=namit@vmware.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=pgofman@codeweavers.com \
    --cc=rppt@kernel.org \
    --cc=shuah@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=surenb@google.com \
    --cc=usama.anjum@collabora.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yun.zhou@windriver.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.