All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michał Mirosław" <emmir@google.com>
To: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: "Peter Xu" <peterx@redhat.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Andrei Vagin" <avagin@gmail.com>,
	"Danylo Mocherniuk" <mdanylo@google.com>,
	"Paul Gofman" <pgofman@codeweavers.com>,
	"Cyrill Gorcunov" <gorcunov@gmail.com>,
	"Mike Rapoport" <rppt@kernel.org>,
	"Nadav Amit" <namit@vmware.com>,
	"Alexander Viro" <viro@zeniv.linux.org.uk>,
	"Shuah Khan" <shuah@kernel.org>,
	"Christian Brauner" <brauner@kernel.org>,
	"Yang Shi" <shy828301@gmail.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	"Yun Zhou" <yun.zhou@windriver.com>,
	"Suren Baghdasaryan" <surenb@google.com>,
	"Alex Sierra" <alex.sierra@amd.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Pasha Tatashin" <pasha.tatashin@soleen.com>,
	"Axel Rasmussen" <axelrasmussen@google.com>,
	"Gustavo A . R . Silva" <gustavoars@kernel.org>,
	"Dan Williams" <dan.j.williams@intel.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
	"Greg KH" <gregkh@linuxfoundation.org>,
	kernel@collabora.com, "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Subject: Re: [PATCH v28 2/6] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs
Date: Thu, 10 Aug 2023 19:32:33 +0200	[thread overview]
Message-ID: <CABb0KFGqDo8hFohqpXewoquyLVZUhG-bRHxpw_PYXzGW9wXofQ@mail.gmail.com> (raw)
In-Reply-To: <20230809061603.1969154-3-usama.anjum@collabora.com>

On Wed, 9 Aug 2023 at 08:16, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have Async Write-Protection enabled
>   (``PAGE_IS_WPALLOWED``), have been written to (``PAGE_IS_WRITTEN``), file
>   mapped (``PAGE_IS_FILE``), present (``PAGE_IS_PRESENT``), swapped
>   (``PAGE_IS_SWAPPED``) or page has pfn zero (``PAGE_IS_PFNZERO``).
> - Find pages which have been written to and/or write protect
>   (atomic ``PM_SCAN_WP_MATCHING + PM_SCAN_CHECK_WPASYNC``) the pages
>   atomically. The (``PM_SCAN_WP_MATCHING``) is used to WP the matched
>   pages. The (``PM_SCAN_CHECK_WPASYNC``) aborts the operation if
>   non-Async-Write-Protected pages are found. Get is automatically performed
>   if output buffer is specified.
>
> This IOCTL can be extended to get information about more PTE bits. The
> entire address range passed by user [start, end) is scanned until either
> the user provided buffer is full or max_pages have been found.
>
> Reviewed-by: Andrei Vagin <avagin@gmail.com>
> Reviewed-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>

Still applies, thanks! Please find some mostly-polishing comments below.

> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>

> ---
> Changes in v28:
> - Fix walk_end one last time after doing through testing
>
> Changes in v27:
> - Add PAGE_IS_HUGE

It seems to be missing from the commitmsg, though. But maybe listing
all the flags there is redundant, since a doc is coming anyway and the
values are listed in the .h?

[...]
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
[...]
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static unsigned long pagemap_thp_category(pmd_t pmd)
> +{
> +       unsigned long categories = PAGE_IS_HUGE;
> +
> +       if (pmd_present(pmd)) {
> +               categories |= PAGE_IS_PRESENT;
> +               if (!pmd_uffd_wp(pmd))
> +                       categories |= PAGE_IS_WRITTEN;
> +               if (is_zero_pfn(pmd_pfn(pmd)))
> +                       categories |= PAGE_IS_PFNZERO;
> +       } else if (is_swap_pmd(pmd)) {
> +               categories |= PAGE_IS_SWAPPED;
> +               if (!pmd_swp_uffd_wp(pmd))
> +                       categories |= PAGE_IS_WRITTEN;
> +       }
> +
> +       return categories;
> +}

I guess THPs can't be file-backed currently, but can we somehow mark
this assumption so it can be easily found if the capability arrives?

> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +#ifdef CONFIG_HUGETLB_PAGE
> +static unsigned long pagemap_hugetlb_category(pte_t pte)
> +{
> +       unsigned long categories = PAGE_IS_HUGE;
> +
> +       if (pte_present(pte)) {
> +               categories |= PAGE_IS_PRESENT;
> +               if (!huge_pte_uffd_wp(pte))
> +                       categories |= PAGE_IS_WRITTEN;
> +               if (!PageAnon(pte_page(pte)))
> +                       categories |= PAGE_IS_FILE;
> +               if (is_zero_pfn(pte_pfn(pte)))
> +                       categories |= PAGE_IS_PFNZERO;
> +       } else if (is_swap_pte(pte)) {
> +               categories |= PAGE_IS_SWAPPED;
> +               if (!pte_swp_uffd_wp_any(pte))
> +                       categories |= PAGE_IS_WRITTEN;
> +       }

BTW, can a HugeTLB page be file-backed and swapped out?

> +static void pagemap_scan_backout_range(struct pagemap_scan_private *p,
> +                                      unsigned long addr, unsigned long end,
> +                                      unsigned long walk_end_addr)
> +{
> +       struct page_region *cur_buf = &p->cur_buf;
> +
> +       if (cur_buf->start != addr)
> +               cur_buf->end = addr;
> +       else
> +               cur_buf->start = cur_buf->end = 0;
> +
> +       p->walk_end_addr = walk_end_addr;
> +       p->found_pages -= (end - addr) / PAGE_SIZE;
> +}

When would `walk_end_addr` be different from `end`? Maybe it would be
easier to understand if the `p->walk_end_addr` update was pushed to
the callers? (Actually: the one that wants to change it.)

> +static int pagemap_scan_output(unsigned long categories,
> +                              struct pagemap_scan_private *p,
> +                              unsigned long addr, unsigned long *end)
> +{
> +       unsigned long n_pages, total_pages;
> +       int ret = 0;
> +
> +       if (!p->vec_buf)
> +               return 0;
> +
> +       categories &= p->arg.return_mask;
> +
> +       n_pages = (*end - addr) / PAGE_SIZE;
> +       if (check_add_overflow(p->found_pages, n_pages, &total_pages) || //TODO
> +           total_pages > p->arg.max_pages) {

Re: TODO: Is there anything left to change here?

> +               size_t n_too_much = total_pages - p->arg.max_pages;
> +               *end -= n_too_much * PAGE_SIZE;
> +               n_pages -= n_too_much;
> +               ret = -ENOSPC;
> +       }
[...]

> +static int pagemap_scan_thp_entry(pmd_t *pmd, unsigned long start,
> +                                 unsigned long end, struct mm_walk *walk)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +       struct pagemap_scan_private *p = walk->private;
> +       struct vm_area_struct *vma = walk->vma;
> +       unsigned long categories;
> +       spinlock_t *ptl;
> +       int ret = 0;
> +
> +       ptl = pmd_trans_huge_lock(pmd, vma);
> +       if (!ptl)
> +               return -ENOENT;
> +
> +       categories = p->cur_vma_category | pagemap_thp_category(*pmd);
> +
> +       if (!pagemap_scan_is_interesting_page(categories, p))
> +               goto out_unlock;
> +
> +       ret = pagemap_scan_output(categories, p, start, &end);
> +       if (start == end)
> +               goto out_unlock;
> +
> +       if (~p->arg.flags & PM_SCAN_WP_MATCHING)
> +               goto out_unlock;
> +       if (~categories & PAGE_IS_WRITTEN)
> +               goto out_unlock;
> +
> +       /*
> +        * Break huge page into small pages if the WP operation
> +        * need to be performed is on a portion of the huge page.

"needs to be performed on ..."

> +        */
> +       if (end != start + HPAGE_SIZE) {
> +               spin_unlock(ptl);
> +               split_huge_pmd(vma, pmd, start);
> +               pagemap_scan_backout_range(p, start, end, 0);
> +               /* Indicate to caller for processing these as normal pages */

Nit: "Report as if there was no THP." ?

> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> +                                 unsigned long end, struct mm_walk *walk)
> +{
> +       struct pagemap_scan_private *p = walk->private;
> +       struct vm_area_struct *vma = walk->vma;
> +       pte_t *pte, *start_pte;
> +       unsigned long addr;
> +       bool flush = false;
> +       spinlock_t *ptl;
> +       int ret;
> +
> +       arch_enter_lazy_mmu_mode();
> +
> +       ret = pagemap_scan_thp_entry(pmd, start, end, walk);
> +       if (ret != -ENOENT) {
> +               arch_leave_lazy_mmu_mode();
> +               return ret;
> +       }
> +
> +       ret = 0;
> +       start_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
> +       if (!pte) {
> +               arch_leave_lazy_mmu_mode();
> +               walk->action = ACTION_AGAIN;
> +               return 0;
> +       }
> +
> +       for (addr = start; addr != end; pte++, addr += PAGE_SIZE) {
> +               unsigned long categories = p->cur_vma_category |
> +                                          pagemap_page_category(p, vma, addr, ptep_get(pte));
> +               unsigned long next = addr + PAGE_SIZE;
> +
> +               if (!pagemap_scan_is_interesting_page(categories, p))
> +                       continue;
> +
> +               ret = pagemap_scan_output(categories, p, addr, &next);
> +               if (next == addr)
> +                       break;
> +
> +               if (~p->arg.flags & PM_SCAN_WP_MATCHING)
> +                       continue;
> +               if (~categories & PAGE_IS_WRITTEN)
> +                       continue;
> +
> +               make_uffd_wp_pte(vma, addr, pte);
> +               if (!flush) {
> +                       start = addr;
> +                       flush = true;
> +               }

A quick improvement:

/* instead of `flush` declaration */ unsigned long flush_end = 0;

if (!flush_end)
  start = addr;
flush_end = next;

> +       }
> +
> +       if (flush)
> +               flush_tlb_range(vma, start, addr);

And here:
if (flush_end)
  flush_tlb_range(vma, start, flush_end);

> +       pte_unmap_unlock(start_pte, ptl);
> +       arch_leave_lazy_mmu_mode();
> +
> +       cond_resched();
> +       return ret;
> +}

[...]
> +static long do_pagemap_scan(struct mm_struct *mm, unsigned long uarg)
> +{
> +       struct mmu_notifier_range range;
> +       struct pagemap_scan_private p;
> +       unsigned long walk_start;
> +       size_t n_ranges_out = 0;
> +       int ret;
> +
> +       memset(&p, 0, sizeof(p));
> +       ret = pagemap_scan_get_args(&p.arg, uarg);
> +       if (ret)
> +               return ret;
> +
> +       p.masks_of_interest = MASKS_OF_INTEREST(p.arg);

Please inline the macro as here is the only use of it.

[...]
> +       walk_start = p.arg.start;
> +       for (; walk_start < p.arg.end; walk_start = p.arg.walk_end) {

Nit: the initialization statement can now be part of the for().

> +               int n_out;
> +
> +               if (fatal_signal_pending(current)) {
> +                       ret = -EINTR;
> +                       break;
> +               }
> +
> +               ret = mmap_read_lock_killable(mm);
> +               if (ret)
> +                       break;
> +               ret = walk_page_range(mm, walk_start, p.arg.end,
> +                                     &pagemap_scan_ops, &p);
> +               mmap_read_unlock(mm);
> +
> +               n_out = pagemap_scan_flush_buffer(&p);
> +               if (n_out < 0)
> +                       ret = n_out;
> +               else
> +                       n_ranges_out += n_out;
> +
> +               if (!ret)
> +                       p.walk_end_addr = p.arg.end;
> +
> +               if (ret != -ENOSPC || p.arg.vec_len - 1 == 0 ||
> +                   p.found_pages == p.arg.max_pages)
> +                       break;

The second condition is equivalent to `p.arg.vec_len == 1`, but why is
that an ending condition? Isn't the last entry enough to gather one
more range? (The walk could have returned -ENOSPC due to buffer full
and after flushing it could continue with the last free entry.)

[...]
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -259,6 +259,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
>                 unsigned long cp_flags);
>
>  bool is_hugetlb_entry_migration(pte_t pte);
> +bool is_hugetlb_entry_hwpoisoned(pte_t pte);
>  void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
>
>  #else /* !CONFIG_HUGETLB_PAGE */
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c5..1c9d38af1015e 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -305,4 +305,63 @@ typedef int __bitwise __kernel_rwf_t;
>  #define RWF_SUPPORTED  (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
>                          RWF_APPEND)
>
> +/* Pagemap ioctl */
> +#define PAGEMAP_SCAN   _IOWR('f', 16, struct pm_scan_arg)
> +
> +/* Bits are set in flags of the page_region and masks in pm_scan_args */

"Bitmasks provided in pm_scan_args masks and reported in
page_region.categories."

> +#define PAGE_IS_WPALLOWED      (1 << 0)
> +#define PAGE_IS_WRITTEN                (1 << 1)
> +#define PAGE_IS_FILE           (1 << 2)
> +#define PAGE_IS_PRESENT                (1 << 3)
> +#define PAGE_IS_SWAPPED                (1 << 4)
> +#define PAGE_IS_PFNZERO                (1 << 5)
> +#define PAGE_IS_HUGE           (1 << 6)
> +
> +/*
> + * struct page_region - Page region with flags
> + * @start:     Start of the region
> + * @end:       End of the region (exclusive)
> + * @categories:        PAGE_IS_* category bitmask for the region
> + */
> +struct page_region {
> +       __u64 start;
> +       __u64 end;
> +       __u64 categories;
> +};
> +
> +/* Flags for PAGEMAP_SCAN ioctl */
> +#define PM_SCAN_WP_MATCHING    (1 << 0)        /* Write protect the pages matched. */
> +#define PM_SCAN_CHECK_WPASYNC  (1 << 1)        /* Abort the scan when a non-WP-enabled page is found. */
> +
> +/*
> + * struct pm_scan_arg - Pagemap ioctl argument
> + * @size:              Size of the structure
> + * @flags:             Flags for the IOCTL
> + * @start:             Starting address of the region
> + * @end:               Ending address of the region
> + * @walk_end           Address where the scan stopped (written by kernel).
> + *                     walk_end == end informs that the scan completed on entire range.

Can we ensure this holds also for the tagged pointers?

> + * @vec:               Address of page_region struct array for output
> + * @vec_len:           Length of the page_region struct array
> + * @max_pages:         Optional limit for number of returned pages (0 = disabled)
> + * @category_inverted: PAGE_IS_* categories which values match if 0 instead of 1
> + * @category_mask:     Skip pages for which any category doesn't match
> + * @category_anyof_mask: Skip pages for which no category matches
> + * @return_mask:       PAGE_IS_* categories that are to be reported in `page_region`s returned
> + */
[...]

Best Regards
Michał Mirosław

  reply	other threads:[~2023-08-10 17:32 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-09  6:15 [PATCH v28 0/6] Implement IOCTL to get and optionally clear info about PTEs Muhammad Usama Anjum
2023-08-09  6:15 ` [PATCH v28 1/6] userfaultfd: UFFD_FEATURE_WP_ASYNC Muhammad Usama Anjum
2023-08-09  6:15 ` [PATCH v28 2/6] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs Muhammad Usama Anjum
2023-08-10 17:32   ` Michał Mirosław [this message]
2023-08-11 12:02     ` Muhammad Usama Anjum
2023-08-11 13:32       ` Michał Mirosław
2023-08-11 15:30         ` Muhammad Usama Anjum
2023-08-12 14:22           ` Michał Mirosław
2023-08-10 19:07   ` Andrei Vagin
2023-08-11 15:19     ` Muhammad Usama Anjum
2023-08-09  6:16 ` [PATCH v28 3/6] fs/proc/task_mmu: Add fast paths to get/clear PAGE_IS_WRITTEN flag Muhammad Usama Anjum
2023-08-09  6:16 ` [PATCH v28 4/6] tools headers UAPI: Update linux/fs.h with the kernel sources Muhammad Usama Anjum
2023-08-09  6:16 ` [PATCH v28 5/6] mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL Muhammad Usama Anjum
2023-08-10 19:26   ` Michał Mirosław
2023-08-11 16:52     ` Muhammad Usama Anjum
2023-08-09  6:16 ` [PATCH v28 6/6] selftests: mm: add pagemap ioctl tests Muhammad Usama Anjum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABb0KFGqDo8hFohqpXewoquyLVZUhG-bRHxpw_PYXzGW9wXofQ@mail.gmail.com \
    --to=emmir@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.sierra@amd.com \
    --cc=avagin@gmail.com \
    --cc=axelrasmussen@google.com \
    --cc=brauner@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=gorcunov@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=gustavoars@kernel.org \
    --cc=kernel@collabora.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mdanylo@google.com \
    --cc=mirq-linux@rere.qmqm.pl \
    --cc=namit@vmware.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=pgofman@codeweavers.com \
    --cc=rppt@kernel.org \
    --cc=shuah@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=surenb@google.com \
    --cc=usama.anjum@collabora.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yun.zhou@windriver.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.