From: "Michał Mirosław" <emmir@google.com>
To: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Peter Xu <peterx@redhat.com>,
David Hildenbrand <david@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Andrei Vagin <avagin@gmail.com>,
Danylo Mocherniuk <mdanylo@google.com>,
Paul Gofman <pgofman@codeweavers.com>,
Cyrill Gorcunov <gorcunov@gmail.com>,
Mike Rapoport <rppt@kernel.org>, Nadav Amit <namit@vmware.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Shuah Khan <shuah@kernel.org>,
Christian Brauner <brauner@kernel.org>,
Yang Shi <shy828301@gmail.com>, Vlastimil Babka <vbabka@suse.cz>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Yun Zhou <yun.zhou@windriver.com>,
Suren Baghdasaryan <surenb@google.com>,
Alex Sierra <alex.sierra@amd.com>,
Matthew Wilcox <willy@infradead.org>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Axel Rasmussen <axelrasmussen@google.com>,
"Gustavo A . R . Silva" <gustavoars@kernel.org>,
Dan Williams <dan.j.williams@intel.com>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
Greg KH <gregkh@linuxfoundation.org>,
kernel@collabora.com
Subject: Re: [PATCH v18 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs
Date: Thu, 15 Jun 2023 16:52:40 +0200 [thread overview]
Message-ID: <CABb0KFEy_mRaT86TEOQ-BoTe_XOVw3Kp5VdzOfEEaiZJuT754g@mail.gmail.com> (raw)
In-Reply-To: <96b7cc00-d213-ad7d-1b48-b27f75b04d22@collabora.com>
On Thu, 15 Jun 2023 at 15:58, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> I'll send next revision now.
> On 6/14/23 11:00 PM, Michał Mirosław wrote:
> > (A quick reply to answer open questions in case they help the next version.)
> >
> > On Wed, 14 Jun 2023 at 19:10, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> >> On 6/14/23 8:14 PM, Michał Mirosław wrote:
> >>> On Wed, 14 Jun 2023 at 15:46, Muhammad Usama Anjum
> >>> <usama.anjum@collabora.com> wrote:
> >>>>
> >>>> On 6/14/23 3:36 AM, Michał Mirosław wrote:
> >>>>> On Tue, 13 Jun 2023 at 12:29, Muhammad Usama Anjum
> >>>>> <usama.anjum@collabora.com> wrote:
> > [...]
> >>>>>> + if (cur_buf->bitmap == bitmap &&
> >>>>>> + cur_buf->start + cur_buf->len * PAGE_SIZE == addr) {
> >>>>>> + cur_buf->len += n_pages;
> >>>>>> + p->found_pages += n_pages;
> >>>>>> + } else {
> >>>>>> + if (cur_buf->len && p->vec_buf_index >= p->vec_buf_len)
> >>>>>> + return -ENOMEM;
> >>>>>
> >>>>> Shouldn't this be -ENOSPC? -ENOMEM usually signifies that the kernel
> >>>>> ran out of memory when allocating, not that there is no space in a
> >>>>> user-provided buffer.
> >>>> There are 3 kinds of return values here:
> >>>> * PM_SCAN_FOUND_MAX_PAGES (1) ---> max_pages have been found. Abort the
> >>>> page walk from next entry
> >>>> * 0 ---> continue the page walk
> >>>> * -ENOMEM --> Abort the page walk from current entry, user buffer is full
> >>>> which is not error, but only a stop signal. This -ENOMEM is just
> >>>> differentiater from (1). This -ENOMEM is for internal use and isn't
> >>>> returned to user.
> >>>
> >>> But why ENOSPC is not good here? I was used before, I think.
> >> -ENOSPC is being returned in form of true error from
> >> pagemap_scan_hugetlb_entry(). So I'd to remove -ENOSPC from here as it
> >> wasn't true error here, it was only a way to abort the walk immediately.
> >> I'm liking the following erturn code from here now:
> >>
> >> #define PM_SCAN_BUFFER_FULL (-256)
> >
> > I guess this will be reworked anyway, but I'd prefer this didn't need
> > custom errors etc. If we agree to decoupling the selection and GET
> > output, it could be:
> >
> > bool is_interesting_page(p, flags); // this one does the
> > required/anyof/excluded match
> > size_t output_range(p, start, len, flags); // this one fills the
> > output vector and returns how many pages were fit
> >
> > In this setup, `is_interesting_page() && (n_out = output_range()) <
> > n_pages` means this is the final range, no more will fit. And if
> > `n_out == 0` then no pages fit and no WP is needed (no other special
> > cases).
> Right now, pagemap_scan_output() performs the work of both of these two
> functions. The part can be broken into is_interesting_pages() and we can
> leave the remaining part as it is.
>
> Saying that n_out < n_pages tells us the buffer is full covers one case.
> But there is case of maximum pages have been found and walk needs to be
> aborted.
This case is exactly what `n_out < n_pages` will cover (if scan_output
uses max_pages properly to limit n_out).
Isn't it that when the buffer is full we want to abort the scan always
(with WP if `n_out > 0`)?
> >>>>> For flags name: PM_REQUIRE_WRITE_ACCESS?
> >>>>> Or Is it intended to be checked only if doing WP (as the current name
> >>>>> suggests) and so it would be redundant as WP currently requires
> >>>>> `p->required_mask = PAGE_IS_WRITTEN`?
> >>>> This is intended to indicate that if userfaultfd is needed. If
> >>>> PAGE_IS_WRITTEN is mentioned in any of mask, we need to check if
> >>>> userfaultfd has been initialized for this memory. I'll rename to
> >>>> PM_SCAN_REQUIRE_UFFD.
> >>>
> >>> Why do we need that check? Wouldn't `is_written = false` work for vmas
> >>> not registered via uffd?
> >> UFFD_FEATURE_WP_ASYNC and UNPOPULATED needs to be set on the memory region
> >> for it to report correct written values on the memory region. Without UFFD
> >> WP ASYNC and UNPOUPULATED defined on the memory, we consider UFFD_WP state
> >> undefined. If user hasn't initialized memory with UFFD, he has no right to
> >> set is_written = false.
> >
> > How about calculating `is_written = is_uffd_registered() &&
> > is_uffd_wp()`? This would enable a user to apply GET+WP for the whole
> > address space of a process regardless of whether all of it is
> > registered.
> I wouldn't want to check if uffd is registered again and again. This is why
> we are doing it only once every walk in pagemap_scan_test_walk().
There is no need to do the checks repeatedly. If I understand the code
correctly, uffd registration is per-vma, so it can be communicated
from test_walk to entry/hole callbacks via a field in
pagemap_scan_private.
> >>> While here, I wonder if we really need to fail the call if there are
> >>> unknown bits in those masks set: if this bit set is expanded with
> >>> another category flags, a newer userspace run on older kernel would
> >>> get EINVAL even if the "treat unknown as 0" be what it requires.
> >>> There is no simple way in the API to discover what bits the kernel
> >>> supports. We could allow a no-op (no WP nor GET) call to help with
> >>> that and then rejecting unknown bits would make sense.
> >> I've not seen any examples of this. But I've seen examples of returning
> >> error if kernel doesn't support a feature. Each new feature comes with a
> >> kernel version, greater than this version support this feature. If user is
> >> trying to use advanced feature which isn't present in a kernel, we should
> >> return error and not proceed to confuse the user/kernel. In fact if we look
> >> at userfaultfd_api(), we return error immediately if feature has some bit
> >> set which kernel doesn't support.
> >
> > I think we should have a way of detecting the supported flags if we
> > don't want a forward compatibility policy for flags here. Maybe it
> > would be enough to allow all the no-op combinations for this purpose?
> Again I don't think UFFD is doing anything like this.
If it's cheap and easy to provide a user with a way to detect the
supported features - why not do it?
Best Regards
Michał Mirosław
next prev parent reply other threads:[~2023-06-15 14:53 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-13 10:29 [PATCH v18 0/5] Implement IOCTL to get and optionally clear info about PTEs Muhammad Usama Anjum
2023-06-13 10:29 ` [PATCH v18 1/5] userfaultfd: UFFD_FEATURE_WP_ASYNC Muhammad Usama Anjum
2023-06-13 10:29 ` [PATCH v18 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs Muhammad Usama Anjum
2023-06-13 16:11 ` kernel test robot
2023-06-13 22:36 ` Michał Mirosław
2023-06-14 13:46 ` Muhammad Usama Anjum
2023-06-14 15:14 ` Michał Mirosław
2023-06-14 17:09 ` Muhammad Usama Anjum
2023-06-14 18:00 ` Michał Mirosław
2023-06-15 13:58 ` Muhammad Usama Anjum
2023-06-15 14:52 ` Michał Mirosław [this message]
2023-06-15 14:58 ` Michał Mirosław
2023-06-15 15:16 ` Muhammad Usama Anjum
2023-06-15 20:00 ` Michał Mirosław
2023-06-16 6:37 ` Muhammad Usama Anjum
2023-06-15 15:11 ` Muhammad Usama Anjum
2023-06-15 20:07 ` Michał Mirosław
2023-06-16 6:57 ` Muhammad Usama Anjum
2023-06-19 8:16 ` Michał Mirosław
2023-06-20 11:15 ` Muhammad Usama Anjum
2023-06-20 22:05 ` Michał Mirosław
2023-06-21 4:44 ` Muhammad Usama Anjum
2023-06-21 13:24 ` Michał Mirosław
2023-06-14 4:09 ` Randy Dunlap
2023-06-14 11:36 ` Muhammad Usama Anjum
2023-06-13 10:29 ` [PATCH v18 3/5] tools headers UAPI: Update linux/fs.h with the kernel sources Muhammad Usama Anjum
2023-06-13 10:29 ` [PATCH v18 4/5] mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL Muhammad Usama Anjum
2023-06-14 4:20 ` Randy Dunlap
2023-06-14 11:27 ` Muhammad Usama Anjum
2023-06-13 10:29 ` [PATCH v18 5/5] selftests: mm: add pagemap ioctl tests Muhammad Usama Anjum
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CABb0KFEy_mRaT86TEOQ-BoTe_XOVw3Kp5VdzOfEEaiZJuT754g@mail.gmail.com \
--to=emmir@google.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alex.sierra@amd.com \
--cc=avagin@gmail.com \
--cc=axelrasmussen@google.com \
--cc=brauner@kernel.org \
--cc=dan.j.williams@intel.com \
--cc=david@redhat.com \
--cc=gorcunov@gmail.com \
--cc=gregkh@linuxfoundation.org \
--cc=gustavoars@kernel.org \
--cc=kernel@collabora.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mdanylo@google.com \
--cc=namit@vmware.com \
--cc=pasha.tatashin@soleen.com \
--cc=peterx@redhat.com \
--cc=pgofman@codeweavers.com \
--cc=rppt@kernel.org \
--cc=shuah@kernel.org \
--cc=shy828301@gmail.com \
--cc=surenb@google.com \
--cc=usama.anjum@collabora.com \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=yun.zhou@windriver.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).