linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Muhammad Usama Anjum <usama.anjum@collabora.com>
To: "Michał Mirosław" <emmir@google.com>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>,
	Peter Xu <peterx@redhat.com>,
	David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrei Vagin <avagin@gmail.com>,
	Danylo Mocherniuk <mdanylo@google.com>,
	Paul Gofman <pgofman@codeweavers.com>,
	Cyrill Gorcunov <gorcunov@gmail.com>,
	Mike Rapoport <rppt@kernel.org>, Nadav Amit <namit@vmware.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Shuah Khan <shuah@kernel.org>,
	Christian Brauner <brauner@kernel.org>,
	Yang Shi <shy828301@gmail.com>, Vlastimil Babka <vbabka@suse.cz>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Yun Zhou <yun.zhou@windriver.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Alex Sierra <alex.sierra@amd.com>,
	Matthew Wilcox <willy@infradead.org>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	"Gustavo A . R . Silva" <gustavoars@kernel.org>,
	Dan Williams <dan.j.williams@intel.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
	Greg KH <gregkh@linuxfoundation.org>,
	kernel@collabora.com
Subject: Re: [PATCH v11 4/7] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs
Date: Mon, 20 Mar 2023 11:08:41 +0500	[thread overview]
Message-ID: <6a452f0b-968e-11fa-530c-64eed955eb1f@collabora.com> (raw)
In-Reply-To: <CABb0KFH+Ho+grc5DXT7iWjnQH7T_o4y3BQj8ri5-wxcOvu12Bg@mail.gmail.com>

On 3/17/23 7:15 PM, Michał Mirosław wrote:
> On Fri, 17 Mar 2023 at 13:44, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 3/17/23 2:28 AM, Michał Mirosław wrote:
>>> On Thu, 16 Mar 2023 at 18:53, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>>> On 3/13/23 9:02 PM, Michał Mirosław wrote:
>>>>> On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum
>>>>> <usama.anjum@collabora.com> wrote:
>>> [...]
>>>>>> --- a/fs/proc/task_mmu.c
>>>>>> +++ b/fs/proc/task_mmu.c
>>> [...]
>>>>>> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
>>> [...]
>>>>> The `cur->len` test seems redundant: is it possible to have
>>>>> `cur->start == addr` in that case (I guess it would have to get
>>>>> `n_pages == 0` in an earlier invocation)?
>>>> No, both wouldn't work. cur->len == 0 means that it has only garbage. It is
>>>> essential to check the validity from cur->len before performing other
>>>> checks. Also cur->start can never be equal to addr as we are walking over
>>>> page addressing in serial manner. We want to see here if the current
>>>> address matches the previous data by finding the ending address of last
>>>> stored data (cur->start + cur->len * PAGE_SIZE).
>>>
>>> If cur->len == 0, then it doesn't matter if it gets merged or not - it
>>> can be filtered out during the flush (see below).
>>> [...]
>>>>>> +               } else if ((!p->vec_index) ||
>>>>>> +                          ((p->vec_index + 1) < p->vec_len)) {
>>>>>
>>>>> Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better:
>>>>>
>>>>> if (vec_index >= p->vec_len)
>>>>>     return -ENOSPC;
>>>>
>>>> No, it'll not work. Lets leave it as it is. :)
>>>>
>>>> It has gotten somewhat complex, but I don't have any other way to make it
>>>> simpler which works. First note the following points:
>>>> 1) We walk over 512 page or 1 thp at a time to not over allocate memory in
>>>> kernel (p->vec).
>>>> 2) We also want to merge the consecutive pages with the same flags into one
>>>> struct page_region. p->vec of current walk may merge with next walk. So we
>>>> cannot write to user memory until we find the results of the next walk.
>>>>
>>>> So most recent data is put into p->cur. When non-intersecting or mergeable
>>>> data is found, we move p->cur to p->vec[p->index] inside the page walk.
>>>> After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all
>>>> the walks are over. We move the p->cur to arg->vec. It completes the data
>>>> transfer to user buffer.
>>> [...]
>>>> I'm so sorry that it has gotten this much complex. It was way simpler when
>>>> we were walking over all the memory in one go. But then we needed an
>>>> unbounded memory from the kernel which we don't want.
>>> [...]
>>>
>>> I've gone through and hopefully understood the code. I'm not sure this
>>> needs to be so complicated: when traversing a single PMD you can
>>> always copy p->cur to p->vec[p->vec_index++] because you can have at
>>> most pages_per_PMD non-merges (in the worst case the last page always
>>> is left in p->cur and whole p->vec is used). After each PMD p->vec
>>> needs a flush if p->vec_index > 0, skipping the dummy entry at front
>>> (len == 0; if present). (This is mostly how it is implemented now, but
>>> I propose to remove the "overflow" check and do the starting guard
>>> removal only every PMD.)
>> Sorry, unable to understand where to remove the guard?
> 
> Instead of checking for it in pagemap_scan_output() for each page you
> can skip it in do_pagemap_cmd() when doing the flush.
No, this cannot be done because in do_pagemap_cmd() we don't know that we
have space for new pages in the output buffer or not because the next page
may be aggregated to already present data.

> 
>>> BTW#2, I think the ENOSPC return in pagemap_scan_output() should
>>> happen later - only if the pages would match and that caused the count
>>> to exceed the limit. For THP n_pages should be truncated to the limit
>>> (and ENOSPC returned right away) only after the pages were verified to
>>> match.
>> We have 2 counters here:
>> * the p->max_pages optionally can be set to find out only N pages of
>> interest. So p->found_pages is counting this. We need to return early if
>> the page limit is complete.
>> * the p->vec_index keeps track of output buffer array size
> 
> I think I get how the limits are supposed to work, but I also think
> the implementation is not optimal. An example (assuming max_pages = 1
> and vec_len = 1):
>  - a matching page has been found
>  - a second - non-matching - is tried but results in immediate -ENOSPC.
> -> In this case I'd expect the early return to happen just after the
> first page is found so that non
> A similar problem occurs for hugepage: when the limit is hit (we found
>> = max_pages, n_pages is possibly truncated), but the scan continues
> until next page / PMD.
I'll try to check if I can optimize it. It seems like I should be able to
update this pretty easily by returning a negative status/error which
signifies that we have found the max_pages. Now just abort in sane way.

> 
> [...]
>>>>>> +       if (!arg->required_mask && !arg->anyof_mask &&
>>>>>> +           !arg->excluded_mask)
>>>>>> +               return false;
>>>>>
>>>>> Is there an assumption in the code that those checks are needed? I'd
>>>>> expect that no selection criteria makes a valid page set?
>>>> In my view, selection criterion must be specified for the ioctl to work. If
>>>> there is no criterio, user should go and read pagemap file directly. So the
>>>> assumption is that at least one selection criterion must be specified.
>>>
>>> Yes. I'm not sure we need to prevent multiple ways of doing the same
>>> thing. But doesn't pagemap reading lack the range aggregation feature?
>> Yeah, correct. But note that we are supporting only selective 4 flags in
>> this ioctl, not all pagemap flags. So it is useful for only those users who
>> depend only on these 4 flags. Out pagemap_ioctl interface is not so much
>> generic that we can cater anyone. Its interface is specific and we are
>> adding only those cases which are of our interest. So if someone wants
>> range aggregation from pagemap_ioctl, he'll need to add that flag in the
>> IOCTL first. When IOCTL support is added, he can specify the selection
>> criterion etc.
> 
> The available flag set is not a problem. An example usecase: dumping
> the memory state for debugging: ioctl(return_mask=ALL) returns a
> conveniently compact vector of ranges of pages that are actually used
> by the process (not only having reserved the virtual space). This is
> actually something that helps dumping processes with using tools like
> AddressSanitizer that create huge sparse mappings.
I don't know, we are adding more and more use cases as people are noticing
it. I've not thought about this use case. So I need more understanding
about it:
How should I identify "which pages are used"? Does use mean present and
swapped both? We we want to find present or swapped pages in other words
!pte_none pages and return in compact form, it can already be done by
ioctl(anyod_mask=PRESET | SWAPPED, return_mask=ALL).

> 
> Best Regards
> Michał Mirosław

-- 
BR,
Muhammad Usama Anjum


  reply	other threads:[~2023-03-20  6:09 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-09 13:57 [PATCH v11 0/7] Implement IOCTL to get and optionally clear info about PTEs Muhammad Usama Anjum
2023-03-09 13:57 ` [PATCH v11 1/7] userfaultfd: Add UFFD WP Async support Muhammad Usama Anjum
2023-03-16 19:20   ` Peter Xu
2023-03-17 14:00     ` Muhammad Usama Anjum
2023-03-21 12:21     ` Muhammad Usama Anjum
2023-03-21 19:25       ` Peter Xu
2023-03-23 15:43         ` Muhammad Usama Anjum
2023-03-09 13:57 ` [PATCH v11 2/7] userfaultfd: Define dummy uffd_wp_range() Muhammad Usama Anjum
2023-03-16  7:02   ` Mike Rapoport
2023-03-16 18:05     ` Muhammad Usama Anjum
2023-03-09 13:57 ` [PATCH v11 3/7] userfaultfd: update documentation to describe UFFD_FEATURE_WP_ASYNC Muhammad Usama Anjum
2023-03-09 13:57 ` [PATCH v11 4/7] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs Muhammad Usama Anjum
2023-03-13 16:02   ` Michał Mirosław
2023-03-16 17:53     ` Muhammad Usama Anjum
2023-03-16 21:28       ` Michał Mirosław
2023-03-17 12:43         ` Muhammad Usama Anjum
2023-03-17 14:15           ` Michał Mirosław
2023-03-20  6:08             ` Muhammad Usama Anjum [this message]
2023-03-15 15:55   ` Peter Xu
2023-03-15 16:54     ` Muhammad Usama Anjum
2023-03-15 19:53       ` Peter Xu
2023-03-16  5:17         ` Muhammad Usama Anjum
2023-03-09 13:57 ` [PATCH v11 5/7] tools headers UAPI: Update linux/fs.h with the kernel sources Muhammad Usama Anjum
2023-03-09 13:57 ` [PATCH v11 6/7] mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL Muhammad Usama Anjum
2023-03-09 13:57 ` [PATCH v11 7/7] selftests: mm: add pagemap ioctl tests Muhammad Usama Anjum
2023-03-09 19:58 ` [PATCH v11 0/7] Implement IOCTL to get and optionally clear info about PTEs Andrew Morton
2023-03-09 22:24   ` Muhammad Usama Anjum
2023-03-20 18:30   ` Andrei Vagin
2023-03-21 12:41     ` Mike Rapoport
2023-03-21 15:10       ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6a452f0b-968e-11fa-530c-64eed955eb1f@collabora.com \
    --to=usama.anjum@collabora.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.sierra@amd.com \
    --cc=avagin@gmail.com \
    --cc=axelrasmussen@google.com \
    --cc=brauner@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=emmir@google.com \
    --cc=gorcunov@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=gustavoars@kernel.org \
    --cc=kernel@collabora.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mdanylo@google.com \
    --cc=namit@vmware.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=pgofman@codeweavers.com \
    --cc=rppt@kernel.org \
    --cc=shuah@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yun.zhou@windriver.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).