Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

From: Muhammad Usama Anjum <usama.anjum@collabora.com>
To: Peter Xu <peterx@redhat.com>
Cc: "Muhammad Usama Anjum" <usama.anjum@collabora.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Michał Mirosław" <emmir@google.com>,
	"Andrei Vagin" <avagin@gmail.com>,
	"Danylo Mocherniuk" <mdanylo@google.com>,
	"Paul Gofman" <pgofman@codeweavers.com>,
	"Cyrill Gorcunov" <gorcunov@gmail.com>,
	"Alexander Viro" <viro@zeniv.linux.org.uk>,
	"Shuah Khan" <shuah@kernel.org>,
	"Christian Brauner" <brauner@kernel.org>,
	"Yang Shi" <shy828301@gmail.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	"Yun Zhou" <yun.zhou@windriver.com>,
	"Suren Baghdasaryan" <surenb@google.com>,
	"Alex Sierra" <alex.sierra@amd.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Pasha Tatashin" <pasha.tatashin@soleen.com>,
	"Mike Rapoport" <rppt@kernel.org>,
	"Nadav Amit" <namit@vmware.com>,
	"Axel Rasmussen" <axelrasmussen@google.com>,
	"Gustavo A . R . Silva" <gustavoars@kernel.org>,
	"Dan Williams" <dan.j.williams@intel.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
	"Greg KH" <gregkh@linuxfoundation.org>,
	kernel@collabora.com
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs
Date: Wed, 15 Feb 2023 15:03:09 +0500	[thread overview]
Message-ID: <884f5aa6-5d12-eecc-ed71-7d653828ca20@collabora.com> (raw)
In-Reply-To: <Y+v2HJ8+3i/KzDBu@x1n>

On 2/15/23 1:59 AM, Peter Xu wrote:
[..]
>>>> static inline bool is_pte_written(pte_t pte)
>>>> {
>>>> 	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>>>> 	    (pte_swp_uffd_wp_any(pte)))
>>>> 		return false;
>>>> 	return (pte_present(pte) || is_swap_pte(pte));
>>>> }
>>>
>>> Could you explain why you don't want to return dirty for !present?  A page
>>> can be written then swapped out.  Don't you want to know that happened
>>> (from dirty tracking POV)?
>>>
>>> The code looks weird to me too..  We only have three types of ptes: (1)
>>> present, (2) swap, (3) none.
>>>
>>> Then, "(pte_present() || is_swap_pte())" is the same as !pte_none().  Is
>>> that what you're really looking for?
>> Yes, this is what I've been trying to do. I'll use !pte_none() to make it
>> simpler.
> 
> Ah I think I see what you wanted to do now.. But I'm afraid it won't work
> for all cases.
> 
> So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't
> persist on anon (but none) ptes, then we got it lost and we cannot identify
> it from pages being written.  Your solution will solve problem for
> anonymous, but I think it'll break file memories.
> 
> Example:
> 
> Consider one shmem page that got mapped, write protected (using UFFDIO_WP
> ioctl), written again (removing uffd-wp bit automatically), then zapped.
> The pte will be pte_none() but it's actually written, afaiu.
> 
> Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need
> to install pte markers for anonymous too (then it will work similarly like
> shmem/hugetlbfs, that we'll report writting to zero pages), then you'll
> need to have the new UFFD_FEATURE_WP_ASYNC depend on it.  With that I think
> you can keep using the old check and it should start to work.
> 
> Please let me know if my understanding is correct above.
Thank you for identifying it. Your understanding seems on point. I'll have
research things up about PTE Markers. I'm looking at your patches about it
[1]. Can you refer me to "mm alignment sessions" discussion in form of
presentation or if any transcript is available?

> 
> I'll see whether I can quickly play with UFFD_FEATURE_WP_ZEROPAGE with some
> patch at the meantime.  That's something we wanted before too, when the app
> cares about zero pages on anon.  We used to populate the pages before doing
> ioctl(UFFDIO_WP) to make sure zero pages will be repoted too, but that flag
> should be more efficient.
Is this discussion public? For what application you were looking into this?
I'll dig down to see how can I contribute to it.

> 
>>
>>>
>>>>
>>>> static inline bool is_pmd_written(pmd_t pmd)
>>>> {
>>>> 	if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
>>>> 	    (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
>>>> 		return false;
>>>> 	return (pmd_present(pmd) || is_swap_pmd(pmd));
>>>> }
>>>
>>> [...]
>>>
>>>>>> +	bitmap = cur & p->return_mask;
>>>>>> +	if (cpy && bitmap) {
>>>>>> +		if ((prev->len) && (prev->bitmap == bitmap) &&
>>>>>> +		    (prev->start + prev->len * PAGE_SIZE == addr)) {
>>>>>> +			prev->len += len;
>>>>>> +			p->found_pages += len;
>>>>>> +		} else if (p->vec_index < p->vec_len) {
>>>>>> +			if (prev->len) {
>>>>>> +				memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>>>>>> +				p->vec_index++;
>>>>>> +			}
>>>>>
>>>>> IIUC you can have:
>>>>>
>>>>>   int pagemap_scan_deposit(p)
>>>>>   {
>>>>>         if (p->vec_index >= p->vec_len)
>>>>>                 return -ENOSPC;
>>>>>
>>>>>         if (p->prev->len) {
>>>>>                 memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>>>>>                 p->vec_index++;
>>>>>         }
>>>>>
>>>>>         return 0;
>>>>>   }
>>>>>
>>>>> Then call it here.  I think it can also be called below to replace
>>>>> export_prev_to_out().
>>>> No this isn't possible. We fill up prev until the next range doesn't merge
>>>> with it. At that point, we put prev into the output buffer and new range is
>>>> put into prev. Now that we have shifted to smaller page walks of <= 512
>>>> entries. We want to visit all ranges before finally putting the prev to
>>>> output. Sorry to have this some what complex method. The problem is that we
>>>> want to merge the consective matching regions into one entry in the output.
>>>> So to achieve this among multiple different page walks, the prev is being used.
>>>>
>>>> Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000
>>>> having length of 1024 pages and all of the memory has been written.
>>>> walk_page_range() will be called 2 times. In the first call, prev will be
>>>> set having length of 512. In second call, prev will be updated to 1024 as
>>>> the previous range stored in prev could be extended. After this, the prev
>>>> will be stored to the user output buffer consuming only 1 struct of page_range.
>>>>
>>>> If we store prev back to output memory in every walk_page_range() call, we
>>>> wouldn't get 1 struct of page_range with length 1024. Instead we would get
>>>> 2 elements of page_range structs with half the length.
>>>
>>> I didn't mean to merge PREV for each pgtable walk.  What I meant is I think
>>> with such a pagemap_scan_deposit() you can rewrite it as:
>>>
>>> if (cpy && bitmap) {
>>>         if ((prev->len) && (prev->bitmap == bitmap) &&
>>>             (prev->start + prev->len * PAGE_SIZE == addr)) {
>>>                 prev->len += len;
>>>                 p->found_pages += len;
>>>         } else {
>>>                 if (pagemap_scan_deposit(p))
>>>                         return -ENOSPC;
>>>                 prev->start = addr;
>>>                 prev->len = len;
>>>                 prev->bitmap = bitmap;
>>>                 p->found_pages += len;
>>>         }
>>> }
>>>
>>> Then you can reuse pagemap_scan_deposit() when before returning to
>>> userspace, just to flush PREV to p->vec properly in a single helper.
>>> It also makes the code slightly easier to read.
>> Yeah, this would have worked as you have described. But in
>> pagemap_scan_output(), we are flushing prev to p->vec. But later in
>> export_prev_to_out() we need to flush prev to user_memory directly.
> 
> I think there's a loop to copy_to_user().  Could you use the new helper so
> the copy_to_user() loop will work without export_prev_to_out()?
> 
> I really hope we can get rid of export_prev_to_out().  Thanks,
I truly understand how you feel about export_prev_to_out(). It is really
difficult to understand. Even I had to made a hard try to come up with the
current code to avoid consuming a lot of kernel's memory while giving user
the compact output. I can surely map both of these with a dirty looking
macro. But I'm unable to find a decent macro to replace these. I think I'll
put a comment some where to explain whats going-on.


-- 
BR,
Muhammad Usama Anjum