linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Muhammad Usama Anjum <usama.anjum@collabora.com>
To: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: "Muhammad Usama Anjum" <usama.anjum@collabora.com>,
	"Andrei Vagin" <avagin@gmail.com>,
	"Danylo Mocherniuk" <mdanylo@google.com>,
	"Alex Sierra" <alex.sierra@amd.com>,
	"Alexander Viro" <viro@zeniv.linux.org.uk>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Axel Rasmussen" <axelrasmussen@google.com>,
	"Christian Brauner" <brauner@kernel.org>,
	"Cyrill Gorcunov" <gorcunov@gmail.com>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Greg KH" <gregkh@linuxfoundation.org>,
	"Gustavo A . R . Silva" <gustavoars@kernel.org>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Michał Mirosław" <emmir@google.com>,
	"Mike Rapoport" <rppt@kernel.org>,
	"Nadav Amit" <namit@vmware.com>,
	"Pasha Tatashin" <pasha.tatashin@soleen.com>,
	"Paul Gofman" <pgofman@codeweavers.com>,
	"Peter Xu" <peterx@redhat.com>, "Shuah Khan" <shuah@kernel.org>,
	"Suren Baghdasaryan" <surenb@google.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Yang Shi" <shy828301@gmail.com>,
	"Yun Zhou" <yun.zhou@windriver.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
	kernel@collabora.com
Subject: Re: fs/proc/task_mmu: Implement IOCTL for efficient page table scanning
Date: Fri, 21 Jul 2023 22:50:19 +0500	[thread overview]
Message-ID: <382f4435-2088-08ce-20e9-bc1a15050861@collabora.com> (raw)
In-Reply-To: <ZLpqzcyo2ZMXwtm4@qmqm.qmqm.pl>

On 7/21/23 4:23 PM, Michał Mirosław wrote:
> On Fri, Jul 21, 2023 at 03:48:22PM +0500, Muhammad Usama Anjum wrote:
>> On 7/21/23 12:28 AM, Michał Mirosław wrote:
>>> This is a massaged version of patch by Muhammad Usama Anjum [1]
>>> to illustrate my review comments and hopefully push the implementation
>>> efforts closer to conclusion. The changes are:
>> Thank you so much for this effort. I also want to reach conclusion. I'll
>> agree with all the changes which don't affect me. But some requirements
>> aren't being fulfilled with this current design.
>>
>>>
>>> 1. the API:
> [...]
>>>   b. rename match "flags" to 'page categories' everywhere - this makes
>>> 	it easier to differentiate the ioctl()s categorisation of pages
>>> 	from struct page flags;
>> I've no problem with it.
>>
>> #define PAGE_IS_WPASYNC		(1 << 0)
>> #define PAGE_IS_WRITTEN		(1 << 1)
>> You have another new flag PAGE_IS_WPASYNC. But there is no application of
>> PAGE_IS_WPASYNC. We must not add a flag which don't have any user.
> 
> Please see below.
> 
>>>   c. change {required + excluded} to {inverted + required}. This was
>>> 	rejected before, but I'd like to illustrate the difference.
>>> 	Old interface can be translated to the new by:
>>> 		categories_inverted = excluded_mask
>>> 		categories_mask = required_mask | excluded_mask
>>> 		categories_anyof_mask = anyof_mask
>>> 	The new way allows filtering by: A & (B | !C)
>>> 		categories_inverted = C
>>> 		categories_mask = A
>>> 		categories_anyof_mask = B | C
>> I'm still unable to get the idea of inverted masks. IIRC Andei had also not
>> supported/accepted this masking scheme. But I'll be okay with it if he
>> supports this masking.
> 
> Please note that the masks are not inverted -- the values are. Masks
> select which categories you want to filter on, and category_inverted
> invert the meaning of a match (match 0 instead of 1).
> 
>>>   d. change the ioctl to be a SCAN with optional WP. Addressing the
>>> 	original use-case, GetWriteWatch() can be implemented as:
>> As I've mentioned several times previously (without the name of
>> ResetWriteWatch()) that we need exclusive WP without GET. This could be
>> implemented with UFFDIO_WRITEPROTECT. But when we use UFFDIO_WRITEPROTECT,
>> we hit some special case and performance is very slow. So with UFFD WP
>> expert, Peter Xu we have decided to put exclusive WP in this IOCTL for
>> implementation of ResetWriteWatch().
>>
>> A lot of simplification of the patch is made possible because of not
>> keeping exclusive WP. (You have also written some quality code, more better.)
>>>
>>> 		memset(&args, 0, sizeof(args));
>>> 		args.start = lpBaseAddress;
>>> 		args.end = lpBaseAddress + dwRegionSize;
>>> 		args.max_pages = *lpdwCount;
>>> 		*lpdwGranularity = PAGE_SIZE;
>>> 		args.flags = PM_SCAN_CHECK_WPASYNC;
>>> 		if (dwFlags & WRITE_WATCH_FLAG_RESET)
>>> 			args.flags |= PM_SCAN_WP_MATCHED;
>>> 		args.categories_mask = PAGE_IS_WRITTEN;
>>> 		args.return_mask = PAGE_IS_WRITTEN;
> 
> For ResetWriteWatch() you would:
> 
> args.flags = PM_SCAN_WP_MATCHING;
> args.categories_mask = PAGE_IS_WPASYNC | PAGE_IS_WRITTEN;
> args.return_mask = 0;
> 
> Or (if you want to error out if the range doesn't have WP enabled):
> 
> args.flags = PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC;
> args.categories_mask = PAGE_IS_WRITTEN;
> args.return_mask = 0;
> 
> (PM_SCAN_CHECK_WPASYNC is effectively adding PAGE_IS_WPASYNC to the
> required categories.)
Right. But we don't want to perform GET in case of ResetWriteWatch(). It'll
be wasted effort to perform GET as well when we don't care about the dirty
status of the pages.


> 
> [...]
> 
>>> 2. the implementation:
>>>   a. gather the page-categorising and write-protecting code in one place;
>> Agreed.
>>
>>>   b. optimization: add whole-vma skipping for WP usecase;
>> I don't know who can benefit from it. Do you have any user in mind? When
>> the user come of this optimization, this can be added later.
> 
> This is for users of WP that want to ignore WP for non-registered ranges
> instead of erroring out on them. (I anticipate CRIU to use this.)
> 
>>>   c. extracted output limiting code to pagemap_scan_output();
>> If user passes half THP, current code wouldn't split huge page and WP the
>> whole THP. We would loose the dirty state of other half huge page. This is
>> bug. consoliding the output limiting code looks optimal, but we'll need to
>> same limiting code to detect if full THP hasn't been passed in case of THP
>> and HugeTLB.
> 
> For THP indeed - the code should check `end != start + HPAGE_SIZE`
> instead of `ret == -ENOSPC`.
Yeah, this need to be fixed.

> 
> For HugeTLB there is a check that returns EINVAL when trying to WP
> a partial page. I think I didn't change that part.
> 
>>>   d. extracted range coalescing to pagemap_scan_push_range();
>> My old pagemap_scan_output has become pagemap_scan_push_range().
> 
> Indeed. I did first push the max_pages check in, hence the 'extracting'
> later.
> 
>>>   e. extracted THP entry handling to pagemap_scan_thp_entry();
>> Good. But I didn't found value in seperating it just like other historic
>> pagemap code.
> 
> This is to avoid having to much indentation and long functions that do
> many things at once.
> 
>>>   f. added a shortcut for non-WP hugetlb scan; avoids conditional
>>> 	locking;
>> Yeah, some if conditions have been removed. But overall did couple of calls
>> to is_interesting and scan_output functions instead of just one.
> 
> Yes, there are now two pairs instead of one. I see that I haven't pushed
> the is_interesting calls into scan_output. This is now trivial:
> 	if (!interesting...) {
> 		*end = start;
> 		return 0;
> 	}
This can be the 3rd thing to fix.

Is it possible for you to fix the above mentioned 3 things and send the
patch for my testing:
1 Make GET optional
2 Detect partial THP WP operation and split
3 Optimization of moving this interesting logic to output() function

Please let me know if you cannot make the above fixes. I'll mix my patch
version and your patch and fix these things up.

> and could save some typing (but would need a different name for
> scan_output as it would do filter & output), but I'm not sure about
> readability.
> 
> Best Regards
> Michał Mirosław

-- 
BR,
Muhammad Usama Anjum

  reply	other threads:[~2023-07-21 17:50 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-13 10:14 [PATCH v25 0/5] Implement IOCTL to get and optionally clear info about PTEs Muhammad Usama Anjum
2023-07-13 10:14 ` [PATCH v25 1/5] userfaultfd: UFFD_FEATURE_WP_ASYNC Muhammad Usama Anjum
2023-07-13 10:14 ` [PATCH v25 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs Muhammad Usama Anjum
2023-07-17 17:26   ` Andrei Vagin
2023-07-18  8:18     ` Muhammad Usama Anjum
2023-07-18 16:08       ` Andrei Vagin
2023-07-18 16:27         ` Muhammad Usama Anjum
2023-07-13 10:14 ` [PATCH v25 3/5] tools headers UAPI: Update linux/fs.h with the kernel sources Muhammad Usama Anjum
2023-07-13 10:14 ` [PATCH v25 4/5] mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL Muhammad Usama Anjum
2023-07-13 10:14 ` [PATCH v25 5/5] selftests: mm: add pagemap ioctl tests Muhammad Usama Anjum
     [not found]   ` <a0b5c6776b2ed91f78a7575649f8b100e58bd3a9.1689881078.git.mirq-linux@rere.qmqm.pl>
2023-07-20 19:50     ` fs/proc/task_mmu: Implement IOCTL for efficient page table scanning Michał Mirosław
2023-07-20 21:12     ` kernel test robot
2023-07-21  2:56     ` kernel test robot
2023-07-21  4:27     ` Muhammad Usama Anjum
2023-07-21 14:49       ` Andrei Vagin
2023-07-21  5:43     ` kernel test robot
2023-07-21  7:18     ` kernel test robot
2023-07-21 10:48     ` Muhammad Usama Anjum
2023-07-21 11:23       ` Michał Mirosław
2023-07-21 17:50         ` Muhammad Usama Anjum [this message]
2023-07-22  0:22           ` Michał Mirosław
2023-07-22  0:24           ` [v2] " Michał Mirosław
2023-07-22 13:55             ` kernel test robot
2023-07-22 14:05             ` kernel test robot
2023-07-24 14:04             ` Muhammad Usama Anjum
2023-07-24 14:38               ` Michał Mirosław
2023-07-24 15:21                 ` Muhammad Usama Anjum
2023-07-24 16:10                   ` Michał Mirosław
2023-07-25  7:23                     ` Muhammad Usama Anjum
2023-07-25  9:09                       ` Muhammad Usama Anjum
2023-07-25  9:11                         ` [v3] " Muhammad Usama Anjum
2023-07-25 18:05                           ` Michał Mirosław
2023-07-26  8:34                             ` Muhammad Usama Anjum
2023-07-26 21:10                               ` Michał Mirosław
2023-07-26 23:06                                 ` Paul Gofman
2023-07-27 11:18                                   ` Michał Mirosław
2023-07-27 11:21                                     ` Michał Mirosław
2023-07-27 17:15                                     ` Paul Gofman
2023-07-27  8:03                                 ` Muhammad Usama Anjum
2023-07-27 11:26                                   ` Michał Mirosław
2023-07-27 11:31                                     ` Muhammad Usama Anjum
2023-07-21 11:29       ` Michał Mirosław
2023-07-21 17:51         ` Muhammad Usama Anjum
2023-08-26 13:07     ` kernel test robot
2023-07-18 16:05 ` [PATCH v25 0/5] Implement IOCTL to get and optionally clear info about PTEs Rogerio Alves

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=382f4435-2088-08ce-20e9-bc1a15050861@collabora.com \
    --to=usama.anjum@collabora.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.sierra@amd.com \
    --cc=avagin@gmail.com \
    --cc=axelrasmussen@google.com \
    --cc=brauner@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=emmir@google.com \
    --cc=gorcunov@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=gustavoars@kernel.org \
    --cc=kernel@collabora.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mdanylo@google.com \
    --cc=mirq-linux@rere.qmqm.pl \
    --cc=namit@vmware.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=pgofman@codeweavers.com \
    --cc=rppt@kernel.org \
    --cc=shuah@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yun.zhou@windriver.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).