linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Muhammad Usama Anjum <usama.anjum@collabora.com>
To: Peter Xu <peterx@redhat.com>
Cc: "Muhammad Usama Anjum" <usama.anjum@collabora.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Michał Mirosław" <emmir@google.com>,
	"Andrei Vagin" <avagin@gmail.com>,
	"Danylo Mocherniuk" <mdanylo@google.com>,
	"Paul Gofman" <pgofman@codeweavers.com>,
	"Cyrill Gorcunov" <gorcunov@gmail.com>,
	"Mike Rapoport" <rppt@kernel.org>,
	"Nadav Amit" <namit@vmware.com>,
	"Alexander Viro" <viro@zeniv.linux.org.uk>,
	"Shuah Khan" <shuah@kernel.org>,
	"Christian Brauner" <brauner@kernel.org>,
	"Yang Shi" <shy828301@gmail.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	"Yun Zhou" <yun.zhou@windriver.com>,
	"Suren Baghdasaryan" <surenb@google.com>,
	"Alex Sierra" <alex.sierra@amd.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Pasha Tatashin" <pasha.tatashin@soleen.com>,
	"Axel Rasmussen" <axelrasmussen@google.com>,
	"Gustavo A . R . Silva" <gustavoars@kernel.org>,
	"Dan Williams" <dan.j.williams@intel.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
	"Greg KH" <gregkh@linuxfoundation.org>,
	kernel@collabora.com
Subject: Re: [PATCH v11 4/7] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs
Date: Wed, 15 Mar 2023 21:54:40 +0500	[thread overview]
Message-ID: <3d2d1ba4-bfab-6b3d-f0d6-ae0920ebdcb0@collabora.com> (raw)
In-Reply-To: <ZBHqjBjj6nn1xeTM@x1n>

On 3/15/23 8:55 PM, Peter Xu wrote:
> On Thu, Mar 09, 2023 at 06:57:15PM +0500, Muhammad Usama Anjum wrote:
>> +	for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) {
>> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> +
>> +		is_writ = !is_pte_uffd_wp(*pte);
>> +		is_file = vma->vm_file;
>> +		is_pres = pte_present(*pte);
>> +		is_swap = is_swap_pte(*pte);
>> +
>> +		pte_unmap_unlock(pte, ptl);
>> +
>> +		ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap,
>> +					  p, addr, 1);
>> +		if (ret)
>> +			break;
>> +
>> +		if (PM_SCAN_OP_IS_WP(p) && is_writ &&
>> +		    uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0)
>> +			ret = -EINVAL;
>> +	}
> 
> This is not real atomic..
> 
> Taking the spinlock for eacy pte is not only overkill but wrong in
> atomicity because the pte can change right after spinlock unlocked.
Let me explain. It seems like wrong, but it isn't. In my rigorous testing,
it didn't show any side-effect.  Here we are finding out if a page is
written. If page is written, only then we clear it. Lets look at the
different possibilities here:
- If a page isn't written, we'll not clear it.
- If a page is written and there isn't any race, we'll clear written-to
flag by write protecting it.
- If a page is written but before clearing it, data is written again to the
page. The page would remain written and we'll clear it.
- If a page is written but before clearing it, it gets write protected,
we'll still write protected it. There is double right protection here, but
no side-effect.

Lets turn this into a truth table for easier understanding. Here first
coulmn and thrid column represents this above code. 2nd column represents
any other thread interacting with the page.

If page is written/dirty	some other task interacts	wp_page
no				does nothing			no
no				writes to page			no
no				wp the page			no
yes				does nothing			yes
yes				write to page			yes
yes				wp the page			yes

As you can see there isn't any side-effect happening. We aren't over doing
the wp or under-doing the write-protect.

Even if we were doing something wrong here and I bring the lock over all of
this, the pages get become written or wp just after unlocking. It is
expected. This current implementation doesn't seem to be breaking this.

Is my understanding wrong somewhere here? Can you point out?

Previous to this current locking design were either buggy or slower when
multiple threads were working on same pages. Current implementation removes
the limitations:
- The memcpy inside pagemap_scan_output is happening with pte unlocked.
- We are only wp a page if we have noted this page to be dirty
- No mm write lock is required. Only read lock works fine just like
userfaultfd_writeprotect() takes only read lock.

There is only one con here that we are locking and unlocking the pte lock
again and again.

Please have a look at my explanation and let me know what do you think.

> 
> Unfortunately you also cannot reuse uffd_wp_range() because that's not
> atomic either, my fault here.  Probably I was thinking mostly from
> soft-dirty pov on batching the collect+reset.
> 
> You need to take the spin lock, collect whatever bits, set/clear whatever
> bits, only until then release the spin lock.
> 
> "Not atomic" means you can have some page got dirtied but you could miss
> it.  Depending on how strict you want, I think it'll break apps like CRIU
> if strict atomicity needed for migrating a process.  If we want to have a
> new interface anyway, IMHO we'd better do that in the strict way.
In my rigorous multi-threaded testing where a lots of threads are working
on same set of pages, we aren't losing even a single update. I can share
the test if you want.

> 
> Same comment applies to the THP handling (where I cut from the context).
> 

-- 
BR,
Muhammad Usama Anjum

  reply	other threads:[~2023-03-15 16:55 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-09 13:57 [PATCH v11 0/7] Implement IOCTL to get and optionally clear info about PTEs Muhammad Usama Anjum
2023-03-09 13:57 ` [PATCH v11 1/7] userfaultfd: Add UFFD WP Async support Muhammad Usama Anjum
2023-03-16 19:20   ` Peter Xu
2023-03-17 14:00     ` Muhammad Usama Anjum
2023-03-21 12:21     ` Muhammad Usama Anjum
2023-03-21 19:25       ` Peter Xu
2023-03-23 15:43         ` Muhammad Usama Anjum
2023-03-09 13:57 ` [PATCH v11 2/7] userfaultfd: Define dummy uffd_wp_range() Muhammad Usama Anjum
2023-03-16  7:02   ` Mike Rapoport
2023-03-16 18:05     ` Muhammad Usama Anjum
2023-03-09 13:57 ` [PATCH v11 3/7] userfaultfd: update documentation to describe UFFD_FEATURE_WP_ASYNC Muhammad Usama Anjum
2023-03-09 13:57 ` [PATCH v11 4/7] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs Muhammad Usama Anjum
2023-03-13 16:02   ` Michał Mirosław
2023-03-16 17:53     ` Muhammad Usama Anjum
2023-03-16 21:28       ` Michał Mirosław
2023-03-17 12:43         ` Muhammad Usama Anjum
2023-03-17 14:15           ` Michał Mirosław
2023-03-20  6:08             ` Muhammad Usama Anjum
2023-03-15 15:55   ` Peter Xu
2023-03-15 16:54     ` Muhammad Usama Anjum [this message]
2023-03-15 19:53       ` Peter Xu
2023-03-16  5:17         ` Muhammad Usama Anjum
2023-03-09 13:57 ` [PATCH v11 5/7] tools headers UAPI: Update linux/fs.h with the kernel sources Muhammad Usama Anjum
2023-03-09 13:57 ` [PATCH v11 6/7] mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL Muhammad Usama Anjum
2023-03-09 13:57 ` [PATCH v11 7/7] selftests: mm: add pagemap ioctl tests Muhammad Usama Anjum
2023-03-09 19:58 ` [PATCH v11 0/7] Implement IOCTL to get and optionally clear info about PTEs Andrew Morton
2023-03-09 22:24   ` Muhammad Usama Anjum
2023-03-20 18:30   ` Andrei Vagin
2023-03-21 12:41     ` Mike Rapoport
2023-03-21 15:10       ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3d2d1ba4-bfab-6b3d-f0d6-ae0920ebdcb0@collabora.com \
    --to=usama.anjum@collabora.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.sierra@amd.com \
    --cc=avagin@gmail.com \
    --cc=axelrasmussen@google.com \
    --cc=brauner@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=emmir@google.com \
    --cc=gorcunov@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=gustavoars@kernel.org \
    --cc=kernel@collabora.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mdanylo@google.com \
    --cc=namit@vmware.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=pgofman@codeweavers.com \
    --cc=rppt@kernel.org \
    --cc=shuah@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yun.zhou@windriver.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).