All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: "Muhammad Usama Anjum" <usama.anjum@collabora.com>,
	"Michał Mirosław" <emmir@google.com>,
	"Andrei Vagin" <avagin@gmail.com>,
	"Danylo Mocherniuk" <mdanylo@google.com>,
	"Alexander Viro" <viro@zeniv.linux.org.uk>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Suren Baghdasaryan" <surenb@google.com>,
	"Greg KH" <gregkh@linuxfoundation.org>,
	"Christian Brauner" <brauner@kernel.org>,
	"Peter Xu" <peterx@redhat.com>, "Yang Shi" <shy828301@gmail.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Zach O'Keefe" <zokeefe@google.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	"Gustavo A. R. Silva" <gustavoars@kernel.org>,
	"Dan Williams" <dan.j.williams@intel.com>,
	kernel@collabora.com,
	"Gabriel Krisman Bertazi" <krisman@collabora.com>,
	"Peter Enderborg" <peter.enderborg@sony.com>,
	"open list : KERNEL SELFTEST FRAMEWORK"
	<linux-kselftest@vger.kernel.org>,
	"Shuah Khan" <shuah@kernel.org>,
	"open list" <linux-kernel@vger.kernel.org>,
	"open list : PROC FILESYSTEM" <linux-fsdevel@vger.kernel.org>,
	"open list : MEMORY MANAGEMENT" <linux-mm@kvack.org>,
	"Paul Gofman" <pgofman@codeweavers.com>
Subject: Re: [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs
Date: Wed, 9 Nov 2022 11:34:43 +0100	[thread overview]
Message-ID: <9c167d01-ef09-ec4e-b4a1-2fff62bf01fe@redhat.com> (raw)
In-Reply-To: <20221109102303.851281-1-usama.anjum@collabora.com>

On 09.11.22 11:23, Muhammad Usama Anjum wrote:
> Changes in v6:
> - Updated the interface and made cosmetic changes
> 
> Original Cover Letter in v5:
> Hello,
> 
> This patch series implements IOCTL on the pagemap procfs file to get the
> information about the page table entries (PTEs). The following operations
> are supported in this ioctl:
> - Get the information if the pages are soft-dirty, file mapped, present
>    or swapped.
> - Clear the soft-dirty PTE bit of the pages.
> - Get and clear the soft-dirty PTE bit of the pages atomically.
> 
> Soft-dirty PTE bit of the memory pages can be read by using the pagemap
> procfs file. The soft-dirty PTE bit for the whole memory range of the
> process can be cleared by writing to the clear_refs file. There are other
> methods to mimic this information entirely in userspace with poor
> performance:
> - The mprotect syscall and SIGSEGV handler for bookkeeping
> - The userfaultfd syscall with the handler for bookkeeping
> Some benchmarks can be seen here[1]. This series adds features that weren't
> present earlier:
> - There is no atomic get soft-dirty PTE bit status and clear operation
>    possible.
> - The soft-dirty PTE bit of only a part of memory cannot be cleared.
> 
> Historically, soft-dirty PTE bit tracking has been used in the CRIU
> project. The procfs interface is enough for finding the soft-dirty bit
> status and clearing the soft-dirty bit of all the pages of a process.
> We have the use case where we need to track the soft-dirty PTE bit for
> only specific pages on demand. We need this tracking and clear mechanism
> of a region of memory while the process is running to emulate the
> getWriteWatch() syscall of Windows. This syscall is used by games to
> keep track of dirty pages to process only the dirty pages.
> 
> The information related to pages if the page is file mapped, present and
> swapped is required for the CRIU project[2][3]. The addition of the
> required mask, any mask, excluded mask and return masks are also required
> for the CRIU project[2].
> 
> The IOCTL returns the addresses of the pages which match the specific masks.
> The page addresses are returned in struct page_region in a compact form.
> The max_pages is needed to support a use case where user only wants to get
> a specific number of pages. So there is no need to find all the pages of
> interest in the range when max_pages is specified. The IOCTL returns when
> the maximum number of the pages are found. The max_pages is optional. If
> max_pages is specified, it must be equal or greater than the vec_size.
> This restriction is needed to handle worse case when one page_region only
> contains info of one page and it cannot be compacted. This is needed to
> emulate the Windows getWriteWatch() syscall.
> 
> Some non-dirty pages get marked as dirty because of the kernel's
> internal activity (such as VMA merging as soft-dirty bit difference isn't
> considered while deciding to merge VMAs). The dirty bit of the pages is
> stored in the VMA flags and in the per page flags. If any of these two bits
> are set, the page is considered to be soft dirty. Suppose you have cleared
> the soft dirty bit of half of VMA which will be done by splitting the VMA
> and clearing soft dirty bit flag in the half VMA and the pages in it. Now
> kernel may decide to merge the VMAs again. So the half VMA becomes dirty
> again. This splitting/merging costs performance. The application receives
> a lot of pages which aren't dirty in reality but marked as dirty.
> Performance is lost again here. Also sometimes user doesn't want the newly
> allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag
> solves both the problems. It is used to not depend on the soft dirty flag
> in the VMA flags. So VMA splitting and merging doesn't happen. It only
> depends on the soft dirty bit of the individual pages. Thus by using this
> flag, there may be a scenerio such that the new memory regions which are
> just created, doesn't look dirty when seen with the IOCTL, but look dirty
> when seen from procfs. This seems okay as the user of this flag know the
> implication of using it.

Please separate that part out from the other changes; I am still not 
convinced that we want this and what the semantical implications are.

Let's take a look at an example: can_change_pte_writable()

	/* Do we need write faults for softdirty tracking? */
	if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
		return false;

We care about PTE softdirty tracking, if it is enabled for the VMA. 
Tracking is enabled if: vma_soft_dirty_enabled()

	/*
	 * Soft-dirty is kind of special: its tracking is enabled when
	 * the vma flags not set.
	 */
	return !(vma->vm_flags & VM_SOFTDIRTY);

Consequently, if VM_SOFTDIRTY is set, we are not considering the 
soft_dirty PTE bits accordingly.


I'd suggest moving forward without this controversial 
PAGEMAP_NO_REUSED_REGIONS functionality for now, and preparing it as a 
clear add-on we can discuss separately.

-- 
Thanks,

David / dhildenb


  parent reply	other threads:[~2022-11-09 10:35 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-09 10:23 [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
2022-11-09 10:23 ` [PATCH v6 1/3] fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit Muhammad Usama Anjum
2022-11-09 10:23 ` [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
2022-11-09 23:54   ` Andrei Vagin
2022-11-11 10:10     ` Muhammad Usama Anjum
2022-11-10 17:58   ` kernel test robot
2022-11-11 17:13   ` kernel test robot
2022-11-11 17:53     ` Muhammad Usama Anjum
2022-11-18  1:32   ` kernel test robot
2022-12-12 20:42   ` Cyrill Gorcunov
2022-12-13 13:04     ` Muhammad Usama Anjum
2022-12-13 22:22       ` Cyrill Gorcunov
2022-11-09 10:23 ` [PATCH v6 3/3] selftests: vm: add pagemap ioctl tests Muhammad Usama Anjum
2022-11-09 10:34 ` David Hildenbrand [this message]
2022-11-11  7:08   ` [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
2022-11-14 15:46     ` David Hildenbrand
2022-11-21 15:00       ` Muhammad Usama Anjum
2022-11-21 15:55         ` David Hildenbrand
2022-11-30 11:42           ` Muhammad Usama Anjum
2022-11-30 12:10             ` David Hildenbrand
2022-12-05 15:29               ` Muhammad Usama Anjum
2022-12-05 15:39                 ` David Hildenbrand
2022-11-23 14:11 ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9c167d01-ef09-ec4e-b4a1-2fff62bf01fe@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=avagin@gmail.com \
    --cc=brauner@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=emmir@google.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=gustavoars@kernel.org \
    --cc=kernel@collabora.com \
    --cc=krisman@collabora.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mdanylo@google.com \
    --cc=peter.enderborg@sony.com \
    --cc=peterx@redhat.com \
    --cc=pgofman@codeweavers.com \
    --cc=shuah@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=surenb@google.com \
    --cc=usama.anjum@collabora.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.