[LSF/MM/BPF TOPIC] Generic page write protection

From: jglisse@redhat.com
To: lsf-pc@lists.linux-foundation.org
Cc: "Jérôme Glisse" <jglisse@redhat.com>,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org,
	linux-mm@kvack.org
Subject: [LSF/MM/BPF TOPIC] Generic page write protection
Date: Tue, 21 Jan 2020 18:32:22 -0800	[thread overview]
Message-ID: <20200122023222.75347-1-jglisse@redhat.com> (raw)

From: Jérôme Glisse <jglisse@redhat.com>

Provide a generic way to write protect page (à la KSM) to enable new mm
optimization:
    - KSM (kernel share memory) to deduplicate pages (for file
      back pages too not only anonymous memory like today)
    - page duplication NUMA (read only duplication) in multiple
      different physical page. For instance share library code
      having a copy on each NUMA node. Or in case like GPU/FPGA
      duplicating memory read only inside the local device memory.
    ...

Note that this write protection is intend to be broken at anytime in
reasonable time (like KSM today) so that we never block more than
necessary anything that need to write to the page.

The goal is to provide a mechanism that work for both anonymous and
file back memory. For this we need to a pointer inside struct page.
For anonymous memory KSM uses the anon_vma field which correspond
to mapping field for file back pages.

So to allow generic write protection for file back pages we need to
avoid relying on struct page mapping field in the various kernel code
path that do use it today.

The page->mapping fields is use in 5 different ways:
 [1]- Functions operating on file, we can get the mapping from the file
      (issue here is that we might need to pass the file down the call-
      stack)

 [2]- Core/arch mm functions, those do not care about the file (if they
      do then it means they are vma related and we can get the mapping
      from the vma). Those functions only want to be able to walk all
      the pte point to the page (for instance memory compaction, memory
      reclaim, ...). We can provide the exact same functionality for
      write protected pages (like KSM does today).

 [3]- Block layer when I/O fails. This depends on fs, for instance for
      fs which uses buffer_head we can update buffer_head to store the
      mapping instead of the block_device as we can get the block_device
      from the mapping but not the mapping from the block_device.

      So solving this is mostly filesystem specific but i have not seen
      any fs that could not be updated properly so that block layer can
      report I/O failures without relying on page->mapping

 [4]- Debugging (mostly procfs/sysfs files to dump memory states). Those
      do not need the mapping per say, we just need to report page states
      (and thus write protection information if page is write protected).

 [5]- GUP (get user page) if something calls GUP in write mode then we
      need to break write protection (like KSM today). GUPed page should
      not be write protected as we do not know what the GUPers is doing
      with the page.

Most of the patchset deals with [1], [2] and [3] ([4] and [5] are mostly
trivial).

For [1] we only need to pass down the mapping to all fs and vfs callback
functions (this is mostly achieve with coccinelle). Roughly speaking the
patches are generated with following pseudo code:

add_mapping_parameter(func)
{
    function_add_parameter(func, mapping);

    for_each_function_calling (caller, func) {
        calling_add_parameter(caller, func, mapping);

        if (function_parameters_contains(caller, mapping|file))
            continue;

        add_mapping_parameter(caller);
    }
}

passdown_mapping()
{
    for_each_function_in_fs (func, fs_functions) {
        if (!function_body_contains(func, page->mapping))
            continue;

        if (function_parameters_contains(func, mapping|file))
            continue;

        add_mapping_parameter(func);
    }
}

For [2] KSM is generalized and extended so that both anonymous and file
back pages can be handled by a common write protected page case.

For [3] it depends on the filesystem (fs which uses buffer_head are
easily handled by storing mapping into the buffer_head struct).

To avoid any regression risks the page->mapping field is left intact as
today for non write protect pages. This means that if you do not use the
page write protection mechanism then it can not regress. This is achieve
by using an helper function that take the mapping from the context
(current function parameter, see above on how function are updated) and
the struct page. If the page is not write protected then it uses the
mapping from the struct page (just like today). The only difference
between before and after the patchset is that all fs functions that do
need the mapping for a page now also do get it as a parameter but only
use the parameter mapping pointer if the page is write protected.

Note also that i do not believe that once confidence is high that we
always passdown the correct mapping down each callstack, it does not
mean we will be able to get rid of the struct page mapping field.

I posted patchset before [*1] and i intend to post an updated patchset
before LSF/MM/BPF. I also talked about this at LSF/MM 2018. I still
believe this will a topic that warrent a discussion with FS/MM and
block device folks.

[*1] https://lwn.net/Articles/751050/
     https://cgit.freedesktop.org/~glisse/linux/log/?h=generic-write-protection-rfc
[*2] https://lwn.net/Articles/752564/

To: lsf-pc@lists.linux-foundation.org
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-block@vger.kernel.org
Cc: linux-mm@kvack.org