Re: [LSF/MM/BPF TOPIC] Generic page write protection

From: Jerome Glisse <jglisse@redhat.com>
To: Gao Xiang <hsiangkao@aol.com>
Cc: lsf-pc@lists.linux-foundation.org,
	Andrea Arcangeli <aarcange@redhat.com>,
	linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [LSF/MM/BPF TOPIC] Generic page write protection
Date: Tue, 21 Jan 2020 21:21:18 -0800	[thread overview]
Message-ID: <20200122052118.GE76712@redhat.com> (raw)
In-Reply-To: <20200122042832.GA6542@hsiangkao-HP-ZHAN-66-Pro-G1>

On Wed, Jan 22, 2020 at 12:28:39PM +0800, Gao Xiang wrote:
> Hi Jï¿½rï¿½me,
> 
> On Tue, Jan 21, 2020 at 06:32:22PM -0800, jglisse@redhat.com wrote:
> > From: Jï¿½rï¿½me Glisse <jglisse@redhat.com>
> > 
> > 
> 
> <snip>
> 
> > 
> > To avoid any regression risks the page->mapping field is left intact as
> > today for non write protect pages. This means that if you do not use the
> > page write protection mechanism then it can not regress. This is achieve
> > by using an helper function that take the mapping from the context
> > (current function parameter, see above on how function are updated) and
> > the struct page. If the page is not write protected then it uses the
> > mapping from the struct page (just like today). The only difference
> > between before and after the patchset is that all fs functions that do
> > need the mapping for a page now also do get it as a parameter but only
> > use the parameter mapping pointer if the page is write protected.
> > 
> > Note also that i do not believe that once confidence is high that we
> > always passdown the correct mapping down each callstack, it does not
> > mean we will be able to get rid of the struct page mapping field.
> 
> This feature is awesome and I might have some premature words here...
> 
> In short, are you suggesting completely getting rid of all way to access
> mapping directly from struct page (other than by page->private or something
> else like calling trace)?

No, all access to page->mapping are replace by:
    struct address_space *fs_page_mapping(struct page *page,
                                          struct address_space *mapping)
    {
        if (unlikely(!PageIsWriteProtected(page)))
            return page->mapping;
        return mapping;
    }

All function that where doing direct dereference are updated to use this
helper. If the function already has mapping in its context then it is
easy (there is a lot of place like that because you have file or inode or
mapping available from the function context).

If function does not have file, inode or mapping in its context then a
new mapping parameter is added to that function and all call site are
updated (and this does recurse ie if call site do not have file,inode or
mapping then a mapping parameter is added to them too ...).

This takes care of all fs code. The mm code is split between code that
deal with vma where we can get the mapping from the vma and mm code that
just want to walk all the CPU pte pointing to the page. In this latter
case we just need to provide CPU pte walkers for write protected pages
(like KSM does today).

The block device code only need the mapping on io error and they are
different strategy depending on individual fs. fs using buffer_head
can easily be updated. For other they are different solution and they
can be updated one at a time with tailor solution.

> I'm not sure if all cases can be handled without page->mapping easily (or
> handled effectively) since mapping field could also be used to indicate/judge
> truncated pages or some other filesystem specific states (okay, I think there
> could be some replacement, but it seems a huge project...)

I forgot to talk about truncate, all place that test for truncate are
updated to:
    bool fs_page_is_truncated(struct page *page,
                              struct address_space *mapping)
    {
        if (unlikely(!PageIsWriteProtected(page)))
            return !page->mapping || mapping != page->mapping;
        return wp_page_is_protected(page, mapping);
    }

Where wp_page_is_protected() will use common write protect mm code
(look at mm/ksm.c as it will be mostly that) to determine if the page
have been truncated. Also code doing truncation will have to special
case write protected page but that's easy enough.

> Currently, page->private is a per-page user-defined field, yet I don't think
> it could always be used as a pointer pointing to some structure. It can be
> simply used to store some unsigned long values for some kinds of filesystem
> pages as well...

For fs that use buffer_head i change buffer_head struct to store mapping
and not block_device. For other fs it will depend on the individual fs
but i am not changing page->private, i might only change the struct that
page->private points to for that specific fs.

> 
> It might some ineffective to convert such above usage to individual per-page
> structure pointers --- from cacheline or extra memory overhead view...
> 
> So I think at least there could be some another way to get its content
> source (inode or sub-inode granularity, a reverse way) effectively...
> by some field in struct page directly or indirectly...
> 
> I agree that the usage of page->mapping field is complicated for now.
> I'm looking forward some unique way to mark the page type for a filesystem
> to use (inode or fs internal special pages) or even extend to analymous
> pages [1]. However, it seems a huge project to keep from some regression...

Note that page->mapping stays _untouch_ if page is not write protected
so there is no memory lookup overhead, the only overhead is the extra
branch to test if the page is write protected or not.

So if you do not use the write protection feature then you can not
regress ie page->mapping is untouch and that's what get use like it is
today. So it can not regress unless i do stupid mistake, but that's
what review is for ;)).

> 
> I'm interested in related stuffs, some conclusion and I saw the article of
> LSF/MM 2018 although my English isn't good...
> 
> If something wrong, please kindly point out...
> 
> [1] https://lore.kernel.org/r/20191030172234.GA7018@hsiangkao-HP-ZHAN-66-Pro-G1

Missed that thread thank you for the pointer, i have some reading to do :)

Cheers,
Jérôme Glisse