All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Jerome Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	Andrea Arcangeli <aarcange@redhat.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	"Theodore Ts'o" <tytso@mit.edu>, Tejun Heo <tj@kernel.org>,
	Josef Bacik <jbacik@fb.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Jeff Layton <jlayton@redhat.com>
Subject: Re: [RFC PATCH 00/79] Generic page write protection and a solution to page waitqueue
Date: Wed, 18 Apr 2018 09:20:55 -0700	[thread overview]
Message-ID: <20180418162054.GA5197@magnolia> (raw)
In-Reply-To: <20180418155429.GA3476@redhat.com>

On Wed, Apr 18, 2018 at 11:54:30AM -0400, Jerome Glisse wrote:
> On Wed, Apr 18, 2018 at 04:13:37PM +0200, Jan Kara wrote:
> > Hello,
> > 
> > so I finally got to this :)
> > 
> > On Wed 04-04-18 15:17:50, jglisse@redhat.com wrote:
> > > From: J�r�me Glisse <jglisse@redhat.com>
> 
> [...]
> 
> > > ----------------------------------------------------------------------
> > > The Why ?
> > > 
> > > I have two objectives: duplicate memory read only accross nodes and or
> > > devices and work around PCIE atomic limitations. More on each of those
> > > objective below. I also want to put forward that it can solve the page
> > > wait list issue ie having each page with its own wait list and thus
> > > avoiding long wait list traversale latency recently reported [1].
> > > 
> > > It does allow KSM for file back pages (truely generic KSM even between
> > > both anonymous and file back page). I am not sure how useful this can
> > > be, this was not an objective i did pursue, this is just a for free
> > > feature (see below).
> > 
> > I know some people (Matthew Wilcox?) wanted to do something like KSM for
> > file pages - not all virtualization schemes use overlayfs and e.g. if you
> > use reflinks (essentially shared on-disk extents among files) for your
> > container setup, you could save significant amounts of memory with the
> > ability to share pages in page cache among files that are reflinked.
> 
> Yes i believe they are still use case where KSM with file back page make
> senses, i am just not familiar enough with those workload to know how big
> of a deal this is.

Imagine container farms where they deploy the base os image via cp --reflink.
This would be a huge win for btrfs/xfs but we've all been too terrified
of the memory manager to try it. :)

For those following at home, we had a track at LSFMM2017 (and hallway
bofs about this in previous years):
https://lwn.net/Articles/717950/

/me starts to look at this big series, having sent his own yesterday. :)

--D

> > > [1] https://groups.google.com/forum/#!topic/linux.kernel/Iit1P5BNyX8
> > > 
> > > ----------------------------------------------------------------------
> > > Per page wait list, so long page_waitqueue() !
> > > 
> > > Not implemented in this RFC but below is the logic and pseudo code
> > > at bottom of this email.
> > > 
> > > When there is a contention on struct page lock bit, the caller which
> > > is trying to lock the page will add itself to a waitqueue. The issues
> > > here is that multiple pages share the same wait queue and on large
> > > system with a lot of ram this means we can quickly get to a long list
> > > of waiters for differents pages (or for the same page) on the same
> > > list [1].
> > > 
> > > The present patchset virtualy kills all places that need to access the
> > > page->mapping field and only a handfull are left, namely for testing
> > > page truncation and for vmscan. The former can be remove if we reuse
> > > the PG_waiters flag for a new PG_truncate flag set on truncation then
> > > we can virtualy kill all derefence of page->mapping (this patchset
> > > proves it is doable). NOTE THIS DOES NOT MEAN THAT MAPPING is FREE TO
> > > BE USE BY ANYONE TO STORE WHATEVER IN STRUCT PAGE. SORRY NO !
> > 
> > It is interesting that you can get rid of page->mapping uses in most
> > places. For page reclaim (vmscan) you'll still need a way to get from a
> > page to an address_space so that you can reclaim the page so you can hardly
> > get rid of page->mapping completely but you're right that with such limited
> > use that transition could be more complex / expensive.
> 
> Idea for vmscan is that you either have regular mapping pointer store in
> page->mapping or you have a pointer to special struct which has a function
> pointer to a reclaim/walker function (rmap_walk_ksm)
> 
> > What I wonder though is what is the cost of this (in the terms of code size
> > and speed) - propagating the mapping down the stack costs something... Also
> > in terms of maintainability, code readability suffers a bit.
> 
> I haven't checked that, i will, i was not so concern because in the vast
> majority of places there is already struct address_space on the stack
> frame (ie local variable in function being call) so moving it to function
> argument shouldn't impact that. However as i expect this will be merge
> over multiple kernel release cycle and the intermediary step will see an
> increase in stack size. The code size should only grow marginaly i expect.
> I will provide numbers with my next posting after LSF/MM.
> 
> 
> > This could be helped though. In some cases it seems we just use the mapping
> > because it was easily available but could get away without it. In other
> > case (e.g. lot of fs/buffer.c) we could make bh -> mapping transition easy
> > by storing the mapping in the struct buffer_head - possibly it could
> > replace b_bdev pointer as we could get to that from the mapping with a bit
> > of magic and pointer chasing and accessing b_bdev is not very performance
> > critical. OTOH such optimizations make a rather complex patches from mostly
> > mechanical replacement so I can see why you didn't go that route.
> 
> I am willing to do the buffer_head change, i remember considering it but
> i don't remember why not doing it (i failed to take note of that).
> 
> 
> > Overall I think you'd need to make a good benchmarking comparison showing
> > how much this helps some real workloads (your motivation) and also how
> > other loads on lower end machines are affected.
> 
> Do you have any specific benchmark you would like to see ? My list was:
>   https://github.com/01org/lkp-tests
>   https://github.com/gormanm/mmtests
>   https://github.com/akopytov/sysbench/
>   http://git.infradead.org/users/dhowells/unionmount-testsuite.git
> 
> For workload i care this will be CUDA workload. We are still working on
> the OpenCL open source stack but i don't expect we will have someting
> that can shows the same performance improvement with OpenCL as soon as
> with CUDA.
> 
> > > ----------------------------------------------------------------------
> > > The What ?
> > > 
> > > Aim of this patch serie is to introduce generic page write protection
> > > for any kind of regular page in a process (private anonymous or back
> > > by regular file). This feature already exist, in one form, for private
> > > anonymous page, as part of KSM (Kernel Share Memory).
> > > 
> > > So this patch serie is two fold. First it factors out the page write
> > > protection of KSM into a generic write protection mechanim which KSM
> > > becomes the first user of. Then it add support for regular file back
> > > page memory (regular file or share memory aka shmem). To achieve this
> > > i need to cut the dependency lot of code have on page->mapping so i
> > > can set page->mapping to point to special structure when write
> > > protected.
> > 
> > So I'm interested in this write protection mechanism but I didn't find much
> > about it in the series. How does it work? I can see KSM writeprotects pages
> > in page tables so that works for userspace mappings but what about
> > in-kernel users modifying pages - e.g. pages in page cache carrying
> > filesystem metadata do get modified a lot like this.
> 
> So i only care about page which are mmaped into a process address space.
> At first i only want to intercept CPU write access through mmap of file
> but i also intend to extend write syscall to also "fault" on the write
> protection ie it will call a callback to unprotect the page allowing the
> write protector to take proper action while write syscall is happening.
> 
> I am affraid truely generic write protection for metadata pages is bit
> out of scope of what i am doing. However the mechanism i am proposing
> can be extended for that too. Issue is that all place that want to write
> to those page need to be converted to something where write happens
> between write_begin and write_end section (mmap and CPU pte does give
> this implicitly through page fault, so does write syscall). Basicly
> there is a need to make sure that write and write protection can be
> ordered against one another without complex locking.
> 
> 
> > > ----------------------------------------------------------------------
> > > The How ?
> > > 
> > > The corner stone assumption in this patch serie is that page->mapping
> > > is always the same as vma->vm_file->f_mapping (modulo when a page is
> > > truncated). The one exception is in respect to swaping with nfs file.
> > > 
> > > Am i fundamentaly wrong in my assumption ?
> > 
> > AFAIK you're right.
> > 
> > > I believe this is a do-able plan because virtually all place do know
> > > the address_space a page belongs to, or someone in the callchain do.
> > > Hence this patchset is all about passing down that information. The
> > > only exception i am aware of is page reclamation (vmscan) but this can
> > > be handled as a special case as there we not interested in the page
> > > mapping per say but in reclaiming memory.
> > > 
> > > Once you have both struct page and mapping (without relying on the
> > > struct page to get the latter) you can use mapping that as a unique
> > > key to lookup page->private/page->index value. So all dereference of
> > > those fields become:
> > >     page_offset(page) -> page_offset(page, mapping)
> > >     page_buffers(page) -> page_buffers(page, mapping)
> > > 
> > > Note than this only need special handling for write protected page ie
> > > it is the same as before if page is not write protected so it just add
> > > a test each time code call either helper.
> > > 
> > > Sinful function (all existing usage are remove in this patchset):
> > >     page_mapping(page)
> > > 
> > > You can also use the page buffer head as a unique key. So following
> > > helpers are added (thought i do not use them):
> > >     page_mapping_with_buffers(page, (struct buffer_head *)bh)
> > >     page_offset_with_buffers(page, (struct buffer_head *)bh)
> > > 
> > > A write protected page has page->mapping pointing to a structure like
> > > struct rmap_item for KSM. So this structure has a list for each unique
> > > combination:
> > >     struct write_protect {
> > >         struct list_head *mappings; /* write_protect_mapping list */
> > >         ...
> > >     };
> > > 
> > >     struct write_protect_mapping {
> > >         struct list_head list
> > >         struct address_space *mapping;
> > >         unsigned long offset;
> > >         unsigned long private;
> > >         ...
> > >     };
> > 
> > Auch, the fact that we could share a page as data storage for several
> > inode+offset combinations that are not sharing underlying storage just
> > looks viciously twisted ;) But is it really that useful to warrant
> > complications? In particular I'm afraid that filesystems expect consistency
> > between their internal state (attached to page->private) and page state
> > (e.g. page->flags) and when there are multiple internal states attached to
> > the same page this could go easily wrong...
> 
> So at first i want to limit to write protect (not KSM) thus page->flags
> will stay consistent (ie page is only ever associated with a single
> mapping). For KSM yes the page->flags can be problematic, however here
> we can assume that page is clean (and uptodate) and not under write
> back. So problematic flags for KSM:
>   - private (page_has_buffers() or PagePrivate (nfs, metadata, ...))
>   - private_2 (FsCache)
>   - mappedtodisk
>   - swapcache
>   - error
> 
> Idea again would be to PageFlagsWithMapping(page, mapping) so that for
> non KSM write protected page you test the usual page->flags and for
> write protected page you find the flag value using mapping as lookup
> index. Usualy those flag are seldomly changed/accessed. Again the
> overhead (ignoring code size) would only be for page which are KSM.
> So maybe KSM will not make sense because perf overhead it has with
> page->flags access (i don't think so but i haven't tested this).
> 
> 
> Thank you for taking time to read over all this.
> 
> Cheers,
> J�r�me

WARNING: multiple messages have this Message-ID (diff)
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Jerome Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	Andrea Arcangeli <aarcange@redhat.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	"Theodore Ts'o" <tytso@mit.edu>, Tejun Heo <tj@kernel.org>,
	Josef Bacik <jbacik@fb.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Jeff Layton <jlayton@redhat.com>
Subject: Re: [RFC PATCH 00/79] Generic page write protection and a solution to page waitqueue
Date: Wed, 18 Apr 2018 09:20:55 -0700	[thread overview]
Message-ID: <20180418162054.GA5197@magnolia> (raw)
In-Reply-To: <20180418155429.GA3476@redhat.com>

On Wed, Apr 18, 2018 at 11:54:30AM -0400, Jerome Glisse wrote:
> On Wed, Apr 18, 2018 at 04:13:37PM +0200, Jan Kara wrote:
> > Hello,
> > 
> > so I finally got to this :)
> > 
> > On Wed 04-04-18 15:17:50, jglisse@redhat.com wrote:
> > > From: Jérôme Glisse <jglisse@redhat.com>
> 
> [...]
> 
> > > ----------------------------------------------------------------------
> > > The Why ?
> > > 
> > > I have two objectives: duplicate memory read only accross nodes and or
> > > devices and work around PCIE atomic limitations. More on each of those
> > > objective below. I also want to put forward that it can solve the page
> > > wait list issue ie having each page with its own wait list and thus
> > > avoiding long wait list traversale latency recently reported [1].
> > > 
> > > It does allow KSM for file back pages (truely generic KSM even between
> > > both anonymous and file back page). I am not sure how useful this can
> > > be, this was not an objective i did pursue, this is just a for free
> > > feature (see below).
> > 
> > I know some people (Matthew Wilcox?) wanted to do something like KSM for
> > file pages - not all virtualization schemes use overlayfs and e.g. if you
> > use reflinks (essentially shared on-disk extents among files) for your
> > container setup, you could save significant amounts of memory with the
> > ability to share pages in page cache among files that are reflinked.
> 
> Yes i believe they are still use case where KSM with file back page make
> senses, i am just not familiar enough with those workload to know how big
> of a deal this is.

Imagine container farms where they deploy the base os image via cp --reflink.
This would be a huge win for btrfs/xfs but we've all been too terrified
of the memory manager to try it. :)

For those following at home, we had a track at LSFMM2017 (and hallway
bofs about this in previous years):
https://lwn.net/Articles/717950/

/me starts to look at this big series, having sent his own yesterday. :)

--D

> > > [1] https://groups.google.com/forum/#!topic/linux.kernel/Iit1P5BNyX8
> > > 
> > > ----------------------------------------------------------------------
> > > Per page wait list, so long page_waitqueue() !
> > > 
> > > Not implemented in this RFC but below is the logic and pseudo code
> > > at bottom of this email.
> > > 
> > > When there is a contention on struct page lock bit, the caller which
> > > is trying to lock the page will add itself to a waitqueue. The issues
> > > here is that multiple pages share the same wait queue and on large
> > > system with a lot of ram this means we can quickly get to a long list
> > > of waiters for differents pages (or for the same page) on the same
> > > list [1].
> > > 
> > > The present patchset virtualy kills all places that need to access the
> > > page->mapping field and only a handfull are left, namely for testing
> > > page truncation and for vmscan. The former can be remove if we reuse
> > > the PG_waiters flag for a new PG_truncate flag set on truncation then
> > > we can virtualy kill all derefence of page->mapping (this patchset
> > > proves it is doable). NOTE THIS DOES NOT MEAN THAT MAPPING is FREE TO
> > > BE USE BY ANYONE TO STORE WHATEVER IN STRUCT PAGE. SORRY NO !
> > 
> > It is interesting that you can get rid of page->mapping uses in most
> > places. For page reclaim (vmscan) you'll still need a way to get from a
> > page to an address_space so that you can reclaim the page so you can hardly
> > get rid of page->mapping completely but you're right that with such limited
> > use that transition could be more complex / expensive.
> 
> Idea for vmscan is that you either have regular mapping pointer store in
> page->mapping or you have a pointer to special struct which has a function
> pointer to a reclaim/walker function (rmap_walk_ksm)
> 
> > What I wonder though is what is the cost of this (in the terms of code size
> > and speed) - propagating the mapping down the stack costs something... Also
> > in terms of maintainability, code readability suffers a bit.
> 
> I haven't checked that, i will, i was not so concern because in the vast
> majority of places there is already struct address_space on the stack
> frame (ie local variable in function being call) so moving it to function
> argument shouldn't impact that. However as i expect this will be merge
> over multiple kernel release cycle and the intermediary step will see an
> increase in stack size. The code size should only grow marginaly i expect.
> I will provide numbers with my next posting after LSF/MM.
> 
> 
> > This could be helped though. In some cases it seems we just use the mapping
> > because it was easily available but could get away without it. In other
> > case (e.g. lot of fs/buffer.c) we could make bh -> mapping transition easy
> > by storing the mapping in the struct buffer_head - possibly it could
> > replace b_bdev pointer as we could get to that from the mapping with a bit
> > of magic and pointer chasing and accessing b_bdev is not very performance
> > critical. OTOH such optimizations make a rather complex patches from mostly
> > mechanical replacement so I can see why you didn't go that route.
> 
> I am willing to do the buffer_head change, i remember considering it but
> i don't remember why not doing it (i failed to take note of that).
> 
> 
> > Overall I think you'd need to make a good benchmarking comparison showing
> > how much this helps some real workloads (your motivation) and also how
> > other loads on lower end machines are affected.
> 
> Do you have any specific benchmark you would like to see ? My list was:
>   https://github.com/01org/lkp-tests
>   https://github.com/gormanm/mmtests
>   https://github.com/akopytov/sysbench/
>   http://git.infradead.org/users/dhowells/unionmount-testsuite.git
> 
> For workload i care this will be CUDA workload. We are still working on
> the OpenCL open source stack but i don't expect we will have someting
> that can shows the same performance improvement with OpenCL as soon as
> with CUDA.
> 
> > > ----------------------------------------------------------------------
> > > The What ?
> > > 
> > > Aim of this patch serie is to introduce generic page write protection
> > > for any kind of regular page in a process (private anonymous or back
> > > by regular file). This feature already exist, in one form, for private
> > > anonymous page, as part of KSM (Kernel Share Memory).
> > > 
> > > So this patch serie is two fold. First it factors out the page write
> > > protection of KSM into a generic write protection mechanim which KSM
> > > becomes the first user of. Then it add support for regular file back
> > > page memory (regular file or share memory aka shmem). To achieve this
> > > i need to cut the dependency lot of code have on page->mapping so i
> > > can set page->mapping to point to special structure when write
> > > protected.
> > 
> > So I'm interested in this write protection mechanism but I didn't find much
> > about it in the series. How does it work? I can see KSM writeprotects pages
> > in page tables so that works for userspace mappings but what about
> > in-kernel users modifying pages - e.g. pages in page cache carrying
> > filesystem metadata do get modified a lot like this.
> 
> So i only care about page which are mmaped into a process address space.
> At first i only want to intercept CPU write access through mmap of file
> but i also intend to extend write syscall to also "fault" on the write
> protection ie it will call a callback to unprotect the page allowing the
> write protector to take proper action while write syscall is happening.
> 
> I am affraid truely generic write protection for metadata pages is bit
> out of scope of what i am doing. However the mechanism i am proposing
> can be extended for that too. Issue is that all place that want to write
> to those page need to be converted to something where write happens
> between write_begin and write_end section (mmap and CPU pte does give
> this implicitly through page fault, so does write syscall). Basicly
> there is a need to make sure that write and write protection can be
> ordered against one another without complex locking.
> 
> 
> > > ----------------------------------------------------------------------
> > > The How ?
> > > 
> > > The corner stone assumption in this patch serie is that page->mapping
> > > is always the same as vma->vm_file->f_mapping (modulo when a page is
> > > truncated). The one exception is in respect to swaping with nfs file.
> > > 
> > > Am i fundamentaly wrong in my assumption ?
> > 
> > AFAIK you're right.
> > 
> > > I believe this is a do-able plan because virtually all place do know
> > > the address_space a page belongs to, or someone in the callchain do.
> > > Hence this patchset is all about passing down that information. The
> > > only exception i am aware of is page reclamation (vmscan) but this can
> > > be handled as a special case as there we not interested in the page
> > > mapping per say but in reclaiming memory.
> > > 
> > > Once you have both struct page and mapping (without relying on the
> > > struct page to get the latter) you can use mapping that as a unique
> > > key to lookup page->private/page->index value. So all dereference of
> > > those fields become:
> > >     page_offset(page) -> page_offset(page, mapping)
> > >     page_buffers(page) -> page_buffers(page, mapping)
> > > 
> > > Note than this only need special handling for write protected page ie
> > > it is the same as before if page is not write protected so it just add
> > > a test each time code call either helper.
> > > 
> > > Sinful function (all existing usage are remove in this patchset):
> > >     page_mapping(page)
> > > 
> > > You can also use the page buffer head as a unique key. So following
> > > helpers are added (thought i do not use them):
> > >     page_mapping_with_buffers(page, (struct buffer_head *)bh)
> > >     page_offset_with_buffers(page, (struct buffer_head *)bh)
> > > 
> > > A write protected page has page->mapping pointing to a structure like
> > > struct rmap_item for KSM. So this structure has a list for each unique
> > > combination:
> > >     struct write_protect {
> > >         struct list_head *mappings; /* write_protect_mapping list */
> > >         ...
> > >     };
> > > 
> > >     struct write_protect_mapping {
> > >         struct list_head list
> > >         struct address_space *mapping;
> > >         unsigned long offset;
> > >         unsigned long private;
> > >         ...
> > >     };
> > 
> > Auch, the fact that we could share a page as data storage for several
> > inode+offset combinations that are not sharing underlying storage just
> > looks viciously twisted ;) But is it really that useful to warrant
> > complications? In particular I'm afraid that filesystems expect consistency
> > between their internal state (attached to page->private) and page state
> > (e.g. page->flags) and when there are multiple internal states attached to
> > the same page this could go easily wrong...
> 
> So at first i want to limit to write protect (not KSM) thus page->flags
> will stay consistent (ie page is only ever associated with a single
> mapping). For KSM yes the page->flags can be problematic, however here
> we can assume that page is clean (and uptodate) and not under write
> back. So problematic flags for KSM:
>   - private (page_has_buffers() or PagePrivate (nfs, metadata, ...))
>   - private_2 (FsCache)
>   - mappedtodisk
>   - swapcache
>   - error
> 
> Idea again would be to PageFlagsWithMapping(page, mapping) so that for
> non KSM write protected page you test the usual page->flags and for
> write protected page you find the flag value using mapping as lookup
> index. Usualy those flag are seldomly changed/accessed. Again the
> overhead (ignoring code size) would only be for page which are KSM.
> So maybe KSM will not make sense because perf overhead it has with
> page->flags access (i don't think so but i haven't tested this).
> 
> 
> Thank you for taking time to read over all this.
> 
> Cheers,
> Jérôme

WARNING: multiple messages have this Message-ID (diff)
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Jerome Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	Andrea Arcangeli <aarcange@redhat.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Theodore Ts'o <tytso@mit.edu>, Tejun Heo <tj@kernel.org>,
	Josef Bacik <jbacik@fb.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Jeff Layton <jlayton@redhat.com>
Subject: Re: [RFC PATCH 00/79] Generic page write protection and a solution to page waitqueue
Date: Wed, 18 Apr 2018 09:20:55 -0700	[thread overview]
Message-ID: <20180418162054.GA5197@magnolia> (raw)
In-Reply-To: <20180418155429.GA3476@redhat.com>

On Wed, Apr 18, 2018 at 11:54:30AM -0400, Jerome Glisse wrote:
> On Wed, Apr 18, 2018 at 04:13:37PM +0200, Jan Kara wrote:
> > Hello,
> > 
> > so I finally got to this :)
> > 
> > On Wed 04-04-18 15:17:50, jglisse@redhat.com wrote:
> > > From: Jerome Glisse <jglisse@redhat.com>
> 
> [...]
> 
> > > ----------------------------------------------------------------------
> > > The Why ?
> > > 
> > > I have two objectives: duplicate memory read only accross nodes and or
> > > devices and work around PCIE atomic limitations. More on each of those
> > > objective below. I also want to put forward that it can solve the page
> > > wait list issue ie having each page with its own wait list and thus
> > > avoiding long wait list traversale latency recently reported [1].
> > > 
> > > It does allow KSM for file back pages (truely generic KSM even between
> > > both anonymous and file back page). I am not sure how useful this can
> > > be, this was not an objective i did pursue, this is just a for free
> > > feature (see below).
> > 
> > I know some people (Matthew Wilcox?) wanted to do something like KSM for
> > file pages - not all virtualization schemes use overlayfs and e.g. if you
> > use reflinks (essentially shared on-disk extents among files) for your
> > container setup, you could save significant amounts of memory with the
> > ability to share pages in page cache among files that are reflinked.
> 
> Yes i believe they are still use case where KSM with file back page make
> senses, i am just not familiar enough with those workload to know how big
> of a deal this is.

Imagine container farms where they deploy the base os image via cp --reflink.
This would be a huge win for btrfs/xfs but we've all been too terrified
of the memory manager to try it. :)

For those following at home, we had a track at LSFMM2017 (and hallway
bofs about this in previous years):
https://lwn.net/Articles/717950/

/me starts to look at this big series, having sent his own yesterday. :)

--D

> > > [1] https://groups.google.com/forum/#!topic/linux.kernel/Iit1P5BNyX8
> > > 
> > > ----------------------------------------------------------------------
> > > Per page wait list, so long page_waitqueue() !
> > > 
> > > Not implemented in this RFC but below is the logic and pseudo code
> > > at bottom of this email.
> > > 
> > > When there is a contention on struct page lock bit, the caller which
> > > is trying to lock the page will add itself to a waitqueue. The issues
> > > here is that multiple pages share the same wait queue and on large
> > > system with a lot of ram this means we can quickly get to a long list
> > > of waiters for differents pages (or for the same page) on the same
> > > list [1].
> > > 
> > > The present patchset virtualy kills all places that need to access the
> > > page->mapping field and only a handfull are left, namely for testing
> > > page truncation and for vmscan. The former can be remove if we reuse
> > > the PG_waiters flag for a new PG_truncate flag set on truncation then
> > > we can virtualy kill all derefence of page->mapping (this patchset
> > > proves it is doable). NOTE THIS DOES NOT MEAN THAT MAPPING is FREE TO
> > > BE USE BY ANYONE TO STORE WHATEVER IN STRUCT PAGE. SORRY NO !
> > 
> > It is interesting that you can get rid of page->mapping uses in most
> > places. For page reclaim (vmscan) you'll still need a way to get from a
> > page to an address_space so that you can reclaim the page so you can hardly
> > get rid of page->mapping completely but you're right that with such limited
> > use that transition could be more complex / expensive.
> 
> Idea for vmscan is that you either have regular mapping pointer store in
> page->mapping or you have a pointer to special struct which has a function
> pointer to a reclaim/walker function (rmap_walk_ksm)
> 
> > What I wonder though is what is the cost of this (in the terms of code size
> > and speed) - propagating the mapping down the stack costs something... Also
> > in terms of maintainability, code readability suffers a bit.
> 
> I haven't checked that, i will, i was not so concern because in the vast
> majority of places there is already struct address_space on the stack
> frame (ie local variable in function being call) so moving it to function
> argument shouldn't impact that. However as i expect this will be merge
> over multiple kernel release cycle and the intermediary step will see an
> increase in stack size. The code size should only grow marginaly i expect.
> I will provide numbers with my next posting after LSF/MM.
> 
> 
> > This could be helped though. In some cases it seems we just use the mapping
> > because it was easily available but could get away without it. In other
> > case (e.g. lot of fs/buffer.c) we could make bh -> mapping transition easy
> > by storing the mapping in the struct buffer_head - possibly it could
> > replace b_bdev pointer as we could get to that from the mapping with a bit
> > of magic and pointer chasing and accessing b_bdev is not very performance
> > critical. OTOH such optimizations make a rather complex patches from mostly
> > mechanical replacement so I can see why you didn't go that route.
> 
> I am willing to do the buffer_head change, i remember considering it but
> i don't remember why not doing it (i failed to take note of that).
> 
> 
> > Overall I think you'd need to make a good benchmarking comparison showing
> > how much this helps some real workloads (your motivation) and also how
> > other loads on lower end machines are affected.
> 
> Do you have any specific benchmark you would like to see ? My list was:
>   https://github.com/01org/lkp-tests
>   https://github.com/gormanm/mmtests
>   https://github.com/akopytov/sysbench/
>   http://git.infradead.org/users/dhowells/unionmount-testsuite.git
> 
> For workload i care this will be CUDA workload. We are still working on
> the OpenCL open source stack but i don't expect we will have someting
> that can shows the same performance improvement with OpenCL as soon as
> with CUDA.
> 
> > > ----------------------------------------------------------------------
> > > The What ?
> > > 
> > > Aim of this patch serie is to introduce generic page write protection
> > > for any kind of regular page in a process (private anonymous or back
> > > by regular file). This feature already exist, in one form, for private
> > > anonymous page, as part of KSM (Kernel Share Memory).
> > > 
> > > So this patch serie is two fold. First it factors out the page write
> > > protection of KSM into a generic write protection mechanim which KSM
> > > becomes the first user of. Then it add support for regular file back
> > > page memory (regular file or share memory aka shmem). To achieve this
> > > i need to cut the dependency lot of code have on page->mapping so i
> > > can set page->mapping to point to special structure when write
> > > protected.
> > 
> > So I'm interested in this write protection mechanism but I didn't find much
> > about it in the series. How does it work? I can see KSM writeprotects pages
> > in page tables so that works for userspace mappings but what about
> > in-kernel users modifying pages - e.g. pages in page cache carrying
> > filesystem metadata do get modified a lot like this.
> 
> So i only care about page which are mmaped into a process address space.
> At first i only want to intercept CPU write access through mmap of file
> but i also intend to extend write syscall to also "fault" on the write
> protection ie it will call a callback to unprotect the page allowing the
> write protector to take proper action while write syscall is happening.
> 
> I am affraid truely generic write protection for metadata pages is bit
> out of scope of what i am doing. However the mechanism i am proposing
> can be extended for that too. Issue is that all place that want to write
> to those page need to be converted to something where write happens
> between write_begin and write_end section (mmap and CPU pte does give
> this implicitly through page fault, so does write syscall). Basicly
> there is a need to make sure that write and write protection can be
> ordered against one another without complex locking.
> 
> 
> > > ----------------------------------------------------------------------
> > > The How ?
> > > 
> > > The corner stone assumption in this patch serie is that page->mapping
> > > is always the same as vma->vm_file->f_mapping (modulo when a page is
> > > truncated). The one exception is in respect to swaping with nfs file.
> > > 
> > > Am i fundamentaly wrong in my assumption ?
> > 
> > AFAIK you're right.
> > 
> > > I believe this is a do-able plan because virtually all place do know
> > > the address_space a page belongs to, or someone in the callchain do.
> > > Hence this patchset is all about passing down that information. The
> > > only exception i am aware of is page reclamation (vmscan) but this can
> > > be handled as a special case as there we not interested in the page
> > > mapping per say but in reclaiming memory.
> > > 
> > > Once you have both struct page and mapping (without relying on the
> > > struct page to get the latter) you can use mapping that as a unique
> > > key to lookup page->private/page->index value. So all dereference of
> > > those fields become:
> > >     page_offset(page) -> page_offset(page, mapping)
> > >     page_buffers(page) -> page_buffers(page, mapping)
> > > 
> > > Note than this only need special handling for write protected page ie
> > > it is the same as before if page is not write protected so it just add
> > > a test each time code call either helper.
> > > 
> > > Sinful function (all existing usage are remove in this patchset):
> > >     page_mapping(page)
> > > 
> > > You can also use the page buffer head as a unique key. So following
> > > helpers are added (thought i do not use them):
> > >     page_mapping_with_buffers(page, (struct buffer_head *)bh)
> > >     page_offset_with_buffers(page, (struct buffer_head *)bh)
> > > 
> > > A write protected page has page->mapping pointing to a structure like
> > > struct rmap_item for KSM. So this structure has a list for each unique
> > > combination:
> > >     struct write_protect {
> > >         struct list_head *mappings; /* write_protect_mapping list */
> > >         ...
> > >     };
> > > 
> > >     struct write_protect_mapping {
> > >         struct list_head list
> > >         struct address_space *mapping;
> > >         unsigned long offset;
> > >         unsigned long private;
> > >         ...
> > >     };
> > 
> > Auch, the fact that we could share a page as data storage for several
> > inode+offset combinations that are not sharing underlying storage just
> > looks viciously twisted ;) But is it really that useful to warrant
> > complications? In particular I'm afraid that filesystems expect consistency
> > between their internal state (attached to page->private) and page state
> > (e.g. page->flags) and when there are multiple internal states attached to
> > the same page this could go easily wrong...
> 
> So at first i want to limit to write protect (not KSM) thus page->flags
> will stay consistent (ie page is only ever associated with a single
> mapping). For KSM yes the page->flags can be problematic, however here
> we can assume that page is clean (and uptodate) and not under write
> back. So problematic flags for KSM:
>   - private (page_has_buffers() or PagePrivate (nfs, metadata, ...))
>   - private_2 (FsCache)
>   - mappedtodisk
>   - swapcache
>   - error
> 
> Idea again would be to PageFlagsWithMapping(page, mapping) so that for
> non KSM write protected page you test the usual page->flags and for
> write protected page you find the flag value using mapping as lookup
> index. Usualy those flag are seldomly changed/accessed. Again the
> overhead (ignoring code size) would only be for page which are KSM.
> So maybe KSM will not make sense because perf overhead it has with
> page->flags access (i don't think so but i haven't tested this).
> 
> 
> Thank you for taking time to read over all this.
> 
> Cheers,
> Jerome

  reply	other threads:[~2018-04-18 16:21 UTC|newest]

Thread overview: 104+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-04 19:17 [RFC PATCH 00/79] Generic page write protection and a solution to page waitqueue jglisse
2018-04-04 19:17 ` jglisse
2018-04-04 19:17 ` [RFC PATCH 04/79] pipe: add inode field to struct pipe_inode_info jglisse
2018-04-04 19:17   ` jglisse
2018-04-04 19:17 ` [RFC PATCH 05/79] mm/swap: add an helper to get address_space from swap_entry_t jglisse
2018-04-04 19:17   ` jglisse
2018-04-04 19:17 ` [RFC PATCH 06/79] mm/page: add helpers to dereference struct page index field jglisse
2018-04-04 19:17   ` jglisse
2018-04-04 19:17 ` [RFC PATCH 07/79] mm/page: add helpers to find mapping give a page and buffer head jglisse
2018-04-04 19:17   ` jglisse
2018-04-04 19:17 ` [RFC PATCH 08/79] mm/page: add helpers to find page mapping and private given a bio jglisse
2018-04-04 19:17   ` jglisse
2018-04-04 19:17 ` [RFC PATCH 09/79] fs: add struct address_space to read_cache_page() callback argument jglisse
2018-04-04 19:17   ` jglisse
2018-04-04 19:17 ` [RFC PATCH 20/79] fs: add struct address_space to write_cache_pages() " jglisse
2018-04-04 19:17   ` jglisse
2018-04-04 19:17 ` [RFC PATCH 22/79] fs: add struct inode to block_read_full_page() arguments jglisse
2018-04-04 19:17   ` jglisse
2018-04-04 19:17 ` [RFC PATCH 24/79] fs: add struct inode to nobh_writepage() arguments jglisse
2018-04-04 19:17   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 26/79] fs: add struct address_space to mpage_readpage() arguments jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 27/79] fs: add struct address_space to fscache_read*() callback arguments jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 28/79] fs: introduce page_is_truncated() helper jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 29/79] fs/block: add struct address_space to bdev_write_page() arguments jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 30/79] fs/block: add struct address_space to __block_write_begin() arguments jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 31/79] fs/block: add struct address_space to __block_write_begin_int() args jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 32/79] fs/block: do not rely on page->mapping get it from the context jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 33/79] fs/journal: add struct super_block to jbd2_journal_forget() arguments jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 34/79] fs/journal: add struct inode to jbd2_journal_revoke() arguments jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 35/79] fs/buffer: add struct address_space and struct page to end_io callback jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 36/79] fs/buffer: add struct super_block to bforget() arguments jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 37/79] fs/buffer: add struct super_block to __bforget() arguments jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 38/79] fs/buffer: add first buffer flag for first buffer_head in a page jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 39/79] fs/buffer: add struct address_space to clean_page_buffers() arguments jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 50/79] fs: stop relying on mapping field of struct page, get it from context jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 51/79] " jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 52/79] fs/buffer: use _page_has_buffers() instead of page_has_buffers() jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 63/79] mm/page: convert page's index lookup to be against specific mapping jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 64/79] mm/buffer: use _page_has_buffers() instead of page_has_buffers() jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 65/79] mm/swap: add struct swap_info_struct swap_readpage() arguments jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 68/79] mm/vma_address: convert page's index lookup to be against specific mapping jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 69/79] fs/journal: add struct address_space to jbd2_journal_try_to_free_buffers() arguments jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 70/79] mm: add struct address_space to mark_buffer_dirty() jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 71/79] mm: add struct address_space to set_page_dirty() jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 72/79] mm: add struct address_space to set_page_dirty_lock() jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 73/79] mm: pass down struct address_space to set_page_dirty() jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 74/79] mm/page_ronly: add config option for generic read only page framework jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 75/79] mm/page_ronly: add page read only core structure and helpers jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 76/79] mm/ksm: have ksm select PAGE_RONLY config jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 77/79] mm/ksm: hide set_page_stable_node() and page_stable_node() jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 78/79] mm/ksm: rename PAGE_MAPPING_KSM to PAGE_MAPPING_RONLY jglisse
2018-04-04 19:18   ` jglisse
2018-04-04 19:18 ` [RFC PATCH 79/79] mm/ksm: set page->mapping to page_ronly struct instead of stable_node jglisse
2018-04-04 19:18   ` jglisse
2018-04-18 14:13 ` [RFC PATCH 00/79] Generic page write protection and a solution to page waitqueue Jan Kara
2018-04-18 14:13   ` Jan Kara
2018-04-18 14:13   ` Jan Kara
2018-04-18 15:54   ` Jerome Glisse
2018-04-18 15:54     ` Jerome Glisse
2018-04-18 15:54     ` Jerome Glisse
2018-04-18 16:20     ` Darrick J. Wong [this message]
2018-04-18 16:20       ` Darrick J. Wong
2018-04-18 16:20       ` Darrick J. Wong
2018-04-19 10:32     ` Jan Kara
2018-04-19 14:52       ` Jerome Glisse
2018-04-19 14:52         ` Jerome Glisse
2018-04-19 14:52         ` Jerome Glisse
2018-04-20 19:57 ` Tim Chen
2018-04-20 19:57   ` Tim Chen
2018-04-20 22:19   ` Jerome Glisse
2018-04-20 22:19     ` Jerome Glisse
2018-04-20 22:19     ` Jerome Glisse
2018-04-20 23:48     ` Tim Chen
2018-04-20 23:48       ` Tim Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180418162054.GA5197@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=jack@suse.cz \
    --cc=jbacik@fb.com \
    --cc=jglisse@redhat.com \
    --cc=jlayton@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=tim.c.chen@linux.intel.com \
    --cc=tj@kernel.org \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.