Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU

From: Miklos Szeredi <mszeredi@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Boaz Harrosh <boazh@netapp.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Ric Wheeler <rwheeler@redhat.com>,
	Steve French <smfrench@gmail.com>,
	Steven Whitehouse <swhiteho@redhat.com>,
	Jefff moyer <jmoyer@redhat.com>, Sage Weil <sweil@redhat.com>,
	Jan Kara <jack@suse.cz>, Amir Goldstein <amir73il@gmail.com>,
	Andy Rudof <andy.rudoff@intel.com>,
	Anna Schumaker <Anna.Schumaker@netapp.com>,
	Amit Golander <Amit.Golander@netapp.com>,
	Sagi Manole <sagim@netapp.com>,
	Shachar Sharon <Shachar.Sharon@netapp.com>
Subject: Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
Date: Wed, 14 Mar 2018 12:31:22 +0100	[thread overview]
Message-ID: <CAOssrKdffjmUPpsnzyZudtk9kvT7mXfgMG4fzJvFRQqo5Li79A@mail.gmail.com> (raw)
In-Reply-To: <20180314111750.GA29631@bombadil.infradead.org>

On Wed, Mar 14, 2018 at 12:17 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Wed, Mar 14, 2018 at 09:20:57AM +0100, Miklos Szeredi wrote:
>> On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox <willy@infradead.org> wrote:
>> > On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote:
>> >> On a call to mmap an mmap provider (like an FS) can put
>> >> this flag on vma->vm_flags.
>> >>
>> >> This tells the Kernel that the vma will be used from a single
>> >> core only and therefore invalidation of PTE(s) need not a
>> >> wide CPU scheduling
>> >>
>> >> The motivation of this flag is the ZUFS project where we want
>> >> to optimally map user-application buffers into a user-mode-server
>> >> execute the operation and efficiently unmap.
>> >
>> > I've been looking at something similar, and I prefer my approach,
>> > although I'm not nearly as far along with my implementation as you are.
>> >
>> > My approach is also to add a vm_flags bit, tentatively called VM_NOTLB.
>> > The page fault handler refuses to insert any TLB entries into the process
>> > address space.  But follow_page_mask() will return the appropriate struct
>> > page for it.  This should be enough for O_DIRECT accesses to work as
>> > you'll get the appropriate scatterlists built.
>> >
>> > I suspect Boaz has already done a lot of thinking about this and doesn't
>> > need the explanation, but here's how it looks for anyone following along
>> > at home:
>> >
>> > Process A calls read().
>> > Kernel allocates a page cache page for it and calls the filesystem through
>> >   ->readpages (or ->readpage).
>> > Filesystem calls the managing process to get the data for that page.
>> > Managing process draws a pentagram and summons Beelzebub (or runs Perl;
>> >   whichever you find more scary).
>> > Managing process notifies the filesystem that the page is now full of data.
>> > Filesystem marks the page as being Uptodate and unlocks it.
>> > Process was waiting on the page lock, wakes up and copies the data from the
>> >   page cache into userspace.  read() is complete.
>> >
>> > What we're concerned about here is what to do after the managing process
>> > tells the kernel that the read is complete.  Clearly allowing the managing
>> > process continued access to the page is Bad as the page may be freed by the
>> > page cache and then reused for something else.  Doing a TLB shootdown is
>> > expensive.  So Boaz's approach is to have the process promise that it won't
>> > have any other thread look at it.  My approach is to never allow the page
>> > to have load/store access from userspace; it can only be passed to other
>> > system calls.
>>
>> This all seems to revolve around the fact that userspace fs server
>> process needs to copy something into userspace client's buffer, right?
>>
>> Instead of playing with memory mappings, why not just tell the kernel
>> *what* to copy?
>>
>> While in theory not as generic, I don't see any real limitations (you
>> don't actually need the current contents of the buffer in the read
>> case and vica verse in the write case).
>>
>> And we already have an interface for this: splice(2).  What am I
>> missing?  What's the killer argument in favor of the above messing
>> with tlb caches etc, instead of just letting the kernel do the dirty
>> work.
>
> Great question.  You're completely right that the question is how to tell
> the kernel what to copy.  The problem is that splice() can only write to
> the first page of a pipe.  So you need one pipe per outstanding request,
> which can easily turn into thousands of file descriptors.  If we enhanced
> splice() so it could write to any page in a pipe, then I think splice()
> would be the perfect interface.

Don't know your usecase, but afaict zufs will have one queue per cpu.
Having one pipe/cpu doesn't sound too bad.

But yeah, there's plenty of room for improvement in the splice
interface.  Just needs a killer app like this :)

Thanks,
Miklos