Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU

From: Miklos Szeredi <mszeredi@redhat.com>
To: Boaz Harrosh <boazh@netapp.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Ric Wheeler <rwheeler@redhat.com>,
	Steve French <smfrench@gmail.com>,
	Steven Whitehouse <swhiteho@redhat.com>,
	Jefff moyer <jmoyer@redhat.com>, Sage Weil <sweil@redhat.com>,
	Jan Kara <jack@suse.cz>, Amir Goldstein <amir73il@gmail.com>,
	Andy Rudof <andy.rudoff@intel.com>,
	Anna Schumaker <Anna.Schumaker@netapp.com>,
	Amit Golander <Amit.Golander@netapp.com>,
	Sagi Manole <sagim@netapp.com>,
	Shachar Sharon <Shachar.Sharon@netapp.com>
Subject: Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
Date: Thu, 15 Mar 2018 09:47:49 +0100	[thread overview]
Message-ID: <CAOssrKf+KJfr8anKZFqTwLNO85Fkfrw7=ZpXYi53uT++PqADbA@mail.gmail.com> (raw)
In-Reply-To: <b49772ef-e96e-af22-ba6d-f91a26389fab@netapp.com>

On Wed, Mar 14, 2018 at 10:41 PM, Boaz Harrosh <boazh@netapp.com> wrote:
> On 14/03/18 10:20, Miklos Szeredi wrote:
>> On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox <willy@infradead.org> wrote:
>>> On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote:
>>>> On a call to mmap an mmap provider (like an FS) can put
>>>> this flag on vma->vm_flags.
>>>>
>>>> This tells the Kernel that the vma will be used from a single
>>>> core only and therefore invalidation of PTE(s) need not a
>>>> wide CPU scheduling
>>>>
>>>> The motivation of this flag is the ZUFS project where we want
>>>> to optimally map user-application buffers into a user-mode-server
>>>> execute the operation and efficiently unmap.
>>>
>>> I've been looking at something similar, and I prefer my approach,
>>> although I'm not nearly as far along with my implementation as you are.
>>>
>>> My approach is also to add a vm_flags bit, tentatively called VM_NOTLB.
>>> The page fault handler refuses to insert any TLB entries into the process
>>> address space.  But follow_page_mask() will return the appropriate struct
>>> page for it.  This should be enough for O_DIRECT accesses to work as
>>> you'll get the appropriate scatterlists built.
>>>
>>> I suspect Boaz has already done a lot of thinking about this and doesn't
>>> need the explanation, but here's how it looks for anyone following along
>>> at home:
>>>
>>> Process A calls read().
>>> Kernel allocates a page cache page for it and calls the filesystem through
>>>   ->readpages (or ->readpage).
>>> Filesystem calls the managing process to get the data for that page.
>>> Managing process draws a pentagram and summons Beelzebub (or runs Perl;
>>>   whichever you find more scary).
>>> Managing process notifies the filesystem that the page is now full of data.
>>> Filesystem marks the page as being Uptodate and unlocks it.
>>> Process was waiting on the page lock, wakes up and copies the data from the
>>>   page cache into userspace.  read() is complete.
>>>
>>> What we're concerned about here is what to do after the managing process
>>> tells the kernel that the read is complete.  Clearly allowing the managing
>>> process continued access to the page is Bad as the page may be freed by the
>>> page cache and then reused for something else.  Doing a TLB shootdown is
>>> expensive.  So Boaz's approach is to have the process promise that it won't
>>> have any other thread look at it.  My approach is to never allow the page
>>> to have load/store access from userspace; it can only be passed to other
>>> system calls.
>>
>
> Hi Matthew, Hi Miklos
>
> Thank you for looking at this.
> I'm answering both Matthew an Miklos's all thread, by trying to explain
> something that you might not have completely wrapped around yet.
>
> Matthew first
>
> Please note that in the ZUFS system there are no page-faults at all involved
> (God no, this is like +40us minimum and I'm fighting to shave off 13us)
>
> In ZUF-to-ZUS communication
> command comes in:
> A1 we punch in the pages at the per-core-VMA before they are used,
> A2 we then return to user-space, access these pages once.
>    (without any page faults)
> A3 Then return to kernel and punch in a drain page at that spot
>
> New command comes in:
> B1 we punch in the pages at the same per-core-VMA before they are used,
> B2 Return to user-space, access these new pages once.
> B3 Then return to kernel and punch in a drain page at that spot
>
> Actually I could skip A3/B3 all together but in testing after my patch
> it did not cost at all, so I like the extra easiness (Because otherwise
> there is a dance I need to do when app or server crash and files start
> to close I need to scan VMAs and zap them)
>
> Current mm's mapping code (at insert_pfn) will fail at B1 above. Because
> it wants to see a ZERO empty spot before inserting a new pte.
> What the mm code wants is that I call
>         A3 - zap_vma_ptes(vma)
>
> This is because if the spot was not ZERO it means there was a previous
> mapping there. And some other core might have cached that entry at the
> TLB. so when I punch in this new value the other core could access
> the old page while this core is accessing the new page.
> (TLB-invalidate is a single core command and is why zap_vma_ptes
>  needs to schedule all cores to each call TLB-invalidate)
>
> And this is the all difference between the two testes above. That I do not
> zap_vma_ptes With the new (one liner) code.
>
> Please Note that the VM_LOCAL_CPU flag is not set by the application (zus Server)
> but by the Kernel driver, telling the Kernel that it has enforced such an API
> that we access from a single CORE so please allow me B1 because I know what I'm
> doing. (Also we do put some trust into zus because it has our filesystem  data
> and because we wrote it ;-))
>
> I understand your approach where you say "The PTE table is just a global
> communicator of pages but is not really mapped into any process .i.e never
> faulted into any core's local-TLB" (The Kernel access of that memory is done
> on a Kernel address at another TLB). And is why I can get away from
> zap_vma_ptes(vma).
> So is this not the same thing? your flag says no one TLB cached this PTE
> my flag says only-this-core-cached this PTE. We both ask
> "So please skip the zap_vma_ptes(vma) stage for me"
>
> I think you might be able to use my flag for your system. Is only a small
> part of what you need with the all "Get the page from the PTE at" and
> so on. But the "please skip zap_vma_ptes(vma)" part is this patch here, No?
>
> BTW I did not at all understand what is your project trying to solve.
> please send me some Notes about it I want to see if they might fit
> after all
>
>> This all seems to revolve around the fact that userspace fs server
>> process needs to copy something into userspace client's buffer, right?
>>
>> Instead of playing with memory mappings, why not just tell the kernel
>> *what* to copy?
>>
>> While in theory not as generic, I don't see any real limitations (you
>> don't actually need the current contents of the buffer in the read
>> case and vica verse in the write case).
>>
>
> This is not so easy, for many reasons. It was actually my first approach
> which I pursued for a while but dropped it for the easier to implement
> and more general approach.
>
> Note that we actually do that in the implementation of mmap. There is
> a ZUS_OP_GET_BLOCK which returns a dpp_t of a page to map into the
> application's VM. We could just copy it at that point
>
> We have some app buffers arriving with pointers local to one VM (the app)
> and then we want to copy them to another app buffers. How do you do that?
> So yes you need to get_user_pages() so they can be accessed from kernel, switch
> to second VM then receive pointers there. These need to be dpp_t like the games
> I do, or - In the app context copy_user_to_page.
>
> But that API was not enough for me. Because this is good with pmem.
> But what if I actually want it from disk or network.

Yeah, that's the interesting part.  We want direct-io into the client
(the app) memory, with the userspace filesystem acting as a traffic
controller.

With your scheme it's like:

- get_user_pages
- map pages into server address space
- send request to server
- server does direct-io read from network/disk fd into mapped buffer
- server sends reply
- done

This could be changed to
- get_user_pages
- insert pages into pipe
- send request to server
- server "reverse splices" buffers from  pipe to network/disk fd
- server sends reply
- done

The two are basically the same, except we got rid of the unnecessary
userspace mapping.

Okay, the "reverse splice" or "rsplice" operation is yet to be
defined.  It would be like splice, except it passes an empty buffer
from the pipe into an operation that uses it to fill the buffer
(RSPLICE is to SPLICE as READ is to WRITE).

For write operation the normal splice(2) would be used in the same
way, straightforward passing of user buffer directly to underlying
device without memory copy ever being done.

See what I'm getting at?

> 1. Allocate a vma per core
> 2. call vm_insert_pfn
>  .... Do something
> 3. vm_insert_pfn(NULL) (before this patch zap_vma_ptes())
>
> It is all very simple really. For me it is opposite. It is
> "Why mess around with dual_port_pointers, caching, and copy
>  life time rules, when you can just call vm_insert_pfn"

Because you normally gain nothing by going through the server address space.

Mapping to server address space has issues, like allowing access from
server to full page containing the buffer, which might well be a
security issue.

Also with the direct-io from network/disk case the userspace address
will again be translated to a page in the kernel so it's just going
back and forth between representations using the page tables, which
likely even results in a measurable performance loss.

Again, what's the advantage of mapping to server address space?

Thanks,
Miklos