From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot0-f182.google.com ([74.125.82.182]:42641 "EHLO mail-ot0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751348AbeCNLbY (ORCPT ); Wed, 14 Mar 2018 07:31:24 -0400 Received: by mail-ot0-f182.google.com with SMTP id l5-v6so2868907otf.9 for ; Wed, 14 Mar 2018 04:31:24 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20180314111750.GA29631@bombadil.infradead.org> References: <443fea57-f165-6bed-8c8a-0a32f72b9cd2@netapp.com> <20180313185658.GB21538@bombadil.infradead.org> <20180314111750.GA29631@bombadil.infradead.org> From: Miklos Szeredi Date: Wed, 14 Mar 2018 12:31:22 +0100 Message-ID: Subject: Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU To: Matthew Wilcox Cc: Boaz Harrosh , linux-fsdevel , Ric Wheeler , Steve French , Steven Whitehouse , Jefff moyer , Sage Weil , Jan Kara , Amir Goldstein , Andy Rudof , Anna Schumaker , Amit Golander , Sagi Manole , Shachar Sharon Content-Type: text/plain; charset="UTF-8" Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Wed, Mar 14, 2018 at 12:17 PM, Matthew Wilcox wrote: > On Wed, Mar 14, 2018 at 09:20:57AM +0100, Miklos Szeredi wrote: >> On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox wrote: >> > On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote: >> >> On a call to mmap an mmap provider (like an FS) can put >> >> this flag on vma->vm_flags. >> >> >> >> This tells the Kernel that the vma will be used from a single >> >> core only and therefore invalidation of PTE(s) need not a >> >> wide CPU scheduling >> >> >> >> The motivation of this flag is the ZUFS project where we want >> >> to optimally map user-application buffers into a user-mode-server >> >> execute the operation and efficiently unmap. >> > >> > I've been looking at something similar, and I prefer my approach, >> > although I'm not nearly as far along with my implementation as you are. >> > >> > My approach is also to add a vm_flags bit, tentatively called VM_NOTLB. >> > The page fault handler refuses to insert any TLB entries into the process >> > address space. But follow_page_mask() will return the appropriate struct >> > page for it. This should be enough for O_DIRECT accesses to work as >> > you'll get the appropriate scatterlists built. >> > >> > I suspect Boaz has already done a lot of thinking about this and doesn't >> > need the explanation, but here's how it looks for anyone following along >> > at home: >> > >> > Process A calls read(). >> > Kernel allocates a page cache page for it and calls the filesystem through >> > ->readpages (or ->readpage). >> > Filesystem calls the managing process to get the data for that page. >> > Managing process draws a pentagram and summons Beelzebub (or runs Perl; >> > whichever you find more scary). >> > Managing process notifies the filesystem that the page is now full of data. >> > Filesystem marks the page as being Uptodate and unlocks it. >> > Process was waiting on the page lock, wakes up and copies the data from the >> > page cache into userspace. read() is complete. >> > >> > What we're concerned about here is what to do after the managing process >> > tells the kernel that the read is complete. Clearly allowing the managing >> > process continued access to the page is Bad as the page may be freed by the >> > page cache and then reused for something else. Doing a TLB shootdown is >> > expensive. So Boaz's approach is to have the process promise that it won't >> > have any other thread look at it. My approach is to never allow the page >> > to have load/store access from userspace; it can only be passed to other >> > system calls. >> >> This all seems to revolve around the fact that userspace fs server >> process needs to copy something into userspace client's buffer, right? >> >> Instead of playing with memory mappings, why not just tell the kernel >> *what* to copy? >> >> While in theory not as generic, I don't see any real limitations (you >> don't actually need the current contents of the buffer in the read >> case and vica verse in the write case). >> >> And we already have an interface for this: splice(2). What am I >> missing? What's the killer argument in favor of the above messing >> with tlb caches etc, instead of just letting the kernel do the dirty >> work. > > Great question. You're completely right that the question is how to tell > the kernel what to copy. The problem is that splice() can only write to > the first page of a pipe. So you need one pipe per outstanding request, > which can easily turn into thousands of file descriptors. If we enhanced > splice() so it could write to any page in a pipe, then I think splice() > would be the perfect interface. Don't know your usecase, but afaict zufs will have one queue per cpu. Having one pipe/cpu doesn't sound too bad. But yeah, there's plenty of room for improvement in the splice interface. Just needs a killer app like this :) Thanks, Miklos