From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx144.netapp.com ([216.240.21.25]:30574 "EHLO mx144.netapp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752476AbeCOP1b (ORCPT ); Thu, 15 Mar 2018 11:27:31 -0400 Subject: Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU To: Miklos Szeredi References: <443fea57-f165-6bed-8c8a-0a32f72b9cd2@netapp.com> <20180313185658.GB21538@bombadil.infradead.org> CC: Matthew Wilcox , linux-fsdevel , Ric Wheeler , Steve French , Steven Whitehouse , Jefff moyer , Sage Weil , Jan Kara , Amir Goldstein , Andy Rudof , Anna Schumaker , Amit Golander , Sagi Manole , Shachar Sharon From: Boaz Harrosh Message-ID: <07cda3e5-c911-a49b-fceb-052f8ca57e66@netapp.com> Date: Thu, 15 Mar 2018 17:27:09 +0200 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 15/03/18 10:47, Miklos Szeredi wrote: > On Wed, Mar 14, 2018 at 10:41 PM, Boaz Harrosh wrote: >> On 14/03/18 10:20, Miklos Szeredi wrote: >>> On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox wrote: >>>> On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote: >>>>> On a call to mmap an mmap provider (like an FS) can put >>>>> this flag on vma->vm_flags. >>>>> >>>>> This tells the Kernel that the vma will be used from a single >>>>> core only and therefore invalidation of PTE(s) need not a >>>>> wide CPU scheduling >>>>> >>>>> The motivation of this flag is the ZUFS project where we want >>>>> to optimally map user-application buffers into a user-mode-server >>>>> execute the operation and efficiently unmap. >>>> >>>> I've been looking at something similar, and I prefer my approach, >>>> although I'm not nearly as far along with my implementation as you are. >>>> >>>> My approach is also to add a vm_flags bit, tentatively called VM_NOTLB. >>>> The page fault handler refuses to insert any TLB entries into the process >>>> address space. But follow_page_mask() will return the appropriate struct >>>> page for it. This should be enough for O_DIRECT accesses to work as >>>> you'll get the appropriate scatterlists built. >>>> >>>> I suspect Boaz has already done a lot of thinking about this and doesn't >>>> need the explanation, but here's how it looks for anyone following along >>>> at home: >>>> >>>> Process A calls read(). >>>> Kernel allocates a page cache page for it and calls the filesystem through >>>> ->readpages (or ->readpage). >>>> Filesystem calls the managing process to get the data for that page. >>>> Managing process draws a pentagram and summons Beelzebub (or runs Perl; >>>> whichever you find more scary). >>>> Managing process notifies the filesystem that the page is now full of data. >>>> Filesystem marks the page as being Uptodate and unlocks it. >>>> Process was waiting on the page lock, wakes up and copies the data from the >>>> page cache into userspace. read() is complete. >>>> >>>> What we're concerned about here is what to do after the managing process >>>> tells the kernel that the read is complete. Clearly allowing the managing >>>> process continued access to the page is Bad as the page may be freed by the >>>> page cache and then reused for something else. Doing a TLB shootdown is >>>> expensive. So Boaz's approach is to have the process promise that it won't >>>> have any other thread look at it. My approach is to never allow the page >>>> to have load/store access from userspace; it can only be passed to other >>>> system calls. >>> >> >> Hi Matthew, Hi Miklos >> >> Thank you for looking at this. >> I'm answering both Matthew an Miklos's all thread, by trying to explain >> something that you might not have completely wrapped around yet. >> >> Matthew first >> >> Please note that in the ZUFS system there are no page-faults at all involved >> (God no, this is like +40us minimum and I'm fighting to shave off 13us) >> >> In ZUF-to-ZUS communication >> command comes in: >> A1 we punch in the pages at the per-core-VMA before they are used, >> A2 we then return to user-space, access these pages once. >> (without any page faults) >> A3 Then return to kernel and punch in a drain page at that spot >> >> New command comes in: >> B1 we punch in the pages at the same per-core-VMA before they are used, >> B2 Return to user-space, access these new pages once. >> B3 Then return to kernel and punch in a drain page at that spot >> >> Actually I could skip A3/B3 all together but in testing after my patch >> it did not cost at all, so I like the extra easiness (Because otherwise >> there is a dance I need to do when app or server crash and files start >> to close I need to scan VMAs and zap them) >> >> Current mm's mapping code (at insert_pfn) will fail at B1 above. Because >> it wants to see a ZERO empty spot before inserting a new pte. >> What the mm code wants is that I call >> A3 - zap_vma_ptes(vma) >> >> This is because if the spot was not ZERO it means there was a previous >> mapping there. And some other core might have cached that entry at the >> TLB. so when I punch in this new value the other core could access >> the old page while this core is accessing the new page. >> (TLB-invalidate is a single core command and is why zap_vma_ptes >> needs to schedule all cores to each call TLB-invalidate) >> >> And this is the all difference between the two testes above. That I do not >> zap_vma_ptes With the new (one liner) code. >> >> Please Note that the VM_LOCAL_CPU flag is not set by the application (zus Server) >> but by the Kernel driver, telling the Kernel that it has enforced such an API >> that we access from a single CORE so please allow me B1 because I know what I'm >> doing. (Also we do put some trust into zus because it has our filesystem data >> and because we wrote it ;-)) >> >> I understand your approach where you say "The PTE table is just a global >> communicator of pages but is not really mapped into any process .i.e never >> faulted into any core's local-TLB" (The Kernel access of that memory is done >> on a Kernel address at another TLB). And is why I can get away from >> zap_vma_ptes(vma). >> So is this not the same thing? your flag says no one TLB cached this PTE >> my flag says only-this-core-cached this PTE. We both ask >> "So please skip the zap_vma_ptes(vma) stage for me" >> >> I think you might be able to use my flag for your system. Is only a small >> part of what you need with the all "Get the page from the PTE at" and >> so on. But the "please skip zap_vma_ptes(vma)" part is this patch here, No? >> >> BTW I did not at all understand what is your project trying to solve. >> please send me some Notes about it I want to see if they might fit >> after all >> >>> This all seems to revolve around the fact that userspace fs server >>> process needs to copy something into userspace client's buffer, right? >>> >>> Instead of playing with memory mappings, why not just tell the kernel >>> *what* to copy? >>> >>> While in theory not as generic, I don't see any real limitations (you >>> don't actually need the current contents of the buffer in the read >>> case and vica verse in the write case). >>> >> >> This is not so easy, for many reasons. It was actually my first approach >> which I pursued for a while but dropped it for the easier to implement >> and more general approach. >> >> Note that we actually do that in the implementation of mmap. There is >> a ZUS_OP_GET_BLOCK which returns a dpp_t of a page to map into the >> application's VM. We could just copy it at that point >> >> We have some app buffers arriving with pointers local to one VM (the app) >> and then we want to copy them to another app buffers. How do you do that? >> So yes you need to get_user_pages() so they can be accessed from kernel, switch >> to second VM then receive pointers there. These need to be dpp_t like the games >> I do, or - In the app context copy_user_to_page. >> >> But that API was not enough for me. Because this is good with pmem. >> But what if I actually want it from disk or network. > > Yeah, that's the interesting part. We want direct-io into the client > (the app) memory, with the userspace filesystem acting as a traffic > controller. > (Yes we want that, but is not the only thing we want. We also want pmem, and other Server available data) > With your scheme it's like: > > - get_user_pages > - map pages into server address space > - send request to server > - server does direct-io read from network/disk fd into mapped buffer > - server sends reply > - done > > This could be changed to > - get_user_pages > - insert pages into pipe > - send request to server > - server "reverse splices" buffers from pipe to network/disk fd This can never properly translate. Even a simple file on disk is linear for the app (unaligned buffer) but is scattered on multiple blocks on disk. Yes perhaps networking can somewhat work if you pre/post-pend the headers you need. And you restrict direct IO semantics on everything specially the APP with my system you can do zero copy on any kind of application And this assumes networking or some-device. Which means going back to the Kernel, which in ZUFS rules you must return -ASYNC to the zuf and complete in a background ASYNC thread. This is an order of a magnitude higher latency then what I showed here. And what about the SYNC copy from Server to APP. With a pipe you are forcing me to go back to the Kernel to execute the copy. which means two more crossings. This will double the round trips. > - server sends reply > - done > > The two are basically the same, except we got rid of the unnecessary > userspace mapping. > > Okay, the "reverse splice" or "rsplice" operation is yet to be > defined. It would be like splice, except it passes an empty buffer > from the pipe into an operation that uses it to fill the buffer > (RSPLICE is to SPLICE as READ is to WRITE). > Exactly another trip back to Kernel. > For write operation the normal splice(2) would be used in the same > way, straightforward passing of user buffer directly to underlying > device without memory copy ever being done. > Here too, can work for direct access semantics with pointers and sizes aligned, but cannot satisfy POSIX. > See what I'm getting at? > Do you see my points? - Zero-copy on all posix API, including *no* page-cache. - A single kernel-UM transition. - Synchronous extreme low latency. >> 1. Allocate a vma per core >> 2. call vm_insert_pfn >> .... Do something >> 3. vm_insert_pfn(NULL) (before this patch zap_vma_ptes()) >> >> It is all very simple really. For me it is opposite. It is >> "Why mess around with dual_port_pointers, caching, and copy >> life time rules, when you can just call vm_insert_pfn" > > Because you normally gain nothing by going through the server address space. > > Mapping to server address space has issues, like allowing access from > server to full page containing the buffer, which might well be a > security issue. > Not really there is already an high trust between the APP and the filesystem Server owning the all of the APP's data. A compromised Server can do lots and lots of bad things before a bug trashes the unaligned tails of a buffer. (And at that the Server only has access to IO buffers in the short window of the IO execution. Once on IO return this access is disconnected) > Also with the direct-io from network/disk case the userspace address > will again be translated to a page in the kernel so it's just going > back and forth between representations using the page tables, which > likely even results in a measurable performance loss. > You mean another get_user_pages() This is not so bad we have all these pages already paged in and the page-table HOT because we just now set it. From my tests I did not notice any such slowness. Again this is not my typical work load, and this extra get_user_pages() is not an high priority for me. But if you really care about it and you measure real slowness because of that extra get_user_pages() what we can do is: I already have a zuf internal object describing the request including the app mapped pages and sizes. We can implement a splice() operation on the zuf driver. As a target. To supply the array of pages already gotten at the first above get_user_pages(). > Again, what's the advantage of mapping to server address space? > See above. In my case 90% of the time the data is already at the Server application memcpy_nt away. If I want to support that mode of a single trip to user-land. That is the way. If you are positive you have 2 trips minimum and the target is in kernel. You are already too slow and there is not much of a difference. So the advantage is the extra choice. Which for me is the 90% of the work load. > Thanks, > Miklos > Thanks Boaz