From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754493AbaJGPyK (ORCPT ); Tue, 7 Oct 2014 11:54:10 -0400 Received: from mx1.redhat.com ([209.132.183.28]:8764 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753984AbaJGPyH (ORCPT ); Tue, 7 Oct 2014 11:54:07 -0400 Date: Tue, 7 Oct 2014 17:52:47 +0200 From: Andrea Arcangeli To: Linus Torvalds Cc: "Dr. David Alan Gilbert" , qemu-devel@nongnu.org, KVM list , Linux Kernel Mailing List , linux-mm , Linux API , Andres Lagar-Cavilla , Dave Hansen , Paolo Bonzini , Rik van Riel , Mel Gorman , Andy Lutomirski , Andrew Morton , Sasha Levin , Hugh Dickins , Peter Feiner , Christopher Covington , Johannes Weiner , Android Kernel Team , Robert Love , Dmitry Adamushko , Neil Brown , Mike Hommey , Taras Glek , Jan Kara , KOSAKI Motohiro , Michel Lespinasse , Minchan Kim , Keith Packard , "Huangpeng (Peter)" , Isaku Yamahata , Anthony Liguori , Stefan Hajnoczi , Wenchao Xia , Andrew Jones , Juan Quintela Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages Message-ID: <20141007155247.GD2342@redhat.com> References: <1412356087-16115-1-git-send-email-aarcange@redhat.com> <1412356087-16115-11-git-send-email-aarcange@redhat.com> <20141006085540.GD2336@work-vm> <20141006164156.GA31075@redhat.com> <20141007141913.GC2342@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141007141913.GC2342@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote: > mremap like interface, or file+commands protocol interface. I tend to > like mremap more, that's why I opted for a remap_anon_pages syscall > kept orthogonal to the userfaultfd functionality (remap_anon_pages > could be also used standalone as an accelerated mremap in some > circumstances) but nothing prevents to just embed the same mechanism Sorry for the self followup, but something else comes to mind to elaborate this further. In term of interfaces, the most efficient I could think of to minimize the enter/exit kernel, would be to append the "source address" of the data received from the network transport, to the userfaultfd_write() command (by appending 8 bytes to the wakeup command). Said that, mixing the mechanism to be notified about userfaults with the mechanism to resolve an userfault to me looks a complication. I kind of liked to keep the userfaultfd protocol is very simple and doing just its thing. The userfaultfd doesn't need to know how the userfault was resolved, even mremap would work theoretically (until we run out of vmas). I thought it was simpler to keep it that way. However if we want to resolve the fault with a "write()" syscall this may be the most efficient way to do it, as we're already doing a write() into the pseudofd to wakeup the page fault that contains the destination address, I just need to append the source address to the wakeup command. I probably grossly overestimated the benefits of resolving the userfault with a zerocopy page move, sorry. So if we entirely drop the zerocopy behavior and the TLB flush of the old page like you suggested, the way to keep the userfaultfd mechanism decoupled from the userfault resolution mechanism would be to implement an atomic-copy syscall. That would work for SIGBUS userfaults too without requiring a pseudofd then. It would be enough then to call mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic wouldn't page fault or call GUP into the destination address (it can't otherwise the in-flight partial copy would be visible to the process, breaking the atomicity of the copy), but it would fill in the pte/trans_huge_pmd with the same strict behavior that remap_anon_pages currently has (in turn it would by design bypass the VM_USERFAULT check and be ideal for resolving userfaults). mcopy_atomic could then be also extended to tmpfs and it would work without requiring the source page to be a tmpfs page too without having to convert page types on the fly. If I add mcopy_atomic, the patch in subject (10/17) can be dropped of course so it'd be even less intrusive than the current remap_anon_pages and it would require zero TLB flush during its runtime (it would just require an atomic copy). So should I try to embed a mcopy_atomic inside userfault_write or can I expose it to userland as a standalone new syscall? Or should I do something different? Comments? Thanks, Andrea