From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754182AbcEBPX1 (ORCPT ); Mon, 2 May 2016 11:23:27 -0400 Received: from mx1.redhat.com ([209.132.183.28]:41419 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751623AbcEBPXK (ORCPT ); Mon, 2 May 2016 11:23:10 -0400 Date: Mon, 2 May 2016 17:23:07 +0200 From: Andrea Arcangeli To: "Kirill A. Shutemov" Cc: Alex Williamson , kirill.shutemov@linux.intel.com, linux-kernel@vger.kernel.org, "linux-mm@kvack.org" Subject: Re: [BUG] vfio device assignment regression with THP ref counting redesign Message-ID: <20160502152307.GA12310@redhat.com> References: <20160428102051.17d1c728@t450s.home> <20160428181726.GA2847@node.shutemov.name> <20160428125808.29ad59e5@t450s.home> <20160428232127.GL11700@redhat.com> <20160429005106.GB2847@node.shutemov.name> <20160428204542.5f2053f7@ul30vt.home> <20160429070611.GA4990@node.shutemov.name> <20160429163444.GM11700@redhat.com> <20160502104119.GA23305@node.shutemov.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160502104119.GA23305@node.shutemov.name> User-Agent: Mutt/1.6.0 (2016-04-01) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Mon, 02 May 2016 15:23:09 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 02, 2016 at 01:41:19PM +0300, Kirill A. Shutemov wrote: > I don't think this would work correctly. Let's check one of callers: > > static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > unsigned long address, pte_t *page_table, pmd_t *pmd, > spinlock_t *ptl, pte_t orig_pte) > __releases(ptl) > { > ... > if (reuse_swap_page(old_page)) { > /* > * The page is all ours. Move it to our anon_vma so > * the rmap code will not search our parent or siblings. > * Protected against the rmap code by the page lock. > */ > page_move_anon_rmap(old_page, vma, address); > unlock_page(old_page); > return wp_page_reuse(mm, vma, address, page_table, ptl, > orig_pte, old_page, 0, 0); > } > > The first thing to notice is that old_page can be a tail page here > therefore page_move_anon_rmap() should be able to handle this after you > patch, which it doesn't. Agreed, that's an implementation error and easy to fix. > But I think there's a bigger problem. > > Consider the following situation: after split_huge_pmd() we have > pte-mapped THP, fork() comes and now the pages is shared between two > processes. Child process munmap()s one half of the THP page, parent > munmap()s the other half. > > IIUC, afther that page_trans_huge_mapcount() would give us 1 as all 4k > subpages have mapcount exactly one. Fault in the child would trigger > do_wp_page() and reuse_swap_page() returns true, which would lead to > page_move_anon_rmap() tranferring the whole compound page to child's > anon_vma. That's not correct. > > We should at least avoid page_move_anon_rmap() for compound pages there. So (compound_head() missing aside) the calculation I was doing is correct with regard to taking over the page and marking the pagetable read-write instead of triggering a COW and breaking the pinning, but it's not right only in terms of calling page_move_anon_rmap? The child or parent would then lose visibility on its ptes if the compound page is moved to the local vma->anon_vma. The fix should be just to change page_trans_huge_mapcount() to return two refcounts, one "hard" for the pinning, and one "soft" for the rmap which will be the same as total_mapcount. The runtime cost will remain the same, so a fix can be easy for this one too. > Other thing I would like to discuss is if there's a problem on vfio side. > To me it looks like vfio expects guarantee from get_user_pages() which it > doesn't provide: obtaining pin on the page doesn't guarantee that the page > is going to remain mapped into userspace until the pin is gone. > > Even with THP COW regressing fixed, vfio would stay fragile: any > MADV_DONTNEED/fork()/mremap()/whatever what would make vfio expectation > broken. vfio must run as root, it will take care of not doing such things, it just needs a way to prevent the page to be moved so it can DMA into it and mlock is not enough. This clearly has to be caused by a get_user_pages(write=0) or by a serialized fork/exec() while a longstanding page pin is being held (and to be safe fork/exec had to be serialized in a way that the parent process wouldn't write to the pinned page until after exec has run in the child, or it's already racy no matter what kernel). I agree it's somewhat fragile, the problem here is that the THP refcounting change made it even weaker than it already was. Ideally the MMU notifier invalidate should be used instead of pinning the page, that would make it 100% robust and it wouldn't even pin the page at all. However we can't send an MMU notifier invalidate to an IOMMU because next time the IOMMU non-present physical address is used it would kill the app. Some new IOMMU can raise an exception synchronously that we could use to implement a IOMMU secondary MMU page fault to make the MMU notifier model work with IOMMUs too, but that's not feasible with most IOMMU out there that raises an unrecoverable asynchronous exception instead and can't implement a proper "IOMMU page fault". Furthermore the speed of the invalidate may not be optimal with IOMMUs which would then be an added cost to pay for swapping and memory migration. This is anyway a regression of the previous guarantees a pin would provide, if we want to bring back the old semantics of a page pin, I think fixing both places like I attempted to do (modulo two implementation bugs) is better than fixing only the THP case. If instead leave things as is, and we weaken the semantics of a page pin, the alternative to deal with the even weakened semantics inside the vfio code, is to use get_user_pages with write=1 forced and then it'll probably work also with current upstream (unless it's fork/exec, but I don't think it is, MADV_DONTFORK would be recommended anyway for usages like this with vfio if fork can ever run and there are threads in the parent, even O_DIRECT generates data corruption without MADV_DONTFORK in such conditions for similar reasons).