From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754836AbcEBSDW (ORCPT ); Mon, 2 May 2016 14:03:22 -0400 Received: from mx1.redhat.com ([209.132.183.28]:37326 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754782AbcEBSDK (ORCPT ); Mon, 2 May 2016 14:03:10 -0400 Date: Mon, 2 May 2016 20:03:07 +0200 From: Andrea Arcangeli To: "Kirill A. Shutemov" Cc: Alex Williamson , kirill.shutemov@linux.intel.com, linux-kernel@vger.kernel.org, "linux-mm@kvack.org" Subject: Re: [BUG] vfio device assignment regression with THP ref counting redesign Message-ID: <20160502180307.GB12310@redhat.com> References: <20160428181726.GA2847@node.shutemov.name> <20160428125808.29ad59e5@t450s.home> <20160428232127.GL11700@redhat.com> <20160429005106.GB2847@node.shutemov.name> <20160428204542.5f2053f7@ul30vt.home> <20160429070611.GA4990@node.shutemov.name> <20160429163444.GM11700@redhat.com> <20160502104119.GA23305@node.shutemov.name> <20160502152307.GA12310@redhat.com> <20160502160042.GC24419@node.shutemov.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160502160042.GC24419@node.shutemov.name> User-Agent: Mutt/1.6.0 (2016-04-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 02, 2016 at 07:00:42PM +0300, Kirill A. Shutemov wrote: > Sounds correct, but code is going to be ugly :-/ Now if a page is not shared in the parent, it is already in the local anon_vma. The only thing we could lose here is a pmd split in the child caused by swapping and then parent releases the page, child reuse it but it stays in the anon_vma of the parent. It doesn't sound like a major concern. What we could to improve this though, is to do a rmap walk after the physical split_huge_page succeeded, to relocate the page->mapping to the local vma->anon_vma of the child if page_mapcount() is 1 before releasing the (root) anon_vma lock. If a page got a pmd split it'll be a candidate for a physical split if any of the ptes has been unmapped. If it wasn't because of THP in tmpfs the old THP refcounting overall I think would have been simpler, it never had issues like these, but having a single model for all THP sounds much easier to maintain over time, instead of dealing with totally different models and rules and locking for every filesystem and MM part. This is why I hope we'll soon leverage all this work in tmpfs too. With tmpfs being able to map the compound THP with both ptes and pmds is mandatory, the old refcounting had a too big constraint to make compound THP work in tmpfs. > I didn't say we shouldn't fix the problem on THP side. But the attitude > "get_user_pages() would magically freeze page tables" worries me. It doesn't need to freeze page tables, it only needs to prevent the pages to be freed. In fact this bug cannot generate random kernel corruption no matter what, but then userland view of the memory will go out of sync and it can generate data corruption to userland (in RAM or hardware device DMA). What THP refcounting did is just to change some expectation on the userland side in terms of when the view on the pinned pages would get lost and replaced by copies, depending what userland did. A process to invoke page pinning must have root (or enough capabilities anyway), so it's somewhat connected to the kernel behavior, it's non standard. However issues like this are userland visible even to not privileged tasks, in fact it's strongly recommended to use MADV_DONTFORK on the get_user_pages addresses, if a program is using get_user_pages/O_DIRECT+fork+threads to avoid silent data corruption. > Agreed. I just didn't see the two-refcounts solution. If you didn't do it already or if you're busy with something else, I can change the patch to the two refcount solution, which should restore the old semantics without breaking rmap.