From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3CCF2C4361A for ; Fri, 4 Dec 2020 20:52:38 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9191322CF6 for ; Fri, 4 Dec 2020 20:52:37 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9191322CF6 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ziepe.ca Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id AD2E46B0036; Fri, 4 Dec 2020 15:52:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A83D26B005C; Fri, 4 Dec 2020 15:52:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9725C6B005D; Fri, 4 Dec 2020 15:52:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0050.hostedemail.com [216.40.44.50]) by kanga.kvack.org (Postfix) with ESMTP id 7DE796B0036 for ; Fri, 4 Dec 2020 15:52:36 -0500 (EST) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 416568249980 for ; Fri, 4 Dec 2020 20:52:36 +0000 (UTC) X-FDA: 77556798312.30.grape38_1e15799273c7 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin30.hostedemail.com (Postfix) with ESMTP id 20FAB180B3C83 for ; Fri, 4 Dec 2020 20:52:36 +0000 (UTC) X-HE-Tag: grape38_1e15799273c7 X-Filterd-Recvd-Size: 8065 Received: from mail-qk1-f195.google.com (mail-qk1-f195.google.com [209.85.222.195]) by imf49.hostedemail.com (Postfix) with ESMTP for ; Fri, 4 Dec 2020 20:52:35 +0000 (UTC) Received: by mail-qk1-f195.google.com with SMTP id u4so6767618qkk.10 for ; Fri, 04 Dec 2020 12:52:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=xPWnnh6vZQTtFvZJIVOrhc3HGhSgSZqYkemGx+VVR/I=; b=RfdV89Gq2f4iCb4TtgiteOyTSJJiv4MovGwdwdYr31uoZahwrOMrKYsPdeHpENbEGk JIsi7bOPcPRd+dCgkoakgzfY1yCbkMBqd5zbbB5n4dntxdYXB8z7Xcm9nLlmWsqU21AI 9IMF3ObABHP1qWQlJJvtfKZHuSKfK0STWH7C1uqzQmT8HtvJrpNalmJWWYamOXIWDrQm dxydJQejqE5wguGVjp+IXmQbL1yo2oKzw4PdXh+XbsLYPaNd88yPCgifKByhL5TXrPsb Y62hEpTwkNP5ihuCMcms6Nck0LOPv9QJdTfPsARSjq8Nvu3OqsYeNTAsEcLGaRruW9RN V/fg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=xPWnnh6vZQTtFvZJIVOrhc3HGhSgSZqYkemGx+VVR/I=; b=nmN2ZRLQZRNNPvgeuISnyETJeO2ozS+g6oLz60GXNw6jIOrsVUO6n+DLcYFYPjNRZj bxtPW4mUuSAimuwk85vsrYkGPM/ARUcyeR2MiXHci4KKNuSiutDXN8Uq8rQpmkOJlWO1 0w8uzsT/GBDgxV5W908utzdwVtcIPYXcoo1CM7HYSKpzv8ofMt6e3UbZg2C1ohuLPxh7 7+Gma4Aw4R1aiCX5OMzKk2XIEpNcM9cKFIKeWvVdD9gxT+hUIfLV0ZvJshHDZM0uaBwF t7GnInepnPqosOKdoldqOwWsbeKU7eEOgNYYPFZc/+QwbJe4rVJqcO9W13wk/whAjSAe 9Axg== X-Gm-Message-State: AOAM533KvoD2+A82UntN0vNyrwe7U3ykOvAxv77NoLgGhw30FNAWegzD kr6fIAkfYSUhkSdVWtKDQNTbzw== X-Google-Smtp-Source: ABdhPJx+Iji2zhoCycJWUu1vTV+SSIROuayO5pjwDOlSTssLKMfU0Ftl/WNc/2BM/wo4bNoeFkcM6A== X-Received: by 2002:a37:c82:: with SMTP id 124mr11128661qkm.360.1607115154755; Fri, 04 Dec 2020 12:52:34 -0800 (PST) Received: from ziepe.ca (hlfxns017vw-156-34-48-30.dhcp-dynamic.fibreop.ns.bellaliant.net. [156.34.48.30]) by smtp.gmail.com with ESMTPSA id u20sm1927358qtb.9.2020.12.04.12.52.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 04 Dec 2020 12:52:33 -0800 (PST) Received: from jgg by mlx with local (Exim 4.94) (envelope-from ) id 1klI49-006E8l-C1; Fri, 04 Dec 2020 16:52:33 -0400 Date: Fri, 4 Dec 2020 16:52:33 -0400 From: Jason Gunthorpe To: Daniel Jordan Cc: Pavel Tatashin , Alex Williamson , LKML , linux-mm , Andrew Morton , Vlastimil Babka , Michal Hocko , David Hildenbrand , Oscar Salvador , Dan Williams , Sasha Levin , Tyler Hicks , Joonsoo Kim , mike.kravetz@oracle.com, Steven Rostedt , Ingo Molnar , Peter Zijlstra , Mel Gorman , Matthew Wilcox , David Rientjes , John Hubbard Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone Message-ID: <20201204205233.GF5487@ziepe.ca> References: <20201202052330.474592-1-pasha.tatashin@soleen.com> <20201202052330.474592-7-pasha.tatashin@soleen.com> <20201202163507.GL5487@ziepe.ca> <20201203010809.GQ5487@ziepe.ca> <20201203141729.GS5487@ziepe.ca> <87360lnxph.fsf@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87360lnxph.fsf@oracle.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Dec 04, 2020 at 03:05:46PM -0500, Daniel Jordan wrote: > Jason Gunthorpe writes: > > > On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote: > >> What I meant is the users of the interface do it incrementally not in > >> large chunks. For example: > >> > >> vfio_pin_pages_remote > >> vaddr_get_pfn > >> ret = pin_user_pages_remote(mm, vaddr, 1, flags | > >> FOLL_LONGTERM, page, NULL, NULL); > >> 1 -> pin only one pages at a time > > > > I don't know why vfio does this, it is why it so ridiculously slow at > > least. > > Well Alex can correct me, but I went digging and a comment from the > first type1 vfio commit says the iommu API didn't promise to unmap > subpages of previous mappings, so doing page at a time gave flexibility > at the cost of inefficiency. iommu restrictions are not related to with gup. vfio needs to get the page list from the page tables as efficiently as possible, then you break it up into what you want to feed into the IOMMU how the iommu wants. vfio must maintain a page list to call unpin_user_pages() anyhow, so it makes alot of sense to assemble the page list up front, then do the iommu, instead of trying to do both things page at a time. It would be smart to rebuild vfio to use scatter lists to store the page list and then break the sgl into pages for iommu configuration. SGLs will consume alot less memory for the usual case of THPs backing the VFIO registrations. ib_umem_get() has some example of how to code this, I've been thinking we could make this some common API, and it could be further optimized. > Yesterday I tried optimizing vfio to skip gup calls for tail pages after > Matthew pointed out this same issue to me by coincidence last week. Please don't just hack up vfio like this. Everyone needs faster gup, we really need to solve this in the core code. Plus this is tricky, vfio is already using follow_pfn wrongly, drivers should not be open coding MM stuff. > Currently debugging, but if there's a fundamental reason this won't work > on the vfio side, it'd be nice to know. AFAIK there is no guarentee that just because you see a compound head that the remaining pages in the page tables are actually the tail pages. This is only true sometimes, for instance if an entire huge page is placed in a page table level. I belive Ralph pointed to some case where we might break a huge page from PMD to PTEs then later COW one of the PTEs. In this case the compound head will be visible but the page map will be non-contiguous and the page flags on each 4k entry will be different. Only GUP's page walkers know that the compound page is actually at a PMD level and can safely apply the 'everything is the same' optimization. The solution here is to make core gup faster, espcially for the cases where it is returning huge pages. We can approach this by: - Batching the compound & tail page acquisition for higher page levels, eg gup fast does this already, look at record_subpages() gup slow needs it too - Batching unpin for compound & tail page, the opposite of the 'refs' arg for try_grab_compound_head() - Devise some API where get_user_pages can directly return contiguous groups of pages to avoid memory traffic - Reduce the cost of a FOLL_LONGTERM pin eg here is a start: https://lore.kernel.org/linux-mm/0-v1-5551df3ed12e+b8-gup_dax_speedup_jgg@nvidia.com And CMA should get some similar treatment. Scanning the output page list multiple times is slow. I would like to get to a point where the main GUP walker functions can output in more formats than just page array. For instance directly constructing and chaining a biovec or sgl would dramatically improve perfomance and decrease memory consumption. Being able to write in hmm_range_fault's pfn&flags output format would delete a whole bunch of duplicated code. Jason