From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 02BAFC433DB for ; Thu, 25 Mar 2021 09:51:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3DFE161A12 for ; Thu, 25 Mar 2021 09:51:45 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3DFE161A12 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=shipmail.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A9B426B0036; Thu, 25 Mar 2021 05:51:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A4B396B006C; Thu, 25 Mar 2021 05:51:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 89D646B0070; Thu, 25 Mar 2021 05:51:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0192.hostedemail.com [216.40.44.192]) by kanga.kvack.org (Postfix) with ESMTP id 6F18A6B0036 for ; Thu, 25 Mar 2021 05:51:44 -0400 (EDT) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 29C5982499A8 for ; Thu, 25 Mar 2021 09:51:44 +0000 (UTC) X-FDA: 77957929728.12.A6A69C1 Received: from ste-pvt-msa2.bahnhof.se (ste-pvt-msa2.bahnhof.se [213.80.101.71]) by imf28.hostedemail.com (Postfix) with ESMTP id 4DDF62000243 for ; Thu, 25 Mar 2021 09:51:42 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ste-pvt-msa2.bahnhof.se (Postfix) with ESMTP id E9F543F741; Thu, 25 Mar 2021 10:51:40 +0100 (CET) Authentication-Results: ste-pvt-msa2.bahnhof.se; dkim=pass (1024-bit key; unprotected) header.d=shipmail.org header.i=@shipmail.org header.b=qiMGMepD; dkim-atps=neutral X-Virus-Scanned: Debian amavisd-new at bahnhof.se Authentication-Results: ste-ftg-msa2.bahnhof.se (amavisd-new); dkim=pass (1024-bit key) header.d=shipmail.org Received: from ste-pvt-msa2.bahnhof.se ([127.0.0.1]) by localhost (ste-ftg-msa2.bahnhof.se [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vAReIesnAhFI; Thu, 25 Mar 2021 10:51:39 +0100 (CET) Received: by ste-pvt-msa2.bahnhof.se (Postfix) with ESMTPA id 1D2003F492; Thu, 25 Mar 2021 10:51:37 +0100 (CET) Received: from [10.249.254.165] (unknown [192.198.151.44]) by mail1.shipmail.org (Postfix) with ESMTPSA id 1AE563600A8; Thu, 25 Mar 2021 10:51:37 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=shipmail.org; s=mail; t=1616665897; bh=WCKbFkGWBwAkKKUJflucqzb+FagoWI5lSIX7EGlYCNs=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=qiMGMepDGGsEet51B2YyUgoqeCNa1KrerqEw9jc8pUsFpK2msThYncgxxULvngT4B l8IyZ78gIJFnYjw+sZ+DhE6hC59Uq7VP/FiaepV5ux3rhnfWCf0PuXZxIJUnYQ9auu wBeh/K83tVhKAahv46rULBpPFI+PazfdWM3wg+gs= Subject: Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages To: =?UTF-8?Q?Christian_K=c3=b6nig?= , Jason Gunthorpe Cc: David Airlie , linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm@kvack.org, Andrew Morton References: <20210324122430.GW2356281@nvidia.com> <20210324124127.GY2356281@nvidia.com> <6c9acb90-8e91-d8af-7abd-e762d9a901aa@shipmail.org> <20210324134833.GE2356281@nvidia.com> <0b984f96-00fb-5410-bb16-02e12b2cc024@shipmail.org> <20210324163812.GJ2356281@nvidia.com> <08f19e80-d6cb-8858-0c5d-67d2e2723f72@amd.com> <730eb2ff-ba98-2393-6d42-61735e3c6b83@shipmail.org> <20210324231419.GR2356281@nvidia.com> <607ecbeb-e8a5-66e9-6fe2-9a8d22f12bc2@shipmail.org> From: =?UTF-8?Q?Thomas_Hellstr=c3=b6m_=28Intel=29?= Message-ID: <15da5784-96ca-25e5-1485-3ce387ee6695@shipmail.org> Date: Thu, 25 Mar 2021 10:51:35 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 4DDF62000243 X-Stat-Signature: y8n8nnxeo4ggdfw6rde4bhr7f46abzif Received-SPF: none (shipmail.org>: No applicable sender policy available) receiver=imf28; identity=mailfrom; envelope-from=""; helo=ste-pvt-msa2.bahnhof.se; client-ip=213.80.101.71 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1616665902-411966 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 3/25/21 9:27 AM, Christian K=C3=B6nig wrote: > Am 25.03.21 um 08:48 schrieb Thomas Hellstr=C3=B6m (Intel): >> >> On 3/25/21 12:14 AM, Jason Gunthorpe wrote: >>> On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellstr=C3=B6m (Inte= l)=20 >>> wrote: >>>> On 3/24/21 7:31 PM, Christian K=C3=B6nig wrote: >>>>> >>>>> Am 24.03.21 um 17:38 schrieb Jason Gunthorpe: >>>>>> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellstr=C3=B6m (I= ntel) >>>>>> wrote: >>>>>>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote: >>>>>>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellstr=C3=B6m >>>>>>>> (Intel) wrote: >>>>>>>> >>>>>>>>>> In an ideal world the creation/destruction of page >>>>>>>>>> table levels would >>>>>>>>>> by dynamic at this point, like THP. >>>>>>>>> Hmm, but I'm not sure what problem we're trying to solve >>>>>>>>> by changing the >>>>>>>>> interface in this way? >>>>>>>> We are trying to make a sensible driver API to deal with huge=20 >>>>>>>> pages. >>>>>>>>> Currently if the core vm requests a huge pud, we give it >>>>>>>>> one, and if we >>>>>>>>> can't or don't want to (because of dirty-tracking, for >>>>>>>>> example, which is >>>>>>>>> always done on 4K page-level) we just return >>>>>>>>> VM_FAULT_FALLBACK, and the >>>>>>>>> fault is retried at a lower level. >>>>>>>> Well, my thought would be to move the pte related stuff into >>>>>>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK= . >>>>>>>> >>>>>>>> I don't know if the locking works out, but it feels cleaner=20 >>>>>>>> that the >>>>>>>> driver tells the vmf how big a page it can stuff in, not the vm >>>>>>>> telling the driver to stuff in a certain size page which it=20 >>>>>>>> might not >>>>>>>> want to do. >>>>>>>> >>>>>>>> Some devices want to work on a in-between page size like 64k so=20 >>>>>>>> they >>>>>>>> can't form 2M pages but they can stuff 64k of 4K pages in a=20 >>>>>>>> batch on >>>>>>>> every fault. >>>>>>> Hmm, yes, but we would in that case be limited anyway to insert=20 >>>>>>> ranges >>>>>>> smaller than and equal to the fault size to avoid extensive and >>>>>>> possibly >>>>>>> unnecessary checks for contigous memory. >>>>>> Why? The insert function is walking the page tables, it just updat= es >>>>>> things as they are. It learns the arragement for free while doing=20 >>>>>> the >>>>>> walk. >>>>>> >>>>>> The device has to always provide consistent data, if it overlaps=20 >>>>>> into >>>>>> pages that are already populated that is fine so long as it isn't >>>>>> changing their addresses. >>>>>> >>>>>>> And then if we can't support the full fault size, we'd need to >>>>>>> either presume a size and alignment of the next level or search f= or >>>>>>> contigous memory in both directions around the fault address, >>>>>>> perhaps unnecessarily as well. >>>>>> You don't really need to care about levels, the device should be >>>>>> faulting in the largest memory regions it can within its efficienc= y. >>>>>> >>>>>> If it works on 4M pages then it should be faulting 4M pages. The=20 >>>>>> page >>>>>> size of the underlying CPU doesn't really matter much other than=20 >>>>>> some >>>>>> tuning to impact how the device's allocator works. >>>> Yes, but then we'd be adding a lot of complexity into this function=20 >>>> that is >>>> already provided by the current interface for DAX, for little or no=20 >>>> gain, at >>>> least in the drm/ttm setting. Please think of the following=20 >>>> situation: You >>>> get a fault, you do an extensive time-consuming scan of your VRAM=20 >>>> buffer >>>> object into which the fault goes and determine you can fault 1GB.=20 >>>> Now you >>>> hand it to vmf_insert_range() and because the user-space address is >>>> misaligned, or already partly populated because of a previous=20 >>>> eviction, you >>>> can only fault single pages, and you end up faulting a full GB of=20 >>>> single >>>> pages perhaps for a one-time small update. >>> Why would "you can only fault single pages" ever be true? If you have >>> 1GB of pages then the vmf_insert_range should allocate enough page >>> table entries to consume it, regardless of alignment. >> >> Ah yes, What I meant was you can only insert PTE size entries, either=20 >> because of misalignment or because the page-table is alredy=20 >> pre-populated with pmd size page directories, which you can't remove=20 >> with only the read side of the mmap lock held. > > Please explain that further. Why do we need the mmap lock to insert=20 > PMDs but not when insert PTEs? We don't. But once you've inserted a PMD directory you can't remove it=20 unless you have the mmap lock (and probably also the i_mmap_lock in=20 write mode). That for example means that if you have a VRAM region=20 mapped with huge PMDs, and then it gets evicted, and you happen to read=20 a byte from it when it's evicted and therefore populate the full region=20 with PTEs pointing to system pages, you can't go back to huge PMDs again=20 without a munmap() in between. > >>> And why shouldn't DAX switch to this kind of interface anyhow? It is >>> basically exactly the same problem. The underlying filesystem block >>> size is *not* necessarily aligned to the CPU page table sizes and DAX >>> would benefit from better handling of this mismatch. >> >> First, I think we must sort out what "better handling" means. This is=20 >> my takeout of the discussion so far: >> >> Claimed Pros: of vmf_insert_range() >> * We get an interface that doesn't require knowledge of CPU page=20 >> table entry level sizes. >> * We get the best efficiency when we look at what the GPU driver=20 >> provides. (I disagree on this one). >> >> Claimed Cons: >> * A new implementation that may get complicated particularly if it=20 >> involves modifying all of the DAX code >> * The driver would have to know about those sizes anyway to get=20 >> alignment right (Applies to DRM, because we mmap buffer objects, not=20 >> physical address ranges. But not to DAX AFAICT), > > I don't think so. We could just align all buffers to their next power=20 > of two in size. Since we have plenty of offset space that shouldn't=20 > matter much. It's not offset space like in drm fake offsets, but virtual address=20 space. But I guess we have plenty of that as well. > > Apart from that I still don't fully get why we need this in the first=20 > place. Because virtual huge page address boundaries need to be aligned with=20 physical huge page address boundaries, and mmap can happen before bos=20 are populated so you have no way of knowing how physical huge page=20 address boundaries are laid out in the buffer object unless you define a=20 rule for how that should be done. Meaning whatever scheme you use for=20 the virtual address space you need apply for linear VRAM as well, and=20 that's a more scarce resource. The scheme used today is that buffers that are > PMD size aligns bo VRAM=20 on PMD size boundaries if possible (and similar for virtual addresses).=20 Buffers that are > PUD size aligns to PUD size boundaries. > >> * We loose efficiency, because we are prepared to spend an extra=20 >> effort for alignment- and continuity checks when we know we can=20 >> insert a huge page table entry, but not if we know we can't > > I don't think so either. See with don't need any extra effort for the=20 > alignment nor the handling, it actually becomes much cheaper as far as=20 > I can see. We have those checks there today. Think buddy allocator, bos with system=20 pages. They are only executed when we know we can insert a huge page=20 today. With the new interface we'd do them always. > > In other words when you have a fault you don't care about the faulting=20 > address that much, you only use it to determine the memory segment to=20 > map. > > Then this whole memory segment is mapped into the address space of the=20 > user application. > > If can of course happen that we need to fiddle with addresses and=20 > sizes because userspace only mmap a fraction of the underlying buffer,=20 > but in reality we never do this. > >> * We loose efficiency because we might unnecessarily prefault a=20 >> number of PTE size page-table entries (really a special case of the=20 >> above one). > > I really don't see that either. When a buffer is accessed by the CPU=20 > it is in > 90% of all cases completely accessed. Not faulting in full=20 > ranges is just optimizing for a really unlikely case here. It might be that you're right, but are all drivers wanting to use this=20 like drm in this respect? Using the interface to fault in a 1G range in=20 the hope it could map it to a huge pud may unexpectedly consume and=20 populate some 16+ MB of page tables. To me, keeping the current interface for flexibility and add an optional=20 huge-page-table-entry-aware prefaulting helper, perhaps not restricted=20 to contigous ranges, for drivers that think its a good idea to prefault=20 given the right conditions and unconditionally for stuff like=20 remap_pfn_range() sounds reasonable. /Thomas > >> >> Now in the context of quickly fixing a critical bug, the choice IMHO=20 >> becomes easy. > > Well for quick fixing this I would rather disable huge pages for now. > > Regards, > Christian. > >> >>> >>>> On top of this, unless we want to do the walk trying increasingly=20 >>>> smaller >>>> sizes of vmf_insert_xxx(), we'd have to use apply_to_page_range()=20 >>>> and teach >>>> it about transhuge page table entries, because pagewalk.c can't be=20 >>>> used (It >>>> can't populate page tables). That also means apply_to_page_range()=20 >>>> needs to >>>> be complicated with page table locks since transhuge pages aren't=20 >>>> stable and >>>> can be zapped and refaulted under us while we do the walk. >>> I didn't say it would be simple :) But we also need to stop hacking >>> around the sides of all this huge page stuff and come up with sensibl= e >>> APIs that drivers can actually implement correctly. Exposing drivers >>> to specific kinds of page levels really feels like the wrong level of >>> abstraction. >> >> I generally agree. But for the last sentence I think the potential=20 >> gain must be carefully weighed against the efficiency arguments. >> >>> >>> Once we start doing this we should do it everywhere, the io_remap_pfn >>> stuff should be able to create huge special IO pages as well, for >>> instance. >> >> I agree here as well. Here we can be more agressive as the contigous=20 >> range is already known and we IIRC hold the mmap lock in write mode. >> >>>> On top of this, the user-space address allocator needs to know how=20 >>>> large gpu >>>> pages are aligned in buffer objects to have a reasonable chance of=20 >>>> aligning >>>> with CPU huge page boundaries which is a requirement to be able to=20 >>>> insert a >>>> huge CPU page table entry, so the driver would basically need the=20 >>>> drm helper >>>> that can do this alignment anyway. >>> Don't you have this problem anyhow? >> >> Yes, but it sort of defeats the simplicity argument of the proposed=20 >> interface change. >> >> /Thomas >> >>