From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1DEE5C433E0 for ; Wed, 24 Mar 2021 20:08:07 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A3C93619F8 for ; Wed, 24 Mar 2021 20:08:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A3C93619F8 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=shipmail.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 1B4EF6B030D; Wed, 24 Mar 2021 16:08:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 166A16B030E; Wed, 24 Mar 2021 16:08:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0060F6B030F; Wed, 24 Mar 2021 16:08:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0171.hostedemail.com [216.40.44.171]) by kanga.kvack.org (Postfix) with ESMTP id DBCC86B030D for ; Wed, 24 Mar 2021 16:08:05 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 90D1D18016C61 for ; Wed, 24 Mar 2021 20:08:05 +0000 (UTC) X-FDA: 77955854130.19.86C4F10 Received: from ste-pvt-msa1.bahnhof.se (ste-pvt-msa1.bahnhof.se [213.80.101.70]) by imf09.hostedemail.com (Postfix) with ESMTP id A3C566000111 for ; Wed, 24 Mar 2021 20:08:01 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ste-pvt-msa1.bahnhof.se (Postfix) with ESMTP id 880A241C23; Wed, 24 Mar 2021 21:08:00 +0100 (CET) Authentication-Results: ste-pvt-msa1.bahnhof.se; dkim=pass (1024-bit key; unprotected) header.d=shipmail.org header.i=@shipmail.org header.b=P+s2NS8J; dkim-atps=neutral X-Virus-Scanned: Debian amavisd-new at bahnhof.se Received: from ste-pvt-msa1.bahnhof.se ([127.0.0.1]) by localhost (ste-pvt-msa1.bahnhof.se [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id mMze0gdUSJgY; Wed, 24 Mar 2021 21:07:59 +0100 (CET) Received: by ste-pvt-msa1.bahnhof.se (Postfix) with ESMTPA id 5FEB83F660; Wed, 24 Mar 2021 21:07:56 +0100 (CET) Received: from [10.249.254.166] (unknown [192.198.151.44]) by mail1.shipmail.org (Postfix) with ESMTPSA id 689EC36062E; Wed, 24 Mar 2021 21:07:55 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=shipmail.org; s=mail; t=1616616476; bh=3qCxYoZfXVcIr6IpgNofra0wmJqpRFAzpDo4oMeG2n4=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=P+s2NS8JR6bgZjWtEWtrP50nVsF5NTzm85ecSAvPaeaV2HAmU4h5YlAYuj/1/OfJN yMZPRJgOAIMozwhOd8Mc3SWg9rvlQuobgmFUAtCRsC0FE7IV+Yux+l6tIBgP8OoGKG I8WBTZW8AxL5/8afqsZTd1D4Ip814CDQSFBKgSrE= Subject: Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages To: =?UTF-8?Q?Christian_K=c3=b6nig?= , Jason Gunthorpe Cc: David Airlie , linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm@kvack.org, Andrew Morton References: <314fc020-d243-dbf0-acb3-ecfcc9c2443c@shipmail.org> <20210323163715.GJ2356281@nvidia.com> <5824b731-ca6a-92fd-e314-d986b6a7b101@shipmail.org> <20210324122430.GW2356281@nvidia.com> <20210324124127.GY2356281@nvidia.com> <6c9acb90-8e91-d8af-7abd-e762d9a901aa@shipmail.org> <20210324134833.GE2356281@nvidia.com> <0b984f96-00fb-5410-bb16-02e12b2cc024@shipmail.org> <20210324163812.GJ2356281@nvidia.com> <08f19e80-d6cb-8858-0c5d-67d2e2723f72@amd.com> From: =?UTF-8?Q?Thomas_Hellstr=c3=b6m_=28Intel=29?= Message-ID: <730eb2ff-ba98-2393-6d42-61735e3c6b83@shipmail.org> Date: Wed, 24 Mar 2021 21:07:53 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <08f19e80-d6cb-8858-0c5d-67d2e2723f72@amd.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Stat-Signature: 7s14aay4mdwg5y1mqes5f9ro5gnc713d X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: A3C566000111 Received-SPF: none (shipmail.org>: No applicable sender policy available) receiver=imf09; identity=mailfrom; envelope-from=""; helo=ste-pvt-msa1.bahnhof.se; client-ip=213.80.101.70 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1616616481-575407 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 3/24/21 7:31 PM, Christian K=C3=B6nig wrote: > > > Am 24.03.21 um 17:38 schrieb Jason Gunthorpe: >> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellstr=C3=B6m (Intel= )=20 >> wrote: >>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote: >>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellstr=C3=B6m (Int= el)=20 >>>> wrote: >>>> >>>>>> In an ideal world the creation/destruction of page table levels=20 >>>>>> would >>>>>> by dynamic at this point, like THP. >>>>> Hmm, but I'm not sure what problem we're trying to solve by=20 >>>>> changing the >>>>> interface in this way? >>>> We are trying to make a sensible driver API to deal with huge pages. >>>>> Currently if the core vm requests a huge pud, we give it one, and=20 >>>>> if we >>>>> can't or don't want to (because of dirty-tracking, for example,=20 >>>>> which is >>>>> always done on 4K page-level) we just return VM_FAULT_FALLBACK,=20 >>>>> and the >>>>> fault is retried at a lower level. >>>> Well, my thought would be to move the pte related stuff into >>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK. >>>> >>>> I don't know if the locking works out, but it feels cleaner that the >>>> driver tells the vmf how big a page it can stuff in, not the vm >>>> telling the driver to stuff in a certain size page which it might no= t >>>> want to do. >>>> >>>> Some devices want to work on a in-between page size like 64k so they >>>> can't form 2M pages but they can stuff 64k of 4K pages in a batch on >>>> every fault. >>> Hmm, yes, but we would in that case be limited anyway to insert range= s >>> smaller than and equal to the fault size to avoid extensive and=20 >>> possibly >>> unnecessary checks for contigous memory. >> Why? The insert function is walking the page tables, it just updates >> things as they are. It learns the arragement for free while doing the >> walk. >> >> The device has to always provide consistent data, if it overlaps into >> pages that are already populated that is fine so long as it isn't >> changing their addresses. >> >>> And then if we can't support the full fault size, we'd need to >>> either presume a size and alignment of the next level or search for >>> contigous memory in both directions around the fault address, >>> perhaps unnecessarily as well. >> You don't really need to care about levels, the device should be >> faulting in the largest memory regions it can within its efficiency. >> >> If it works on 4M pages then it should be faulting 4M pages. The page >> size of the underlying CPU doesn't really matter much other than some >> tuning to impact how the device's allocator works. Yes, but then we'd be adding a lot of complexity into this function that=20 is already provided by the current interface for DAX, for little or no=20 gain, at least in the drm/ttm setting. Please think of the following=20 situation: You get a fault, you do an extensive time-consuming scan of=20 your VRAM buffer object into which the fault goes and determine you can=20 fault 1GB. Now you hand it to vmf_insert_range() and because the=20 user-space address is misaligned, or already partly populated because of=20 a previous eviction, you can only fault single pages, and you end up=20 faulting a full GB of single pages perhaps for a one-time small update. On top of this, unless we want to do the walk trying increasingly=20 smaller sizes of vmf_insert_xxx(), we'd have to use=20 apply_to_page_range() and teach it about transhuge page table entries,=20 because pagewalk.c can't be used (It can't populate page tables). That=20 also means apply_to_page_range() needs to be complicated with page table=20 locks since transhuge pages aren't stable and can be zapped and=20 refaulted under us while we do the walk. On top of this, the user-space address allocator needs to know how large=20 gpu pages are aligned in buffer objects to have a reasonable chance of=20 aligning with CPU huge page boundaries which is a requirement to be able=20 to insert a huge CPU page table entry, so the driver would basically=20 need the drm helper that can do this alignment anyway. All this makes me think we should settle for the current interface for=20 now, and if someone feels like refining it, I'm fine with that.=C2=A0 Aft= er=20 all, this isn't a strange drm/ttm invention, it's a pre-existing=20 interface that we reuse. > > I agree with Jason here. > > We get the best efficiency when we look at the what the GPU driver=20 > provides and make sure that we handle one GPU page at once instead of=20 > looking to much into what the CPU is doing with it's page tables. > > At least one AMD GPUs the GPU page size can be anything between 4KiB=20 > and 2GiB and if we will in a 2GiB chunk at once this can in theory be=20 > handled by just two giant page table entries on the CPU side. Yes, but I fail to see why, with the current code, we can't do this=20 (save the refcounting bug)? /Thomas