From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=u5hr=IW=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1DEE5C433E0
	for <linux-mm@archiver.kernel.org>; Wed, 24 Mar 2021 20:08:07 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id A3C93619F8
	for <linux-mm@archiver.kernel.org>; Wed, 24 Mar 2021 20:08:06 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A3C93619F8
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=shipmail.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 1B4EF6B030D; Wed, 24 Mar 2021 16:08:06 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 166A16B030E; Wed, 24 Mar 2021 16:08:06 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0060F6B030F; Wed, 24 Mar 2021 16:08:05 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0171.hostedemail.com [216.40.44.171])
	by kanga.kvack.org (Postfix) with ESMTP id DBCC86B030D
	for <linux-mm@kvack.org>; Wed, 24 Mar 2021 16:08:05 -0400 (EDT)
Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 90D1D18016C61
	for <linux-mm@kvack.org>; Wed, 24 Mar 2021 20:08:05 +0000 (UTC)
X-FDA: 77955854130.19.86C4F10
Received: from ste-pvt-msa1.bahnhof.se (ste-pvt-msa1.bahnhof.se [213.80.101.70])
	by imf09.hostedemail.com (Postfix) with ESMTP id A3C566000111
	for <linux-mm@kvack.org>; Wed, 24 Mar 2021 20:08:01 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by ste-pvt-msa1.bahnhof.se (Postfix) with ESMTP id 880A241C23;
	Wed, 24 Mar 2021 21:08:00 +0100 (CET)
Authentication-Results: ste-pvt-msa1.bahnhof.se;
	dkim=pass (1024-bit key; unprotected) header.d=shipmail.org header.i=@shipmail.org header.b=P+s2NS8J;
	dkim-atps=neutral
X-Virus-Scanned: Debian amavisd-new at bahnhof.se
Received: from ste-pvt-msa1.bahnhof.se ([127.0.0.1])
	by localhost (ste-pvt-msa1.bahnhof.se [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id mMze0gdUSJgY; Wed, 24 Mar 2021 21:07:59 +0100 (CET)
Received: 
	by ste-pvt-msa1.bahnhof.se (Postfix) with ESMTPA id 5FEB83F660;
	Wed, 24 Mar 2021 21:07:56 +0100 (CET)
Received: from [10.249.254.166] (unknown [192.198.151.44])
	by mail1.shipmail.org (Postfix) with ESMTPSA id 689EC36062E;
	Wed, 24 Mar 2021 21:07:55 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=shipmail.org; s=mail;
	t=1616616476; bh=3qCxYoZfXVcIr6IpgNofra0wmJqpRFAzpDo4oMeG2n4=;
	h=Subject:To:Cc:References:From:Date:In-Reply-To:From;
	b=P+s2NS8JR6bgZjWtEWtrP50nVsF5NTzm85ecSAvPaeaV2HAmU4h5YlAYuj/1/OfJN
	 yMZPRJgOAIMozwhOd8Mc3SWg9rvlQuobgmFUAtCRsC0FE7IV+Yux+l6tIBgP8OoGKG
	 I8WBTZW8AxL5/8afqsZTd1D4Ip814CDQSFBKgSrE=
Subject: Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
To: =?UTF-8?Q?Christian_K=c3=b6nig?= <christian.koenig@amd.com>,
 Jason Gunthorpe <jgg@nvidia.com>
Cc: David Airlie <airlied@linux.ie>, linux-kernel@vger.kernel.org,
 dri-devel@lists.freedesktop.org, linux-mm@kvack.org,
 Andrew Morton <akpm@linux-foundation.org>
References: <314fc020-d243-dbf0-acb3-ecfcc9c2443c@shipmail.org>
 <20210323163715.GJ2356281@nvidia.com>
 <5824b731-ca6a-92fd-e314-d986b6a7b101@shipmail.org>
 <YFsM23t2niJwhpM/@phenom.ffwll.local> <20210324122430.GW2356281@nvidia.com>
 <e12e2c49-afaf-dbac-b18c-272c93c83e06@shipmail.org>
 <20210324124127.GY2356281@nvidia.com>
 <6c9acb90-8e91-d8af-7abd-e762d9a901aa@shipmail.org>
 <20210324134833.GE2356281@nvidia.com>
 <0b984f96-00fb-5410-bb16-02e12b2cc024@shipmail.org>
 <20210324163812.GJ2356281@nvidia.com>
 <08f19e80-d6cb-8858-0c5d-67d2e2723f72@amd.com>
From: =?UTF-8?Q?Thomas_Hellstr=c3=b6m_=28Intel=29?= <thomas_os@shipmail.org>
Message-ID: <730eb2ff-ba98-2393-6d42-61735e3c6b83@shipmail.org>
Date: Wed, 24 Mar 2021 21:07:53 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.1
MIME-Version: 1.0
In-Reply-To: <08f19e80-d6cb-8858-0c5d-67d2e2723f72@amd.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Stat-Signature: 7s14aay4mdwg5y1mqes5f9ro5gnc713d
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: A3C566000111
Received-SPF: none (shipmail.org>: No applicable sender policy available) receiver=imf09; identity=mailfrom; envelope-from="<thomas_os@shipmail.org>"; helo=ste-pvt-msa1.bahnhof.se; client-ip=213.80.101.70
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1616616481-575407
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On 3/24/21 7:31 PM, Christian K=C3=B6nig wrote:
>
>
> Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:
>> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellstr=C3=B6m (Intel=
)=20
>> wrote:
>>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote:
>>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellstr=C3=B6m (Int=
el)=20
>>>> wrote:
>>>>
>>>>>> In an ideal world the creation/destruction of page table levels=20
>>>>>> would
>>>>>> by dynamic at this point, like THP.
>>>>> Hmm, but I'm not sure what problem we're trying to solve by=20
>>>>> changing the
>>>>> interface in this way?
>>>> We are trying to make a sensible driver API to deal with huge pages.
>>>>> Currently if the core vm requests a huge pud, we give it one, and=20
>>>>> if we
>>>>> can't or don't want to (because of dirty-tracking, for example,=20
>>>>> which is
>>>>> always done on 4K page-level) we just return VM_FAULT_FALLBACK,=20
>>>>> and the
>>>>> fault is retried at a lower level.
>>>> Well, my thought would be to move the pte related stuff into
>>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.
>>>>
>>>> I don't know if the locking works out, but it feels cleaner that the
>>>> driver tells the vmf how big a page it can stuff in, not the vm
>>>> telling the driver to stuff in a certain size page which it might no=
t
>>>> want to do.
>>>>
>>>> Some devices want to work on a in-between page size like 64k so they
>>>> can't form 2M pages but they can stuff 64k of 4K pages in a batch on
>>>> every fault.
>>> Hmm, yes, but we would in that case be limited anyway to insert range=
s
>>> smaller than and equal to the fault size to avoid extensive and=20
>>> possibly
>>> unnecessary checks for contigous memory.
>> Why? The insert function is walking the page tables, it just updates
>> things as they are. It learns the arragement for free while doing the
>> walk.
>>
>> The device has to always provide consistent data, if it overlaps into
>> pages that are already populated that is fine so long as it isn't
>> changing their addresses.
>>
>>> And then if we can't support the full fault size, we'd need to
>>> either presume a size and alignment of the next level or search for
>>> contigous memory in both directions around the fault address,
>>> perhaps unnecessarily as well.
>> You don't really need to care about levels, the device should be
>> faulting in the largest memory regions it can within its efficiency.
>>
>> If it works on 4M pages then it should be faulting 4M pages. The page
>> size of the underlying CPU doesn't really matter much other than some
>> tuning to impact how the device's allocator works.

Yes, but then we'd be adding a lot of complexity into this function that=20
is already provided by the current interface for DAX, for little or no=20
gain, at least in the drm/ttm setting. Please think of the following=20
situation: You get a fault, you do an extensive time-consuming scan of=20
your VRAM buffer object into which the fault goes and determine you can=20
fault 1GB. Now you hand it to vmf_insert_range() and because the=20
user-space address is misaligned, or already partly populated because of=20
a previous eviction, you can only fault single pages, and you end up=20
faulting a full GB of single pages perhaps for a one-time small update.

On top of this, unless we want to do the walk trying increasingly=20
smaller sizes of vmf_insert_xxx(), we'd have to use=20
apply_to_page_range() and teach it about transhuge page table entries,=20
because pagewalk.c can't be used (It can't populate page tables). That=20
also means apply_to_page_range() needs to be complicated with page table=20
locks since transhuge pages aren't stable and can be zapped and=20
refaulted under us while we do the walk.

On top of this, the user-space address allocator needs to know how large=20
gpu pages are aligned in buffer objects to have a reasonable chance of=20
aligning with CPU huge page boundaries which is a requirement to be able=20
to insert a huge CPU page table entry, so the driver would basically=20
need the drm helper that can do this alignment anyway.

All this makes me think we should settle for the current interface for=20
now, and if someone feels like refining it, I'm fine with that.=C2=A0 Aft=
er=20
all, this isn't a strange drm/ttm invention, it's a pre-existing=20
interface that we reuse.

>
> I agree with Jason here.
>
> We get the best efficiency when we look at the what the GPU driver=20
> provides and make sure that we handle one GPU page at once instead of=20
> looking to much into what the CPU is doing with it's page tables.
>
> At least one AMD GPUs the GPU page size can be anything between 4KiB=20
> and 2GiB and if we will in a 2GiB chunk at once this can in theory be=20
> handled by just two giant page table entries on the CPU side.

Yes, but I fail to see why, with the current code, we can't do this=20
(save the refcounting bug)?

/Thomas