From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=trbL=IX=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 70C25C433E1
	for <linux-mm@archiver.kernel.org>; Thu, 25 Mar 2021 07:48:48 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id A58F961945
	for <linux-mm@archiver.kernel.org>; Thu, 25 Mar 2021 07:48:47 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A58F961945
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=shipmail.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id ECEE46B0036; Thu, 25 Mar 2021 03:48:46 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id EA5426B006C; Thu, 25 Mar 2021 03:48:46 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D463A6B006E; Thu, 25 Mar 2021 03:48:46 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0069.hostedemail.com [216.40.44.69])
	by kanga.kvack.org (Postfix) with ESMTP id BDE446B0036
	for <linux-mm@kvack.org>; Thu, 25 Mar 2021 03:48:46 -0400 (EDT)
Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 74A25B784
	for <linux-mm@kvack.org>; Thu, 25 Mar 2021 07:48:46 +0000 (UTC)
X-FDA: 77957619852.09.503127F
Received: from ste-pvt-msa2.bahnhof.se (ste-pvt-msa2.bahnhof.se [213.80.101.71])
	by imf23.hostedemail.com (Postfix) with ESMTP id 9E3C7A0009CD
	for <linux-mm@kvack.org>; Thu, 25 Mar 2021 07:48:44 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by ste-pvt-msa2.bahnhof.se (Postfix) with ESMTP id 4CB603FAC2;
	Thu, 25 Mar 2021 08:48:43 +0100 (CET)
Authentication-Results: ste-pvt-msa2.bahnhof.se;
	dkim=pass (1024-bit key; unprotected) header.d=shipmail.org header.i=@shipmail.org header.b=rjjw6b5f;
	dkim-atps=neutral
X-Virus-Scanned: Debian amavisd-new at bahnhof.se
Authentication-Results: ste-ftg-msa2.bahnhof.se (amavisd-new);
	dkim=pass (1024-bit key) header.d=shipmail.org
Received: from ste-pvt-msa2.bahnhof.se ([127.0.0.1])
	by localhost (ste-ftg-msa2.bahnhof.se [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id pDGIQe96YS27; Thu, 25 Mar 2021 08:48:42 +0100 (CET)
Received: 
	by ste-pvt-msa2.bahnhof.se (Postfix) with ESMTPA id 6B3183F8A2;
	Thu, 25 Mar 2021 08:48:40 +0100 (CET)
Received: from [10.249.254.165] (unknown [192.198.151.44])
	by mail1.shipmail.org (Postfix) with ESMTPSA id 515933600A8;
	Thu, 25 Mar 2021 08:48:39 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=shipmail.org; s=mail;
	t=1616658520; bh=M6Y7VQAxKQqKu/9EYXFH7+a9T3OZ0ie/KAWLd3bLZts=;
	h=Subject:To:Cc:References:From:Date:In-Reply-To:From;
	b=rjjw6b5fgH1LdcpVtK0ueVUajLTSvJd7xFsQYN7SPJFPF9OT/xBFD4+h4NJ3K+wWn
	 g8Oi44qjXy9oB1y+OJDu+ZOueVkT7h81/iB08c0KDigaUbqUEiV3XQpP5GzPGysnM+
	 o8/A/vJaZHQojYyB6lxob8XeRJAIPCLYpTIR7aH0=
Subject: Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: =?UTF-8?Q?Christian_K=c3=b6nig?= <christian.koenig@amd.com>,
 David Airlie <airlied@linux.ie>, linux-kernel@vger.kernel.org,
 dri-devel@lists.freedesktop.org, linux-mm@kvack.org,
 Andrew Morton <akpm@linux-foundation.org>
References: <YFsM23t2niJwhpM/@phenom.ffwll.local>
 <20210324122430.GW2356281@nvidia.com>
 <e12e2c49-afaf-dbac-b18c-272c93c83e06@shipmail.org>
 <20210324124127.GY2356281@nvidia.com>
 <6c9acb90-8e91-d8af-7abd-e762d9a901aa@shipmail.org>
 <20210324134833.GE2356281@nvidia.com>
 <0b984f96-00fb-5410-bb16-02e12b2cc024@shipmail.org>
 <20210324163812.GJ2356281@nvidia.com>
 <08f19e80-d6cb-8858-0c5d-67d2e2723f72@amd.com>
 <730eb2ff-ba98-2393-6d42-61735e3c6b83@shipmail.org>
 <20210324231419.GR2356281@nvidia.com>
From: =?UTF-8?Q?Thomas_Hellstr=c3=b6m_=28Intel=29?= <thomas_os@shipmail.org>
Message-ID: <607ecbeb-e8a5-66e9-6fe2-9a8d22f12bc2@shipmail.org>
Date: Thu, 25 Mar 2021 08:48:37 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.1
MIME-Version: 1.0
In-Reply-To: <20210324231419.GR2356281@nvidia.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Stat-Signature: c7tsah8dzagwmjpqj6xkkqduhuzofknj
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 9E3C7A0009CD
Received-SPF: none (shipmail.org>: No applicable sender policy available) receiver=imf23; identity=mailfrom; envelope-from="<thomas_os@shipmail.org>"; helo=ste-pvt-msa2.bahnhof.se; client-ip=213.80.101.71
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1616658524-216641
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On 3/25/21 12:14 AM, Jason Gunthorpe wrote:
> On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellstr=C3=B6m (Intel)=
 wrote:
>> On 3/24/21 7:31 PM, Christian K=C3=B6nig wrote:
>>>
>>> Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:
>>>> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellstr=C3=B6m (Int=
el)
>>>> wrote:
>>>>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote:
>>>>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellstr=C3=B6m
>>>>>> (Intel) wrote:
>>>>>>
>>>>>>>> In an ideal world the creation/destruction of page
>>>>>>>> table levels would
>>>>>>>> by dynamic at this point, like THP.
>>>>>>> Hmm, but I'm not sure what problem we're trying to solve
>>>>>>> by changing the
>>>>>>> interface in this way?
>>>>>> We are trying to make a sensible driver API to deal with huge page=
s.
>>>>>>> Currently if the core vm requests a huge pud, we give it
>>>>>>> one, and if we
>>>>>>> can't or don't want to (because of dirty-tracking, for
>>>>>>> example, which is
>>>>>>> always done on 4K page-level) we just return
>>>>>>> VM_FAULT_FALLBACK, and the
>>>>>>> fault is retried at a lower level.
>>>>>> Well, my thought would be to move the pte related stuff into
>>>>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.
>>>>>>
>>>>>> I don't know if the locking works out, but it feels cleaner that t=
he
>>>>>> driver tells the vmf how big a page it can stuff in, not the vm
>>>>>> telling the driver to stuff in a certain size page which it might =
not
>>>>>> want to do.
>>>>>>
>>>>>> Some devices want to work on a in-between page size like 64k so th=
ey
>>>>>> can't form 2M pages but they can stuff 64k of 4K pages in a batch =
on
>>>>>> every fault.
>>>>> Hmm, yes, but we would in that case be limited anyway to insert ran=
ges
>>>>> smaller than and equal to the fault size to avoid extensive and
>>>>> possibly
>>>>> unnecessary checks for contigous memory.
>>>> Why? The insert function is walking the page tables, it just updates
>>>> things as they are. It learns the arragement for free while doing th=
e
>>>> walk.
>>>>
>>>> The device has to always provide consistent data, if it overlaps int=
o
>>>> pages that are already populated that is fine so long as it isn't
>>>> changing their addresses.
>>>>
>>>>> And then if we can't support the full fault size, we'd need to
>>>>> either presume a size and alignment of the next level or search for
>>>>> contigous memory in both directions around the fault address,
>>>>> perhaps unnecessarily as well.
>>>> You don't really need to care about levels, the device should be
>>>> faulting in the largest memory regions it can within its efficiency.
>>>>
>>>> If it works on 4M pages then it should be faulting 4M pages. The pag=
e
>>>> size of the underlying CPU doesn't really matter much other than som=
e
>>>> tuning to impact how the device's allocator works.
>> Yes, but then we'd be adding a lot of complexity into this function th=
at is
>> already provided by the current interface for DAX, for little or no ga=
in, at
>> least in the drm/ttm setting. Please think of the following situation:=
 You
>> get a fault, you do an extensive time-consuming scan of your VRAM buff=
er
>> object into which the fault goes and determine you can fault 1GB. Now =
you
>> hand it to vmf_insert_range() and because the user-space address is
>> misaligned, or already partly populated because of a previous eviction=
, you
>> can only fault single pages, and you end up faulting a full GB of sing=
le
>> pages perhaps for a one-time small update.
> Why would "you can only fault single pages" ever be true? If you have
> 1GB of pages then the vmf_insert_range should allocate enough page
> table entries to consume it, regardless of alignment.

Ah yes, What I meant was you can only insert PTE size entries, either=20
because of misalignment or because the page-table is alredy=20
pre-populated with pmd size page directories, which you can't remove=20
with only the read side of the mmap lock held.

>
> And why shouldn't DAX switch to this kind of interface anyhow? It is
> basically exactly the same problem. The underlying filesystem block
> size is *not* necessarily aligned to the CPU page table sizes and DAX
> would benefit from better handling of this mismatch.

First, I think we must sort out what "better handling" means. This is my=20
takeout of the discussion so far:

Claimed Pros: of vmf_insert_range()
* We get an interface that doesn't require knowledge of CPU page table=20
entry level sizes.
* We get the best efficiency when we look at what the GPU driver=20
provides. (I disagree on this one).

Claimed Cons:
* A new implementation that may get complicated particularly if it=20
involves modifying all of the DAX code
* The driver would have to know about those sizes anyway to get=20
alignment right (Applies to DRM, because we mmap buffer objects, not=20
physical address ranges. But not to DAX AFAICT),
* We loose efficiency, because we are prepared to spend an extra effort=20
for alignment- and continuity checks when we know we can insert a huge=20
page table entry, but not if we know we can't
* We loose efficiency because we might unnecessarily prefault a number=20
of PTE size page-table entries (really a special case of the above one).

Now in the context of quickly fixing a critical bug, the choice IMHO=20
becomes easy.

>
>> On top of this, unless we want to do the walk trying increasingly smal=
ler
>> sizes of vmf_insert_xxx(), we'd have to use apply_to_page_range() and =
teach
>> it about transhuge page table entries, because pagewalk.c can't be use=
d (It
>> can't populate page tables). That also means apply_to_page_range() nee=
ds to
>> be complicated with page table locks since transhuge pages aren't stab=
le and
>> can be zapped and refaulted under us while we do the walk.
> I didn't say it would be simple :) But we also need to stop hacking
> around the sides of all this huge page stuff and come up with sensible
> APIs that drivers can actually implement correctly. Exposing drivers
> to specific kinds of page levels really feels like the wrong level of
> abstraction.

I generally agree. But for the last sentence I think the potential gain=20
must be carefully weighed against the efficiency arguments.

>
> Once we start doing this we should do it everywhere, the io_remap_pfn
> stuff should be able to create huge special IO pages as well, for
> instance.

I agree here as well. Here we can be more agressive as the contigous=20
range is already known and we IIRC hold the mmap lock in write mode.

>  =20
>> On top of this, the user-space address allocator needs to know how lar=
ge gpu
>> pages are aligned in buffer objects to have a reasonable chance of ali=
gning
>> with CPU huge page boundaries which is a requirement to be able to ins=
ert a
>> huge CPU page table entry, so the driver would basically need the drm =
helper
>> that can do this alignment anyway.
> Don't you have this problem anyhow?

Yes, but it sort of defeats the simplicity argument of the proposed=20
interface change.

/Thomas