From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A104DC433E1 for ; Thu, 25 Mar 2021 09:41:57 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C528B619FB for ; Thu, 25 Mar 2021 09:41:56 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C528B619FB Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ffwll.ch Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 416BA6B006C; Thu, 25 Mar 2021 05:41:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3EC576B0070; Thu, 25 Mar 2021 05:41:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2B50E6B0071; Thu, 25 Mar 2021 05:41:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0120.hostedemail.com [216.40.44.120]) by kanga.kvack.org (Postfix) with ESMTP id 135216B006C for ; Thu, 25 Mar 2021 05:41:56 -0400 (EDT) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id C4D2082499B9 for ; Thu, 25 Mar 2021 09:41:55 +0000 (UTC) X-FDA: 77957904990.06.A8725A8 Received: from mail-oi1-f169.google.com (mail-oi1-f169.google.com [209.85.167.169]) by imf03.hostedemail.com (Postfix) with ESMTP id 51C72C0007C5 for ; Thu, 25 Mar 2021 09:41:54 +0000 (UTC) Received: by mail-oi1-f169.google.com with SMTP id c16so1512468oib.3 for ; Thu, 25 Mar 2021 02:41:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=ytpgGJ+GtYvC05QBQ0nA79g5rjGcUQUWfto8MGT8T5A=; b=hW77Ci8f+U/tmbEmXL8ihJtLZwDMcW8v0ECQY1tqoOjKzhYO2e2BocQHeYO9UZXfKY 59sqIoLcgr+1t2YNCZ4buVs4UrMdexYoh/cHwYTwYDc7GjNlICuuoObPtE7fwovvCv2z 7VxapeeBtYVHShmbzgcHfuY1yDFFIL4nqKov0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=ytpgGJ+GtYvC05QBQ0nA79g5rjGcUQUWfto8MGT8T5A=; b=ccTi4rSJitiyAlS0+6pCPBrk47nLQqrTea9KmjgMK06Mdd1WpzDgvW+lXEPjU5mv+p AIPBaBUrf8Sw9/Z7QDELbC2qwDpJC7nOcyny2LNVIX4L+hK0Qa+aN/VJjFAN7/pIECIQ qaHF+MxEQ71f3qAzhVQTtDzs0t7kQ6zmNhJKP4+phk2pTjTTCUl48BUMsg4bZxIX7Jr7 0ihnyNgMcvwJvckgguHKXUs09MJInb/P28U+EB0iK+1jmXcugMKZrPawBr7gyWVm9zBW noYkRtCwNIGMvx/kvuPiY/V7tsUDZZq/gupTnhm327bKCpDFe73Kp5BEE7UIh6MxitDh NA4A== X-Gm-Message-State: AOAM532sCD/p1ckUT7A77C1EK9LyxxJH9775xVeSmll92+wagT67rJfs qCDKHldNC7WPgkhmW3HDOiwmk7vQ0vkbrRGpPxjSyg== X-Google-Smtp-Source: ABdhPJxxDxu5h9yCJVcP15IRalhi2k05sHNvrUu9jeEPSUjkUiUhVdJspa93B2N9/pa4ZgGWb1DRGRP5BClMeSyEVBk= X-Received: by 2002:aca:4188:: with SMTP id o130mr5362730oia.101.1616665314626; Thu, 25 Mar 2021 02:41:54 -0700 (PDT) MIME-Version: 1.0 References: <20210324122430.GW2356281@nvidia.com> <20210324124127.GY2356281@nvidia.com> <6c9acb90-8e91-d8af-7abd-e762d9a901aa@shipmail.org> <20210324134833.GE2356281@nvidia.com> <0b984f96-00fb-5410-bb16-02e12b2cc024@shipmail.org> <20210324163812.GJ2356281@nvidia.com> <08f19e80-d6cb-8858-0c5d-67d2e2723f72@amd.com> <730eb2ff-ba98-2393-6d42-61735e3c6b83@shipmail.org> <20210324231419.GR2356281@nvidia.com> In-Reply-To: From: Daniel Vetter Date: Thu, 25 Mar 2021 10:41:43 +0100 Message-ID: Subject: Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages To: =?UTF-8?Q?Christian_K=C3=B6nig?= Cc: Jason Gunthorpe , =?UTF-8?Q?Thomas_Hellstr=C3=B6m_=28Intel=29?= , David Airlie , Linux MM , Andrew Morton , Linux Kernel Mailing List , dri-devel Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 51C72C0007C5 X-Stat-Signature: q6wgror554cowtui14ih9qmq3ufgit4k Received-SPF: none (ffwll.ch>: No applicable sender policy available) receiver=imf03; identity=mailfrom; envelope-from=""; helo=mail-oi1-f169.google.com; client-ip=209.85.167.169 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1616665314-849142 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Mar 25, 2021 at 8:50 AM Christian K=C3=B6nig wrote: > > Am 25.03.21 um 00:14 schrieb Jason Gunthorpe: > > On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellstr=C3=B6m (Intel)= wrote: > >> On 3/24/21 7:31 PM, Christian K=C3=B6nig wrote: > >>> > >>> Am 24.03.21 um 17:38 schrieb Jason Gunthorpe: > >>>> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellstr=C3=B6m (Int= el) > >>>> wrote: > >>>>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote: > >>>>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellstr=C3=B6m > >>>>>> (Intel) wrote: > >>>>>> > >>>>>>>> In an ideal world the creation/destruction of page > >>>>>>>> table levels would > >>>>>>>> by dynamic at this point, like THP. > >>>>>>> Hmm, but I'm not sure what problem we're trying to solve > >>>>>>> by changing the > >>>>>>> interface in this way? > >>>>>> We are trying to make a sensible driver API to deal with huge page= s. > >>>>>>> Currently if the core vm requests a huge pud, we give it > >>>>>>> one, and if we > >>>>>>> can't or don't want to (because of dirty-tracking, for > >>>>>>> example, which is > >>>>>>> always done on 4K page-level) we just return > >>>>>>> VM_FAULT_FALLBACK, and the > >>>>>>> fault is retried at a lower level. > >>>>>> Well, my thought would be to move the pte related stuff into > >>>>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK. > >>>>>> > >>>>>> I don't know if the locking works out, but it feels cleaner that t= he > >>>>>> driver tells the vmf how big a page it can stuff in, not the vm > >>>>>> telling the driver to stuff in a certain size page which it might = not > >>>>>> want to do. > >>>>>> > >>>>>> Some devices want to work on a in-between page size like 64k so th= ey > >>>>>> can't form 2M pages but they can stuff 64k of 4K pages in a batch = on > >>>>>> every fault. > >>>>> Hmm, yes, but we would in that case be limited anyway to insert ran= ges > >>>>> smaller than and equal to the fault size to avoid extensive and > >>>>> possibly > >>>>> unnecessary checks for contigous memory. > >>>> Why? The insert function is walking the page tables, it just updates > >>>> things as they are. It learns the arragement for free while doing th= e > >>>> walk. > >>>> > >>>> The device has to always provide consistent data, if it overlaps int= o > >>>> pages that are already populated that is fine so long as it isn't > >>>> changing their addresses. > >>>> > >>>>> And then if we can't support the full fault size, we'd need to > >>>>> either presume a size and alignment of the next level or search for > >>>>> contigous memory in both directions around the fault address, > >>>>> perhaps unnecessarily as well. > >>>> You don't really need to care about levels, the device should be > >>>> faulting in the largest memory regions it can within its efficiency. > >>>> > >>>> If it works on 4M pages then it should be faulting 4M pages. The pag= e > >>>> size of the underlying CPU doesn't really matter much other than som= e > >>>> tuning to impact how the device's allocator works. > >> Yes, but then we'd be adding a lot of complexity into this function th= at is > >> already provided by the current interface for DAX, for little or no ga= in, at > >> least in the drm/ttm setting. Please think of the following situation:= You > >> get a fault, you do an extensive time-consuming scan of your VRAM buff= er > >> object into which the fault goes and determine you can fault 1GB. Now = you > >> hand it to vmf_insert_range() and because the user-space address is > >> misaligned, or already partly populated because of a previous eviction= , you > >> can only fault single pages, and you end up faulting a full GB of sing= le > >> pages perhaps for a one-time small update. > > Why would "you can only fault single pages" ever be true? If you have > > 1GB of pages then the vmf_insert_range should allocate enough page > > table entries to consume it, regardless of alignment. > > Completely agree with Jason. Filling in the CPU page tables is > relatively cheap if you fill in a large continuous range. > > In other words filling in 1GiB of a linear range is *much* less overhead > than filling in 1<<18 4KiB faults. > > I would say that this is always preferable even if the CPU only wants to > update a single byte. > > > And why shouldn't DAX switch to this kind of interface anyhow? It is > > basically exactly the same problem. The underlying filesystem block > > size is *not* necessarily aligned to the CPU page table sizes and DAX > > would benefit from better handling of this mismatch. > > > >> On top of this, unless we want to do the walk trying increasingly smal= ler > >> sizes of vmf_insert_xxx(), we'd have to use apply_to_page_range() and = teach > >> it about transhuge page table entries, because pagewalk.c can't be use= d (It > >> can't populate page tables). That also means apply_to_page_range() nee= ds to > >> be complicated with page table locks since transhuge pages aren't stab= le and > >> can be zapped and refaulted under us while we do the walk. > > I didn't say it would be simple :) But we also need to stop hacking > > around the sides of all this huge page stuff and come up with sensible > > APIs that drivers can actually implement correctly. Exposing drivers > > to specific kinds of page levels really feels like the wrong level of > > abstraction. > > > > Once we start doing this we should do it everywhere, the io_remap_pfn > > stuff should be able to create huge special IO pages as well, for > > instance. > > Oh, yes please! > > We easily have 16GiB of VRAM which is linear mapped into the kernel > space for each GPU instance. > > Doing that with 1GiB mapping instead of 4KiB would be quite a win. io_remap_pfn is for userspace mmaps. Kernel mappings should be as big as possible already I think for everything. -Daniel > Regards, > Christian. > > > > >> On top of this, the user-space address allocator needs to know how lar= ge gpu > >> pages are aligned in buffer objects to have a reasonable chance of ali= gning > >> with CPU huge page boundaries which is a requirement to be able to ins= ert a > >> huge CPU page table entry, so the driver would basically need the drm = helper > >> that can do this alignment anyway. > > Don't you have this problem anyhow? > > > > Jason > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel --=20 Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch