From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1AF20C56201 for ; Thu, 19 Nov 2020 15:42:59 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 8C40C2469D for ; Thu, 19 Nov 2020 15:42:58 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=ffwll.ch header.i=@ffwll.ch header.b="OZjkhEGI" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8C40C2469D Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ffwll.ch Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id D7F566E58B; Thu, 19 Nov 2020 15:42:57 +0000 (UTC) Received: from mail-wm1-x343.google.com (mail-wm1-x343.google.com [IPv6:2a00:1450:4864:20::343]) by gabe.freedesktop.org (Postfix) with ESMTPS id ABF506E58B for ; Thu, 19 Nov 2020 15:42:56 +0000 (UTC) Received: by mail-wm1-x343.google.com with SMTP id p19so6540916wmg.0 for ; Thu, 19 Nov 2020 07:42:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=RXMnH6JrLdNH2RKHhE3zWQwqo26mWky1KVAck2nmQu0=; b=OZjkhEGIS6jswA3nvgL/l7vBoiI2IIIHlhWGPJcinp5wtNjm394M18rggnHSuQvTjt 4ImMqAZ+oDhAWRrY5hKW3owD9b5Tt6Hm5PTLCOU4IzScagcFAbICyKjqj0q82ciQpOkz Ryobg1W41GQr9Ju/HDqnU0KNdyNKhlpmWrJNk= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=RXMnH6JrLdNH2RKHhE3zWQwqo26mWky1KVAck2nmQu0=; b=IBB0h9YML3haARS48Jc9xQDtsGfWRY1WNpH/kj+yoZ2GU4OB/5hocZRbdbIc4z7Xv1 wXiQzVIrfDGmhb7ISedRmNDPgD1Tb9DgZ8ORyLZqZ72ht4GVHLG937LWRaWFTGqfYkpE BVDcSgADwybYcMnnayvgAlZ+5WrxuUA+a2Ub/2ALCjIZVvUfJx1bbl6DnilJTzJauHht 6yBeSeoQPt8nh5/uUD2wLceKtsHl2Vw30wrcLFpeBna2EAkkC5hCXj9U0eN+WcoYJY/y gkcHC3c+F+R+As0xFiyH/vszdn2uk77sBqlFrPG50LIh8T+sfeReYlp1BE6mWQNJS29Q lWCQ== X-Gm-Message-State: AOAM533oCAmNwndPwqW/2/qk0cQFh/QyUSWb9e08etHeFKRbPCOsYydS nu11aagmD8WROIXSSraagqjt7cn2PIWzYQ== X-Google-Smtp-Source: ABdhPJxb+JF8p8ue3V+dG9SevHQq5C2n8z5J5LYzC6LcZMt9nwqoTsm1s407YiKobACnAAgrRyuX3A== X-Received: by 2002:a1c:e182:: with SMTP id y124mr4999396wmg.145.1605800575227; Thu, 19 Nov 2020 07:42:55 -0800 (PST) Received: from phenom.ffwll.local ([2a02:168:57f4:0:efd0:b9e5:5ae6:c2fa]) by smtp.gmail.com with ESMTPSA id i11sm118862wro.85.2020.11.19.07.42.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 Nov 2020 07:42:54 -0800 (PST) Date: Thu, 19 Nov 2020 16:42:52 +0100 From: Daniel Vetter To: Christian =?iso-8859-1?Q?K=F6nig?= Subject: Re: [PATCH 3/3] drm/ttm: make up to 90% of system memory available Message-ID: <20201119154252.GG401619@phenom.ffwll.local> References: <20201117140615.255887-1-christian.koenig@amd.com> <20201117140615.255887-3-christian.koenig@amd.com> <20201117171900.GM401619@phenom.ffwll.local> <9d6980c2-9429-f15c-7424-35e59bfcf063@gmail.com> <20201118211504.GY401619@phenom.ffwll.local> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: X-Operating-System: Linux phenom 5.7.0-1-amd64 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: airlied@linux.ie, Felix.Kuehling@amd.com, sroland@vmware.com, dri-devel@lists.freedesktop.org, linux-graphics-maintainer@vmware.com Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On Thu, Nov 19, 2020 at 04:27:00PM +0100, Christian K=F6nig wrote: > Am 18.11.20 um 22:15 schrieb Daniel Vetter: > > On Wed, Nov 18, 2020 at 02:35:24PM +0100, Christian K=F6nig wrote: > > > Am 17.11.20 um 18:19 schrieb Daniel Vetter: > > > > On Tue, Nov 17, 2020 at 03:06:15PM +0100, Christian K=F6nig wrote: > > > > > Increase the ammount of system memory drivers can use to about 90= % of > > > > > the total available. > > > > > = > > > > > Signed-off-by: Christian K=F6nig > > > > > --- > > > > > drivers/gpu/drm/ttm/ttm_bo.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > = > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/t= tm_bo.c > > > > > index a958135cb3fe..0a93df93dba4 100644 > > > > > --- a/drivers/gpu/drm/ttm/ttm_bo.c > > > > > +++ b/drivers/gpu/drm/ttm/ttm_bo.c > > > > > @@ -1267,7 +1267,7 @@ static int ttm_bo_global_init(void) > > > > > * the available system memory. > > > > > */ > > > > > num_pages =3D (u64)si.totalram * si.mem_unit; > > > > > - num_pages =3D (num_pages * 50 / 100) >> PAGE_SHIFT; > > > > > + num_pages =3D (num_pages * 90 / 100) >> PAGE_SHIFT; > > > > I don't think this is the design we want. As long as it was set at = "half > > > > of system memory" it was clear that a) it's a hack b) precision did= n't > > > > matter. > > > Yeah, I'm also wondering what to do here exactly. In generally I would > > > completely nuke that limit if possible. > > > = > > > > But if you go to the limit and still want to keep the "we make sure > > > > there's no OOM", then precision starts to matter: > > > > - memory hotplug and hotunplug is a thing > > > > - userspace can mlock, and it's configureable > > > > - drivers can pin_user_pages for IO and random other stuff. Some of= it is > > > > accounted as some subsystem specific mlock (like rdma does iirc= ), some > > > > is just yolo or short term enough (like) > > > > - none of what we do here takes into considerations any interaction= s with > > > > core mm tracking (like cgroups or numa or anything like that) > > > OOM is perfectly fine with me, we should just not run into an OOM kil= ler > > > situation because we call shmem_read_mapping_page_gfp() in the shrink= er > > > callback. > > > = > > > Any idea how we could guarantee that? > > Uh ... > > = > > I just realized I missed something between how ttm works and how i915 > > works. With i915 directly using shmem, all we do in the shrinker is un= pin > > the shmem pages. With ttm we have the page pool in the middle. And it's > > needed for dma coherent and other things. This is rather fundamental. > = > Yeah, the WC/UC handling is otherwise way to slow. Only for normal WB > allocations we could do this. > = > > btw I don't think you'll get OOM, you get lockdep splat because you're > > recusring on fs_reclaim fake lock. We should probably put a might_alloc= () > > into shmem_read_mapping_page_gfp(). > = > Yeah, one reason why I haven't enabled that yet. > = > > = > > > > If we want to drop the "half of system ram" limit (and yes that mak= es > > > > sense) I think the right design is: > > > > = > > > > - we give up on the "no OOM" guarantee. > > > > = > > > > - This means if you want real isolation of tasks, we need cgroups, = and we > > > > need to integrate ttm cgroups with system memory cgroups someho= w. Unlike > > > > randomly picked hardcoded limits this should work a lot more re= liably > > > > and be a lot more useful in practical use in the field. > > > > = > > > > - This also means that drivers start to fail in interesting ways. I= think > > > > locking headaches are covered with all the lockdep annotations = I've > > > > pushed, plus some of the things I still have in-flight (I have a > > > > might_alloc() annotations somewhere). That leaves validation of= error > > > > paths for when allocations fail. Ime a very effective way we us= ed in > > > > i915 is (ab)using EINTR restarting, which per drmIoctl uapi spe= c is > > > > requried. We could put a debug mode into ttm_tt which randomly = fails > > > > with -EINTR to make sure it's all working correctly (plus anyth= ing else > > > > that allocates memory), and unlike real out-of-memory injection= piglit > > > > and any other cts will complete without failure. Which gives us= an > > > > excellent metric for "does it work". Actualy OOM, even injected= one, > > > > tends to make stuff blow up in a way that's very hard to track = and make > > > > sure you've got good coverage, since all your usual tests die p= retty > > > > quickly. > > > > = > > > > - ttm_tt needs to play fair with every other system memory user. We= need > > > > to register a shrinker for each ttm_tt (so usually one per devi= ce I > > > > guess), which walks the lru (in shrink_count) and uses dma_resv= _trylock > > > > for actual shrinking. We probably want to move it to SYSTEM fir= st for > > > > that shrinker to pick up, so that there's some global fairness = across > > > > all ttm_tt. > > > I already have patches for this. > > > = > > > What's still missing is teaching the OOM killer which task is using t= he > > > memory since memory referenced through the file descriptors are curre= ntly > > > not accounted towards any process resources. > > Yeah I don't think we've fixed that problem for i915 either. It loves to > > kill randome other stuff. In igt we solve this by marking any igt testc= ase > > as "pls kill this first". Well "solve". > = > Well that is a "creative" solution :) > = > I will be rather looking into if we can somehow track to which files_stru= ct > a file belongs to and if we somehow can use this in the OOM killer. > = > My last try of giving each file something wasn't received well. > = > The real boomer is that you can really nicely create a deny of service us= ing > memfd_create() because of this :) Yeah proper accounting is the right fix here I agree. > > > > - for GFP_DMA32 that means zone aware shrinkers. We've never used t= hose, > > > > because thus far i915 didn't have any big need for low memory, = so we > > > > haven't used this in practice. But it's supposed to be a thing. > > > I think we can mostly forget about GFP_DMA32, this should only be use= d for > > > AGP and other ancient hardware. > > Ok, that's good. So having an arbitrary "only half of GFP_DMA32" for th= ese > > should also be acceptable, since back then gpus didn't really have > > gigabytes of vram yet. Biggest r500 seems to have topped out at 512MB. > > = > > How much do we require the dma page pool? Afaiui it's only when there's > > bounce buffers or something nasty like that present. Are there more use > > cases? > = > Not that I know off. > = > > = > > For fixing the TT -> SYSTEM problem of calling shmem_read_mapping_page_= gfp > > from shrinker I think there's 3 solutions: > > = > > 1)Call shmem_read_mapping_page_gfp with GFP_IO. This way we should not = be > > calling back into our shrinker, which are at GFP_FS level. This isn't > > great, but since page reclaim is a giantic loop which repeats as lon= g as > > we're making meaningful forward progress, we should be able to trick= le > > TT pages to SYSTEM pages even under severe memory pressure. > = > This is most likely the way to go for us. GFP_NOFS is the flag, I mixed them up. > > 2)For the normal page pool (i.e. neither GFP_DMA32 nor dma_alloc_cohere= nt > > or write-combine) we stop copying the pages for TT <-> SYSTEM, but j= ust > > directly pin the pages we get from shmem. Like all the shmem based s= oc > > drivers do. No copying, no need for allocations, no problem in > > shrinkers. > = > Not really doable, WC is just to widely used and shmem doesn't support > larger orders. Hm I thought you get large pages automatically, if you reassemble them. It's just the interface that sucks. btw did you look at 3 below? It might break dma-api assumptions a bit too badly since we're doing nice page flushing behind everyone's back. -Daniel > = > Regards, > Christian. > = > > = > > 3)For the remaining page pools we'd need ttmfs so that we can control t= he > > address_space_operations and directly swap out from our pages to > > swapout, bypassing shmem and the copying. Iirc Chris Wilson had a > > prototype once somewhere under gemfs for this. Only needed if 1) isn= 't > > fast enough for these because there's too many such cases where we c= are. > > = > > At least on intel hw the direction is very much towards fully coherent, > > so 1&2 should be good enough. > > -Daniel > > = > > > Christian. > > > = > > > > It's a bit more code than the oneliner above, but I also think it's= a lot > > > > more solid. Plus it would resolve the last big issue where i915 gem= is > > > > fairly fundamentally different compared to ttm. For that question I= think > > > > once Maarten's locking rework for i915 has landed and all the other= ttm > > > > rework from you and Dave is in, we've resolved them all. > > > > = > > > > = > > > > > /* But for DMA32 we limit ourself to only use 2GiB maximum. */ > > > > > num_dma32_pages =3D (u64)(si.totalram - si.totalhigh) * si.me= m_unit; > > > > > -- = > > > > > 2.25.1 > > > > > = > = -- = Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel