From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=LoKS=AZ=vger.kernel.org=linux-rdma-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.1 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED
	autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 68CE9C433E3
	for <linux-rdma@archiver.kernel.org>; Tue, 14 Jul 2020 10:49:17 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 42EF322203
	for <linux-rdma@archiver.kernel.org>; Tue, 14 Jul 2020 10:49:17 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=ffwll.ch header.i=@ffwll.ch header.b="bOeL2MUd"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1725884AbgGNKtP (ORCPT <rfc822;linux-rdma@archiver.kernel.org>);
        Tue, 14 Jul 2020 06:49:15 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48018 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726534AbgGNKtO (ORCPT
        <rfc822;linux-rdma@vger.kernel.org>); Tue, 14 Jul 2020 06:49:14 -0400
Received: from mail-wr1-x444.google.com (mail-wr1-x444.google.com [IPv6:2a00:1450:4864:20::444])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5C6A5C061794
        for <linux-rdma@vger.kernel.org>; Tue, 14 Jul 2020 03:49:14 -0700 (PDT)
Received: by mail-wr1-x444.google.com with SMTP id k6so20747907wrn.3
        for <linux-rdma@vger.kernel.org>; Tue, 14 Jul 2020 03:49:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=ffwll.ch; s=google;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:content-transfer-encoding:in-reply-to;
        bh=XFyCObAFNGAkAD33iucRxardqlG+NdnL+BA8ka5PPr4=;
        b=bOeL2MUd2f8AlWSJD3wmjK7kX/u8xZmjsUmMQeJzGK+l98O+GbJDKEx2q+dtP0Ysuw
         ahdh9hOxfLzdYsrsSCIb4+jsBwaVOfOB9g1F7JaVOZ01TKCuMezHAGJnllnKMSUhc5F+
         Oz3Dlt8i1yKPbH5ek0krA47zpw8DZA2RsafPc=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:content-transfer-encoding
         :in-reply-to;
        bh=XFyCObAFNGAkAD33iucRxardqlG+NdnL+BA8ka5PPr4=;
        b=Af0LTU8ax5yzOs6RZnBuwvY/RrNU8p0jnfakVWZDRkSKABWx9ORWiIDsbaBhLfw/LN
         LbMMiPo4L9BcLSuezaWjGudSaFdLXSV4wYYzeSivmMFjAW+eSUuisZIrh0Pigwa4X75+
         u3+4jK+m1oCiBwnMtywTDDxQ5qX7spmGvAsHx9pMdnCvzaPCYKVMEEVODhCbCIpXGgII
         +OhVohf4Aa/wTIh9VhrF+KSq9ISidxWzVnEpyKrYVla1N2/Cvc8ZivmTB2FyMaCBHAwN
         osjkW0bTeTtOS6ig920hcxVgpCpGEtl4TkscF8rPzAADVjmyp9mJ72fXyLSoNkMNnWET
         Csbg==
X-Gm-Message-State: AOAM533fo3Fjc/AjBswlw3+g3XFOdGIcjjweQJkQTKTSJQNoG1gc4kNd
        ZMLO7XuMT/6sUwLEoecy99BKFg==
X-Google-Smtp-Source: ABdhPJxzCDVzf3ULPsu9gK9KiSaB34ZTvYO1lf6Ch0WaVTCD95UxXPy23Q0JcDyFOFTm8A13sseYqQ==
X-Received: by 2002:adf:c44d:: with SMTP id a13mr4551346wrg.205.1594723752919;
        Tue, 14 Jul 2020 03:49:12 -0700 (PDT)
Received: from phenom.ffwll.local ([2a02:168:57f4:0:efd0:b9e5:5ae6:c2fa])
        by smtp.gmail.com with ESMTPSA id u17sm27474975wrp.70.2020.07.14.03.49.11
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 14 Jul 2020 03:49:12 -0700 (PDT)
Date:   Tue, 14 Jul 2020 12:49:10 +0200
From:   Daniel Vetter <daniel@ffwll.ch>
To:     DRI Development <dri-devel@lists.freedesktop.org>
Cc:     Intel Graphics Development <intel-gfx@lists.freedesktop.org>,
        linux-rdma@vger.kernel.org, Daniel Vetter <daniel.vetter@ffwll.ch>,
        linux-media@vger.kernel.org, linaro-mm-sig@lists.linaro.org,
        amd-gfx@lists.freedesktop.org,
        Chris Wilson <chris@chris-wilson.co.uk>,
        Maarten Lankhorst <maarten.lankhorst@linux.intel.com>,
        Christian =?iso-8859-1?Q?K=F6nig?= <christian.koenig@amd.com>,
        Daniel Vetter <daniel.vetter@intel.com>
Subject: Re: [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler
 code
Message-ID: <20200714104910.GC3278063@phenom.ffwll.local>
References: <20200707201229.472834-1-daniel.vetter@ffwll.ch>
 <20200707201229.472834-20-daniel.vetter@ffwll.ch>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20200707201229.472834-20-daniel.vetter@ffwll.ch>
X-Operating-System: Linux phenom 5.6.0-1-amd64 
Sender: linux-rdma-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-rdma.vger.kernel.org>
X-Mailing-List: linux-rdma@vger.kernel.org

On Tue, Jul 07, 2020 at 10:12:23PM +0200, Daniel Vetter wrote:
> My dma-fence lockdep annotations caught an inversion because we
> allocate memory where we really shouldn't:
> 
> 	kmem_cache_alloc+0x2b/0x6d0
> 	amdgpu_fence_emit+0x30/0x330 [amdgpu]
> 	amdgpu_ib_schedule+0x306/0x550 [amdgpu]
> 	amdgpu_job_run+0x10f/0x260 [amdgpu]
> 	drm_sched_main+0x1b9/0x490 [gpu_sched]
> 	kthread+0x12e/0x150
> 
> Trouble right now is that lockdep only validates against GFP_FS, which
> would be good enough for shrinkers. But for mmu_notifiers we actually
> need !GFP_ATOMIC, since they can be called from any page laundering,
> even if GFP_NOFS or GFP_NOIO are set.
> 
> I guess we should improve the lockdep annotations for
> fs_reclaim_acquire/release.
> 
> Ofc real fix is to properly preallocate this fence and stuff it into
> the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
> the way.
> 
> v2: Two more allocations in scheduler paths.
> 
> Frist one:
> 
> 	__kmalloc+0x58/0x720
> 	amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
> 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> 	drm_sched_main+0xf9/0x490 [gpu_sched]
> 
> Second one:
> 
> 	kmem_cache_alloc+0x2b/0x6d0
> 	amdgpu_sync_fence+0x7e/0x110 [amdgpu]
> 	amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
> 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> 	drm_sched_main+0xf9/0x490 [gpu_sched]
> 
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

Has anyone from amd side started looking into how to fix this properly?

I looked a bit into fixing this with mempool, and the big guarantee we
need is that
- there's a hard upper limit on how many allocations we minimally need to
  guarantee forward progress. And the entire vmid allocation and
  amdgpu_sync_fence stuff kinda makes me question that's a valid
  assumption.

- mempool_free must be called without any locks in the way which are held
  while we call mempool_alloc. Otherwise we again have a nice deadlock
  with no forward progress. I tried auditing that, but got lost in amdgpu
  and scheduler code. Some lockdep annotations for mempool.c might help,
  but they're not going to catch everything. Plus it would be again manual
  annotations because this is yet another cross-release issue. So not sure
  that helps at all.

iow, not sure what to do here. Ideas?

Cheers, Daniel

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 8d84975885cd..a089a827fdfe 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
>  	uint32_t seq;
>  	int r;
>  
> -	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
> +	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
>  	if (fence == NULL)
>  		return -ENOMEM;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> index 267fa45ddb66..a333ca2d4ddd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> @@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
>  	if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
>  		return amdgpu_sync_fence(sync, ring->vmid_wait);
>  
> -	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
> +	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
>  	if (!fences)
>  		return -ENOMEM;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> index 8ea6c49529e7..af22b526cec9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> @@ -160,7 +160,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f)
>  	if (amdgpu_sync_add_later(sync, f))
>  		return 0;
>  
> -	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
> +	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
>  	if (!e)
>  		return -ENOMEM;
>  
> -- 
> 2.27.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=WUjm=AZ=lists.freedesktop.org=dri-devel-bounces@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-9.8 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,
	SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 94F83C433E4
	for <dri-devel@archiver.kernel.org>; Tue, 14 Jul 2020 10:49:18 +0000 (UTC)
Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 68D4A22203
	for <dri-devel@archiver.kernel.org>; Tue, 14 Jul 2020 10:49:18 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key) header.d=ffwll.ch header.i=@ffwll.ch header.b="bOeL2MUd"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 68D4A22203
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ffwll.ch
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org
Received: from gabe.freedesktop.org (localhost [127.0.0.1])
	by gabe.freedesktop.org (Postfix) with ESMTP id 31EEA6E949;
	Tue, 14 Jul 2020 10:49:16 +0000 (UTC)
Received: from mail-wr1-x444.google.com (mail-wr1-x444.google.com
 [IPv6:2a00:1450:4864:20::444])
 by gabe.freedesktop.org (Postfix) with ESMTPS id 9DB366E8B4
 for <dri-devel@lists.freedesktop.org>; Tue, 14 Jul 2020 10:49:14 +0000 (UTC)
Received: by mail-wr1-x444.google.com with SMTP id z2so20758646wrp.2
 for <dri-devel@lists.freedesktop.org>; Tue, 14 Jul 2020 03:49:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:content-transfer-encoding:in-reply-to;
 bh=XFyCObAFNGAkAD33iucRxardqlG+NdnL+BA8ka5PPr4=;
 b=bOeL2MUd2f8AlWSJD3wmjK7kX/u8xZmjsUmMQeJzGK+l98O+GbJDKEx2q+dtP0Ysuw
 ahdh9hOxfLzdYsrsSCIb4+jsBwaVOfOB9g1F7JaVOZ01TKCuMezHAGJnllnKMSUhc5F+
 Oz3Dlt8i1yKPbH5ek0krA47zpw8DZA2RsafPc=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:content-transfer-encoding
 :in-reply-to;
 bh=XFyCObAFNGAkAD33iucRxardqlG+NdnL+BA8ka5PPr4=;
 b=lZX3D4BzkfkBZrV2KGLAkQAZ5Rz3KkpP7xi8YGG1SEXrSygmcl0EfWUwGTu3EudR5W
 rn0AVasNanuH4JAiB2KuvBlVkP5Gm2N4MWs5KQZZmE5JR5fpOALioKxO4n6tLxgkMavi
 E/N/3u/JSiTNXYB8S+/rEqx0Qv29/SwRRb/RoErrvdzDmpN5d287KS2SK1fSP4ikFU8V
 IWpNj7f0Jicyw6SFFgY+LKlBL/zrMnxcQ1VW+nzbCHH10N28gXbSS6B6fMxEtiTQSA8x
 FyopOL/CwY/yCawSysQzrmsQm/DWlY9jeblgB9H6S50e4RTTq8bfE3bd0BpxZgAqOex3
 MtPg==
X-Gm-Message-State: AOAM530s3TjGhVxtO2QD+Pz3XJx57MVBJN3Waf8TFScY67LZhOrCysjs
 4mRGFvnOttv8mYrs0EBXvmrGRmIEWpk=
X-Google-Smtp-Source: ABdhPJxzCDVzf3ULPsu9gK9KiSaB34ZTvYO1lf6Ch0WaVTCD95UxXPy23Q0JcDyFOFTm8A13sseYqQ==
X-Received: by 2002:adf:c44d:: with SMTP id a13mr4551346wrg.205.1594723752919; 
 Tue, 14 Jul 2020 03:49:12 -0700 (PDT)
Received: from phenom.ffwll.local ([2a02:168:57f4:0:efd0:b9e5:5ae6:c2fa])
 by smtp.gmail.com with ESMTPSA id u17sm27474975wrp.70.2020.07.14.03.49.11
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 14 Jul 2020 03:49:12 -0700 (PDT)
Date: Tue, 14 Jul 2020 12:49:10 +0200
From: Daniel Vetter <daniel@ffwll.ch>
To: DRI Development <dri-devel@lists.freedesktop.org>
Subject: Re: [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler
 code
Message-ID: <20200714104910.GC3278063@phenom.ffwll.local>
References: <20200707201229.472834-1-daniel.vetter@ffwll.ch>
 <20200707201229.472834-20-daniel.vetter@ffwll.ch>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20200707201229.472834-20-daniel.vetter@ffwll.ch>
X-Operating-System: Linux phenom 5.6.0-1-amd64 
X-BeenThere: dri-devel@lists.freedesktop.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Direct Rendering Infrastructure - Development
 <dri-devel.lists.freedesktop.org>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/dri-devel>
List-Post: <mailto:dri-devel@lists.freedesktop.org>
List-Help: <mailto:dri-devel-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=subscribe>
Cc: linux-rdma@vger.kernel.org, Daniel Vetter <daniel.vetter@ffwll.ch>,
 Intel Graphics Development <intel-gfx@lists.freedesktop.org>,
 amd-gfx@lists.freedesktop.org, Chris Wilson <chris@chris-wilson.co.uk>,
 linaro-mm-sig@lists.linaro.org, Daniel Vetter <daniel.vetter@intel.com>,
 Christian =?iso-8859-1?Q?K=F6nig?= <christian.koenig@amd.com>,
 linux-media@vger.kernel.org
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

On Tue, Jul 07, 2020 at 10:12:23PM +0200, Daniel Vetter wrote:
> My dma-fence lockdep annotations caught an inversion because we
> allocate memory where we really shouldn't:
> =

> 	kmem_cache_alloc+0x2b/0x6d0
> 	amdgpu_fence_emit+0x30/0x330 [amdgpu]
> 	amdgpu_ib_schedule+0x306/0x550 [amdgpu]
> 	amdgpu_job_run+0x10f/0x260 [amdgpu]
> 	drm_sched_main+0x1b9/0x490 [gpu_sched]
> 	kthread+0x12e/0x150
> =

> Trouble right now is that lockdep only validates against GFP_FS, which
> would be good enough for shrinkers. But for mmu_notifiers we actually
> need !GFP_ATOMIC, since they can be called from any page laundering,
> even if GFP_NOFS or GFP_NOIO are set.
> =

> I guess we should improve the lockdep annotations for
> fs_reclaim_acquire/release.
> =

> Ofc real fix is to properly preallocate this fence and stuff it into
> the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
> the way.
> =

> v2: Two more allocations in scheduler paths.
> =

> Frist one:
> =

> 	__kmalloc+0x58/0x720
> 	amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
> 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> 	drm_sched_main+0xf9/0x490 [gpu_sched]
> =

> Second one:
> =

> 	kmem_cache_alloc+0x2b/0x6d0
> 	amdgpu_sync_fence+0x7e/0x110 [amdgpu]
> 	amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
> 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> 	drm_sched_main+0xf9/0x490 [gpu_sched]
> =

> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian K=F6nig <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

Has anyone from amd side started looking into how to fix this properly?

I looked a bit into fixing this with mempool, and the big guarantee we
need is that
- there's a hard upper limit on how many allocations we minimally need to
  guarantee forward progress. And the entire vmid allocation and
  amdgpu_sync_fence stuff kinda makes me question that's a valid
  assumption.

- mempool_free must be called without any locks in the way which are held
  while we call mempool_alloc. Otherwise we again have a nice deadlock
  with no forward progress. I tried auditing that, but got lost in amdgpu
  and scheduler code. Some lockdep annotations for mempool.c might help,
  but they're not going to catch everything. Plus it would be again manual
  annotations because this is yet another cross-release issue. So not sure
  that helps at all.

iow, not sure what to do here. Ideas?

Cheers, Daniel

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> =

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/=
amd/amdgpu/amdgpu_fence.c
> index 8d84975885cd..a089a827fdfe 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struc=
t dma_fence **f,
>  	uint32_t seq;
>  	int r;
>  =

> -	fence =3D kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
> +	fence =3D kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
>  	if (fence =3D=3D NULL)
>  		return -ENOMEM;
>  =

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/am=
d/amdgpu/amdgpu_ids.c
> index 267fa45ddb66..a333ca2d4ddd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> @@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
>  	if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
>  		return amdgpu_sync_fence(sync, ring->vmid_wait);
>  =

> -	fences =3D kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
> +	fences =3D kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
>  	if (!fences)
>  		return -ENOMEM;
>  =

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/a=
md/amdgpu/amdgpu_sync.c
> index 8ea6c49529e7..af22b526cec9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> @@ -160,7 +160,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struc=
t dma_fence *f)
>  	if (amdgpu_sync_add_later(sync, f))
>  		return 0;
>  =

> -	e =3D kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
> +	e =3D kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
>  	if (!e)
>  		return -ENOMEM;
>  =

> -- =

> 2.27.0
> =


-- =

Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=MnHL=AZ=lists.freedesktop.org=intel-gfx-bounces@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-9.8 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,
	SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D80AFC433E2
	for <intel-gfx@archiver.kernel.org>; Tue, 14 Jul 2020 10:49:16 +0000 (UTC)
Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id AE29822203
	for <intel-gfx@archiver.kernel.org>; Tue, 14 Jul 2020 10:49:16 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key) header.d=ffwll.ch header.i=@ffwll.ch header.b="bOeL2MUd"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AE29822203
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ffwll.ch
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org
Received: from gabe.freedesktop.org (localhost [127.0.0.1])
	by gabe.freedesktop.org (Postfix) with ESMTP id C4BF06E8BE;
	Tue, 14 Jul 2020 10:49:15 +0000 (UTC)
Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com
 [IPv6:2a00:1450:4864:20::443])
 by gabe.freedesktop.org (Postfix) with ESMTPS id 6B0536E8B4
 for <intel-gfx@lists.freedesktop.org>; Tue, 14 Jul 2020 10:49:14 +0000 (UTC)
Received: by mail-wr1-x443.google.com with SMTP id z15so20738285wrl.8
 for <intel-gfx@lists.freedesktop.org>; Tue, 14 Jul 2020 03:49:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:content-transfer-encoding:in-reply-to;
 bh=XFyCObAFNGAkAD33iucRxardqlG+NdnL+BA8ka5PPr4=;
 b=bOeL2MUd2f8AlWSJD3wmjK7kX/u8xZmjsUmMQeJzGK+l98O+GbJDKEx2q+dtP0Ysuw
 ahdh9hOxfLzdYsrsSCIb4+jsBwaVOfOB9g1F7JaVOZ01TKCuMezHAGJnllnKMSUhc5F+
 Oz3Dlt8i1yKPbH5ek0krA47zpw8DZA2RsafPc=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:content-transfer-encoding
 :in-reply-to;
 bh=XFyCObAFNGAkAD33iucRxardqlG+NdnL+BA8ka5PPr4=;
 b=UZM6jgGSPNiCck37+J+pP8DoFs5B338BLL0VkfatRH+nbVW3jNFiQN+JnYTE3XWcDW
 I0ElR6Ug7vBBrxcF+MjJYiW01NcSazqv3CARHfQFujovNHoqwTanp9U9QEZaYnX2fgGs
 AI4n0BCMiaH/fQYLRKQJCTzM8YZTO6sUM/KCkEqaQhU2+Pqo7EtItV5R8f72CeL3egVR
 OKo3eTZZw18cAIMz+fl/ILCR8Zv0hBHGpsbF9RSKNSdah0FJt00zcOMNHuVCON9Th9Aj
 kkYWPFmzlmFssY282DkjsuNXkHpG5uhjIzZCGoD9x9QeE9iIW/DLwn0kJ0odquzkRGyx
 s24g==
X-Gm-Message-State: AOAM532g82IdLt0Qcodc+yjile9pU9A+S0eonZHiKRxTFXe1U6RBSKjO
 g86t8FYc5sXjtJLZOtwk70tFGw==
X-Google-Smtp-Source: ABdhPJxzCDVzf3ULPsu9gK9KiSaB34ZTvYO1lf6Ch0WaVTCD95UxXPy23Q0JcDyFOFTm8A13sseYqQ==
X-Received: by 2002:adf:c44d:: with SMTP id a13mr4551346wrg.205.1594723752919; 
 Tue, 14 Jul 2020 03:49:12 -0700 (PDT)
Received: from phenom.ffwll.local ([2a02:168:57f4:0:efd0:b9e5:5ae6:c2fa])
 by smtp.gmail.com with ESMTPSA id u17sm27474975wrp.70.2020.07.14.03.49.11
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 14 Jul 2020 03:49:12 -0700 (PDT)
Date: Tue, 14 Jul 2020 12:49:10 +0200
From: Daniel Vetter <daniel@ffwll.ch>
To: DRI Development <dri-devel@lists.freedesktop.org>
Message-ID: <20200714104910.GC3278063@phenom.ffwll.local>
References: <20200707201229.472834-1-daniel.vetter@ffwll.ch>
 <20200707201229.472834-20-daniel.vetter@ffwll.ch>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20200707201229.472834-20-daniel.vetter@ffwll.ch>
X-Operating-System: Linux phenom 5.6.0-1-amd64 
Subject: Re: [Intel-gfx] [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC
 in scheduler code
X-BeenThere: intel-gfx@lists.freedesktop.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Intel graphics driver community testing & development
 <intel-gfx.lists.freedesktop.org>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/intel-gfx>,
 <mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/intel-gfx>
List-Post: <mailto:intel-gfx@lists.freedesktop.org>
List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/intel-gfx>,
 <mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe>
Cc: linux-rdma@vger.kernel.org, Daniel Vetter <daniel.vetter@ffwll.ch>,
 Intel Graphics Development <intel-gfx@lists.freedesktop.org>,
 amd-gfx@lists.freedesktop.org, Chris Wilson <chris@chris-wilson.co.uk>,
 linaro-mm-sig@lists.linaro.org, Daniel Vetter <daniel.vetter@intel.com>,
 Christian =?iso-8859-1?Q?K=F6nig?= <christian.koenig@amd.com>,
 linux-media@vger.kernel.org
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

On Tue, Jul 07, 2020 at 10:12:23PM +0200, Daniel Vetter wrote:
> My dma-fence lockdep annotations caught an inversion because we
> allocate memory where we really shouldn't:
> =

> 	kmem_cache_alloc+0x2b/0x6d0
> 	amdgpu_fence_emit+0x30/0x330 [amdgpu]
> 	amdgpu_ib_schedule+0x306/0x550 [amdgpu]
> 	amdgpu_job_run+0x10f/0x260 [amdgpu]
> 	drm_sched_main+0x1b9/0x490 [gpu_sched]
> 	kthread+0x12e/0x150
> =

> Trouble right now is that lockdep only validates against GFP_FS, which
> would be good enough for shrinkers. But for mmu_notifiers we actually
> need !GFP_ATOMIC, since they can be called from any page laundering,
> even if GFP_NOFS or GFP_NOIO are set.
> =

> I guess we should improve the lockdep annotations for
> fs_reclaim_acquire/release.
> =

> Ofc real fix is to properly preallocate this fence and stuff it into
> the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
> the way.
> =

> v2: Two more allocations in scheduler paths.
> =

> Frist one:
> =

> 	__kmalloc+0x58/0x720
> 	amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
> 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> 	drm_sched_main+0xf9/0x490 [gpu_sched]
> =

> Second one:
> =

> 	kmem_cache_alloc+0x2b/0x6d0
> 	amdgpu_sync_fence+0x7e/0x110 [amdgpu]
> 	amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
> 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> 	drm_sched_main+0xf9/0x490 [gpu_sched]
> =

> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian K=F6nig <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

Has anyone from amd side started looking into how to fix this properly?

I looked a bit into fixing this with mempool, and the big guarantee we
need is that
- there's a hard upper limit on how many allocations we minimally need to
  guarantee forward progress. And the entire vmid allocation and
  amdgpu_sync_fence stuff kinda makes me question that's a valid
  assumption.

- mempool_free must be called without any locks in the way which are held
  while we call mempool_alloc. Otherwise we again have a nice deadlock
  with no forward progress. I tried auditing that, but got lost in amdgpu
  and scheduler code. Some lockdep annotations for mempool.c might help,
  but they're not going to catch everything. Plus it would be again manual
  annotations because this is yet another cross-release issue. So not sure
  that helps at all.

iow, not sure what to do here. Ideas?

Cheers, Daniel

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> =

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/=
amd/amdgpu/amdgpu_fence.c
> index 8d84975885cd..a089a827fdfe 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struc=
t dma_fence **f,
>  	uint32_t seq;
>  	int r;
>  =

> -	fence =3D kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
> +	fence =3D kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
>  	if (fence =3D=3D NULL)
>  		return -ENOMEM;
>  =

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/am=
d/amdgpu/amdgpu_ids.c
> index 267fa45ddb66..a333ca2d4ddd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> @@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
>  	if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
>  		return amdgpu_sync_fence(sync, ring->vmid_wait);
>  =

> -	fences =3D kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
> +	fences =3D kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
>  	if (!fences)
>  		return -ENOMEM;
>  =

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/a=
md/amdgpu/amdgpu_sync.c
> index 8ea6c49529e7..af22b526cec9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> @@ -160,7 +160,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struc=
t dma_fence *f)
>  	if (amdgpu_sync_add_later(sync, f))
>  		return 0;
>  =

> -	e =3D kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
> +	e =3D kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
>  	if (!e)
>  		return -ENOMEM;
>  =

> -- =

> 2.27.0
> =


-- =

Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Dt/2=AZ=lists.freedesktop.org=amd-gfx-bounces@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-9.8 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,
	SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AFCCAC433E1
	for <amd-gfx@archiver.kernel.org>; Tue, 14 Jul 2020 10:49:15 +0000 (UTC)
Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 80C2422203
	for <amd-gfx@archiver.kernel.org>; Tue, 14 Jul 2020 10:49:15 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key) header.d=ffwll.ch header.i=@ffwll.ch header.b="bOeL2MUd"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 80C2422203
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ffwll.ch
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=amd-gfx-bounces@lists.freedesktop.org
Received: from gabe.freedesktop.org (localhost [127.0.0.1])
	by gabe.freedesktop.org (Postfix) with ESMTP id 307D06E8B4;
	Tue, 14 Jul 2020 10:49:15 +0000 (UTC)
Received: from mail-wr1-x441.google.com (mail-wr1-x441.google.com
 [IPv6:2a00:1450:4864:20::441])
 by gabe.freedesktop.org (Postfix) with ESMTPS id 50D0B6E8B2
 for <amd-gfx@lists.freedesktop.org>; Tue, 14 Jul 2020 10:49:14 +0000 (UTC)
Received: by mail-wr1-x441.google.com with SMTP id f18so20769415wrs.0
 for <amd-gfx@lists.freedesktop.org>; Tue, 14 Jul 2020 03:49:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:content-transfer-encoding:in-reply-to;
 bh=XFyCObAFNGAkAD33iucRxardqlG+NdnL+BA8ka5PPr4=;
 b=bOeL2MUd2f8AlWSJD3wmjK7kX/u8xZmjsUmMQeJzGK+l98O+GbJDKEx2q+dtP0Ysuw
 ahdh9hOxfLzdYsrsSCIb4+jsBwaVOfOB9g1F7JaVOZ01TKCuMezHAGJnllnKMSUhc5F+
 Oz3Dlt8i1yKPbH5ek0krA47zpw8DZA2RsafPc=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:content-transfer-encoding
 :in-reply-to;
 bh=XFyCObAFNGAkAD33iucRxardqlG+NdnL+BA8ka5PPr4=;
 b=s2M5n0lbiQemHpvQPhMfzBiTVRrgaNtogrXpX/UE+sZ9Drke7M8mVlfgGkRuVQ8UUq
 vnq0CZXymfchQp8GhusGr81DuN1ae9Jj1gk5yifnunY7/BAU8lX3zAd10kdqgzUwDOPD
 2c/qKrISxAk2+xByRsTFmUdj56IVhnJMbDYwl/Cs61Sj4zYqz+VLuxkM1xlr4Eupl2OX
 vH5wAWcuY4gwARlvPFSxb5mQsVDl7WOQtGxtdr3jfYO4MI8iKTg3zxsJNXT71OwILqIf
 pRv3twM5OUyMcaLLLVXi3WzSH23pagQA1DIIeC9QKfYVCvZz6rf3NhgShxQbydpFe4E1
 NpvA==
X-Gm-Message-State: AOAM531mzEATdbj6nJl2DbtviQSLg1Z1A9XdSmeMe/dxR8DgjgreEZfV
 oX4v2JWxA654pBxgkNK8UxQuHQ==
X-Google-Smtp-Source: ABdhPJxzCDVzf3ULPsu9gK9KiSaB34ZTvYO1lf6Ch0WaVTCD95UxXPy23Q0JcDyFOFTm8A13sseYqQ==
X-Received: by 2002:adf:c44d:: with SMTP id a13mr4551346wrg.205.1594723752919; 
 Tue, 14 Jul 2020 03:49:12 -0700 (PDT)
Received: from phenom.ffwll.local ([2a02:168:57f4:0:efd0:b9e5:5ae6:c2fa])
 by smtp.gmail.com with ESMTPSA id u17sm27474975wrp.70.2020.07.14.03.49.11
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 14 Jul 2020 03:49:12 -0700 (PDT)
Date: Tue, 14 Jul 2020 12:49:10 +0200
From: Daniel Vetter <daniel@ffwll.ch>
To: DRI Development <dri-devel@lists.freedesktop.org>
Subject: Re: [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler
 code
Message-ID: <20200714104910.GC3278063@phenom.ffwll.local>
References: <20200707201229.472834-1-daniel.vetter@ffwll.ch>
 <20200707201229.472834-20-daniel.vetter@ffwll.ch>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20200707201229.472834-20-daniel.vetter@ffwll.ch>
X-Operating-System: Linux phenom 5.6.0-1-amd64 
X-BeenThere: amd-gfx@lists.freedesktop.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Discussion list for AMD gfx <amd-gfx.lists.freedesktop.org>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/amd-gfx>,
 <mailto:amd-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/amd-gfx>
List-Post: <mailto:amd-gfx@lists.freedesktop.org>
List-Help: <mailto:amd-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>,
 <mailto:amd-gfx-request@lists.freedesktop.org?subject=subscribe>
Cc: linux-rdma@vger.kernel.org, Daniel Vetter <daniel.vetter@ffwll.ch>,
 Intel Graphics Development <intel-gfx@lists.freedesktop.org>,
 Maarten Lankhorst <maarten.lankhorst@linux.intel.com>,
 amd-gfx@lists.freedesktop.org, Chris Wilson <chris@chris-wilson.co.uk>,
 linaro-mm-sig@lists.linaro.org, Daniel Vetter <daniel.vetter@intel.com>,
 Christian =?iso-8859-1?Q?K=F6nig?= <christian.koenig@amd.com>,
 linux-media@vger.kernel.org
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Errors-To: amd-gfx-bounces@lists.freedesktop.org
Sender: "amd-gfx" <amd-gfx-bounces@lists.freedesktop.org>

On Tue, Jul 07, 2020 at 10:12:23PM +0200, Daniel Vetter wrote:
> My dma-fence lockdep annotations caught an inversion because we
> allocate memory where we really shouldn't:
> =

> 	kmem_cache_alloc+0x2b/0x6d0
> 	amdgpu_fence_emit+0x30/0x330 [amdgpu]
> 	amdgpu_ib_schedule+0x306/0x550 [amdgpu]
> 	amdgpu_job_run+0x10f/0x260 [amdgpu]
> 	drm_sched_main+0x1b9/0x490 [gpu_sched]
> 	kthread+0x12e/0x150
> =

> Trouble right now is that lockdep only validates against GFP_FS, which
> would be good enough for shrinkers. But for mmu_notifiers we actually
> need !GFP_ATOMIC, since they can be called from any page laundering,
> even if GFP_NOFS or GFP_NOIO are set.
> =

> I guess we should improve the lockdep annotations for
> fs_reclaim_acquire/release.
> =

> Ofc real fix is to properly preallocate this fence and stuff it into
> the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
> the way.
> =

> v2: Two more allocations in scheduler paths.
> =

> Frist one:
> =

> 	__kmalloc+0x58/0x720
> 	amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
> 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> 	drm_sched_main+0xf9/0x490 [gpu_sched]
> =

> Second one:
> =

> 	kmem_cache_alloc+0x2b/0x6d0
> 	amdgpu_sync_fence+0x7e/0x110 [amdgpu]
> 	amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
> 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> 	drm_sched_main+0xf9/0x490 [gpu_sched]
> =

> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian K=F6nig <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

Has anyone from amd side started looking into how to fix this properly?

I looked a bit into fixing this with mempool, and the big guarantee we
need is that
- there's a hard upper limit on how many allocations we minimally need to
  guarantee forward progress. And the entire vmid allocation and
  amdgpu_sync_fence stuff kinda makes me question that's a valid
  assumption.

- mempool_free must be called without any locks in the way which are held
  while we call mempool_alloc. Otherwise we again have a nice deadlock
  with no forward progress. I tried auditing that, but got lost in amdgpu
  and scheduler code. Some lockdep annotations for mempool.c might help,
  but they're not going to catch everything. Plus it would be again manual
  annotations because this is yet another cross-release issue. So not sure
  that helps at all.

iow, not sure what to do here. Ideas?

Cheers, Daniel

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> =

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/=
amd/amdgpu/amdgpu_fence.c
> index 8d84975885cd..a089a827fdfe 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struc=
t dma_fence **f,
>  	uint32_t seq;
>  	int r;
>  =

> -	fence =3D kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
> +	fence =3D kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
>  	if (fence =3D=3D NULL)
>  		return -ENOMEM;
>  =

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/am=
d/amdgpu/amdgpu_ids.c
> index 267fa45ddb66..a333ca2d4ddd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> @@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
>  	if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
>  		return amdgpu_sync_fence(sync, ring->vmid_wait);
>  =

> -	fences =3D kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
> +	fences =3D kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
>  	if (!fences)
>  		return -ENOMEM;
>  =

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/a=
md/amdgpu/amdgpu_sync.c
> index 8ea6c49529e7..af22b526cec9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> @@ -160,7 +160,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struc=
t dma_fence *f)
>  	if (amdgpu_sync_add_later(sync, f))
>  		return 0;
>  =

> -	e =3D kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
> +	e =3D kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
>  	if (!e)
>  		return -ENOMEM;
>  =

> -- =

> 2.27.0
> =


-- =

Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx