Re: [git pull] drm for 5.14-rc1

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Dave Airlie <airlied@gmail.com>,
	Philip Yang <Philip.Yang@amd.com>,
	Felix Kuehling <Felix.Kuehling@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>,
	dri-devel <dri-devel@lists.freedesktop.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [git pull] drm for 5.14-rc1
Date: Thu, 1 Jul 2021 13:15:07 -0700	[thread overview]
Message-ID: <CAHk-=whgcN6MEyZBgK3UZRw=vwd1CAAK9+rafmZ2vsOiGpsMSA@mail.gmail.com> (raw)
In-Reply-To: <CAPM=9tzR4BqTtamrTy4T_XV7E0fUNyduaVtH5zAi=sqwX_3udg@mail.gmail.com>

On Wed, Jun 30, 2021 at 9:34 PM Dave Airlie <airlied@gmail.com> wrote:
>
> Hi Linus,
>
> This is the main drm pull request for 5.14-rc1.
>
> I've done a test pull into your current tree, and hit two conflicts
> (one in vc4, one in amdgpu), both seem pretty trivial, the amdgpu one
> is recent and sfr sent out a resolution for it today.

Well, the resolutions may be trivial, but the conflict made me look at
the code, and it's buggy.

Commit 04d8d73dbcbe ("drm/amdgpu: add common HMM get pages function")
is broken. It made the code do

        mmap_read_lock(mm);
        vma = find_vma(mm, start);
        mmap_read_unlock(mm);

and then it *uses* that "vma" after it has dropped the lock.

That's a big no-no - once you've dropped the lock, the vma contents
simply aren't reliable any more. That mapping could now be unmapped
and removed at any time.

Now, the conflict actually made one of the uses go away (switching to
vma_lookup() means that the subsequent code no longer needs to look at
"vm_start" to verify we're actually _inside_ the vma), but it still
checks for vma->vm_file afterwards.

So those locking changes in commit 04d8d73dbcbe are completely bogus.

I tried to fix up that bug while handling the conflict, but who knows
what else similar is going on elsewhere.

So I would ask people to

 (a) verify that I didn't make things worse as I fixed things up (note
how I had to change the last argument to amdgpu_hmm_range_get_pages()
from false to true etc).

 (b) go and look at their vma lookup code: you can't just look up a
vma under the lock, and then drop the lock, and then think things stay
stable.

In particular for that (b) case: it is *NOT* enough to look up
vma->vm_file inside the lock and cache that. No - if the test is about
"no backing file before looking up pages", then you have to *keep*
holding the lock until after you've actually looked up the pages!

Because otherwise any test for "vma->vm_file" is entirely pointless,
for the same reason it's buggy to even look at it after dropping the
lock: because once you've dropped the lock, the thing you just tested
for might not be true any more.

So no, it's not valid to do

    bool has_file = vma && vma->vm_file;

and then drop the lock, because you don't use 'vma' any more as a
pointer, and then use 'has_file' outside the lock. Because after
you've dropped the lock, 'has_file' is now meaningless.

So it's not just about "you can't look at vma->vm_file after dropping
the lock". It's more fundamental than that. Any *decision* you make
based on the vma is entirely pointless and moot after the lock is
dropped!

Did I fix it up correctly? Who knows. The code makes more sense to me
now and seems valid. But I really *really* want to stress how locking
is important.

You also can't just unlock in the middle of an operation - even if you
then take the lock *again* later (as amdgpu_hmm_range_get_pages() then
did), the fact that you unlocked in the middle means that all the
earlier tests you did are simply no longer valid when you re-take the
lock.

                 Linus