Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review

From: Jason Gunthorpe <jgg@ziepe.ca>
To: Jerome Glisse <jglisse@redhat.com>
Cc: linux-rdma@vger.kernel.org, linux-mm@kvack.org,
	Ralph Campbell <rcampbell@nvidia.com>,
	John Hubbard <jhubbard@nvidia.com>
Subject: Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
Date: Fri, 24 May 2019 19:09:23 -0300	[thread overview]
Message-ID: <20190524220923.GA8519@ziepe.ca> (raw)
In-Reply-To: <20190524184608.GE3346@redhat.com>

On Fri, May 24, 2019 at 02:46:08PM -0400, Jerome Glisse wrote:
> > Here is the big 3 CPU ladder diagram that shows how 'valid' does not
> > work:
> > 
> >        CPU0                                               CPU1                                          CPU2
> >                                                         DEVICE PAGE FAULT
> >                                                         range = hmm_range_register()
> >
> >   // Overlaps with range
> >   hmm_invalidate_start()
> >     range->valid = false
> >     ops->sync_cpu_device_pagetables()
> >       take_lock(driver->update);
> >        // Wipe out page tables in device, enable faulting
> >       release_lock(driver->update);
> >                                                                                                    // Does not overlap with range
> >                                                                                                    hmm_invalidate_start()
> >                                                                                                    hmm_invalidate_end()
> >                                                                                                        list_for_each
> >                                                                                                            range->valid =  true
> 
>                                                                                                              ^
> No this can not happen because CPU0 still has invalidate_range in progress and
> thus hmm->notifiers > 0 so the hmm_invalidate_range_end() will not set the
> range->valid as true.

Oh, Okay, I now see how this all works, thank you

> > And I can make this more complicated (ie overlapping parallel
> > invalidates, etc) and show any 'bool' valid cannot work.
> 
> It does work. 

Well, I ment the bool alone cannot work, but this is really bool + a
counter.

> If you want i can remove the range->valid = true from the
> hmm_invalidate_range_end() and move it within hmm_range_wait_until_valid()
> ie modifying the hmm_range_wait_until_valid() logic, this might look
> cleaner.

Let me reflect on it for a bit. I have to say I don't like the clarity
here, and I don't like the valid=true loop in the invalidate_end, it
is pretty clunky.

I'm thinking a more obvious API for drivers, as something like:

again:
    hmm_range_start();
     [..]
    if (hmm_range_test_retry())
          goto again

    driver_lock()
      if (hmm_range_end())
           goto again
    driver_unlock();

Just because it makes it very clear to the driver author what to do
and how this is working, and makes it clear that there is no such
thing as 'valid' - what we *really* have is a locking collision
forcing retry. ie this is really closer to a seq-lock scheme, not a
valid/invalid scheme. Being able to explain the concept does matter
for maintainability...

And I'm thinking the above API design would comfortably support a more
efficient seq-lock like approach without the loop in invalidate_end..

But I haven't quite thought it all through yet. Next week!

> > I still think the best solution is to move device_lock() into mirror
> > and have hmm manage it for the driver as ODP does. It is certainly the
> > simplest solution to understand.
> 
> It is un-efficient and would block further than needed forward progress
> by mm code.

I'm not sure how you get to that, we already have the device_lock()
and it already blocks forward progress by mm code.

Really the big unfortunate thing here is that valid is manipulated
outside the device_lock, but really, logically, it is covered under
the device_lock

Jason