Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Jerome Glisse <jglisse@redhat.com>
To: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Jason Gunthorpe" <jgg@ziepe.ca>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"Linux MM" <linux-mm@kvack.org>,
	"DRI Development" <dri-devel@lists.freedesktop.org>,
	"Intel Graphics Development" <intel-gfx@lists.freedesktop.org>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Ingo Molnar" <mingo@redhat.com>,
	"David Rientjes" <rientjes@google.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Masahiro Yamada" <yamada.masahiro@socionext.com>,
	"Wei Wang" <wvw@google.com>,
	"Andy Shevchenko" <andriy.shevchenko@linux.intel.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Jann Horn" <jannh@google.com>, "Feng Tang" <feng.tang@intel.com>,
	"Kees Cook" <keescook@chromium.org>,
	"Randy Dunlap" <rdunlap@infradead.org>,
	"Daniel Vetter" <daniel.vetter@intel.com>
Subject: Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
Date: Thu, 15 Aug 2019 13:35:11 -0400
Message-ID: <20190815173511.GG30916@redhat.com> (raw)
In-Reply-To: <CAKMK7uGm_vg_6sqU2X=Owu79zLcC8d5U+Yr2O-oEsCnTjeWzCw@mail.gmail.com>

On Thu, Aug 15, 2019 at 07:21:47PM +0200, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 7:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote:
> > > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> > > >
> > > > > You have to wait for the gpu to finnish current processing in
> > > > > invalidate_range_start. Otherwise there's no point to any of this
> > > > > really. So the wait_event/dma_fence_wait are unavoidable really.
> > > >
> > > > I don't envy your task :|
> > > >
> > > > But, what you describe sure sounds like a 'registration cache' model,
> > > > not the 'shadow pte' model of coherency.
> > > >
> > > > The key difference is that a regirstationcache is allowed to become
> > > > incoherent with the VMA's because it holds page pins. It is a
> > > > programming bug in userspace to change VA mappings via mmap/munmap/etc
> > > > while the device is working on that VA, but it does not harm system
> > > > integrity because of the page pin.
> > > >
> > > > The cache ensures that each initiated operation sees a DMA setup that
> > > > matches the current VA map when the operation is initiated and allows
> > > > expensive device DMA setups to be re-used.
> > > >
> > > > A 'shadow pte' model (ie hmm) *really* needs device support to
> > > > directly block DMA access - ie trigger 'device page fault'. ie the
> > > > invalidate_start should inform the device to enter a fault mode and
> > > > that is it.  If the device can't do that, then the driver probably
> > > > shouldn't persue this level of coherency. The driver would quickly get
> > > > into the messy locking problems like dma_fence_wait from a notifier.
> > >
> > > I think here we do not agree on the hardware requirement. For GPU
> > > we will always need to be able to wait for some GPU fence from inside
> > > the notifier callback, there is just no way around that for many of
> > > the GPUs today (i do not see any indication of that changing).
> >
> > I didn't say you couldn't wait, I was trying to say that the wait
> > should only be contigent on the HW itself. Ie you can wait on a GPU
> > page table lock, and you can wait on a GPU page table flush completion
> > via IRQ.
> >
> > What is troubling is to wait till some other thread gets a GPU command
> > completion and decr's a kref on the DMA buffer - which kinda looks
> > like what this dma_fence() stuff is all about. A driver like that
> > would have to be super careful to ensure consistent forward progress
> > toward dma ref == 0 when the system is under reclaim.
> >
> > ie by running it's entire IRQ flow under fs_reclaim locking.
> 
> This is correct. At least for i915 it's already a required due to our
> shrinker also having to do the same. I think amdgpu isn't bothering
> with that since they have vram for most of the stuff, and just limit
> system memory usage to half of all and forgo the shrinker. Probably
> not the nicest approach. Anyway, both do the same mmu_notifier dance,
> just want to explain that we've been living with this for longer
> already.
> 
> So yeah writing a gpu driver is not easy.
> 
> > > associated with the mm_struct. In all GPU driver so far it is a short
> > > lived lock and nothing blocking is done while holding it (it is just
> > > about updating page table directory really wether it is filling it or
> > > clearing it).
> >
> > The main blocking I expect in a shadow PTE flow is waiting for the HW
> > to complete invalidations of its PTE cache.
> >
> > > > It is important to identify what model you are going for as defining a
> > > > 'registration cache' coherence expectation allows the driver to skip
> > > > blocking in invalidate_range_start. All it does is invalidate the
> > > > cache so that future operations pick up the new VA mapping.
> > > >
> > > > Intel's HFI RDMA driver uses this model extensively, and I think it is
> > > > well proven, within some limitations of course.
> > > >
> > > > At least, 'registration cache' is the only use model I know of where
> > > > it is acceptable to skip invalidate_range_end.
> > >
> > > Here GPU are not in the registration cache model, i know it might looks
> > > like it because of GUP but GUP was use just because hmm did not exist
> > > at the time.
> >
> > It is not because of GUP, it is because of the lack of
> > invalidate_range_end. A driver cannot correctly implement the SPTE
> > model without invalidate_range_end, even if it holds the page pins via
> > GUP.
> >
> > So, I've been assuming the few drivers without invalidate_range_end
> > are trying to do registration caching, rather than assuming they are
> > broken.
> 
> I915 might just be broken. amdgpu does the full thing, using
> hmm_mirror. But still with dma_fence_wait.

Yeah i915 is broken but it never hurted anyone ;) I posted patch
a long time ago to convert it to hmm but i delayed that to until
i can get through making something of GUPfast that can also be
use for HMM/ODP user.

Cheers,
Jérôme


  reply index

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-14 20:20 [PATCH 0/5] hmm & mmu_notifier debug/lockdep annotations Daniel Vetter
2019-08-14 20:20 ` [PATCH 1/5] mm: Check if mmu notifier callbacks are allowed to fail Daniel Vetter
2019-08-14 22:14   ` Andrew Morton
2019-08-14 23:22     ` Jason Gunthorpe
2019-08-14 23:34     ` Ralph Campbell
2019-08-16 17:19   ` Jason Gunthorpe
2019-08-14 20:20 ` [PATCH 2/5] kernel.h: Add non_block_start/end() Daniel Vetter
2019-08-14 20:45   ` Andrew Morton
2019-08-15  6:52     ` Daniel Vetter
2019-08-15  8:44     ` Michal Hocko
2019-08-15 13:04       ` Jason Gunthorpe
2019-08-15 13:12         ` Daniel Vetter
2019-08-15 14:37           ` Jason Gunthorpe
2019-08-15 14:43             ` Daniel Vetter
2019-08-15 15:10               ` Jason Gunthorpe
2019-08-15 16:25                 ` Daniel Vetter
2019-08-15 17:35                   ` Jason Gunthorpe
2019-08-15 17:39                     ` Jerome Glisse
2019-08-15 18:01                       ` Jason Gunthorpe
2019-08-15 18:27                         ` Jerome Glisse
2019-08-15 18:57                           ` Jason Gunthorpe
2019-08-15 16:32                 ` Jerome Glisse
2019-08-15 17:16                   ` Jason Gunthorpe
2019-08-15 17:21                     ` Daniel Vetter
2019-08-15 17:35                       ` Jerome Glisse [this message]
2019-08-15 13:24         ` Michal Hocko
2019-08-15 22:15       ` Andrew Morton
2019-08-16  8:24         ` Michal Hocko
2019-08-14 23:58   ` Jason Gunthorpe
2019-08-15  6:58     ` Daniel Vetter
2019-08-15 12:23       ` Jason Gunthorpe
2019-08-15 13:21         ` Michal Hocko
2019-08-15 14:12           ` Jason Gunthorpe
2019-08-15 16:00             ` Michal Hocko
2019-08-15 16:56               ` Jason Gunthorpe
2019-08-15 17:11                 ` Jerome Glisse
2019-08-15 17:17                   ` Jason Gunthorpe
2019-08-15 17:42                 ` Michal Hocko
2019-08-15 17:57                   ` Jerome Glisse
2019-08-15 18:24                   ` Jason Gunthorpe
2019-08-15 19:05                     ` Michal Hocko
2019-08-15 19:18                       ` Jason Gunthorpe
2019-08-15 19:35                         ` Michal Hocko
2019-08-15 20:13                           ` Jason Gunthorpe
2019-08-16  8:10                             ` Michal Hocko
2019-08-16 12:19                               ` Jason Gunthorpe
2019-08-16 12:26                                 ` Michal Hocko
2019-08-15 20:16                           ` [Intel-gfx] " Daniel Vetter
2019-08-15 20:27                             ` Jason Gunthorpe
2019-08-15 20:49                               ` Daniel Vetter
2019-08-16  1:00                                 ` Jason Gunthorpe
2019-08-16  6:20                                   ` Daniel Vetter
2019-08-16 12:12                                     ` Jason Gunthorpe
2019-08-16 14:11                                       ` Daniel Vetter
2019-08-16 14:38                                         ` Jason Gunthorpe
2019-08-16 16:36                                           ` Daniel Vetter
2019-08-16 16:54                                             ` Jason Gunthorpe
2019-08-16  8:27                             ` Michal Hocko
2019-08-14 20:20 ` [PATCH 3/5] mm, notifier: Catch sleeping/blocking for !blockable Daniel Vetter
2019-08-15  0:00   ` Jason Gunthorpe
2019-08-15  7:02     ` Daniel Vetter
     [not found]       ` <20190815123556.GB21596@ziepe.ca>
2019-08-17 16:09         ` Daniel Vetter
2019-08-14 20:20 ` [PATCH 4/5] mm, notifier: Add a lockdep map for invalidate_range_start Daniel Vetter
2019-08-15  0:09   ` Jason Gunthorpe
2019-08-15  7:10     ` Daniel Vetter
2019-08-15 12:53       ` Jason Gunthorpe
2019-08-14 20:20 ` [PATCH 5/5] mm/hmm: WARN on illegal ->sync_cpu_device_pagetables errors Daniel Vetter
2019-08-15  0:11   ` Jason Gunthorpe
2019-08-15  7:14     ` Daniel Vetter

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190815173511.GG30916@redhat.com \
    --to=jglisse@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=daniel.vetter@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=feng.tang@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=tglx@linutronix.de \
    --cc=wvw@google.com \
    --cc=yamada.masahiro@socionext.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org linux-mm@archiver.kernel.org
	public-inbox-index linux-mm


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/ public-inbox