Hi all (but I guess mostly Jason), Finally gotten around to rebasing the previous version, fixing the rebase fail in there, update the commit message a bit and give this a spin with some tests. Nicely caught a lockdep splat that we're now discussing in i915, and seems to not have misfired anywhere else (including a few oom). Review, comments and everything very much appreciated. Plus I'd really like to land this, there's more hmm_mirror users in-flight, and I've seen some that get things wrong which this patchset should catch. Thanks, Daniel Daniel Vetter (5): mm: Check if mmu notifier callbacks are allowed to fail kernel.h: Add non_block_start/end() mm, notifier: Catch sleeping/blocking for !blockable mm, notifier: Add a lockdep map for invalidate_range_start mm/hmm: WARN on illegal ->sync_cpu_device_pagetables errors include/linux/kernel.h | 10 +++++++++- include/linux/mmu_notifier.h | 6 ++++++ include/linux/sched.h | 4 ++++ kernel/sched/core.c | 19 ++++++++++++++----- mm/hmm.c | 3 +++ mm/mmu_notifier.c | 17 ++++++++++++++++- 6 files changed, 52 insertions(+), 7 deletions(-) -- 2.22.0
Just a bit of paranoia, since if we start pushing this deep into callchains it's hard to spot all places where an mmu notifier implementation might fail when it's not allowed to. Inspired by some confusion we had discussing i915 mmu notifiers and whether we could use the newly-introduced return value to handle some corner cases. Until we realized that these are only for when a task has been killed by the oom reaper. An alternative approach would be to split the callback into two versions, one with the int return value, and the other with void return value like in older kernels. But that's a lot more churn for fairly little gain I think. Summary from the m-l discussion on why we want something at warning level: This allows automated tooling in CI to catch bugs without humans having to look at everything. If we just upgrade the existing pr_info to a pr_warn, then we'll have false positives. And as-is, no one will ever spot the problem since it's lost in the massive amounts of overall dmesg noise. v2: Drop the full WARN_ON backtrace in favour of just a pr_warn for the problematic case (Michal Hocko). v3: Rebase on top of Glisse's arg rework. v4: More rebase on top of Glisse reworking everything. v5: Fixup rebase damage and also catch failures != EAGAIN for !blockable (Jason). Also go back to WARN_ON as requested by Jason, so automatic checkers can easily catch bugs by setting panic_on_warn. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: David Rientjes <rientjes@google.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: linux-mm@kvack.org Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- mm/mmu_notifier.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index b5670620aea0..16f1cbc775d0 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -179,6 +179,8 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) pr_info("%pS callback failed with %d in %sblockable context.\n", mn->ops->invalidate_range_start, _ret, !mmu_notifier_range_blockable(range) ? "non-" : ""); + WARN_ON(mmu_notifier_range_blockable(range) || + ret != -EAGAIN); ret = _ret; } } -- 2.22.0
In some special cases we must not block, but there's not a spinlock, preempt-off, irqs-off or similar critical section already that arms the might_sleep() debug checks. Add a non_block_start/end() pair to annotate these. This will be used in the oom paths of mmu-notifiers, where blocking is not allowed to make sure there's forward progress. Quoting Michal: "The notifier is called from quite a restricted context - oom_reaper - which shouldn't depend on any locks or sleepable conditionals. The code should be swift as well but we mostly do care about it to make a forward progress. Checking for sleepable context is the best thing we could come up with that would describe these demands at least partially." Peter also asked whether we want to catch spinlocks on top, but Michal said those are less of a problem because spinlocks can't have an indirect dependency upon the page allocator and hence close the loop with the oom reaper. Suggested by Michal Hocko. v2: - Improve commit message (Michal) - Also check in schedule, not just might_sleep (Peter) v3: It works better when I actually squash in the fixup I had lying around :-/ Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: linux-mm@kvack.org Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Wei Wang <wvw@google.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jann Horn <jannh@google.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: linux-kernel@vger.kernel.org Acked-by: Christian König <christian.koenig@amd.com> (v1) Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- include/linux/kernel.h | 10 +++++++++- include/linux/sched.h | 4 ++++ kernel/sched/core.c | 19 ++++++++++++++----- 3 files changed, 27 insertions(+), 6 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 4fa360a13c1e..915fd9888afb 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -217,7 +217,9 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset); * might_sleep - annotation for functions that can sleep * * this macro will print a stack trace if it is executed in an atomic - * context (spinlock, irq-handler, ...). + * context (spinlock, irq-handler, ...). Additional sections where blocking is + * not allowed can be annotated with non_block_start() and non_block_end() + * pairs. * * This is a useful debugging help to be able to catch problems early and not * be bitten later when the calling function happens to sleep when it is not @@ -233,6 +235,10 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset); # define cant_sleep() \ do { __cant_sleep(__FILE__, __LINE__, 0); } while (0) # define sched_annotate_sleep() (current->task_state_change = 0) +# define non_block_start() \ + do { current->non_block_count++; } while (0) +# define non_block_end() \ + do { WARN_ON(current->non_block_count-- == 0); } while (0) #else static inline void ___might_sleep(const char *file, int line, int preempt_offset) { } @@ -241,6 +247,8 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset); # define might_sleep() do { might_resched(); } while (0) # define cant_sleep() do { } while (0) # define sched_annotate_sleep() do { } while (0) +# define non_block_start() do { } while (0) +# define non_block_end() do { } while (0) #endif #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0) diff --git a/include/linux/sched.h b/include/linux/sched.h index 9f51932bd543..c5630f3dca1f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -974,6 +974,10 @@ struct task_struct { struct mutex_waiter *blocked_on; #endif +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP + int non_block_count; +#endif + #ifdef CONFIG_TRACE_IRQFLAGS unsigned int irq_events; unsigned long hardirq_enable_ip; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2b037f195473..57245770d6cc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3700,13 +3700,22 @@ static noinline void __schedule_bug(struct task_struct *prev) /* * Various schedule()-time debugging checks and statistics: */ -static inline void schedule_debug(struct task_struct *prev) +static inline void schedule_debug(struct task_struct *prev, bool preempt) { #ifdef CONFIG_SCHED_STACK_END_CHECK if (task_stack_end_corrupted(prev)) panic("corrupted stack end detected inside scheduler\n"); #endif +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP + if (!preempt && prev->state && prev->non_block_count) { + printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n", + prev->comm, prev->pid, prev->non_block_count); + dump_stack(); + add_taint(TAINT_WARN, LOCKDEP_STILL_OK); + } +#endif + if (unlikely(in_atomic_preempt_off())) { __schedule_bug(prev); preempt_count_set(PREEMPT_DISABLED); @@ -3813,7 +3822,7 @@ static void __sched notrace __schedule(bool preempt) rq = cpu_rq(cpu); prev = rq->curr; - schedule_debug(prev); + schedule_debug(prev, preempt); if (sched_feat(HRTICK)) hrtick_clear(rq); @@ -6570,7 +6579,7 @@ void ___might_sleep(const char *file, int line, int preempt_offset) rcu_sleep_check(); if ((preempt_count_equals(preempt_offset) && !irqs_disabled() && - !is_idle_task(current)) || + !is_idle_task(current) && !current->non_block_count) || system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING || oops_in_progress) return; @@ -6586,8 +6595,8 @@ void ___might_sleep(const char *file, int line, int preempt_offset) "BUG: sleeping function called from invalid context at %s:%d\n", file, line); printk(KERN_ERR - "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n", - in_atomic(), irqs_disabled(), + "in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n", + in_atomic(), irqs_disabled(), current->non_block_count, current->pid, current->comm); if (task_stack_end_corrupted(current)) -- 2.22.0
We need to make sure implementations don't cheat and don't have a possible schedule/blocking point deeply burried where review can't catch it. I'm not sure whether this is the best way to make sure all the might_sleep() callsites trigger, and it's a bit ugly in the code flow. But it gets the job done. Inspired by an i915 patch series which did exactly that, because the rules haven't been entirely clear to us. v2: Use the shiny new non_block_start/end annotations instead of abusing preempt_disable/enable. v3: Rebase on top of Glisse's arg rework. v4: Rebase on top of more Glisse rework. Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: linux-mm@kvack.org Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- mm/mmu_notifier.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 16f1cbc775d0..43a76d030164 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -174,7 +174,13 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) id = srcu_read_lock(&srcu); hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { if (mn->ops->invalidate_range_start) { - int _ret = mn->ops->invalidate_range_start(mn, range); + int _ret; + + if (!mmu_notifier_range_blockable(range)) + non_block_start(); + _ret = mn->ops->invalidate_range_start(mn, range); + if (!mmu_notifier_range_blockable(range)) + non_block_end(); if (_ret) { pr_info("%pS callback failed with %d in %sblockable context.\n", mn->ops->invalidate_range_start, _ret, -- 2.22.0
This is a similar idea to the fs_reclaim fake lockdep lock. It's fairly easy to provoke a specific notifier to be run on a specific range: Just prep it, and then munmap() it. A bit harder, but still doable, is to provoke the mmu notifiers for all the various callchains that might lead to them. But both at the same time is really hard to reliable hit, especially when you want to exercise paths like direct reclaim or compaction, where it's not easy to control what exactly will be unmapped. By introducing a lockdep map to tie them all together we allow lockdep to see a lot more dependencies, without having to actually hit them in a single challchain while testing. Aside: Since I typed this to test i915 mmu notifiers I've only rolled this out for the invaliate_range_start callback. If there's interest, we should probably roll this out to all of them. But my undestanding of core mm is seriously lacking, and I'm not clear on whether we need a lockdep map for each callback, or whether some can be shared. v2: Use lock_map_acquire/release() like fs_reclaim, to avoid confusion with this being a real mutex (Chris Wilson). v3: Rebase on top of Glisse's arg rework. Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: linux-mm@kvack.org Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- include/linux/mmu_notifier.h | 6 ++++++ mm/mmu_notifier.c | 7 +++++++ 2 files changed, 13 insertions(+) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index b6c004bd9f6a..9dd38c32fc53 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -42,6 +42,10 @@ enum mmu_notifier_event { #ifdef CONFIG_MMU_NOTIFIER +#ifdef CONFIG_LOCKDEP +extern struct lockdep_map __mmu_notifier_invalidate_range_start_map; +#endif + /* * The mmu notifier_mm structure is allocated and installed in * mm->mmu_notifier_mm inside the mm_take_all_locks() protected @@ -310,10 +314,12 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm, static inline void mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); if (mm_has_notifiers(range->mm)) { range->flags |= MMU_NOTIFIER_RANGE_BLOCKABLE; __mmu_notifier_invalidate_range_start(range); } + lock_map_release(&__mmu_notifier_invalidate_range_start_map); } static inline int diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 43a76d030164..331e43ce6f3c 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -21,6 +21,13 @@ /* global SRCU for all MMs */ DEFINE_STATIC_SRCU(srcu); +#ifdef CONFIG_LOCKDEP +struct lockdep_map __mmu_notifier_invalidate_range_start_map = { + .name = "mmu_notifier_invalidate_range_start" +}; +EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start_map); +#endif + /* * This function allows mmu_notifier::release callback to delay a call to * a function that will free appropriate resources. The function must be -- 2.22.0
Similar to the warning in the mmu notifer, warning if an hmm mirror callback gets it's blocking vs. nonblocking handling wrong, or if it fails with anything else than -EAGAIN. Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: linux-mm@kvack.org Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- mm/hmm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/hmm.c b/mm/hmm.c index 16b6731a34db..52ac59384268 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -205,6 +205,9 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn, ret = -EAGAIN; break; } + WARN(ret, "%pS callback failed with %d in %sblockable context\n", + mirror->ops->sync_cpu_device_pagetables, ret, + update.blockable ? "" : "non-"); } up_read(&hmm->mirrors_sem); -- 2.22.0
On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> In some special cases we must not block, but there's not a
> spinlock, preempt-off, irqs-off or similar critical section already
> that arms the might_sleep() debug checks. Add a non_block_start/end()
> pair to annotate these.
>
> This will be used in the oom paths of mmu-notifiers, where blocking is
> not allowed to make sure there's forward progress. Quoting Michal:
>
> "The notifier is called from quite a restricted context - oom_reaper -
> which shouldn't depend on any locks or sleepable conditionals. The code
> should be swift as well but we mostly do care about it to make a forward
> progress. Checking for sleepable context is the best thing we could come
> up with that would describe these demands at least partially."
>
> Peter also asked whether we want to catch spinlocks on top, but Michal
> said those are less of a problem because spinlocks can't have an
> indirect dependency upon the page allocator and hence close the loop
> with the oom reaper.
I continue to struggle with this. It introduces a new kernel state
"running preemptibly but must not call schedule()". How does this make
any sense?
Perhaps a much, much more detailed description of the oom_reaper
situation would help out.
On Wed, 14 Aug 2019 22:20:23 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> Just a bit of paranoia, since if we start pushing this deep into
> callchains it's hard to spot all places where an mmu notifier
> implementation might fail when it's not allowed to.
>
> Inspired by some confusion we had discussing i915 mmu notifiers and
> whether we could use the newly-introduced return value to handle some
> corner cases. Until we realized that these are only for when a task
> has been killed by the oom reaper.
>
> An alternative approach would be to split the callback into two
> versions, one with the int return value, and the other with void
> return value like in older kernels. But that's a lot more churn for
> fairly little gain I think.
>
> Summary from the m-l discussion on why we want something at warning
> level: This allows automated tooling in CI to catch bugs without
> humans having to look at everything. If we just upgrade the existing
> pr_info to a pr_warn, then we'll have false positives. And as-is, no
> one will ever spot the problem since it's lost in the massive amounts
> of overall dmesg noise.
>
> ...
>
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -179,6 +179,8 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
> pr_info("%pS callback failed with %d in %sblockable context.\n",
> mn->ops->invalidate_range_start, _ret,
> !mmu_notifier_range_blockable(range) ? "non-" : "");
> + WARN_ON(mmu_notifier_range_blockable(range) ||
> + ret != -EAGAIN);
> ret = _ret;
> }
> }
A problem with WARN_ON(a || b) is that if it triggers, we don't know
whether it was because of a or because of b. Or both. So I'd suggest
WARN_ON(a);
WARN_ON(b);
On Wed, Aug 14, 2019 at 03:14:47PM -0700, Andrew Morton wrote:
> On Wed, 14 Aug 2019 22:20:23 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> > Just a bit of paranoia, since if we start pushing this deep into
> > callchains it's hard to spot all places where an mmu notifier
> > implementation might fail when it's not allowed to.
> >
> > Inspired by some confusion we had discussing i915 mmu notifiers and
> > whether we could use the newly-introduced return value to handle some
> > corner cases. Until we realized that these are only for when a task
> > has been killed by the oom reaper.
> >
> > An alternative approach would be to split the callback into two
> > versions, one with the int return value, and the other with void
> > return value like in older kernels. But that's a lot more churn for
> > fairly little gain I think.
> >
> > Summary from the m-l discussion on why we want something at warning
> > level: This allows automated tooling in CI to catch bugs without
> > humans having to look at everything. If we just upgrade the existing
> > pr_info to a pr_warn, then we'll have false positives. And as-is, no
> > one will ever spot the problem since it's lost in the massive amounts
> > of overall dmesg noise.
> >
> > ...
> >
> > +++ b/mm/mmu_notifier.c
> > @@ -179,6 +179,8 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
> > pr_info("%pS callback failed with %d in %sblockable context.\n",
> > mn->ops->invalidate_range_start, _ret,
> > !mmu_notifier_range_blockable(range) ? "non-" : "");
> > + WARN_ON(mmu_notifier_range_blockable(range) ||
> > + ret != -EAGAIN);
> > ret = _ret;
> > }
> > }
>
> A problem with WARN_ON(a || b) is that if it triggers, we don't know
> whether it was because of a or because of b. Or both. So I'd suggest
>
> WARN_ON(a);
> WARN_ON(b);
>
Well, we did just make a pr_info right above with the value of
blockable, that seems enough to tell the cases apart?
But you are generally right, the full logic:
if (_ret) {
if (WARN_ON(mmu_notifier_range_blockable(range)))
continue;
WARN_ON(_ret != -EAGAIN);
ret = -EAGAIN;
break;
}
would force correct API contract up the call chain once we detect a
broken driver..
But at some point it does feel like a bit much debugging logic to have
in a production code path, as this should never happen and is just to
discourage wrong driver behaviors during driver development.
If we like this version then:
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Also - I have a bunch of other patches to mmu notifiers for hmm.git,
so when everyone agrees I can grab this to avoid conflicts.
Thanks,
Jason
On 8/14/19 3:14 PM, Andrew Morton wrote:
> On Wed, 14 Aug 2019 22:20:23 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
>> Just a bit of paranoia, since if we start pushing this deep into
>> callchains it's hard to spot all places where an mmu notifier
>> implementation might fail when it's not allowed to.
>>
>> Inspired by some confusion we had discussing i915 mmu notifiers and
>> whether we could use the newly-introduced return value to handle some
>> corner cases. Until we realized that these are only for when a task
>> has been killed by the oom reaper.
>>
>> An alternative approach would be to split the callback into two
>> versions, one with the int return value, and the other with void
>> return value like in older kernels. But that's a lot more churn for
>> fairly little gain I think.
>>
>> Summary from the m-l discussion on why we want something at warning
>> level: This allows automated tooling in CI to catch bugs without
>> humans having to look at everything. If we just upgrade the existing
>> pr_info to a pr_warn, then we'll have false positives. And as-is, no
>> one will ever spot the problem since it's lost in the massive amounts
>> of overall dmesg noise.
>>
>> ...
>>
>> --- a/mm/mmu_notifier.c
>> +++ b/mm/mmu_notifier.c
>> @@ -179,6 +179,8 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
>> pr_info("%pS callback failed with %d in %sblockable context.\n",
>> mn->ops->invalidate_range_start, _ret,
>> !mmu_notifier_range_blockable(range) ? "non-" : "");
>> + WARN_ON(mmu_notifier_range_blockable(range) ||
>> + ret != -EAGAIN);
>> ret = _ret;
>> }
>> }
>
> A problem with WARN_ON(a || b) is that if it triggers, we don't know
> whether it was because of a or because of b. Or both. So I'd suggest
>
> WARN_ON(a);
> WARN_ON(b);
>
This won't quite work. It is OK to have
mmu_notifier_range_blockable(range) be true or false.
sync_cpu_device_pagetables() shouldn't return
-EAGAIN unless blockable is true.
On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote: > In some special cases we must not block, but there's not a > spinlock, preempt-off, irqs-off or similar critical section already > that arms the might_sleep() debug checks. Add a non_block_start/end() > pair to annotate these. > > This will be used in the oom paths of mmu-notifiers, where blocking is > not allowed to make sure there's forward progress. Quoting Michal: > > "The notifier is called from quite a restricted context - oom_reaper - > which shouldn't depend on any locks or sleepable conditionals. The code > should be swift as well but we mostly do care about it to make a forward > progress. Checking for sleepable context is the best thing we could come > up with that would describe these demands at least partially." But this describes fs_reclaim_acquire() - is there some reason we are conflating fs_reclaim with non-sleeping? ie is there some fundamental difference between the block stack sleeping during reclaim while it waits for a driver to write out a page and a GPU driver sleeping during OOM while it waits for it's HW to fence DMA on a page? Fundamentally we have invalidate_range_start() vs invalidate_range() as the start() version is able to sleep. If drivers can do their work without sleeping then they should be using invalidare_range() instead. Thus, it doesn't seem to make any sense to ask a driver that requires a sleeping API not to sleep. AFAICT what is really going on here is that drivers care about only a subset of the VA space, and we want to query the driver if it cares about the range proposed to be OOM'd, so we can OOM ranges that are do not have SPTEs. ie if you look pretty much all drivers do exactly as userptr_mn_invalidate_range_start() does, and bail once they detect the VA range is of interest. So, I'm working on a patch to lift the interval tree into the notifier core and then do the VA test OOM needs without bothering the driver. Drivers can retain the blocking API they require and OOM can work on VA's that don't have SPTEs. This approach also solves the critical bug in this path: https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/ And solves a bunch of other bugs in the drivers. > Peter also asked whether we want to catch spinlocks on top, but Michal > said those are less of a problem because spinlocks can't have an > indirect dependency upon the page allocator and hence close the loop > with the oom reaper. Again, this entirely sounds like fs_reclaim - isn't that exactly what it is for? I have had on my list a second and very related possible bug. I ran into commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in advance") which says that mapping->i_mmap_mutex is under fs_reclaim(). We do hold i_mmap_rwsem while calling invalidate_range_start(): unmap_mapping_pages i_mmap_lock_write(mapping); // ie i_mmap_rwsem unmap_mapping_range_tree unmap_mapping_range_vma zap_page_range_single mmu_notifier_invalidate_range_start So, if it is still true that i_mmap_rwsem is under fs_reclaim then invalidate_range_start is *always* under fs_reclaim anyhow! (this I do not know) Thus we should use lockdep to force this and fix all the drivers. .. and if we force fs_reclaim always, do we care about blockable anymore?? Jason
On Wed, Aug 14, 2019 at 10:20:25PM +0200, Daniel Vetter wrote:
> We need to make sure implementations don't cheat and don't have a
> possible schedule/blocking point deeply burried where review can't
> catch it.
>
> I'm not sure whether this is the best way to make sure all the
> might_sleep() callsites trigger, and it's a bit ugly in the code flow.
> But it gets the job done.
>
> Inspired by an i915 patch series which did exactly that, because the
> rules haven't been entirely clear to us.
I thought lockdep already was able to detect:
spin_lock()
might_sleep();
spin_unlock()
Am I mistaken? If yes, couldn't this patch just inject a dummy lockdep
spinlock?
Jason
On Wed, Aug 14, 2019 at 10:20:26PM +0200, Daniel Vetter wrote: > This is a similar idea to the fs_reclaim fake lockdep lock. It's > fairly easy to provoke a specific notifier to be run on a specific > range: Just prep it, and then munmap() it. > > A bit harder, but still doable, is to provoke the mmu notifiers for > all the various callchains that might lead to them. But both at the > same time is really hard to reliable hit, especially when you want to > exercise paths like direct reclaim or compaction, where it's not > easy to control what exactly will be unmapped. > > By introducing a lockdep map to tie them all together we allow lockdep > to see a lot more dependencies, without having to actually hit them > in a single challchain while testing. > > Aside: Since I typed this to test i915 mmu notifiers I've only rolled > this out for the invaliate_range_start callback. If there's > interest, we should probably roll this out to all of them. But my > undestanding of core mm is seriously lacking, and I'm not clear on > whether we need a lockdep map for each callback, or whether some can > be shared. I was thinking about doing something like this.. IMHO only range_end needs annotation, the other ops are either already non-sleeping or only used by KVM. BTW, I have found it strange that i915 only uses invalidate_range_start. Not really sure how it is able to do that. Would love to know the answer :) > Reviewed-by: Jérôme Glisse <jglisse@redhat.com> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > include/linux/mmu_notifier.h | 6 ++++++ > mm/mmu_notifier.c | 7 +++++++ > 2 files changed, 13 insertions(+) > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > index b6c004bd9f6a..9dd38c32fc53 100644 > +++ b/include/linux/mmu_notifier.h > @@ -42,6 +42,10 @@ enum mmu_notifier_event { > > #ifdef CONFIG_MMU_NOTIFIER > > +#ifdef CONFIG_LOCKDEP > +extern struct lockdep_map __mmu_notifier_invalidate_range_start_map; > +#endif I wonder what the trade off is having a global map vs a map in each mmu_notifier_mm ? > /* > * The mmu notifier_mm structure is allocated and installed in > * mm->mmu_notifier_mm inside the mm_take_all_locks() protected > @@ -310,10 +314,12 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm, > static inline void > mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) > { > + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); > if (mm_has_notifiers(range->mm)) { > range->flags |= MMU_NOTIFIER_RANGE_BLOCKABLE; > __mmu_notifier_invalidate_range_start(range); > } > + lock_map_release(&__mmu_notifier_invalidate_range_start_map); > } Also range_end should have this too - it has all the same constraints. I think it can share the map. So 'range_start_map' is probably not the right name. It may also make some sense to do a dummy acquire/release under the mm_take_all_locks() to forcibly increase map coverage and reduce the scenario complexity required to hit bugs. And if we do decide on the reclaim thing in my other email then the reclaim dependency can be reliably injected by doing: fs_reclaim_acquire(); lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); lock_map_release(&__mmu_notifier_invalidate_range_start_map); fs_reclaim_release(); If I understand lockdep properly.. Jason
On Wed, Aug 14, 2019 at 10:20:27PM +0200, Daniel Vetter wrote:
> Similar to the warning in the mmu notifer, warning if an hmm mirror
> callback gets it's blocking vs. nonblocking handling wrong, or if it
> fails with anything else than -EAGAIN.
>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dan Carpenter <dan.carpenter@oracle.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Balbir Singh <bsingharora@gmail.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Souptick Joarder <jrdr.linux@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: linux-mm@kvack.org
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> mm/hmm.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 16b6731a34db..52ac59384268 100644
> +++ b/mm/hmm.c
> @@ -205,6 +205,9 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> ret = -EAGAIN;
> break;
> }
> + WARN(ret, "%pS callback failed with %d in %sblockable context\n",
> + mirror->ops->sync_cpu_device_pagetables, ret,
> + update.blockable ? "" : "non-");
> }
> up_read(&hmm->mirrors_sem);
Didn't I beat you to this?
list_for_each_entry(mirror, &hmm->mirrors, list) {
int rc;
rc = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
if (rc) {
if (WARN_ON(update.blockable || rc != -EAGAIN))
continue;
ret = -EAGAIN;
break;
}
}
Thanks,
Jason
On Wed, Aug 14, 2019 at 01:45:58PM -0700, Andrew Morton wrote: > On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > > > In some special cases we must not block, but there's not a > > spinlock, preempt-off, irqs-off or similar critical section already > > that arms the might_sleep() debug checks. Add a non_block_start/end() > > pair to annotate these. > > > > This will be used in the oom paths of mmu-notifiers, where blocking is > > not allowed to make sure there's forward progress. Quoting Michal: > > > > "The notifier is called from quite a restricted context - oom_reaper - > > which shouldn't depend on any locks or sleepable conditionals. The code > > should be swift as well but we mostly do care about it to make a forward > > progress. Checking for sleepable context is the best thing we could come > > up with that would describe these demands at least partially." > > > > Peter also asked whether we want to catch spinlocks on top, but Michal > > said those are less of a problem because spinlocks can't have an > > indirect dependency upon the page allocator and hence close the loop > > with the oom reaper. > > I continue to struggle with this. It introduces a new kernel state > "running preemptibly but must not call schedule()". How does this make > any sense? > > Perhaps a much, much more detailed description of the oom_reaper > situation would help out. I agree on both points, but I guess I'm not the expert to explain why we have this. All I'm trying to do is that drivers hold up their side. If you want to have better documentation for why the oom case needs this special new mode, you're looking at the wrong guy for that :-) Of course if you folks all decide that you just don't want to be remembered about that I guess we can drop this one here, but you're just shooting the messenger I think. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote: > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote: > > In some special cases we must not block, but there's not a > > spinlock, preempt-off, irqs-off or similar critical section already > > that arms the might_sleep() debug checks. Add a non_block_start/end() > > pair to annotate these. > > > > This will be used in the oom paths of mmu-notifiers, where blocking is > > not allowed to make sure there's forward progress. Quoting Michal: > > > > "The notifier is called from quite a restricted context - oom_reaper - > > which shouldn't depend on any locks or sleepable conditionals. The code > > should be swift as well but we mostly do care about it to make a forward > > progress. Checking for sleepable context is the best thing we could come > > up with that would describe these demands at least partially." > > But this describes fs_reclaim_acquire() - is there some reason we are > conflating fs_reclaim with non-sleeping? No idea why you tie this into fs_reclaim. We can definitly sleep in there, and for e.g. kswapd (which also wraps everything in fs_reclaim) we're event supposed to I thought. To make sure we can get at the last bit of memory by flushing all the queues and waiting for everything to be cleaned out. > ie is there some fundamental difference between the block stack > sleeping during reclaim while it waits for a driver to write out a > page and a GPU driver sleeping during OOM while it waits for it's HW > to fence DMA on a page? > > Fundamentally we have invalidate_range_start() vs invalidate_range() > as the start() version is able to sleep. If drivers can do their work > without sleeping then they should be using invalidare_range() instead. > > Thus, it doesn't seem to make any sense to ask a driver that requires a > sleeping API not to sleep. > > AFAICT what is really going on here is that drivers care about only a > subset of the VA space, and we want to query the driver if it cares > about the range proposed to be OOM'd, so we can OOM ranges that are > do not have SPTEs. > > ie if you look pretty much all drivers do exactly as > userptr_mn_invalidate_range_start() does, and bail once they detect > the VA range is of interest. > > So, I'm working on a patch to lift the interval tree into the notifier > core and then do the VA test OOM needs without bothering the > driver. Drivers can retain the blocking API they require and OOM can > work on VA's that don't have SPTEs. Hm I figured the point of hmm_mirror is to have that interval tree for everyone (among other things). But yeah lifting to mmu_notifier sounds like a clean solution for this, but I really have not much clue about why we even have this for special mode in the oom case. I'm just trying to increase the odds that drivers hold up their end of the bargain. > This approach also solves the critical bug in this path: > https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/ > > And solves a bunch of other bugs in the drivers. > > > Peter also asked whether we want to catch spinlocks on top, but Michal > > said those are less of a problem because spinlocks can't have an > > indirect dependency upon the page allocator and hence close the loop > > with the oom reaper. > > Again, this entirely sounds like fs_reclaim - isn't that exactly what > it is for? > > I have had on my list a second and very related possible bug. I ran > into commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in > advance") which says that mapping->i_mmap_mutex is under fs_reclaim(). > > We do hold i_mmap_rwsem while calling invalidate_range_start(): > > unmap_mapping_pages > i_mmap_lock_write(mapping); // ie i_mmap_rwsem > unmap_mapping_range_tree > unmap_mapping_range_vma > zap_page_range_single > mmu_notifier_invalidate_range_start > > So, if it is still true that i_mmap_rwsem is under fs_reclaim then > invalidate_range_start is *always* under fs_reclaim anyhow! (this I do > not know) > > Thus we should use lockdep to force this and fix all the drivers. > > .. and if we force fs_reclaim always, do we care about blockable > anymore?? Still not sure what fs_reclaim matters here. One of the later patches steals the fs_reclaim idea for mmu_notifiers, but that ties together completely different paths. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Aug 14, 2019 at 09:00:29PM -0300, Jason Gunthorpe wrote: > On Wed, Aug 14, 2019 at 10:20:25PM +0200, Daniel Vetter wrote: > > We need to make sure implementations don't cheat and don't have a > > possible schedule/blocking point deeply burried where review can't > > catch it. > > > > I'm not sure whether this is the best way to make sure all the > > might_sleep() callsites trigger, and it's a bit ugly in the code flow. > > But it gets the job done. > > > > Inspired by an i915 patch series which did exactly that, because the > > rules haven't been entirely clear to us. > > I thought lockdep already was able to detect: > > spin_lock() > might_sleep(); > spin_unlock() > > Am I mistaken? If yes, couldn't this patch just inject a dummy lockdep > spinlock? Hm ... assuming I didn't get lost in the maze I think might_sleep (well ___might_sleep) doesn't do any lockdep checking at all. And we want might_sleep, since that catches a lot more than lockdep. Maybe you mixed it up with the hard/softirq context stuff that lockdep tracks and complains about if you get it wrong? -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Aug 14, 2019 at 09:09:59PM -0300, Jason Gunthorpe wrote: > On Wed, Aug 14, 2019 at 10:20:26PM +0200, Daniel Vetter wrote: > > This is a similar idea to the fs_reclaim fake lockdep lock. It's > > fairly easy to provoke a specific notifier to be run on a specific > > range: Just prep it, and then munmap() it. > > > > A bit harder, but still doable, is to provoke the mmu notifiers for > > all the various callchains that might lead to them. But both at the > > same time is really hard to reliable hit, especially when you want to > > exercise paths like direct reclaim or compaction, where it's not > > easy to control what exactly will be unmapped. > > > > By introducing a lockdep map to tie them all together we allow lockdep > > to see a lot more dependencies, without having to actually hit them > > in a single challchain while testing. > > > > Aside: Since I typed this to test i915 mmu notifiers I've only rolled > > this out for the invaliate_range_start callback. If there's > > interest, we should probably roll this out to all of them. But my > > undestanding of core mm is seriously lacking, and I'm not clear on > > whether we need a lockdep map for each callback, or whether some can > > be shared. > > I was thinking about doing something like this.. > > IMHO only range_end needs annotation, the other ops are either already > non-sleeping or only used by KVM. This isnt' about sleeping, this is about locking loops. And the biggest risk for that is from driver code, and at least hmm_mirror only has the driver code callback on invalidate_range_start. Once thing I discovered using this (and it would be really hard to spot, it's deeply neested) is that i915 userptr. Even if i915 userptr would use hmm_mirror (to fix the issue you mention below), if we then switch the annotation to invalidate_range_end nothing interesting would ever come from this. Well, the only thing it'd catch is issues in hmm_mirror, but I think core mm review will catch that before it reaches us :-) > BTW, I have found it strange that i915 only uses > invalidate_range_start. Not really sure how it is able to do > that. Would love to know the answer :) I suspect it's broken :-/ Our userptr is ... not the best. Part of the motivation here. > > Reviewed-by: Jérôme Glisse <jglisse@redhat.com> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > > include/linux/mmu_notifier.h | 6 ++++++ > > mm/mmu_notifier.c | 7 +++++++ > > 2 files changed, 13 insertions(+) > > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > index b6c004bd9f6a..9dd38c32fc53 100644 > > +++ b/include/linux/mmu_notifier.h > > @@ -42,6 +42,10 @@ enum mmu_notifier_event { > > > > #ifdef CONFIG_MMU_NOTIFIER > > > > +#ifdef CONFIG_LOCKDEP > > +extern struct lockdep_map __mmu_notifier_invalidate_range_start_map; > > +#endif > > I wonder what the trade off is having a global map vs a map in each > mmu_notifier_mm ? Less reports, specifically no reports involving multiple different mmu notifiers to build the entire chain. But I'm assuming it's possible to combine them in one mm (kvm+gpu+infiniband in one process sounds like something someone could reasonably do), and it will help to make sure everyone follows the same rules. > > > /* > > * The mmu notifier_mm structure is allocated and installed in > > * mm->mmu_notifier_mm inside the mm_take_all_locks() protected > > @@ -310,10 +314,12 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm, > > static inline void > > mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) > > { > > + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); > > if (mm_has_notifiers(range->mm)) { > > range->flags |= MMU_NOTIFIER_RANGE_BLOCKABLE; > > __mmu_notifier_invalidate_range_start(range); > > } > > + lock_map_release(&__mmu_notifier_invalidate_range_start_map); > > } > > Also range_end should have this too - it has all the same > constraints. I think it can share the map. So 'range_start_map' is > probably not the right name. > > It may also make some sense to do a dummy acquire/release under the > mm_take_all_locks() to forcibly increase map coverage and reduce the > scenario complexity required to hit bugs. > > And if we do decide on the reclaim thing in my other email then the > reclaim dependency can be reliably injected by doing: > > fs_reclaim_acquire(); > lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); > lock_map_release(&__mmu_notifier_invalidate_range_start_map); > fs_reclaim_release(); > > If I understand lockdep properly.. Ime fs_reclaim injects the mmu_notifier map here reliably as soon as you've thrown out the first pagecache mmap on any process. That "make sure we inject it quickly" is why the lockdep is _outside_ of the mm_has_notifiers() check. So no further injection needed imo. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Aug 14, 2019 at 09:11:37PM -0300, Jason Gunthorpe wrote: > On Wed, Aug 14, 2019 at 10:20:27PM +0200, Daniel Vetter wrote: > > Similar to the warning in the mmu notifer, warning if an hmm mirror > > callback gets it's blocking vs. nonblocking handling wrong, or if it > > fails with anything else than -EAGAIN. > > > > Cc: Jason Gunthorpe <jgg@ziepe.ca> > > Cc: Ralph Campbell <rcampbell@nvidia.com> > > Cc: John Hubbard <jhubbard@nvidia.com> > > Cc: Dan Williams <dan.j.williams@intel.com> > > Cc: Dan Carpenter <dan.carpenter@oracle.com> > > Cc: Matthew Wilcox <willy@infradead.org> > > Cc: Arnd Bergmann <arnd@arndb.de> > > Cc: Balbir Singh <bsingharora@gmail.com> > > Cc: Ira Weiny <ira.weiny@intel.com> > > Cc: Souptick Joarder <jrdr.linux@gmail.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: "Jérôme Glisse" <jglisse@redhat.com> > > Cc: linux-mm@kvack.org > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > > mm/hmm.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/mm/hmm.c b/mm/hmm.c > > index 16b6731a34db..52ac59384268 100644 > > +++ b/mm/hmm.c > > @@ -205,6 +205,9 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn, > > ret = -EAGAIN; > > break; > > } > > + WARN(ret, "%pS callback failed with %d in %sblockable context\n", > > + mirror->ops->sync_cpu_device_pagetables, ret, > > + update.blockable ? "" : "non-"); > > } > > up_read(&hmm->mirrors_sem); > > Didn't I beat you to this? Very much possible, I think I didn't rebase this to linux-next before resending ... have an Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> in case you need. Cheers, Daniel > > list_for_each_entry(mirror, &hmm->mirrors, list) { > int rc; > > rc = mirror->ops->sync_cpu_device_pagetables(mirror, &update); > if (rc) { > if (WARN_ON(update.blockable || rc != -EAGAIN)) > continue; > ret = -EAGAIN; > break; > } > } > > Thanks, > Jason -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed 14-08-19 13:45:58, Andrew Morton wrote:
> On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> > In some special cases we must not block, but there's not a
> > spinlock, preempt-off, irqs-off or similar critical section already
> > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > pair to annotate these.
> >
> > This will be used in the oom paths of mmu-notifiers, where blocking is
> > not allowed to make sure there's forward progress. Quoting Michal:
> >
> > "The notifier is called from quite a restricted context - oom_reaper -
> > which shouldn't depend on any locks or sleepable conditionals. The code
> > should be swift as well but we mostly do care about it to make a forward
> > progress. Checking for sleepable context is the best thing we could come
> > up with that would describe these demands at least partially."
> >
> > Peter also asked whether we want to catch spinlocks on top, but Michal
> > said those are less of a problem because spinlocks can't have an
> > indirect dependency upon the page allocator and hence close the loop
> > with the oom reaper.
>
> I continue to struggle with this. It introduces a new kernel state
> "running preemptibly but must not call schedule()". How does this make
> any sense?
>
> Perhaps a much, much more detailed description of the oom_reaper
> situation would help out.
The primary point here is that there is a demand of non blockable mmu
notifiers to be called when the oom reaper tears down the address space.
As the oom reaper is the primary guarantee of the oom handling forward
progress it cannot be blocked on anything that might depend on blockable
memory allocations. These are not really easy to track because they
might be indirect - e.g. notifier blocks on a lock which other context
holds while allocating memory or waiting for a flusher that needs memory
to perform its work. If such a blocking state happens that we can end up
in a silent hang with an unusable machine.
Now we hope for reasonable implementations of mmu notifiers (strong
words I know ;) and this should be relatively simple and effective catch
all tool to detect something suspicious is going on.
Does that make the situation more clear?
--
Michal Hocko
SUSE Labs
On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > In some special cases we must not block, but there's not a
> > > spinlock, preempt-off, irqs-off or similar critical section already
> > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > pair to annotate these.
> > >
> > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > not allowed to make sure there's forward progress. Quoting Michal:
> > >
> > > "The notifier is called from quite a restricted context - oom_reaper -
> > > which shouldn't depend on any locks or sleepable conditionals. The code
> > > should be swift as well but we mostly do care about it to make a forward
> > > progress. Checking for sleepable context is the best thing we could come
> > > up with that would describe these demands at least partially."
> >
> > But this describes fs_reclaim_acquire() - is there some reason we are
> > conflating fs_reclaim with non-sleeping?
>
> No idea why you tie this into fs_reclaim. We can definitly sleep in there,
> and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> event supposed to I thought. To make sure we can get at the last bit of
> memory by flushing all the queues and waiting for everything to be cleaned
> out.
AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
the page allocator" ie a justification that was given this !blockable
stuff.
For instance:
fs_reclaim_acquire()
kmalloc(GFP_KERNEL) <- lock dep assertion
And further, Michal's concern about indirectness through locks is also
handled by lockdep:
CPU0 CPU1
mutex_lock()
kmalloc(GFP_KERNEL)
mutex_unlock()
fs_reclaim_acquire()
mutex_lock() <- lock dep assertion
In other words, to prevent recursion into the page allocator you use
fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.
I asked Tejun about this once in regards to WQ_MEM_RECLAIM and he
explained that it means you can't call the allocator functions in a
way that would recurse into reclaim (ie instead use instead GFP_ATOMIC, or
tolerate allocation failure, or various other things).
So, the reason I bring it up is half the justifications you posted for
blockable had to do with not recursing into reclaim and deadlocking,
and didn't seem to have much to do with blocking.
I'm asking if *non-blocking* is really the requirement or if this is
just the usual 'do not deadlock on the allocator' thing reclaim paths
alread have?
Jason
On Thu, Aug 15, 2019 at 09:10:14AM +0200, Daniel Vetter wrote: > On Wed, Aug 14, 2019 at 09:09:59PM -0300, Jason Gunthorpe wrote: > > On Wed, Aug 14, 2019 at 10:20:26PM +0200, Daniel Vetter wrote: > > > This is a similar idea to the fs_reclaim fake lockdep lock. It's > > > fairly easy to provoke a specific notifier to be run on a specific > > > range: Just prep it, and then munmap() it. > > > > > > A bit harder, but still doable, is to provoke the mmu notifiers for > > > all the various callchains that might lead to them. But both at the > > > same time is really hard to reliable hit, especially when you want to > > > exercise paths like direct reclaim or compaction, where it's not > > > easy to control what exactly will be unmapped. > > > > > > By introducing a lockdep map to tie them all together we allow lockdep > > > to see a lot more dependencies, without having to actually hit them > > > in a single challchain while testing. > > > > > > Aside: Since I typed this to test i915 mmu notifiers I've only rolled > > > this out for the invaliate_range_start callback. If there's > > > interest, we should probably roll this out to all of them. But my > > > undestanding of core mm is seriously lacking, and I'm not clear on > > > whether we need a lockdep map for each callback, or whether some can > > > be shared. > > > > I was thinking about doing something like this.. > > > > IMHO only range_end needs annotation, the other ops are either already > > non-sleeping or only used by KVM. > > This isnt' about sleeping, this is about locking loops. And the biggest > risk for that is from driver code, and at least hmm_mirror only has the > driver code callback on invalidate_range_start. Once thing I discovered > using this (and it would be really hard to spot, it's deeply neested) is > that i915 userptr. Sorry, that came out wrong, what I ment is that only range_end and range_start really need annotation. The other places are only used by KVM and are called in non-sleeping contexts, so they already can't recurse back onto the popular sleeping locks like mmap_sem or what not, can't do allocations, etc. I don't see alot of return in investing in them. > > BTW, I have found it strange that i915 only uses > > invalidate_range_start. Not really sure how it is able to do > > that. Would love to know the answer :) > > I suspect it's broken :-/ Our userptr is ... not the best. Part of the > motivation here. I was wondering if it is what we call in RDMA a 'registration cache' ie you assume that userspace is well behaved while DMA is ongoing and just use the notifer to invalidate cached dma mappings. The hallmark of this pattern is that it holds the page pin the entire time DMA is active, which is why it isn't a bug, it is just best described as loosely coherent. But, in RDMA the best-practice is to do this in userspace with userfaultfd as it is very expensive to take a syscall on command submission to have the kernel figure it out. > > And if we do decide on the reclaim thing in my other email then the > > reclaim dependency can be reliably injected by doing: > > > > fs_reclaim_acquire(); > > lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); > > lock_map_release(&__mmu_notifier_invalidate_range_start_map); > > fs_reclaim_release(); > > > > If I understand lockdep properly.. > > Ime fs_reclaim injects the mmu_notifier map here reliably as soon as > you've thrown out the first pagecache mmap on any process. That "make sure > we inject it quickly" is why the lockdep is _outside_ of the > mm_has_notifiers() check. So no further injection needed imo. I suspect a lot of our automated testing, like syzkaller in restricted kvms, probably does not reliably trigger a fs_reclaim, so I would very much prefer to inject it 100% of the time directly if we are sure this is a reclaim context because of the i_mmap_rwsem I mentioned before. Jason
On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
> As the oom reaper is the primary guarantee of the oom handling forward
> progress it cannot be blocked on anything that might depend on blockable
> memory allocations. These are not really easy to track because they
> might be indirect - e.g. notifier blocks on a lock which other context
> holds while allocating memory or waiting for a flusher that needs memory
> to perform its work.
But lockdep *does* track all this and fs_reclaim_acquire() was created
to solve exactly this problem.
fs_reclaim is a lock and it flows through all the usual lockdep
schemes like any other lock. Any time the page allocator wants to do
something the would deadlock with reclaim it takes the lock.
Failure is expressed by a deadlock cycle in the lockdep map, and
lockdep can handle arbitary complexity through layers of locks, work
queues, threads, etc.
What is missing?
Jason
On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote: > > > As the oom reaper is the primary guarantee of the oom handling forward > > progress it cannot be blocked on anything that might depend on blockable > > memory allocations. These are not really easy to track because they > > might be indirect - e.g. notifier blocks on a lock which other context > > holds while allocating memory or waiting for a flusher that needs memory > > to perform its work. > > But lockdep *does* track all this and fs_reclaim_acquire() was created > to solve exactly this problem. > > fs_reclaim is a lock and it flows through all the usual lockdep > schemes like any other lock. Any time the page allocator wants to do > something the would deadlock with reclaim it takes the lock. > > Failure is expressed by a deadlock cycle in the lockdep map, and > lockdep can handle arbitary complexity through layers of locks, work > queues, threads, etc. > > What is missing? Lockdep doens't seen everything by far. E.g. a wait_event will be caught by the annotations here, but not by lockdep. Plus lockdep does not see through the wait_event, and doesn't realize if e.g. that event will never signal because the worker is part of the deadlock loop. cross-release was supposed to fix that, but seems like that will never land. And since we're talking about mmu notifiers here and gpus/dma engines. We have dma_fence_wait, which can wait for any hw/driver in the system that takes part in shared/zero-copy buffer processing. Which at least on the graphics side is everything. This pulls in enormous amounts of deadlock potential that lockdep simply is blind about and will never see. Arming might_sleep catches them all. Cheers, Daniel PS: Don't ask me about why we need these semantics for oom_reaper, like I said just trying to follow the rules. -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote: > > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote: > > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote: > > > > In some special cases we must not block, but there's not a > > > > spinlock, preempt-off, irqs-off or similar critical section already > > > > that arms the might_sleep() debug checks. Add a non_block_start/end() > > > > pair to annotate these. > > > > > > > > This will be used in the oom paths of mmu-notifiers, where blocking is > > > > not allowed to make sure there's forward progress. Quoting Michal: > > > > > > > > "The notifier is called from quite a restricted context - oom_reaper - > > > > which shouldn't depend on any locks or sleepable conditionals. The code > > > > should be swift as well but we mostly do care about it to make a forward > > > > progress. Checking for sleepable context is the best thing we could come > > > > up with that would describe these demands at least partially." > > > > > > But this describes fs_reclaim_acquire() - is there some reason we are > > > conflating fs_reclaim with non-sleeping? > > > > No idea why you tie this into fs_reclaim. We can definitly sleep in there, > > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're > > event supposed to I thought. To make sure we can get at the last bit of > > memory by flushing all the queues and waiting for everything to be cleaned > > out. > > AFAIK the point of fs_reclaim is to prevent "indirect dependency upon > the page allocator" ie a justification that was given this !blockable > stuff. > > For instance: > > fs_reclaim_acquire() > kmalloc(GFP_KERNEL) <- lock dep assertion > > And further, Michal's concern about indirectness through locks is also > handled by lockdep: > > CPU0 CPU1 > mutex_lock() > kmalloc(GFP_KERNEL) > mutex_unlock() > fs_reclaim_acquire() > mutex_lock() <- lock dep assertion > > In other words, to prevent recursion into the page allocator you use > fs_reclaim_acquire(), and lockdep verfies it in its usual robust way. fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about any !GFP_NOWAIT allocation context here and any {in}direct dependency on it. Whether fs_reclaim_acquire can be reused for that I do not know because I am not familiar with the lockdep machinery enough > I asked Tejun about this once in regards to WQ_MEM_RECLAIM and he > explained that it means you can't call the allocator functions in a > way that would recurse into reclaim (ie instead use instead GFP_ATOMIC, or > tolerate allocation failure, or various other things). > > So, the reason I bring it up is half the justifications you posted for > blockable had to do with not recursing into reclaim and deadlocking, > and didn't seem to have much to do with blocking. > > I'm asking if *non-blocking* is really the requirement or if this is > just the usual 'do not deadlock on the allocator' thing reclaim paths > alread have? No, non-blocking is a very coarse approximation of what we really need. But it should give us even a stronger condition. Essentially any sleep other than a preemption shouldn't be allowed in that context. -- Michal Hocko SUSE Labs
On Thu 15-08-19 10:04:15, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
>
> > As the oom reaper is the primary guarantee of the oom handling forward
> > progress it cannot be blocked on anything that might depend on blockable
> > memory allocations. These are not really easy to track because they
> > might be indirect - e.g. notifier blocks on a lock which other context
> > holds while allocating memory or waiting for a flusher that needs memory
> > to perform its work.
>
> But lockdep *does* track all this and fs_reclaim_acquire() was created
> to solve exactly this problem.
>
> fs_reclaim is a lock and it flows through all the usual lockdep
> schemes like any other lock. Any time the page allocator wants to do
> something the would deadlock with reclaim it takes the lock.
Our emails have crossed. Even if fs_reclaim can be re-purposed for other
than FS/IO reclaim contexts which I am not sure about I think that
lockdep is too heavy weight for the purpose of this debugging aid in the
first place. The non block mode is really something as simple as "hey
mmu notifier you are only allowed to do a lightweight processing, if you
cannot guarantee that then just back off". The annotation will give us a
warning if the notifier gets out of this promise.
--
Michal Hocko
SUSE Labs
On Thu, Aug 15, 2019 at 03:21:27PM +0200, Michal Hocko wrote: > On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote: > > > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote: > > > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote: > > > > > In some special cases we must not block, but there's not a > > > > > spinlock, preempt-off, irqs-off or similar critical section already > > > > > that arms the might_sleep() debug checks. Add a non_block_start/end() > > > > > pair to annotate these. > > > > > > > > > > This will be used in the oom paths of mmu-notifiers, where blocking is > > > > > not allowed to make sure there's forward progress. Quoting Michal: > > > > > > > > > > "The notifier is called from quite a restricted context - oom_reaper - > > > > > which shouldn't depend on any locks or sleepable conditionals. The code > > > > > should be swift as well but we mostly do care about it to make a forward > > > > > progress. Checking for sleepable context is the best thing we could come > > > > > up with that would describe these demands at least partially." > > > > > > > > But this describes fs_reclaim_acquire() - is there some reason we are > > > > conflating fs_reclaim with non-sleeping? > > > > > > No idea why you tie this into fs_reclaim. We can definitly sleep in there, > > > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're > > > event supposed to I thought. To make sure we can get at the last bit of > > > memory by flushing all the queues and waiting for everything to be cleaned > > > out. > > > > AFAIK the point of fs_reclaim is to prevent "indirect dependency upon > > the page allocator" ie a justification that was given this !blockable > > stuff. > > > > For instance: > > > > fs_reclaim_acquire() > > kmalloc(GFP_KERNEL) <- lock dep assertion > > > > And further, Michal's concern about indirectness through locks is also > > handled by lockdep: > > > > CPU0 CPU1 > > mutex_lock() > > kmalloc(GFP_KERNEL) > > mutex_unlock() > > fs_reclaim_acquire() > > mutex_lock() <- lock dep assertion > > > > In other words, to prevent recursion into the page allocator you use > > fs_reclaim_acquire(), and lockdep verfies it in its usual robust way. > > fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about > any !GFP_NOWAIT allocation context here and any {in}direct dependency on > it. AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and __GFP_DIRECT_RECLAIM.. This matches the existing test in __need_fs_reclaim() - so if you are OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), allocations during OOM, then I think fs_reclaim already matches what you described? > Whether fs_reclaim_acquire can be reused for that I do not know > because I am not familiar with the lockdep machinery enough Well, if fs_reclaim is not already testing the flags you want, then we could add another lockdep map that does. The basic principle is the same, if you want to detect and prevent recursion into the allocator under certain GFP flags then then AFAIK lockdep is the best tool we have. > No, non-blocking is a very coarse approximation of what we really need. > But it should give us even a stronger condition. Essentially any sleep > other than a preemption shouldn't be allowed in that context. But it is a nonsense API to give the driver invalidate_range_start, the blocking alternative to the non-blocking invalidate_range and then demand it to be non-blocking. Inspecting the code, no drivers are actually able to progress their side in non-blocking mode. The best we got was drivers tested the VA range and returned success if they had no interest. Which is a big win to be sure, but it looks like getting any more is not really posssible. However, we could (probably even should) make the drivers fs_reclaim safe. If that is enough to guarantee progress of OOM, then lets consider something like using current_gfp_context() to force PF_MEMALLOC_NOFS allocation behavior on the driver callback and lockdep to try and keep pushing on the the debugging, and dropping !blocking. Jason
On Thu, Aug 15, 2019 at 03:12:11PM +0200, Daniel Vetter wrote: > On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote: > > > > > As the oom reaper is the primary guarantee of the oom handling forward > > > progress it cannot be blocked on anything that might depend on blockable > > > memory allocations. These are not really easy to track because they > > > might be indirect - e.g. notifier blocks on a lock which other context > > > holds while allocating memory or waiting for a flusher that needs memory > > > to perform its work. > > > > But lockdep *does* track all this and fs_reclaim_acquire() was created > > to solve exactly this problem. > > > > fs_reclaim is a lock and it flows through all the usual lockdep > > schemes like any other lock. Any time the page allocator wants to do > > something the would deadlock with reclaim it takes the lock. > > > > Failure is expressed by a deadlock cycle in the lockdep map, and > > lockdep can handle arbitary complexity through layers of locks, work > > queues, threads, etc. > > > > What is missing? > > Lockdep doens't seen everything by far. E.g. a wait_event will be > caught by the annotations here, but not by lockdep. Sure, but the wait_event might be OK if its progress isn't contingent on fs_reclaim, ie triggered from interrupt, so why ban it? > And since we're talking about mmu notifiers here and gpus/dma engines. > We have dma_fence_wait, which can wait for any hw/driver in the system > that takes part in shared/zero-copy buffer processing. Which at least > on the graphics side is everything. This pulls in enormous amounts of > deadlock potential that lockdep simply is blind about and will never > see. It seems very risky to entagle a notifier widely like that. It looks pretty sure that notifiers are fs_reclaim, so at a minimum that wait_event can't be contingent on anything that is doing GFP_KERNEL or it will deadlock - and blockable doesn't make that sleep safe. Avoiding an uncertain wait_event under notifiers would seem to be the only reasonable design here.. Jason
On Thu, Aug 15, 2019 at 4:38 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Aug 15, 2019 at 03:12:11PM +0200, Daniel Vetter wrote: > > On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote: > > > > > > > As the oom reaper is the primary guarantee of the oom handling forward > > > > progress it cannot be blocked on anything that might depend on blockable > > > > memory allocations. These are not really easy to track because they > > > > might be indirect - e.g. notifier blocks on a lock which other context > > > > holds while allocating memory or waiting for a flusher that needs memory > > > > to perform its work. > > > > > > But lockdep *does* track all this and fs_reclaim_acquire() was created > > > to solve exactly this problem. > > > > > > fs_reclaim is a lock and it flows through all the usual lockdep > > > schemes like any other lock. Any time the page allocator wants to do > > > something the would deadlock with reclaim it takes the lock. > > > > > > Failure is expressed by a deadlock cycle in the lockdep map, and > > > lockdep can handle arbitary complexity through layers of locks, work > > > queues, threads, etc. > > > > > > What is missing? > > > > Lockdep doens't seen everything by far. E.g. a wait_event will be > > caught by the annotations here, but not by lockdep. > > Sure, but the wait_event might be OK if its progress isn't contingent > on fs_reclaim, ie triggered from interrupt, so why ban it? For normal notifiers sure (but lockdep won't help you proof that at all). For oom_reaper apparently not, but that's really Michal's thing, I have not much idea about that. > > And since we're talking about mmu notifiers here and gpus/dma engines. > > We have dma_fence_wait, which can wait for any hw/driver in the system > > that takes part in shared/zero-copy buffer processing. Which at least > > on the graphics side is everything. This pulls in enormous amounts of > > deadlock potential that lockdep simply is blind about and will never > > see. > > It seems very risky to entagle a notifier widely like that. That's why I've looked into all possible ways to annotate them with debug infrastructure. > It looks pretty sure that notifiers are fs_reclaim, so at a minimum > that wait_event can't be contingent on anything that is doing > GFP_KERNEL or it will deadlock - and blockable doesn't make that sleep > safe. > > Avoiding an uncertain wait_event under notifiers would seem to be the > only reasonable design here.. You have to wait for the gpu to finnish current processing in invalidate_range_start. Otherwise there's no point to any of this really. So the wait_event/dma_fence_wait are unavoidable really. That's also why I'm throwing in the lockdep annotation on top, and why it would be really nice if we somehow can get the cross-release work landed. But it looks like all the people who could make it happen are busy with smeltdown :-/ -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> You have to wait for the gpu to finnish current processing in
> invalidate_range_start. Otherwise there's no point to any of this
> really. So the wait_event/dma_fence_wait are unavoidable really.
I don't envy your task :|
But, what you describe sure sounds like a 'registration cache' model,
not the 'shadow pte' model of coherency.
The key difference is that a regirstationcache is allowed to become
incoherent with the VMA's because it holds page pins. It is a
programming bug in userspace to change VA mappings via mmap/munmap/etc
while the device is working on that VA, but it does not harm system
integrity because of the page pin.
The cache ensures that each initiated operation sees a DMA setup that
matches the current VA map when the operation is initiated and allows
expensive device DMA setups to be re-used.
A 'shadow pte' model (ie hmm) *really* needs device support to
directly block DMA access - ie trigger 'device page fault'. ie the
invalidate_start should inform the device to enter a fault mode and
that is it. If the device can't do that, then the driver probably
shouldn't persue this level of coherency. The driver would quickly get
into the messy locking problems like dma_fence_wait from a notifier.
It is important to identify what model you are going for as defining a
'registration cache' coherence expectation allows the driver to skip
blocking in invalidate_range_start. All it does is invalidate the
cache so that future operations pick up the new VA mapping.
Intel's HFI RDMA driver uses this model extensively, and I think it is
well proven, within some limitations of course.
At least, 'registration cache' is the only use model I know of where
it is acceptable to skip invalidate_range_end.
Jason
On Thu 15-08-19 11:12:19, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 03:21:27PM +0200, Michal Hocko wrote: > > On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote: > > > On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote: > > > > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote: > > > > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote: > > > > > > In some special cases we must not block, but there's not a > > > > > > spinlock, preempt-off, irqs-off or similar critical section already > > > > > > that arms the might_sleep() debug checks. Add a non_block_start/end() > > > > > > pair to annotate these. > > > > > > > > > > > > This will be used in the oom paths of mmu-notifiers, where blocking is > > > > > > not allowed to make sure there's forward progress. Quoting Michal: > > > > > > > > > > > > "The notifier is called from quite a restricted context - oom_reaper - > > > > > > which shouldn't depend on any locks or sleepable conditionals. The code > > > > > > should be swift as well but we mostly do care about it to make a forward > > > > > > progress. Checking for sleepable context is the best thing we could come > > > > > > up with that would describe these demands at least partially." > > > > > > > > > > But this describes fs_reclaim_acquire() - is there some reason we are > > > > > conflating fs_reclaim with non-sleeping? > > > > > > > > No idea why you tie this into fs_reclaim. We can definitly sleep in there, > > > > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're > > > > event supposed to I thought. To make sure we can get at the last bit of > > > > memory by flushing all the queues and waiting for everything to be cleaned > > > > out. > > > > > > AFAIK the point of fs_reclaim is to prevent "indirect dependency upon > > > the page allocator" ie a justification that was given this !blockable > > > stuff. > > > > > > For instance: > > > > > > fs_reclaim_acquire() > > > kmalloc(GFP_KERNEL) <- lock dep assertion > > > > > > And further, Michal's concern about indirectness through locks is also > > > handled by lockdep: > > > > > > CPU0 CPU1 > > > mutex_lock() > > > kmalloc(GFP_KERNEL) > > > mutex_unlock() > > > fs_reclaim_acquire() > > > mutex_lock() <- lock dep assertion > > > > > > In other words, to prevent recursion into the page allocator you use > > > fs_reclaim_acquire(), and lockdep verfies it in its usual robust way. > > > > fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about > > any !GFP_NOWAIT allocation context here and any {in}direct dependency on > > it. > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and > __GFP_DIRECT_RECLAIM.. > > This matches the existing test in __need_fs_reclaim() - so if you are > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), > allocations during OOM, then I think fs_reclaim already matches what > you described? No GFP_NOFS is equally bad. Please read my other email explaining what the oom_reaper actually requires. In short no blocking on direct or indirect dependecy on memory allocation that might sleep. If you can express that in the existing lockdep machinery. All fine. But then consider deployments where lockdep is no-no because of the overhead. > > No, non-blocking is a very coarse approximation of what we really need. > > But it should give us even a stronger condition. Essentially any sleep > > other than a preemption shouldn't be allowed in that context. > > But it is a nonsense API to give the driver invalidate_range_start, > the blocking alternative to the non-blocking invalidate_range and then > demand it to be non-blocking. > > Inspecting the code, no drivers are actually able to progress their > side in non-blocking mode. > > The best we got was drivers tested the VA range and returned success > if they had no interest. Which is a big win to be sure, but it looks > like getting any more is not really posssible. And that is already a great win! Because many notifiers only do care about particular mappings. Please note that backing off unconditioanlly will simply cause that the oom reaper will have to back off not doing any tear down anything. > However, we could (probably even should) make the drivers fs_reclaim > safe. > > If that is enough to guarantee progress of OOM, then lets consider > something like using current_gfp_context() to force PF_MEMALLOC_NOFS > allocation behavior on the driver callback and lockdep to try and keep > pushing on the the debugging, and dropping !blocking. How are you going to enforce indirect dependency? E.g. a lock that is also used in other context which depend on sleepable memory allocation to move forward. Really, this was aimed to give a very simple debugging aid. If it is considered to be so controversial, even though I really do not see why, then let's just drop it on the floor. -- Michal Hocko SUSE Labs
On Thu, Aug 15, 2019 at 5:10 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote: > > > You have to wait for the gpu to finnish current processing in > > invalidate_range_start. Otherwise there's no point to any of this > > really. So the wait_event/dma_fence_wait are unavoidable really. > > I don't envy your task :| > > But, what you describe sure sounds like a 'registration cache' model, > not the 'shadow pte' model of coherency. > > The key difference is that a regirstationcache is allowed to become > incoherent with the VMA's because it holds page pins. It is a > programming bug in userspace to change VA mappings via mmap/munmap/etc > while the device is working on that VA, but it does not harm system > integrity because of the page pin. > > The cache ensures that each initiated operation sees a DMA setup that > matches the current VA map when the operation is initiated and allows > expensive device DMA setups to be re-used. > > A 'shadow pte' model (ie hmm) *really* needs device support to > directly block DMA access - ie trigger 'device page fault'. ie the > invalidate_start should inform the device to enter a fault mode and > that is it. If the device can't do that, then the driver probably > shouldn't persue this level of coherency. The driver would quickly get > into the messy locking problems like dma_fence_wait from a notifier. > > It is important to identify what model you are going for as defining a > 'registration cache' coherence expectation allows the driver to skip > blocking in invalidate_range_start. All it does is invalidate the > cache so that future operations pick up the new VA mapping. > > Intel's HFI RDMA driver uses this model extensively, and I think it is > well proven, within some limitations of course. > > At least, 'registration cache' is the only use model I know of where > it is acceptable to skip invalidate_range_end. I'm not really well versed in the details of our userptr, but both amdgpu and i915 wait for the gpu to complete from invalidate_range_start. Jerome has at least looked a lot at the amdgpu one, so maybe he can explain what exactly it is we're doing ... -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote: > > > You have to wait for the gpu to finnish current processing in > > invalidate_range_start. Otherwise there's no point to any of this > > really. So the wait_event/dma_fence_wait are unavoidable really. > > I don't envy your task :| > > But, what you describe sure sounds like a 'registration cache' model, > not the 'shadow pte' model of coherency. > > The key difference is that a regirstationcache is allowed to become > incoherent with the VMA's because it holds page pins. It is a > programming bug in userspace to change VA mappings via mmap/munmap/etc > while the device is working on that VA, but it does not harm system > integrity because of the page pin. > > The cache ensures that each initiated operation sees a DMA setup that > matches the current VA map when the operation is initiated and allows > expensive device DMA setups to be re-used. > > A 'shadow pte' model (ie hmm) *really* needs device support to > directly block DMA access - ie trigger 'device page fault'. ie the > invalidate_start should inform the device to enter a fault mode and > that is it. If the device can't do that, then the driver probably > shouldn't persue this level of coherency. The driver would quickly get > into the messy locking problems like dma_fence_wait from a notifier. I think here we do not agree on the hardware requirement. For GPU we will always need to be able to wait for some GPU fence from inside the notifier callback, there is just no way around that for many of the GPUs today (i do not see any indication of that changing). Driver should avoid lock complexity by using wait queue so that the driver notifier callback can wait without having to hold some driver lock. However there will be at least one lock needed to update the internal driver state for the range being invalidated. That lock is just the device driver page table lock for the GPU page table associated with the mm_struct. In all GPU driver so far it is a short lived lock and nothing blocking is done while holding it (it is just about updating page table directory really wether it is filling it or clearing it). > > It is important to identify what model you are going for as defining a > 'registration cache' coherence expectation allows the driver to skip > blocking in invalidate_range_start. All it does is invalidate the > cache so that future operations pick up the new VA mapping. > > Intel's HFI RDMA driver uses this model extensively, and I think it is > well proven, within some limitations of course. > > At least, 'registration cache' is the only use model I know of where > it is acceptable to skip invalidate_range_end. Here GPU are not in the registration cache model, i know it might looks like it because of GUP but GUP was use just because hmm did not exist at the time. Cheers, Jérôme
On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote: > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and > > __GFP_DIRECT_RECLAIM.. > > > > This matches the existing test in __need_fs_reclaim() - so if you are > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), > > allocations during OOM, then I think fs_reclaim already matches what > > you described? > > No GFP_NOFS is equally bad. Please read my other email explaining what > the oom_reaper actually requires. In short no blocking on direct or > indirect dependecy on memory allocation that might sleep. It is much easier to follow with some hints on code, so the true requirement is that the OOM repear not block on GFP_FS and GFP_IO allocations, great, that constraint is now clear. > If you can express that in the existing lockdep machinery. All > fine. But then consider deployments where lockdep is no-no because > of the overhead. This is all for driver debugging. The point of lockdep is to find all these paths without have to hit them as actual races, using debug kernels. I don't think we need this kind of debugging on production kernels? > > The best we got was drivers tested the VA range and returned success > > if they had no interest. Which is a big win to be sure, but it looks > > like getting any more is not really posssible. > > And that is already a great win! Because many notifiers only do care > about particular mappings. Please note that backing off unconditioanlly > will simply cause that the oom reaper will have to back off not doing > any tear down anything. Well, I'm working to propose that we do the VA range test under core mmu notifier code that cannot block and then we simply remove the idea of blockable from drivers using this new 'range notifier'. I think this pretty much solves the concern? > > However, we could (probably even should) make the drivers fs_reclaim > > safe. > > > > If that is enough to guarantee progress of OOM, then lets consider > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS > > allocation behavior on the driver callback and lockdep to try and keep > > pushing on the the debugging, and dropping !blocking. > > How are you going to enforce indirect dependency? E.g. a lock that is > also used in other context which depend on sleepable memory allocation > to move forward. You mean like this: CPU0 CPU1 mutex_lock() kmalloc(GFP_KERNEL) mutex_unlock() fs_reclaim_acquire() mutex_lock() <- illegal: lock dep assertion ? lockdep handles this - that is what it does, it builds a graph of all lock dependencies and requires the graph to be acyclic. So mutex_lock depends on fs_reclaim_lock on CPU1 while on CPU0, fs_reclaim_lock depends on mutex_lock. This is an ABBA locking cycle and lockdep will reliably trigger. So, if we wanted to do this, I'd probably suggest we retool fs_reclaim's interface be more like prevent_gfp_flags(__GFP_IO | __GFP_FS); [..] restore_gfp_flags(__GFP_IO | __GFP_FS); Which is lockdep based and follows the indirect lock dependencies you talked about. Then OOM and reclaim can specify the different flags they want blocked. We could probably use the same API with WQ_MEM_RECLAIM as I was chatting with Tejun about: https://www.spinics.net/lists/linux-rdma/msg77362.html IMHO this stuff is *super complicated* for those of us outside the mm subsystem, so having some really clear annotations like the above would go a long way to help understand these special constraints. I'm pretty sure there are lots of driver bugs related to using the wrong GFP flags in the kernel. > Really, this was aimed to give a very simple debugging aid. If it is > considered to be so controversial, even though I really do not see why, > then let's just drop it on the floor. My concern is that the requirement was very unclear. I'm trying to understand all the bits of how these notifiers work and the exact semantic of this OOM path have been vauge if it is really some GFP flag restriction or truely related to not sleeping. Jason
On Thu, Aug 15, 2019 at 01:56:31PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
>
> > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > __GFP_DIRECT_RECLAIM..
> > >
> > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > allocations during OOM, then I think fs_reclaim already matches what
> > > you described?
> >
> > No GFP_NOFS is equally bad. Please read my other email explaining what
> > the oom_reaper actually requires. In short no blocking on direct or
> > indirect dependecy on memory allocation that might sleep.
>
> It is much easier to follow with some hints on code, so the true
> requirement is that the OOM repear not block on GFP_FS and GFP_IO
> allocations, great, that constraint is now clear.
>
> > If you can express that in the existing lockdep machinery. All
> > fine. But then consider deployments where lockdep is no-no because
> > of the overhead.
>
> This is all for driver debugging. The point of lockdep is to find all
> these paths without have to hit them as actual races, using debug
> kernels.
>
> I don't think we need this kind of debugging on production kernels?
>
> > > The best we got was drivers tested the VA range and returned success
> > > if they had no interest. Which is a big win to be sure, but it looks
> > > like getting any more is not really posssible.
> >
> > And that is already a great win! Because many notifiers only do care
> > about particular mappings. Please note that backing off unconditioanlly
> > will simply cause that the oom reaper will have to back off not doing
> > any tear down anything.
>
> Well, I'm working to propose that we do the VA range test under core
> mmu notifier code that cannot block and then we simply remove the idea
> of blockable from drivers using this new 'range notifier'.
>
> I think this pretty much solves the concern?
I am not sure i follow what you propose here ? Like i pointed out in
another email for GPU we do need to be able to sleep (we might get
lucky and not need too but this is runtime thing) within notifier
range_start callback. This has been something allow by notifier since
it has been introduced in the kernel.
Cheers,
Jérôme
On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote: > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote: > > > > > You have to wait for the gpu to finnish current processing in > > > invalidate_range_start. Otherwise there's no point to any of this > > > really. So the wait_event/dma_fence_wait are unavoidable really. > > > > I don't envy your task :| > > > > But, what you describe sure sounds like a 'registration cache' model, > > not the 'shadow pte' model of coherency. > > > > The key difference is that a regirstationcache is allowed to become > > incoherent with the VMA's because it holds page pins. It is a > > programming bug in userspace to change VA mappings via mmap/munmap/etc > > while the device is working on that VA, but it does not harm system > > integrity because of the page pin. > > > > The cache ensures that each initiated operation sees a DMA setup that > > matches the current VA map when the operation is initiated and allows > > expensive device DMA setups to be re-used. > > > > A 'shadow pte' model (ie hmm) *really* needs device support to > > directly block DMA access - ie trigger 'device page fault'. ie the > > invalidate_start should inform the device to enter a fault mode and > > that is it. If the device can't do that, then the driver probably > > shouldn't persue this level of coherency. The driver would quickly get > > into the messy locking problems like dma_fence_wait from a notifier. > > I think here we do not agree on the hardware requirement. For GPU > we will always need to be able to wait for some GPU fence from inside > the notifier callback, there is just no way around that for many of > the GPUs today (i do not see any indication of that changing). I didn't say you couldn't wait, I was trying to say that the wait should only be contigent on the HW itself. Ie you can wait on a GPU page table lock, and you can wait on a GPU page table flush completion via IRQ. What is troubling is to wait till some other thread gets a GPU command completion and decr's a kref on the DMA buffer - which kinda looks like what this dma_fence() stuff is all about. A driver like that would have to be super careful to ensure consistent forward progress toward dma ref == 0 when the system is under reclaim. ie by running it's entire IRQ flow under fs_reclaim locking. > associated with the mm_struct. In all GPU driver so far it is a short > lived lock and nothing blocking is done while holding it (it is just > about updating page table directory really wether it is filling it or > clearing it). The main blocking I expect in a shadow PTE flow is waiting for the HW to complete invalidations of its PTE cache. > > It is important to identify what model you are going for as defining a > > 'registration cache' coherence expectation allows the driver to skip > > blocking in invalidate_range_start. All it does is invalidate the > > cache so that future operations pick up the new VA mapping. > > > > Intel's HFI RDMA driver uses this model extensively, and I think it is > > well proven, within some limitations of course. > > > > At least, 'registration cache' is the only use model I know of where > > it is acceptable to skip invalidate_range_end. > > Here GPU are not in the registration cache model, i know it might looks > like it because of GUP but GUP was use just because hmm did not exist > at the time. It is not because of GUP, it is because of the lack of invalidate_range_end. A driver cannot correctly implement the SPTE model without invalidate_range_end, even if it holds the page pins via GUP. So, I've been assuming the few drivers without invalidate_range_end are trying to do registration caching, rather than assuming they are broken. Jason
On Thu, Aug 15, 2019 at 01:11:56PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 01:56:31PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> >
> > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > __GFP_DIRECT_RECLAIM..
> > > >
> > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > you described?
> > >
> > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > the oom_reaper actually requires. In short no blocking on direct or
> > > indirect dependecy on memory allocation that might sleep.
> >
> > It is much easier to follow with some hints on code, so the true
> > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > allocations, great, that constraint is now clear.
> >
> > > If you can express that in the existing lockdep machinery. All
> > > fine. But then consider deployments where lockdep is no-no because
> > > of the overhead.
> >
> > This is all for driver debugging. The point of lockdep is to find all
> > these paths without have to hit them as actual races, using debug
> > kernels.
> >
> > I don't think we need this kind of debugging on production kernels?
> >
> > > > The best we got was drivers tested the VA range and returned success
> > > > if they had no interest. Which is a big win to be sure, but it looks
> > > > like getting any more is not really posssible.
> > >
> > > And that is already a great win! Because many notifiers only do care
> > > about particular mappings. Please note that backing off unconditioanlly
> > > will simply cause that the oom reaper will have to back off not doing
> > > any tear down anything.
> >
> > Well, I'm working to propose that we do the VA range test under core
> > mmu notifier code that cannot block and then we simply remove the idea
> > of blockable from drivers using this new 'range notifier'.
> >
> > I think this pretty much solves the concern?
>
> I am not sure i follow what you propose here ? Like i pointed out in
> another email for GPU we do need to be able to sleep (we might get
> lucky and not need too but this is runtime thing) within notifier
> range_start callback. This has been something allow by notifier since
> it has been introduced in the kernel.
Sorry, I mean remove the idea of the blockable flag from the
drivers. Drivers will always be able to block, within the existing
limitation of fs_reclaim
Jason
On Thu, Aug 15, 2019 at 7:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote: > > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote: > > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote: > > > > > > > You have to wait for the gpu to finnish current processing in > > > > invalidate_range_start. Otherwise there's no point to any of this > > > > really. So the wait_event/dma_fence_wait are unavoidable really. > > > > > > I don't envy your task :| > > > > > > But, what you describe sure sounds like a 'registration cache' model, > > > not the 'shadow pte' model of coherency. > > > > > > The key difference is that a regirstationcache is allowed to become > > > incoherent with the VMA's because it holds page pins. It is a > > > programming bug in userspace to change VA mappings via mmap/munmap/etc > > > while the device is working on that VA, but it does not harm system > > > integrity because of the page pin. > > > > > > The cache ensures that each initiated operation sees a DMA setup that > > > matches the current VA map when the operation is initiated and allows > > > expensive device DMA setups to be re-used. > > > > > > A 'shadow pte' model (ie hmm) *really* needs device support to > > > directly block DMA access - ie trigger 'device page fault'. ie the > > > invalidate_start should inform the device to enter a fault mode and > > > that is it. If the device can't do that, then the driver probably > > > shouldn't persue this level of coherency. The driver would quickly get > > > into the messy locking problems like dma_fence_wait from a notifier. > > > > I think here we do not agree on the hardware requirement. For GPU > > we will always need to be able to wait for some GPU fence from inside > > the notifier callback, there is just no way around that for many of > > the GPUs today (i do not see any indication of that changing). > > I didn't say you couldn't wait, I was trying to say that the wait > should only be contigent on the HW itself. Ie you can wait on a GPU > page table lock, and you can wait on a GPU page table flush completion > via IRQ. > > What is troubling is to wait till some other thread gets a GPU command > completion and decr's a kref on the DMA buffer - which kinda looks > like what this dma_fence() stuff is all about. A driver like that > would have to be super careful to ensure consistent forward progress > toward dma ref == 0 when the system is under reclaim. > > ie by running it's entire IRQ flow under fs_reclaim locking. This is correct. At least for i915 it's already a required due to our shrinker also having to do the same. I think amdgpu isn't bothering with that since they have vram for most of the stuff, and just limit system memory usage to half of all and forgo the shrinker. Probably not the nicest approach. Anyway, both do the same mmu_notifier dance, just want to explain that we've been living with this for longer already. So yeah writing a gpu driver is not easy. > > associated with the mm_struct. In all GPU driver so far it is a short > > lived lock and nothing blocking is done while holding it (it is just > > about updating page table directory really wether it is filling it or > > clearing it). > > The main blocking I expect in a shadow PTE flow is waiting for the HW > to complete invalidations of its PTE cache. > > > > It is important to identify what model you are going for as defining a > > > 'registration cache' coherence expectation allows the driver to skip > > > blocking in invalidate_range_start. All it does is invalidate the > > > cache so that future operations pick up the new VA mapping. > > > > > > Intel's HFI RDMA driver uses this model extensively, and I think it is > > > well proven, within some limitations of course. > > > > > > At least, 'registration cache' is the only use model I know of where > > > it is acceptable to skip invalidate_range_end. > > > > Here GPU are not in the registration cache model, i know it might looks > > like it because of GUP but GUP was use just because hmm did not exist > > at the time. > > It is not because of GUP, it is because of the lack of > invalidate_range_end. A driver cannot correctly implement the SPTE > model without invalidate_range_end, even if it holds the page pins via > GUP. > > So, I've been assuming the few drivers without invalidate_range_end > are trying to do registration caching, rather than assuming they are > broken. I915 might just be broken. amdgpu does the full thing, using hmm_mirror. But still with dma_fence_wait. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Aug 15, 2019 at 07:21:47PM +0200, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 7:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote:
> > > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> > > >
> > > > > You have to wait for the gpu to finnish current processing in
> > > > > invalidate_range_start. Otherwise there's no point to any of this
> > > > > really. So the wait_event/dma_fence_wait are unavoidable really.
> > > >
> > > > I don't envy your task :|
> > > >
> > > > But, what you describe sure sounds like a 'registration cache' model,
> > > > not the 'shadow pte' model of coherency.
> > > >
> > > > The key difference is that a regirstationcache is allowed to become
> > > > incoherent with the VMA's because it holds page pins. It is a
> > > > programming bug in userspace to change VA mappings via mmap/munmap/etc
> > > > while the device is working on that VA, but it does not harm system
> > > > integrity because of the page pin.
> > > >
> > > > The cache ensures that each initiated operation sees a DMA setup that
> > > > matches the current VA map when the operation is initiated and allows
> > > > expensive device DMA setups to be re-used.
> > > >
> > > > A 'shadow pte' model (ie hmm) *really* needs device support to
> > > > directly block DMA access - ie trigger 'device page fault'. ie the
> > > > invalidate_start should inform the device to enter a fault mode and
> > > > that is it. If the device can't do that, then the driver probably
> > > > shouldn't persue this level of coherency. The driver would quickly get
> > > > into the messy locking problems like dma_fence_wait from a notifier.
> > >
> > > I think here we do not agree on the hardware requirement. For GPU
> > > we will always need to be able to wait for some GPU fence from inside
> > > the notifier callback, there is just no way around that for many of
> > > the GPUs today (i do not see any indication of that changing).
> >
> > I didn't say you couldn't wait, I was trying to say that the wait
> > should only be contigent on the HW itself. Ie you can wait on a GPU
> > page table lock, and you can wait on a GPU page table flush completion
> > via IRQ.
> >
> > What is troubling is to wait till some other thread gets a GPU command
> > completion and decr's a kref on the DMA buffer - which kinda looks
> > like what this dma_fence() stuff is all about. A driver like that
> > would have to be super careful to ensure consistent forward progress
> > toward dma ref == 0 when the system is under reclaim.
> >
> > ie by running it's entire IRQ flow under fs_reclaim locking.
>
> This is correct. At least for i915 it's already a required due to our
> shrinker also having to do the same. I think amdgpu isn't bothering
> with that since they have vram for most of the stuff, and just limit
> system memory usage to half of all and forgo the shrinker. Probably
> not the nicest approach. Anyway, both do the same mmu_notifier dance,
> just want to explain that we've been living with this for longer
> already.
>
> So yeah writing a gpu driver is not easy.
>
> > > associated with the mm_struct. In all GPU driver so far it is a short
> > > lived lock and nothing blocking is done while holding it (it is just
> > > about updating page table directory really wether it is filling it or
> > > clearing it).
> >
> > The main blocking I expect in a shadow PTE flow is waiting for the HW
> > to complete invalidations of its PTE cache.
> >
> > > > It is important to identify what model you are going for as defining a
> > > > 'registration cache' coherence expectation allows the driver to skip
> > > > blocking in invalidate_range_start. All it does is invalidate the
> > > > cache so that future operations pick up the new VA mapping.
> > > >
> > > > Intel's HFI RDMA driver uses this model extensively, and I think it is
> > > > well proven, within some limitations of course.
> > > >
> > > > At least, 'registration cache' is the only use model I know of where
> > > > it is acceptable to skip invalidate_range_end.
> > >
> > > Here GPU are not in the registration cache model, i know it might looks
> > > like it because of GUP but GUP was use just because hmm did not exist
> > > at the time.
> >
> > It is not because of GUP, it is because of the lack of
> > invalidate_range_end. A driver cannot correctly implement the SPTE
> > model without invalidate_range_end, even if it holds the page pins via
> > GUP.
> >
> > So, I've been assuming the few drivers without invalidate_range_end
> > are trying to do registration caching, rather than assuming they are
> > broken.
>
> I915 might just be broken. amdgpu does the full thing, using
> hmm_mirror. But still with dma_fence_wait.
Yeah i915 is broken but it never hurted anyone ;) I posted patch
a long time ago to convert it to hmm but i delayed that to until
i can get through making something of GUPfast that can also be
use for HMM/ODP user.
Cheers,
Jérôme
On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> I'm not really well versed in the details of our userptr, but both
> amdgpu and i915 wait for the gpu to complete from
> invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> one, so maybe he can explain what exactly it is we're doing ...
amdgpu is (wrongly) using hmm for something, I can't really tell what
it is trying to do. The calls to dma_fence under the
invalidate_range_start do not give me a good feeling.
However, i915 shows all the signs of trying to follow the registration
cache model, it even has a nice comment in
i915_gem_userptr_get_pages() explaining that the races it has don't
matter because it is a user space bug to change the VA mapping in the
first place. That just screams registration cache to me.
So it is fine to run HW that way, but if you do, there is no reason to
fence inside the invalidate_range end. Just orphan the DMA buffer and
clean it up & release the page pins when all DMA buffer refs go to
zero. The next access to that VA should get a new DMA buffer with the
right mapping.
In other words the invalidation should be very simple without
complicated locking, or wait_event's. Look at hfi1 for example.
Jason
On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
>
> > I'm not really well versed in the details of our userptr, but both
> > amdgpu and i915 wait for the gpu to complete from
> > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > one, so maybe he can explain what exactly it is we're doing ...
>
> amdgpu is (wrongly) using hmm for something, I can't really tell what
> it is trying to do. The calls to dma_fence under the
> invalidate_range_start do not give me a good feeling.
>
> However, i915 shows all the signs of trying to follow the registration
> cache model, it even has a nice comment in
> i915_gem_userptr_get_pages() explaining that the races it has don't
> matter because it is a user space bug to change the VA mapping in the
> first place. That just screams registration cache to me.
>
> So it is fine to run HW that way, but if you do, there is no reason to
> fence inside the invalidate_range end. Just orphan the DMA buffer and
> clean it up & release the page pins when all DMA buffer refs go to
> zero. The next access to that VA should get a new DMA buffer with the
> right mapping.
>
> In other words the invalidation should be very simple without
> complicated locking, or wait_event's. Look at hfi1 for example.
This would break the today usage model of uptr and it will
break userspace expectation ie if GPU is writting to that
memory and that memory then the userspace want to make sure
that it will see what the GPU write.
Yes i915 is broken in respect to not having a end notifier
and tracking active invalidation for a range but the GUP
side of thing kind of hide this bug and it shrinks the window
for bad to happen to something so small that i doubt anyone
could ever hit it (still a bug thought).
Cheers,
Jérôme
On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote: > > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and > > > __GFP_DIRECT_RECLAIM.. > > > > > > This matches the existing test in __need_fs_reclaim() - so if you are > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), > > > allocations during OOM, then I think fs_reclaim already matches what > > > you described? > > > > No GFP_NOFS is equally bad. Please read my other email explaining what > > the oom_reaper actually requires. In short no blocking on direct or > > indirect dependecy on memory allocation that might sleep. > > It is much easier to follow with some hints on code, so the true > requirement is that the OOM repear not block on GFP_FS and GFP_IO > allocations, great, that constraint is now clear. I still do not get why do you put FS/IO into the picture. This is really about __GFP_DIRECT_RECLAIM. > > > If you can express that in the existing lockdep machinery. All > > fine. But then consider deployments where lockdep is no-no because > > of the overhead. > > This is all for driver debugging. The point of lockdep is to find all > these paths without have to hit them as actual races, using debug > kernels. > > I don't think we need this kind of debugging on production kernels? Again, the primary motivation was a simple debugging aid that could be used without worrying about overhead. So lockdep is very often out of the question. > > > The best we got was drivers tested the VA range and returned success > > > if they had no interest. Which is a big win to be sure, but it looks > > > like getting any more is not really posssible. > > > > And that is already a great win! Because many notifiers only do care > > about particular mappings. Please note that backing off unconditioanlly > > will simply cause that the oom reaper will have to back off not doing > > any tear down anything. > > Well, I'm working to propose that we do the VA range test under core > mmu notifier code that cannot block and then we simply remove the idea > of blockable from drivers using this new 'range notifier'. > > I think this pretty much solves the concern? Well, my idea was that a range check and early bail out was a first step and then each specific notifier would be able to do a more specific check. I was not able to do the second step because that requires a deep understanding of the respective subsystem. Really all I do care about is to reclaim as much memory from the oom_reaper context as possible. And that cannot really be an unbounded process. Quite contrary it should be as swift as possible. From my cursory look some notifiers are able to achieve their task without blocking or depending on memory just fine. So bailing out unconditionally on the range of interest would just put us back. > > > However, we could (probably even should) make the drivers fs_reclaim > > > safe. > > > > > > If that is enough to guarantee progress of OOM, then lets consider > > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS > > > allocation behavior on the driver callback and lockdep to try and keep > > > pushing on the the debugging, and dropping !blocking. > > > > How are you going to enforce indirect dependency? E.g. a lock that is > > also used in other context which depend on sleepable memory allocation > > to move forward. > > You mean like this: > > CPU0 CPU1 > mutex_lock() > kmalloc(GFP_KERNEL) no I mean __GFP_DIRECT_RECLAIM here. > mutex_unlock() > fs_reclaim_acquire() > mutex_lock() <- illegal: lock dep assertion I cannot really comment on how that is achieveable by lockdep. I managed to forget details about FS/IO reclaim recursion tracking already and I do not have time to learn it again. It was quite a hack. Anyway, let me repeat that the primary motivation was a simple aid. Not something as poverful as lockdep. -- Michal Hocko SUSE Labs
On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote: > On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote: > > > > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and > > > > __GFP_DIRECT_RECLAIM.. > > > > > > > > This matches the existing test in __need_fs_reclaim() - so if you are > > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), > > > > allocations during OOM, then I think fs_reclaim already matches what > > > > you described? > > > > > > No GFP_NOFS is equally bad. Please read my other email explaining what > > > the oom_reaper actually requires. In short no blocking on direct or > > > indirect dependecy on memory allocation that might sleep. > > > > It is much easier to follow with some hints on code, so the true > > requirement is that the OOM repear not block on GFP_FS and GFP_IO > > allocations, great, that constraint is now clear. > > I still do not get why do you put FS/IO into the picture. This is really > about __GFP_DIRECT_RECLAIM. > > > > > > If you can express that in the existing lockdep machinery. All > > > fine. But then consider deployments where lockdep is no-no because > > > of the overhead. > > > > This is all for driver debugging. The point of lockdep is to find all > > these paths without have to hit them as actual races, using debug > > kernels. > > > > I don't think we need this kind of debugging on production kernels? > > Again, the primary motivation was a simple debugging aid that could be > used without worrying about overhead. So lockdep is very often out of > the question. > > > > > The best we got was drivers tested the VA range and returned success > > > > if they had no interest. Which is a big win to be sure, but it looks > > > > like getting any more is not really posssible. > > > > > > And that is already a great win! Because many notifiers only do care > > > about particular mappings. Please note that backing off unconditioanlly > > > will simply cause that the oom reaper will have to back off not doing > > > any tear down anything. > > > > Well, I'm working to propose that we do the VA range test under core > > mmu notifier code that cannot block and then we simply remove the idea > > of blockable from drivers using this new 'range notifier'. > > > > I think this pretty much solves the concern? > > Well, my idea was that a range check and early bail out was a first step > and then each specific notifier would be able to do a more specific > check. I was not able to do the second step because that requires a deep > understanding of the respective subsystem. > > Really all I do care about is to reclaim as much memory from the > oom_reaper context as possible. And that cannot really be an unbounded > process. Quite contrary it should be as swift as possible. From my > cursory look some notifiers are able to achieve their task without > blocking or depending on memory just fine. So bailing out > unconditionally on the range of interest would just put us back. Agree, OOM just asking the question: can i unmap that page quickly ? so that me (OOM) can swap it out. In many cases the driver need to lookup something to see if at the time the memory is just not in use and can be reclaim right away. So driver should have a path to optimistically update its state to allow quick reclaim. > > > > However, we could (probably even should) make the drivers fs_reclaim > > > > safe. > > > > > > > > If that is enough to guarantee progress of OOM, then lets consider > > > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS > > > > allocation behavior on the driver callback and lockdep to try and keep > > > > pushing on the the debugging, and dropping !blocking. > > > > > > How are you going to enforce indirect dependency? E.g. a lock that is > > > also used in other context which depend on sleepable memory allocation > > > to move forward. > > > > You mean like this: > > > > CPU0 CPU1 > > mutex_lock() > > kmalloc(GFP_KERNEL) > > no I mean __GFP_DIRECT_RECLAIM here. > > > mutex_unlock() > > fs_reclaim_acquire() > > mutex_lock() <- illegal: lock dep assertion > > I cannot really comment on how that is achieveable by lockdep. I managed > to forget details about FS/IO reclaim recursion tracking already and I > do not have time to learn it again. It was quite a hack. Anyway, let me > repeat that the primary motivation was a simple aid. Not something as > poverful as lockdep. I feel that the fs_reclaim_acquire() is just too heavy weight here. I do think that Daniel patches improve the debugging situation without burdening anything so i am in favor or merging that. I do not think we should devote too much time into fs_reclaim(), our time would be better spend in improving the driver shrinker for instance after all OOM is all about trying to free-up memory. Cheers, Jérôme
On Thu, Aug 15, 2019 at 01:39:22PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> >
> > > I'm not really well versed in the details of our userptr, but both
> > > amdgpu and i915 wait for the gpu to complete from
> > > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > > one, so maybe he can explain what exactly it is we're doing ...
> >
> > amdgpu is (wrongly) using hmm for something, I can't really tell what
> > it is trying to do. The calls to dma_fence under the
> > invalidate_range_start do not give me a good feeling.
> >
> > However, i915 shows all the signs of trying to follow the registration
> > cache model, it even has a nice comment in
> > i915_gem_userptr_get_pages() explaining that the races it has don't
> > matter because it is a user space bug to change the VA mapping in the
> > first place. That just screams registration cache to me.
> >
> > So it is fine to run HW that way, but if you do, there is no reason to
> > fence inside the invalidate_range end. Just orphan the DMA buffer and
> > clean it up & release the page pins when all DMA buffer refs go to
> > zero. The next access to that VA should get a new DMA buffer with the
> > right mapping.
> >
> > In other words the invalidation should be very simple without
> > complicated locking, or wait_event's. Look at hfi1 for example.
>
> This would break the today usage model of uptr and it will
> break userspace expectation ie if GPU is writting to that
> memory and that memory then the userspace want to make sure
> that it will see what the GPU write.
How exactly? This is holding the page pin, so the only way the VA
mapping can be changed is via explicit user action.
ie:
gpu_write_something(va, size)
mmap(.., va, size, MMAP_FIXED);
gpu_wait_done()
This is racy and indeterminate with both models.
Based on the comment in i915 it appears to be going on the model that
changes to the mmap by userspace when the GPU is working on it is a
programming bug. This is reasonable, lots of systems use this kind of
consistency model.
Since the driver seems to rely on a dma_fence to block DMA access, it
looks to me like the kernel has full visibility to the
'gpu_write_something()' and 'gpu_wait_done()' markers.
I think trying to use hmm_range_fault on HW that can't do HW page
faulting and HW 'TLB shootdown' is a very, very bad idea. I fear that
is what amd gpu is trying to do.
I haven't yet seen anything that looks like 'TLB shootdown' in i915??
Jason
On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote: > On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote: > > > > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and > > > > __GFP_DIRECT_RECLAIM.. > > > > > > > > This matches the existing test in __need_fs_reclaim() - so if you are > > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), > > > > allocations during OOM, then I think fs_reclaim already matches what > > > > you described? > > > > > > No GFP_NOFS is equally bad. Please read my other email explaining what > > > the oom_reaper actually requires. In short no blocking on direct or > > > indirect dependecy on memory allocation that might sleep. > > > > It is much easier to follow with some hints on code, so the true > > requirement is that the OOM repear not block on GFP_FS and GFP_IO > > allocations, great, that constraint is now clear. > > I still do not get why do you put FS/IO into the picture. This is really > about __GFP_DIRECT_RECLAIM. Like I said this is complicated, translating "no blocking on direct or indirect dependecy on memory allocation that might sleep" into GFP flags is hard for us outside the mm community. So the contraint here is no __GFP_DIRECT_RECLAIM? I bring up FS/IO because that is what Tejun mentioned when I asked him about reclaim restrictions, and is what fs_reclaim_acquire() is already sensitive too. It is pretty confusing if we have places using the word 'reclaim' with different restrictions. :( > > CPU0 CPU1 > > mutex_lock() > > kmalloc(GFP_KERNEL) > > no I mean __GFP_DIRECT_RECLAIM here. > > > mutex_unlock() > > fs_reclaim_acquire() > > mutex_lock() <- illegal: lock dep assertion > > I cannot really comment on how that is achieveable by lockdep. ??? I am trying to explain this is already done and working today. The above example will already fault with lockdep enabled. This is existing debugging we can use and improve upon rather that invent new debugging. Jason
On Thu, Aug 15, 2019 at 03:01:59PM -0300, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 01:39:22PM -0400, Jerome Glisse wrote: > > On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote: > > > On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote: > > > > > > > I'm not really well versed in the details of our userptr, but both > > > > amdgpu and i915 wait for the gpu to complete from > > > > invalidate_range_start. Jerome has at least looked a lot at the amdgpu > > > > one, so maybe he can explain what exactly it is we're doing ... > > > > > > amdgpu is (wrongly) using hmm for something, I can't really tell what > > > it is trying to do. The calls to dma_fence under the > > > invalidate_range_start do not give me a good feeling. > > > > > > However, i915 shows all the signs of trying to follow the registration > > > cache model, it even has a nice comment in > > > i915_gem_userptr_get_pages() explaining that the races it has don't > > > matter because it is a user space bug to change the VA mapping in the > > > first place. That just screams registration cache to me. > > > > > > So it is fine to run HW that way, but if you do, there is no reason to > > > fence inside the invalidate_range end. Just orphan the DMA buffer and > > > clean it up & release the page pins when all DMA buffer refs go to > > > zero. The next access to that VA should get a new DMA buffer with the > > > right mapping. > > > > > > In other words the invalidation should be very simple without > > > complicated locking, or wait_event's. Look at hfi1 for example. > > > > This would break the today usage model of uptr and it will > > break userspace expectation ie if GPU is writting to that > > memory and that memory then the userspace want to make sure > > that it will see what the GPU write. > > How exactly? This is holding the page pin, so the only way the VA > mapping can be changed is via explicit user action. > > ie: > > gpu_write_something(va, size) > mmap(.., va, size, MMAP_FIXED); > gpu_wait_done() > > This is racy and indeterminate with both models. > > Based on the comment in i915 it appears to be going on the model that > changes to the mmap by userspace when the GPU is working on it is a > programming bug. This is reasonable, lots of systems use this kind of > consistency model. Well userspace process doing munmap(), mremap(), fork() and things like that are a bug from the i915 kernel and userspace contract point of view. But things like migration or reclaim are not cover under that contract and for those the expectation is that CPU access to the same virtual address should allow to get what was last written to it either by the GPU or the CPU. > > Since the driver seems to rely on a dma_fence to block DMA access, it > looks to me like the kernel has full visibility to the > 'gpu_write_something()' and 'gpu_wait_done()' markers. So let's only consider the case where GPU wants to write to the memory (the read only case is obviously right and not need any notifier in fact) and like above the only thing we care about is reclaim or migration (for instance because of some numa compaction) as the rest is cover by i915 userspace contract. So in the write case we do GUPfast(write=true) which will be "safe" from any concurrent CPU page table update ie if GUPfast get a reference for the page then any racing CPU page table update will not be able to migrate or reclaim the page and thus the virtual address to page association will be preserve. This is only true because of GUPfast(), now if GUPfast() fails it will fallback to the slow GUP case which make the same thing safe by taking the page table lock. Because of the reference on the page the i915 driver can forego the mmu notifier end callback. The thing here is that taking a page reference is pointless if we have better synchronization and tracking of mmu notifier. Hence converting to hmm mirror allows to avoid taking a ref on the page while still keeping the same functionality as of today. > I think trying to use hmm_range_fault on HW that can't do HW page > faulting and HW 'TLB shootdown' is a very, very bad idea. I fear that > is what amd gpu is trying to do. > > I haven't yet seen anything that looks like 'TLB shootdown' in i915?? GPU driver have complex usage pattern the tlb shootdown is implicit once the GEM object associated with the uptr is invalidated it means next time userspace submit command against that GEM object it will have to re-validate it which means re-program the GPU page table to point to the proper address (and re-call GUP). So the whole GPU page table update is all hidden behind GEM function like i915_gem_object_unbind() (or ttm invalidate for amd/radeon). Technicaly those GPU do not support page faulting but because of the way GPU works you know that you have frequent checkpoint where you can unbind things. This is what is happening here. Also not all GPU engines can handle page fault, this is true of all GPU with page fault AFAIK (AMD and NVidia so far). This is why uptr is a limited form of SVM (share virtual memory) that can be use on all GPUs engine including the dma engine. When using the full SVM power (like hmm mirror with nouveau) this is only use in GPU engine that supports page fault (but updating the page table still require locking and waiting on acknowledge). Cheers, Jérôme
On Thu, Aug 15, 2019 at 02:27:19PM -0400, Jerome Glisse wrote: > > How exactly? This is holding the page pin, so the only way the VA > > mapping can be changed is via explicit user action. > > > > ie: > > > > gpu_write_something(va, size) > > mmap(.., va, size, MMAP_FIXED); > > gpu_wait_done() > > > > This is racy and indeterminate with both models. > > > > Based on the comment in i915 it appears to be going on the model that > > changes to the mmap by userspace when the GPU is working on it is a > > programming bug. This is reasonable, lots of systems use this kind of > > consistency model. > > Well userspace process doing munmap(), mremap(), fork() and things like > that are a bug from the i915 kernel and userspace contract point of view. > > But things like migration or reclaim are not cover under that contract > and for those the expectation is that CPU access to the same virtual address > should allow to get what was last written to it either by the GPU or the > CPU. Okay, this is a more reasonable point - I agree the i915 registration cache model precludes using migration and thus DEVICE_PRIVATE. This is a strong motivation to use the hmm approach But we started out this converstation asking if i915 is correct, and I still say a registration cache model is a functionally correct way to use notifiers. > Because of the reference on the page the i915 driver can forego the mmu > notifier end callback. The thing here is that taking a page reference > is pointless if we have better synchronization and tracking of mmu > notifier. Hence converting to hmm mirror allows to avoid taking a ref > on the page while still keeping the same functionality as of today. However, there is a huge trade off here. Drivers like this are going to have a very complicated locking inside invalidate_range_start as they must sleep waiting for dma buffer references to go to zero. > GPU driver have complex usage pattern the tlb shootdown is implicit > once the GEM object associated with the uptr is invalidated it means > next time userspace submit command against that GEM object it will > have to re-validate it which means re-program the GPU page table to > point to the proper address (and re-call GUP). I think it is a mistake to try and cram the very different approaches to notifiers into the same API. SW ref counting of DMA buffers vs HW async page faulting have totally different requirements and locking schemes. This explains why AMDGPU gets away with not using the hmm API properly, it is probably relying on its DMA refcount, not the hmm valid, to keep things in order? I think the API approach in hmm_mirror is reasonable for page faulting HW, but does not serve refcounting HW well at all. Jason
On Thu 15-08-19 15:24:48, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote: > > On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote: > > > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote: > > > > > > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and > > > > > __GFP_DIRECT_RECLAIM.. > > > > > > > > > > This matches the existing test in __need_fs_reclaim() - so if you are > > > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(), > > > > > allocations during OOM, then I think fs_reclaim already matches what > > > > > you described? > > > > > > > > No GFP_NOFS is equally bad. Please read my other email explaining what > > > > the oom_reaper actually requires. In short no blocking on direct or > > > > indirect dependecy on memory allocation that might sleep. > > > > > > It is much easier to follow with some hints on code, so the true > > > requirement is that the OOM repear not block on GFP_FS and GFP_IO > > > allocations, great, that constraint is now clear. > > > > I still do not get why do you put FS/IO into the picture. This is really > > about __GFP_DIRECT_RECLAIM. > > Like I said this is complicated, translating "no blocking on direct or > indirect dependecy on memory allocation that might sleep" into GFP > flags is hard for us outside the mm community. > > So the contraint here is no __GFP_DIRECT_RECLAIM? OK, I am obviously failing to explain that. Sorry about that. You are right that this is not simple. Let me try again. The context we are calling !blockable notifiers from has to finish in a _finite_ amount of time (and swift is hugely appreciated by users of otherwise non-responsive system that is under OOM). We are out of memory so we cannot be blocked waiting for memory. Directly or indirectly (via a lock, waiting for an event that needs memory to finish in general). So you need to track dependency over more complicated contexts than the direct call path (think of workqueue for example). > I bring up FS/IO because that is what Tejun mentioned when I asked him > about reclaim restrictions, and is what fs_reclaim_acquire() is > already sensitive too. It is pretty confusing if we have places using > the word 'reclaim' with different restrictions. :( fs_reclaim has been invented to catch potential deadlocks when a GFP_NO{FS/IO} allocation hits into fs/io reclaim. This is a subset of the reclaim that we have. The oom context is more restricted and it cannot depend on _any_ memory reclaim because by the time we have got to this context all the reclaim has already failed and wait some more will simply not help. > > > CPU0 CPU1 > > > mutex_lock() > > > kmalloc(GFP_KERNEL) > > > > no I mean __GFP_DIRECT_RECLAIM here. > > > > > mutex_unlock() > > > fs_reclaim_acquire() > > > mutex_lock() <- illegal: lock dep assertion > > > > I cannot really comment on how that is achieveable by lockdep. > > ??? I am trying to explain this is already done and working today. The > above example will already fault with lockdep enabled. > > This is existing debugging we can use and improve upon rather that > invent new debugging. This is what you claim and I am saying that fs_reclaim is about a restricted reclaim context and it is an ugly hack. It has proven to report false positives. Maybe it can be extended to a generic reclaim. I haven't tried that. Do not aim to try it. -- Michal Hocko SUSE Labs
On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote:
> This is what you claim and I am saying that fs_reclaim is about a
> restricted reclaim context and it is an ugly hack. It has proven to
> report false positives. Maybe it can be extended to a generic reclaim.
> I haven't tried that. Do not aim to try it.
Okay, great, I think this has been very helpful, at least for me,
thanks. I did not know fs_reclaim was so problematic, or the special
cases about OOM 'reclaim'.
On this patch, I have no general objection to enforcing drivers to be
non-blocking, I'd just like to see it done with the existing lockdep
can't sleep detection rather than inventing some new debugging for it.
I understand this means the debugging requires lockdep enabled and
will not run in production, but I'm of the view that is OK and in line
with general kernel practice.
The last detail is I'm still unclear what a GFP flags a blockable
invalidate_range_start() should use. Is GFP_KERNEL OK? Lockdep has
complained on that in past due to fs_reclaim - how do you know if it
is a false positive?
Thanks again,
Jason
On Thu 15-08-19 16:18:10, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote: > > > This is what you claim and I am saying that fs_reclaim is about a > > restricted reclaim context and it is an ugly hack. It has proven to > > report false positives. Maybe it can be extended to a generic reclaim. > > I haven't tried that. Do not aim to try it. > > Okay, great, I think this has been very helpful, at least for me, > thanks. I did not know fs_reclaim was so problematic, or the special > cases about OOM 'reclaim'. I am happy that this is more clear now. > On this patch, I have no general objection to enforcing drivers to be > non-blocking, I'd just like to see it done with the existing lockdep > can't sleep detection rather than inventing some new debugging for it. > > I understand this means the debugging requires lockdep enabled and > will not run in production, but I'm of the view that is OK and in line > with general kernel practice. Yes and I do agree with this in general. > The last detail is I'm still unclear what a GFP flags a blockable > invalidate_range_start() should use. Is GFP_KERNEL OK? I hope I will not make this muddy again ;) invalidate_range_start in the blockable mode can use/depend on any sleepable allocation allowed in the context it is called from. So in other words it is no different from any other function in the kernel that calls into allocator. As the API is missing gfp context then I hope it is not called from any restricted contexts (except from the oom which we have !blockable for). > Lockdep has > complained on that in past due to fs_reclaim - how do you know if it > is a false positive? I would have to see the specific lockdep splat. -- Michal Hocko SUSE Labs
On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote: > > The last detail is I'm still unclear what a GFP flags a blockable > > invalidate_range_start() should use. Is GFP_KERNEL OK? > > I hope I will not make this muddy again ;) > invalidate_range_start in the blockable mode can use/depend on any sleepable > allocation allowed in the context it is called from. 'in the context is is called from' is the magic phrase, as invalidate_range_start is called while holding several different mm related locks. I know at least write mmap_sem and i_mmap_rwsem (write?) Can GFP_KERNEL be called while holding those locks? This is the question of indirect dependency on reclaim via locks you raised earlier. > So in other words it is no different from any other function in the > kernel that calls into allocator. As the API is missing gfp context > then I hope it is not called from any restricted contexts (except > from the oom which we have !blockable for). Yes, the callers are exactly my concern. > > Lockdep has > > complained on that in past due to fs_reclaim - how do you know if it > > is a false positive? > > I would have to see the specific lockdep splat. See below. I found it when trying to understand why the registration of the mmu notififer was so oddly coded. The situation was: down_write(&mm->mmap_sem); mm_take_all_locks(mm); kmalloc(GFP_KERNEL); <--- lockdep warning I understood Daniel said he saw this directly on a recent kernel when working with his lockdep patch? Checking myself, on todays kernel I see a call chain: shrink_all_memory fs_reclaim_acquire(sc.gfp_mask); [..] do_try_to_free_pages shrink_zones shrink_node shrink_node_memcg shrink_list shrink_active_list page_referenced rmap_walk rmap_walk_file i_mmap_lock_read down_read(i_mmap_rwsem) So it is possible that the down_read() above will block on i_mmap_rwsem being held in the caller of invalidate_range_start which is doing kmalloc(GPF_KERNEL). Is this OK? The lockdep annotation says no.. Jason commit 35cfa2b0b491c37e23527822bf365610dbb188e5 Author: Gavin Shan <shangw@linux.vnet.ibm.com> Date: Thu Oct 25 13:38:01 2012 -0700 mm/mmu_notifier: allocate mmu_notifier in advance While allocating mmu_notifier with parameter GFP_KERNEL, swap would start to work in case of tight available memory. Eventually, that would lead to a deadlock while the swap deamon swaps anonymous pages. It was caused by commit e0f3c3f78da29b ("mm/mmu_notifier: init notifier if necessary"). ================================= [ INFO: inconsistent lock state ] 3.7.0-rc1+ #518 Not tainted --------------------------------- inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. kswapd0/35 [HC0[0]:SC0[0]:HE1:SE1] takes: (&mapping->i_mmap_mutex){+.+.?.}, at: page_referenced+0x9c/0x2e0 {RECLAIM_FS-ON-W} state was registered at: mark_held_locks+0x86/0x150 lockdep_trace_alloc+0x67/0xc0 kmem_cache_alloc_trace+0x33/0x230 do_mmu_notifier_register+0x87/0x180 mmu_notifier_register+0x13/0x20 kvm_dev_ioctl+0x428/0x510 do_vfs_ioctl+0x98/0x570 sys_ioctl+0x91/0xb0 system_call_fastpath+0x16/0x1b
On Thu, Aug 15, 2019 at 9:35 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 15-08-19 16:18:10, Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote: > > > > > This is what you claim and I am saying that fs_reclaim is about a > > > restricted reclaim context and it is an ugly hack. It has proven to > > > report false positives. Maybe it can be extended to a generic reclaim. > > > I haven't tried that. Do not aim to try it. > > > > Okay, great, I think this has been very helpful, at least for me, > > thanks. I did not know fs_reclaim was so problematic, or the special > > cases about OOM 'reclaim'. > > I am happy that this is more clear now. > > > On this patch, I have no general objection to enforcing drivers to be > > non-blocking, I'd just like to see it done with the existing lockdep > > can't sleep detection rather than inventing some new debugging for it. > > > > I understand this means the debugging requires lockdep enabled and > > will not run in production, but I'm of the view that is OK and in line > > with general kernel practice. > > Yes and I do agree with this in general. So if someone can explain to me how that works with lockdep I can of course implement it. But afaics that doesn't exist (I tried to explain that somewhere else already), and I'm no really looking forward to hacking also on lockdep for this little series. > > The last detail is I'm still unclear what a GFP flags a blockable > > invalidate_range_start() should use. Is GFP_KERNEL OK? > > I hope I will not make this muddy again ;) > invalidate_range_start in the blockable mode can use/depend on any sleepable > allocation allowed in the context it is called from. So in other words > it is no different from any other function in the kernel that calls into > allocator. As the API is missing gfp context then I hope it is not > called from any restricted contexts (except from the oom which we have > !blockable for). Hm, that's new to me. I thought mmu notifiers very much can be called from direct reclaim paths, so you have to be extremely careful with getting back into that one. At least the lockdep splats I remember also tend to have fs_reclaim in there, that's where all the fun comes from. > > Lockdep has > > complained on that in past due to fs_reclaim - how do you know if it > > is a false positive? > > I would have to see the specific lockdep splat. I guess the lockdep annotation for invalidate_range_start carries the same risks as the fs_reclaim annotation. Still feels like worth it. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> So if someone can explain to me how that works with lockdep I can of
> course implement it. But afaics that doesn't exist (I tried to explain
> that somewhere else already), and I'm no really looking forward to
> hacking also on lockdep for this little series.
Hmm, kind of looks like it is done by calling preempt_disable()
Probably the debug option is CONFIG_DEBUG_PREEMPT, not lockdep?
Jason
On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote: > > So if someone can explain to me how that works with lockdep I can of > > course implement it. But afaics that doesn't exist (I tried to explain > > that somewhere else already), and I'm no really looking forward to > > hacking also on lockdep for this little series. > > Hmm, kind of looks like it is done by calling preempt_disable() Yup. That was v1, then came the suggestion that disabling preemption is maybe not the best thing (the oom reaper could still run for a long time comparatively, if it's cleaning out gigabytes of process memory or what not, hence this dedicated debug infrastructure). > Probably the debug option is CONFIG_DEBUG_PREEMPT, not lockdep? CONFIG_DEBUG_ATOMIC_SLEEP. Like in my patch. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, 15 Aug 2019 10:44:29 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> > I continue to struggle with this. It introduces a new kernel state
> > "running preemptibly but must not call schedule()". How does this make
> > any sense?
> >
> > Perhaps a much, much more detailed description of the oom_reaper
> > situation would help out.
>
> The primary point here is that there is a demand of non blockable mmu
> notifiers to be called when the oom reaper tears down the address space.
> As the oom reaper is the primary guarantee of the oom handling forward
> progress it cannot be blocked on anything that might depend on blockable
> memory allocations. These are not really easy to track because they
> might be indirect - e.g. notifier blocks on a lock which other context
> holds while allocating memory or waiting for a flusher that needs memory
> to perform its work. If such a blocking state happens that we can end up
> in a silent hang with an unusable machine.
> Now we hope for reasonable implementations of mmu notifiers (strong
> words I know ;) and this should be relatively simple and effective catch
> all tool to detect something suspicious is going on.
>
> Does that make the situation more clear?
Yes, thanks, much. Maybe a code comment along the lines of
This is on behalf of the oom reaper, specifically when it is
calling the mmu notifiers. The problem is that if the notifier were
to block on, for example, mutex_lock() and if the process which holds
that mutex were to perform a sleeping memory allocation, the oom
reaper is now blocked on completion of that memory allocation.
btw, do we need task_struct.non_block_count? Perhaps the oom reaper
thread could set a new PF_NONBLOCK (which would be more general than
PF_OOM_REAPER). If we run out of PF_ flags, use (current == oom_reaper_th).
On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > > So if someone can explain to me how that works with lockdep I can of
> > > course implement it. But afaics that doesn't exist (I tried to explain
> > > that somewhere else already), and I'm no really looking forward to
> > > hacking also on lockdep for this little series.
> >
> > Hmm, kind of looks like it is done by calling preempt_disable()
>
> Yup. That was v1, then came the suggestion that disabling preemption
> is maybe not the best thing (the oom reaper could still run for a long
> time comparatively, if it's cleaning out gigabytes of process memory
> or what not, hence this dedicated debug infrastructure).
Oh, I'm coming in late, sorry
Anyhow, I was thinking since we agreed this can trigger on some
CONFIG_DEBUG flag, something like
/* This is a sleepable region, but use preempt_disable to get debugging
* for calls that are not allowed to block for OOM [.. insert
* Michal's explanation.. ] */
if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range))
preempt_disable();
ops->invalidate_range_start();
And I have also been idly mulling doing something like
if (IS_ENABLED(CONFIG_DEBUG_NOTIFIERS) &&
rand &&
mmu_notifier_range_blockable(range)) {
range->flags = 0
if (!ops->invalidate_range_start(range))
continue
// Failed, try again as blockable
range->flags = MMU_NOTIFIER_RANGE_BLOCKABLE
}
ops->invalidate_range_start(range);
Which would give coverage for this corner case without forcing OOM.
Jason
On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote: > > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote: > > > > So if someone can explain to me how that works with lockdep I can of > > > > course implement it. But afaics that doesn't exist (I tried to explain > > > > that somewhere else already), and I'm no really looking forward to > > > > hacking also on lockdep for this little series. > > > > > > Hmm, kind of looks like it is done by calling preempt_disable() > > > > Yup. That was v1, then came the suggestion that disabling preemption > > is maybe not the best thing (the oom reaper could still run for a long > > time comparatively, if it's cleaning out gigabytes of process memory > > or what not, hence this dedicated debug infrastructure). > > Oh, I'm coming in late, sorry > > Anyhow, I was thinking since we agreed this can trigger on some > CONFIG_DEBUG flag, something like > > /* This is a sleepable region, but use preempt_disable to get debugging > * for calls that are not allowed to block for OOM [.. insert > * Michal's explanation.. ] */ > if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range)) > preempt_disable(); > ops->invalidate_range_start(); I think we also discussed that, and some expressed concerns it would change behaviour/timing too much for testing. Since this does does disable preemption for real, not just for might_sleep. > And I have also been idly mulling doing something like > > if (IS_ENABLED(CONFIG_DEBUG_NOTIFIERS) && > rand && > mmu_notifier_range_blockable(range)) { > range->flags = 0 > if (!ops->invalidate_range_start(range)) > continue > > // Failed, try again as blockable > range->flags = MMU_NOTIFIER_RANGE_BLOCKABLE > } > ops->invalidate_range_start(range); > > Which would give coverage for this corner case without forcing OOM. Hm, this sounds like a neat idea to slap on top. The rand is going to be a bit tricky though, but I guess for this we could stuff another counter into task_struct and just e.g. do this every 1000th or so invalidate (well need to pick a prime so we cycle through notifiers in case there's multiple). I like. Michal, thoughts? -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote: > > > > The last detail is I'm still unclear what a GFP flags a blockable > > > invalidate_range_start() should use. Is GFP_KERNEL OK? > > > > I hope I will not make this muddy again ;) > > invalidate_range_start in the blockable mode can use/depend on any sleepable > > allocation allowed in the context it is called from. > > 'in the context is is called from' is the magic phrase, as > invalidate_range_start is called while holding several different mm > related locks. I know at least write mmap_sem and i_mmap_rwsem > (write?) > > Can GFP_KERNEL be called while holding those locks? i_mmap_rwsem would be problematic because it is taken during the reclaim. > This is the question of indirect dependency on reclaim via locks you > raised earlier. > > > So in other words it is no different from any other function in the > > kernel that calls into allocator. As the API is missing gfp context > > then I hope it is not called from any restricted contexts (except > > from the oom which we have !blockable for). > > Yes, the callers are exactly my concern. > > > > Lockdep has > > > complained on that in past due to fs_reclaim - how do you know if it > > > is a false positive? > > > > I would have to see the specific lockdep splat. > > See below. I found it when trying to understand why the registration > of the mmu notififer was so oddly coded. > > The situation was: > > down_write(&mm->mmap_sem); > mm_take_all_locks(mm); > kmalloc(GFP_KERNEL); <--- lockdep warning Ugh. mm_take_all_locks :/ > I understood Daniel said he saw this directly on a recent kernel when > working with his lockdep patch? > > Checking myself, on todays kernel I see a call chain: > > shrink_all_memory > fs_reclaim_acquire(sc.gfp_mask); > [..] > do_try_to_free_pages > shrink_zones > shrink_node > shrink_node_memcg > shrink_list > shrink_active_list > page_referenced > rmap_walk > rmap_walk_file > i_mmap_lock_read > down_read(i_mmap_rwsem) > > So it is possible that the down_read() above will block on > i_mmap_rwsem being held in the caller of invalidate_range_start which > is doing kmalloc(GPF_KERNEL). > > Is this OK? The lockdep annotation says no.. It's not as per the above code patch which is easily possible because mm_take_all_locks will lock all file vmas. -- Michal Hocko SUSE Labs
On Thu 15-08-19 15:15:09, Andrew Morton wrote: > On Thu, 15 Aug 2019 10:44:29 +0200 Michal Hocko <mhocko@kernel.org> wrote: > > > > I continue to struggle with this. It introduces a new kernel state > > > "running preemptibly but must not call schedule()". How does this make > > > any sense? > > > > > > Perhaps a much, much more detailed description of the oom_reaper > > > situation would help out. > > > > The primary point here is that there is a demand of non blockable mmu > > notifiers to be called when the oom reaper tears down the address space. > > As the oom reaper is the primary guarantee of the oom handling forward > > progress it cannot be blocked on anything that might depend on blockable > > memory allocations. These are not really easy to track because they > > might be indirect - e.g. notifier blocks on a lock which other context > > holds while allocating memory or waiting for a flusher that needs memory > > to perform its work. If such a blocking state happens that we can end up > > in a silent hang with an unusable machine. > > Now we hope for reasonable implementations of mmu notifiers (strong > > words I know ;) and this should be relatively simple and effective catch > > all tool to detect something suspicious is going on. > > > > Does that make the situation more clear? > > Yes, thanks, much. Maybe a code comment along the lines of > > This is on behalf of the oom reaper, specifically when it is > calling the mmu notifiers. The problem is that if the notifier were > to block on, for example, mutex_lock() and if the process which holds > that mutex were to perform a sleeping memory allocation, the oom > reaper is now blocked on completion of that memory allocation. reaper is now blocked on completion of that memory allocation and cannot provide the guarantee of the OOM forward progress. OK. > btw, do we need task_struct.non_block_count? Perhaps the oom reaper > thread could set a new PF_NONBLOCK (which would be more general than > PF_OOM_REAPER). If we run out of PF_ flags, use (current == oom_reaper_th). Well, I do not have a strong opinion here. A simple check for the value seems to be trivial. There are quite some holes in task_struct to hide this counter without increasing the size. -- Michal Hocko SUSE Labs
On Thu 15-08-19 22:16:43, Daniel Vetter wrote: > On Thu, Aug 15, 2019 at 9:35 PM Michal Hocko <mhocko@kernel.org> wrote: [...] > > > The last detail is I'm still unclear what a GFP flags a blockable > > > invalidate_range_start() should use. Is GFP_KERNEL OK? > > > > I hope I will not make this muddy again ;) > > invalidate_range_start in the blockable mode can use/depend on any sleepable > > allocation allowed in the context it is called from. So in other words > > it is no different from any other function in the kernel that calls into > > allocator. As the API is missing gfp context then I hope it is not > > called from any restricted contexts (except from the oom which we have > > !blockable for). > > Hm, that's new to me. I thought mmu notifiers very much can be called > from direct reclaim paths, so you have to be extremely careful with > getting back into that one. Correct, I should have added that notifier callbacks ideally do not allocate any memory. They can block and even that is quite a pain to be honest. -- Michal Hocko SUSE Labs
On Fri, Aug 16, 2019 at 08:20:55AM +0200, Daniel Vetter wrote:
> On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote:
> > > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > > > > So if someone can explain to me how that works with lockdep I can of
> > > > > course implement it. But afaics that doesn't exist (I tried to explain
> > > > > that somewhere else already), and I'm no really looking forward to
> > > > > hacking also on lockdep for this little series.
> > > >
> > > > Hmm, kind of looks like it is done by calling preempt_disable()
> > >
> > > Yup. That was v1, then came the suggestion that disabling preemption
> > > is maybe not the best thing (the oom reaper could still run for a long
> > > time comparatively, if it's cleaning out gigabytes of process memory
> > > or what not, hence this dedicated debug infrastructure).
> >
> > Oh, I'm coming in late, sorry
> >
> > Anyhow, I was thinking since we agreed this can trigger on some
> > CONFIG_DEBUG flag, something like
> >
> > /* This is a sleepable region, but use preempt_disable to get debugging
> > * for calls that are not allowed to block for OOM [.. insert
> > * Michal's explanation.. ] */
> > if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range))
> > preempt_disable();
> > ops->invalidate_range_start();
>
> I think we also discussed that, and some expressed concerns it would
> change behaviour/timing too much for testing. Since this does does
> disable preemption for real, not just for might_sleep.
I don't follow, this is a debug kernel, it will have widly different
timing.
Further the point of this debugging on atomic_sleep is to be as
timing-independent as possible since functions with rare sleeps should
be guarded by might_sleep() in their common paths.
I guess I don't get the push to have some low overhead debugging for
this? Is there something special you are looking for?
Jason
On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> >
> > > > The last detail is I'm still unclear what a GFP flags a blockable
> > > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > >
> > > I hope I will not make this muddy again ;)
> > > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > > allocation allowed in the context it is called from.
> >
> > 'in the context is is called from' is the magic phrase, as
> > invalidate_range_start is called while holding several different mm
> > related locks. I know at least write mmap_sem and i_mmap_rwsem
> > (write?)
> >
> > Can GFP_KERNEL be called while holding those locks?
>
> i_mmap_rwsem would be problematic because it is taken during the
> reclaim.
Okay.. So the fs_reclaim debugging does catch errors. Do you have any
reference for what a false positive looks like?
I would like to inject it into the notifier path as this is very
difficult for driver authors to discover and know about, but I'm
worried about your false positive remark.
I think I understand we can use only GFP_ATOMIC in the notifiers, but
we need a strategy to handle OOM to guarentee forward progress.
This is just more bugs to fix :(
Jason
On Fri 16-08-19 09:19:06, Jason Gunthorpe wrote: > On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote: > > On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote: > > > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote: > > > > > > > > The last detail is I'm still unclear what a GFP flags a blockable > > > > > invalidate_range_start() should use. Is GFP_KERNEL OK? > > > > > > > > I hope I will not make this muddy again ;) > > > > invalidate_range_start in the blockable mode can use/depend on any sleepable > > > > allocation allowed in the context it is called from. > > > > > > 'in the context is is called from' is the magic phrase, as > > > invalidate_range_start is called while holding several different mm > > > related locks. I know at least write mmap_sem and i_mmap_rwsem > > > (write?) > > > > > > Can GFP_KERNEL be called while holding those locks? > > > > i_mmap_rwsem would be problematic because it is taken during the > > reclaim. > > Okay.. So the fs_reclaim debugging does catch errors. I do not think fs_reclaim is the udnerlying mechanism to catch this deadlock. It is a simple AA deadlock. You take i_mmap_rwsem and then go down the allocation path, direct reclaim and take the lock again. Nothing really surprising. fs_reclaim is really to catch GFP_NOFS context calling into a less restricted (e.g. GFP_KERNEL allocation context). > Do you have any > reference for what a false positive looks like? I believe I have given some examples when introducing __GFP_NOLOCKDEP. > I would like to inject it into the notifier path as this is very > difficult for driver authors to discover and know about, but I'm > worried about your false positive remark. > > I think I understand we can use only GFP_ATOMIC in the notifiers, but > we need a strategy to handle OOM to guarentee forward progress. Your example is from the notifier registration IIUC. Can you pre-allocate before taking locks? Could you point me to some examples when the allocation is necessary in the range notifier callback? -- Michal Hocko SUSE Labs
On Fri, Aug 16, 2019 at 2:12 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Aug 16, 2019 at 08:20:55AM +0200, Daniel Vetter wrote: > > On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote: > > > > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote: > > > > > > So if someone can explain to me how that works with lockdep I can of > > > > > > course implement it. But afaics that doesn't exist (I tried to explain > > > > > > that somewhere else already), and I'm no really looking forward to > > > > > > hacking also on lockdep for this little series. > > > > > > > > > > Hmm, kind of looks like it is done by calling preempt_disable() > > > > > > > > Yup. That was v1, then came the suggestion that disabling preemption > > > > is maybe not the best thing (the oom reaper could still run for a long > > > > time comparatively, if it's cleaning out gigabytes of process memory > > > > or what not, hence this dedicated debug infrastructure). > > > > > > Oh, I'm coming in late, sorry > > > > > > Anyhow, I was thinking since we agreed this can trigger on some > > > CONFIG_DEBUG flag, something like > > > > > > /* This is a sleepable region, but use preempt_disable to get debugging > > > * for calls that are not allowed to block for OOM [.. insert > > > * Michal's explanation.. ] */ > > > if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range)) > > > preempt_disable(); > > > ops->invalidate_range_start(); > > > > I think we also discussed that, and some expressed concerns it would > > change behaviour/timing too much for testing. Since this does does > > disable preemption for real, not just for might_sleep. > > I don't follow, this is a debug kernel, it will have widly different > timing. > > Further the point of this debugging on atomic_sleep is to be as > timing-independent as possible since functions with rare sleeps should > be guarded by might_sleep() in their common paths. > > I guess I don't get the push to have some low overhead debugging for > this? Is there something special you are looking for? Don't ask me, I'm just trying to get _some_ debugging for this in. I don't care one bit how much overhead it has because in our CI our debug build has lockdep and everything and it crawls anyway. I started out with the preempt_disable/enable thing like you suggested, and a few rounds later we're here. We can go back to v1 on this one, but I'd prefer to not do the lap too often. Also, aside from this patch (which is prep for the next) and some simple reordering conflicts they're all independent. So if there's no way to paint this bikeshed here (technicolor perhaps?) then I'd like to get at least the others considered. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote:
> Also, aside from this patch (which is prep for the next) and some
> simple reordering conflicts they're all independent. So if there's no
> way to paint this bikeshed here (technicolor perhaps?) then I'd like
> to get at least the others considered.
Sure, I think for conflict avoidance reasons I'm probably taking
mmu_notifier stuff via hmm.git, so:
- Andrew had a minor remark on #1, I am ambivalent and would take it
as-is. Your decision if you want to respin.
- #2/#3 is this issue, I would stand by the preempt_disable/etc path
Our situation matches yours, debug tests run lockdep/etc.
- #4 I like a lot, except the map should enclose range_end too,
this can be done after the mm_has_notifiers inside the
__mmu_notifier function
Can you respin?
I will propose preloading the map in another patch
- #5 is already applied in -rc
Jason
On Fri, Aug 16, 2019 at 4:38 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote: > > Also, aside from this patch (which is prep for the next) and some > > simple reordering conflicts they're all independent. So if there's no > > way to paint this bikeshed here (technicolor perhaps?) then I'd like > > to get at least the others considered. > > Sure, I think for conflict avoidance reasons I'm probably taking > mmu_notifier stuff via hmm.git, so: > > - Andrew had a minor remark on #1, I am ambivalent and would take it > as-is. Your decision if you want to respin. I like mine better, see also the reply from Ralph Campbell. > - #2/#3 is this issue, I would stand by the preempt_disable/etc path > Our situation matches yours, debug tests run lockdep/etc. Since Michal requested the current flavour I think we need spin a bit more on these here. I guess I'll just rebase them to the end so they're not holding up the others. > - #4 I like a lot, except the map should enclose range_end too, > this can be done after the mm_has_notifiers inside the > __mmu_notifier function To make sure I get this right: The same lockdep context, but also wrapped around invalidate_range_end? From my understanding of pte zapping that makes sense, but I'm definitely not well-versed enough for that. > Can you respin? Will do. > I will propose preloading the map in another patch > - #5 is already applied in -rc Yup, I'll drop that one. Thanks, Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Fri, Aug 16, 2019 at 06:36:52PM +0200, Daniel Vetter wrote: > On Fri, Aug 16, 2019 at 4:38 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote: > > > Also, aside from this patch (which is prep for the next) and some > > > simple reordering conflicts they're all independent. So if there's no > > > way to paint this bikeshed here (technicolor perhaps?) then I'd like > > > to get at least the others considered. > > > > Sure, I think for conflict avoidance reasons I'm probably taking > > mmu_notifier stuff via hmm.git, so: > > > > - Andrew had a minor remark on #1, I am ambivalent and would take it > > as-is. Your decision if you want to respin. > > I like mine better, see also the reply from Ralph Campbell. Sure > > - #2/#3 is this issue, I would stand by the preempt_disable/etc path > > Our situation matches yours, debug tests run lockdep/etc. > > Since Michal requested the current flavour I think we need spin a bit > more on these here. I guess I'll just rebase them to the end so > they're not holding up the others. > > > - #4 I like a lot, except the map should enclose range_end too, > > this can be done after the mm_has_notifiers inside the > > __mmu_notifier function > > To make sure I get this right: The same lockdep context, but also > wrapped around invalidate_range_end? Yes, the locking context of _range_start and _range_end should be identical, last time I checked callers this was the case. So, just add it to __mmu_notifier_invalidate_range_end() outside the SRCU as there is no reason to burden debug kernel callers twice when mmu notifiers are not enabled Jason
On Wed, Aug 14, 2019 at 10:20:23PM +0200, Daniel Vetter wrote:
> Just a bit of paranoia, since if we start pushing this deep into
> callchains it's hard to spot all places where an mmu notifier
> implementation might fail when it's not allowed to.
>
> Inspired by some confusion we had discussing i915 mmu notifiers and
> whether we could use the newly-introduced return value to handle some
> corner cases. Until we realized that these are only for when a task
> has been killed by the oom reaper.
>
> An alternative approach would be to split the callback into two
> versions, one with the int return value, and the other with void
> return value like in older kernels. But that's a lot more churn for
> fairly little gain I think.
>
> Summary from the m-l discussion on why we want something at warning
> level: This allows automated tooling in CI to catch bugs without
> humans having to look at everything. If we just upgrade the existing
> pr_info to a pr_warn, then we'll have false positives. And as-is, no
> one will ever spot the problem since it's lost in the massive amounts
> of overall dmesg noise.
>
> v2: Drop the full WARN_ON backtrace in favour of just a pr_warn for
> the problematic case (Michal Hocko).
>
> v3: Rebase on top of Glisse's arg rework.
>
> v4: More rebase on top of Glisse reworking everything.
>
> v5: Fixup rebase damage and also catch failures != EAGAIN for
> !blockable (Jason). Also go back to WARN_ON as requested by Jason, so
> automatic checkers can easily catch bugs by setting panic_on_warn.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: linux-mm@kvack.org
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
> mm/mmu_notifier.c | 2 ++
> 1 file changed, 2 insertions(+)
Applied to hmm.git, thanks
Jason
On Sat, Aug 17, 2019 at 5:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Aug 15, 2019 at 09:02:49AM +0200, Daniel Vetter wrote: > > On Wed, Aug 14, 2019 at 09:00:29PM -0300, Jason Gunthorpe wrote: > > > On Wed, Aug 14, 2019 at 10:20:25PM +0200, Daniel Vetter wrote: > > > > We need to make sure implementations don't cheat and don't have a > > > > possible schedule/blocking point deeply burried where review can't > > > > catch it. > > > > > > > > I'm not sure whether this is the best way to make sure all the > > > > might_sleep() callsites trigger, and it's a bit ugly in the code flow. > > > > But it gets the job done. > > > > > > > > Inspired by an i915 patch series which did exactly that, because the > > > > rules haven't been entirely clear to us. > > > > > > I thought lockdep already was able to detect: > > > > > > spin_lock() > > > might_sleep(); > > > spin_unlock() > > > > > > Am I mistaken? If yes, couldn't this patch just inject a dummy lockdep > > > spinlock? > > > > Hm ... assuming I didn't get lost in the maze I think might_sleep (well > > ___might_sleep) doesn't do any lockdep checking at all. And we want > > might_sleep, since that catches a lot more than lockdep. > > Don't know how it works, but it sure looks like it does: > > This: > spin_lock(&file->uobjects_lock); > down_read(&file->hw_destroy_rwsem); > up_read(&file->hw_destroy_rwsem); > spin_unlock(&file->uobjects_lock); > > Causes: > > [ 33.324729] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1444 > [ 33.325599] in_atomic(): 1, irqs_disabled(): 0, pid: 247, name: ibv_devinfo > [ 33.326115] 3 locks held by ibv_devinfo/247: > [ 33.326556] #0: 000000009edf8379 (&uverbs_dev->disassociate_srcu){....}, at: ib_uverbs_open+0xff/0x5f0 [ib_uverbs] > [ 33.327657] #1: 000000005e0eddf1 (&uverbs_dev->lists_mutex){+.+.}, at: ib_uverbs_open+0x16c/0x5f0 [ib_uverbs] > [ 33.328682] #2: 00000000505f509e (&(&file->uobjects_lock)->rlock){+.+.}, at: ib_uverbs_open+0x31a/0x5f0 [ib_uverbs] > > And this: > > spin_lock(&file->uobjects_lock); > might_sleep(); > spin_unlock(&file->uobjects_lock); > > Causes: > > [ 16.867211] BUG: sleeping function called from invalid context at drivers/infiniband/core/uverbs_main.c:1095 > [ 16.867776] in_atomic(): 1, irqs_disabled(): 0, pid: 245, name: ibv_devinfo > [ 16.868098] 3 locks held by ibv_devinfo/245: > [ 16.868383] #0: 000000004c5954ff (&uverbs_dev->disassociate_srcu){....}, at: ib_uverbs_open+0xf8/0x600 [ib_uverbs] > [ 16.868938] #1: 0000000020a6fae2 (&uverbs_dev->lists_mutex){+.+.}, at: ib_uverbs_open+0x16c/0x600 [ib_uverbs] > [ 16.869568] #2: 00000000036e6a97 (&(&file->uobjects_lock)->rlock){+.+.}, at: ib_uverbs_open+0x317/0x600 [ib_uverbs] > > I think this is done in some very expensive way, so it probably only > works when lockdep is enabled.. This is the might_sleep debug infrastructure (both of them), not lockdep. Disable CONFIG_PROVE_LOCKING and you should still get these. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch