Re: [PATCH] mm,oom: Teach lockdep about oom_lock.

From: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
To: Michal Hocko <mhocko@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	akpm@linux-foundation.org, linux-mm@kvack.org
Subject: Re: [PATCH] mm,oom: Teach lockdep about oom_lock.
Date: Tue, 12 Mar 2019 23:06:33 +0900	[thread overview]
Message-ID: <d9b49a08-5d5a-ec4a-7cb7-c268999a9906@i-love.sakura.ne.jp> (raw)
In-Reply-To: <20190311103012.GB5232@dhcp22.suse.cz>

On 2019/03/11 19:30, Michal Hocko wrote:
> On Sat 09-03-19 15:02:22, Tetsuo Handa wrote:
>> Since a thread which succeeded to hold oom_lock must not involve blocking
>> memory allocations, teach lockdep to consider that blocking memory
>> allocations might wait for oom_lock at as early location as possible, and
>> teach lockdep to consider that oom_lock is held by mutex_lock() than by
>> mutex_trylock().
> 
> This is still really hard to understand. Especially the last part of the
> sentence. The lockdep will know that the lock is held even when going
> via trylock. I guess you meant to say that
> 	mutex_lock(oom_lock)
> 	  allocation
> 	    mutex_trylock(oom_lock)
> is not caught by the lockdep, right?

Right.

> 
>> Also, since the OOM killer is disabled until the OOM reaper or exit_mmap()
>> sets MMF_OOM_SKIP, teach lockdep to consider that oom_lock is held when
>> __oom_reap_task_mm() is called.
> 
> It would be good to mention that the oom reaper acts as a guarantee of a
> forward progress and as such it cannot depend on any memory allocation
> and that is why this context is marked. This would be easier to
> understand IMHO.

OK. Here is v3 patch.

From 250bbe28bc3e9946992d960bb90a351a896a543b Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Tue, 12 Mar 2019 22:58:41 +0900
Subject: [PATCH v3] mm,oom: Teach lockdep about oom_lock.

Since a thread which succeeded to hold oom_lock must not involve blocking
memory allocations, teach lockdep to consider that blocking memory
allocations might wait for oom_lock at as early location as possible.

Lockdep can't detect possibility of deadlock when mutex_trylock(&oom_lock)
failed, for we assume that somebody else is still able to make a forward
progress. Thus, teach lockdep to consider that mutex_trylock(&oom_lock) as
mutex_lock(&oom_lock).

Since the OOM killer is disabled when __oom_reap_task_mm() is in progress,
a thread which is calling __oom_reap_task_mm() must not involve blocking
memory allocations. Thus, teach lockdep about that.

This patch should not cause lockdep splats unless there is somebody doing
dangerous things (e.g. from OOM notifiers, from the OOM reaper).

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h | 12 ++++++++++++
 mm/oom_kill.c       | 28 +++++++++++++++++++++++++++-
 mm/page_alloc.c     | 16 ++++++++++++++++
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index d079920..04aa46b 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -56,6 +56,18 @@ struct oom_control {
 
 extern struct mutex oom_lock;
 
+static inline void oom_reclaim_acquire(gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_DIRECT_RECLAIM)
+		mutex_acquire(&oom_lock.dep_map, 0, 0, _THIS_IP_);
+}
+
+static inline void oom_reclaim_release(gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_DIRECT_RECLAIM)
+		mutex_release(&oom_lock.dep_map, 1, _THIS_IP_);
+}
+
 static inline void set_current_oom_origin(void)
 {
 	current->signal->oom_flag_origin = true;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3a24848..6f53bb6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -513,6 +513,14 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 	 */
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
+	/*
+	 * Since this function acts as a guarantee of a forward progress,
+	 * current thread is not allowed to involve (even indirectly via
+	 * dependency) __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocation from
+	 * this function, for such allocation will have to wait for this
+	 * function to complete when __alloc_pages_may_oom() is called.
+	 */
+	oom_reclaim_acquire(GFP_KERNEL);
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
 		if (!can_madv_dontneed_vma(vma))
 			continue;
@@ -544,6 +552,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 			tlb_finish_mmu(&tlb, range.start, range.end);
 		}
 	}
+	oom_reclaim_release(GFP_KERNEL);
 
 	return ret;
 }
@@ -1120,8 +1129,25 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
-	if (!mutex_trylock(&oom_lock))
+	if (!mutex_trylock(&oom_lock)) {
+		/*
+		 * This corresponds to prepare_alloc_pages(). Lockdep will
+		 * complain if e.g. OOM notifier for global OOM by error
+		 * triggered pagefault OOM path.
+		 */
+		oom_reclaim_acquire(GFP_KERNEL);
+		oom_reclaim_release(GFP_KERNEL);
 		return;
+	}
+	/*
+	 * Teach lockdep to consider that current thread is not allowed to
+	 * involve (even indirectly via dependency) __GFP_DIRECT_RECLAIM &&
+	 * !__GFP_NORETRY allocation from this function, for such allocation
+	 * will have to wait for completion of this function when
+	 * __alloc_pages_may_oom() is called.
+	 */
+	oom_reclaim_release(GFP_KERNEL);
+	oom_reclaim_acquire(GFP_KERNEL);
 	out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d0fa5b..c23ae76d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3793,6 +3793,14 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
+	/*
+	 * Teach lockdep to consider that current thread is not allowed to
+	 * involve (even indirectly via dependency) __GFP_DIRECT_RECLAIM &&
+	 * !__GFP_NORETRY allocation from this context, for such allocation
+	 * will have to wait for this function to complete.
+	 */
+	oom_reclaim_release(gfp_mask);
+	oom_reclaim_acquire(gfp_mask);
 
 	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
@@ -4651,6 +4659,14 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 	fs_reclaim_acquire(gfp_mask);
 	fs_reclaim_release(gfp_mask);
 
+	/*
+	 * Since __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocation might call
+	 * __alloc_pages_may_oom(), teach lockdep to record that current thread
+	 * might forever retry until holding oom_lock succeeds.
+	 */
+	oom_reclaim_acquire(gfp_mask);
+	oom_reclaim_release(gfp_mask);
+
 	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
 	if (should_fail_alloc_page(gfp_mask, order))
-- 
1.8.3.1