From: Paul Jackson <pj@sgi.com>
To: akpm@osdl.org
Cc: kurosawa@valinux.co.jp, simon.derr@bull.net,
linux-kernel@vger.kernel.org, st@sw.ru, dev@sw.ru, den@sw.ru,
Paul Jackson <pj@sgi.com>,
djwong@us.ibm.com
Subject: [RFC] cpuset oom lock fix
Date: Wed, 11 Jan 2006 03:01:38 -0800 (PST) [thread overview]
Message-ID: <20060111110138.25784.52441.sendpatchset@jackhammer.engr.sgi.com> (raw)
The following is not adequately tested yet, so it
probably shouldn't go in *-mm.
I'm posting it so that the people who reported this
bug can have the opportunity to test this fix.
The problem, reported in:
http://bugzilla.kernel.org/show_bug.cgi?id=5859
and by various other email messages and lkml posts
is that the cpuset hook in the oom (out of memory)
code can try to take a cpuset semaphore while holding
the tasklist_lock (a spinlock).
One must not sleep while holding a spinlock.
The fix seems easy enough - move the cpuset semaphore
region outside the tasklist_lock region.
This required a few lines of mechanism to implement.
The oom code where the locking needs to be changed
does not have access to the cpuset locks, which are
internal to kernel/cpuset.c only. So I provided a
couple more cpuset interface routines, available too
the rest of the kernel, which simple take and drop
the lock needed here (cpusets callback_sem).
The initial build and boot and poke at it once test
passes ok. But I still need to:
1) build some other arch's (crosstool), and
2) test with the various CONFIG_*DEBUG* and CONFIG_*LOCK*_
options enabled.
It will be about one more day before I complete this
and submit this patch for consideration in *-mm.
If those who have stress tests that hit this could
give this patch a trial, that would be appreciated.
Signed-off-by: Paul Jackson
---
include/linux/cpuset.h | 6 ++++++
kernel/cpuset.c | 33 ++++++++++++++++++++++++++++-----
mm/oom_kill.c | 3 +++
3 files changed, 37 insertions(+), 5 deletions(-)
--- 2.6.15-mm2.orig/include/linux/cpuset.h 2006-01-10 16:00:21.190309415 -0800
+++ 2.6.15-mm2/include/linux/cpuset.h 2006-01-10 23:07:59.648091868 -0800
@@ -48,6 +48,9 @@ extern void __cpuset_memory_pressure_bum
extern struct file_operations proc_cpuset_operations;
extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);
+extern void cpuset_lock(void);
+extern void cpuset_unlock(void);
+
#else /* !CONFIG_CPUSETS */
static inline int cpuset_init_early(void) { return 0; }
@@ -93,6 +96,9 @@ static inline char *cpuset_task_status_a
return buffer;
}
+static inline void cpuset_lock(void) {}
+static inline void cpuset_unlock(void) {}
+
#endif /* !CONFIG_CPUSETS */
#endif /* _LINUX_CPUSET_H */
--- 2.6.15-mm2.orig/kernel/cpuset.c 2006-01-10 18:26:40.408120365 -0800
+++ 2.6.15-mm2/kernel/cpuset.c 2006-01-10 23:14:28.355492590 -0800
@@ -2150,6 +2150,33 @@ int __cpuset_zone_allowed(struct zone *z
}
/**
+ * cpuset_lock - lock out any changes to cpuset structures
+ *
+ * The out of memory (oom) code needs to lock down cpusets
+ * from being changed while it scans the tasklist looking for a
+ * task in an overlapping cpuset. Expose callback_sem via this
+ * cpuset_lock() routine, so the oom code can lock it, before
+ * locking the task list. The tasklist_lock is a spinlock, so
+ * must be taken inside callback_sem.
+ */
+
+void cpuset_lock(void)
+{
+ down(&callback_sem);
+}
+
+/**
+ * cpuset_unlock - release lock on cpuset changes
+ *
+ * Undo the lock taken in a previous cpuset_lock() call.
+ */
+
+void cpuset_unlock(void)
+{
+ up(&callback_sem);
+}
+
+/**
* cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors?
* @p: pointer to task_struct of some other task.
*
@@ -2158,7 +2185,7 @@ int __cpuset_zone_allowed(struct zone *z
* determine if task @p's memory usage might impact the memory
* available to the current task.
*
- * Acquires callback_sem - not suitable for calling from a fast path.
+ * Call while holding callback_sem.
**/
int cpuset_excl_nodes_overlap(const struct task_struct *p)
@@ -2166,8 +2193,6 @@ int cpuset_excl_nodes_overlap(const stru
const struct cpuset *cs1, *cs2; /* my and p's cpuset ancestors */
int overlap = 0; /* do cpusets overlap? */
- down(&callback_sem);
-
task_lock(current);
if (current->flags & PF_EXITING) {
task_unlock(current);
@@ -2186,8 +2211,6 @@ int cpuset_excl_nodes_overlap(const stru
overlap = nodes_intersects(cs1->mems_allowed, cs2->mems_allowed);
done:
- up(&callback_sem);
-
return overlap;
}
--- 2.6.15-mm2.orig/mm/oom_kill.c 2006-01-10 19:18:25.588685008 -0800
+++ 2.6.15-mm2/mm/oom_kill.c 2006-01-10 23:16:05.936643833 -0800
@@ -274,6 +274,7 @@ void out_of_memory(gfp_t gfp_mask, int o
show_mem();
}
+ cpuset_lock();
read_lock(&tasklist_lock);
retry:
p = select_bad_process();
@@ -284,6 +285,7 @@ retry:
/* Found nothing?!?! Either we hang forever, or we panic. */
if (!p) {
read_unlock(&tasklist_lock);
+ cpuset_unlock();
panic("Out of memory and no killable processes...\n");
}
@@ -293,6 +295,7 @@ retry:
out:
read_unlock(&tasklist_lock);
+ cpuset_unlock();
if (mm)
mmput(mm);
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.650.933.1373
reply other threads:[~2006-01-11 11:01 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060111110138.25784.52441.sendpatchset@jackhammer.engr.sgi.com \
--to=pj@sgi.com \
--cc=akpm@osdl.org \
--cc=den@sw.ru \
--cc=dev@sw.ru \
--cc=djwong@us.ibm.com \
--cc=kurosawa@valinux.co.jp \
--cc=linux-kernel@vger.kernel.org \
--cc=simon.derr@bull.net \
--cc=st@sw.ru \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).