linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] [RESEND] rlimits: Print more information when limits are exceeded
@ 2012-02-23 19:29 Arun Raghavan
  2012-03-30 13:39 ` Thomas Gleixner
  0 siblings, 1 reply; 5+ messages in thread
From: Arun Raghavan @ 2012-02-23 19:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: Thomas Gleixner, David Henningsson, Arun Raghavan

This dumps some information in logs when a process exceeds its CPU or RT
limits (soft and hard). Makes debugging easier when userspace triggers
these limits.

Signed-off-by: Arun Raghavan <arun.raghavan@collabora.co.uk>
---
 kernel/posix-cpu-timers.c |   11 ++++++++++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 125cb67..bb0ae71 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -956,6 +956,9 @@ static void check_thread_timers(struct task_struct *tsk,
 			 * At the hard limit, we just die.
 			 * No need to calculate anything else now.
 			 */
+			printk(KERN_INFO
+				"RT Watchdog Timeout (hard): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
 			return;
 		}
@@ -968,7 +971,7 @@ static void check_thread_timers(struct task_struct *tsk,
 				sig->rlim[RLIMIT_RTTIME].rlim_cur = soft;
 			}
 			printk(KERN_INFO
-				"RT Watchdog Timeout: %s[%d]\n",
+				"RT Watchdog Timeout (soft): %s[%d]\n",
 				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
 		}
@@ -1116,6 +1119,9 @@ static void check_process_timers(struct task_struct *tsk,
 			 * At the hard limit, we just die.
 			 * No need to calculate anything else now.
 			 */
+			printk(KERN_INFO
+				"CPU Watchdog Timeout (hard): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
 			return;
 		}
@@ -1123,6 +1129,9 @@ static void check_process_timers(struct task_struct *tsk,
 			/*
 			 * At the soft limit, send a SIGXCPU every second.
 			 */
+			printk(KERN_INFO
+				"CPU Watchdog Timeout (soft): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
 			if (soft < hard) {
 				soft++;
-- 
1.7.8.4


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] [RESEND] rlimits: Print more information when limits are exceeded
  2012-02-23 19:29 [PATCH] [RESEND] rlimits: Print more information when limits are exceeded Arun Raghavan
@ 2012-03-30 13:39 ` Thomas Gleixner
  2012-03-30 14:12   ` David Henningsson
  0 siblings, 1 reply; 5+ messages in thread
From: Thomas Gleixner @ 2012-03-30 13:39 UTC (permalink / raw)
  To: Arun Raghavan; +Cc: LKML, David Henningsson, Peter Zijlstra

On Fri, 24 Feb 2012, Arun Raghavan wrote:

> This dumps some information in logs when a process exceeds its CPU or RT
> limits (soft and hard). Makes debugging easier when userspace triggers
> these limits.

Why do we need to spam the logs with such information?

SIGXCPU is only ever sent by this code. If there is a signal handler
in the application it's easy to debug. If not it's even easier, the
thing will simply be killed and you get the reason printed.

For the SIGKILL case there only a limited number of reasons why a
SIGKILL is sent. So no, I rather commit a patch which removes that
ugly printk which is already there instead of adding more of them.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] [RESEND] rlimits: Print more information when limits are exceeded
  2012-03-30 13:39 ` Thomas Gleixner
@ 2012-03-30 14:12   ` David Henningsson
  2012-03-30 14:29     ` Thomas Gleixner
  0 siblings, 1 reply; 5+ messages in thread
From: David Henningsson @ 2012-03-30 14:12 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Arun Raghavan, LKML, Peter Zijlstra

On 03/30/2012 03:39 PM, Thomas Gleixner wrote:
> On Fri, 24 Feb 2012, Arun Raghavan wrote:
>
>> This dumps some information in logs when a process exceeds its CPU or RT
>> limits (soft and hard). Makes debugging easier when userspace triggers
>> these limits.
>
> Why do we need to spam the logs with such information?
>
> SIGXCPU is only ever sent by this code. If there is a signal handler
> in the application it's easy to debug. If not it's even easier, the
> thing will simply be killed and you get the reason printed.

I'm not totally sure, but don't we log SIGSEGVs? If so, the same 
reasoning would apply to SIGSEGV.

> For the SIGKILL case there only a limited number of reasons why a
> SIGKILL is sent. So no, I rather commit a patch which removes that
> ugly printk which is already there instead of adding more of them.

The reason I proposed some kind of printk for SIGKILL, was to get some 
diagnostic information out of the SIGKILL. E g, if you have two threads 
both running on rtprio rlimits in the same process, it would be very 
interesting to know which one of them was causing the kernel to send 
SIGKILL.

Also, it could be useful to know whether the SIGKILL was actually sent 
by the kernel, or by some other process feeling evil (e g "kill -9").

-- 
David Henningsson, Canonical Ltd.
http://launchpad.net/~diwic

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] [RESEND] rlimits: Print more information when limits are exceeded
  2012-03-30 14:12   ` David Henningsson
@ 2012-03-30 14:29     ` Thomas Gleixner
  2012-03-30 17:19       ` Arun Raghavan
  0 siblings, 1 reply; 5+ messages in thread
From: Thomas Gleixner @ 2012-03-30 14:29 UTC (permalink / raw)
  To: David Henningsson; +Cc: Arun Raghavan, LKML, Peter Zijlstra

On Fri, 30 Mar 2012, David Henningsson wrote:
> On 03/30/2012 03:39 PM, Thomas Gleixner wrote:
> > On Fri, 24 Feb 2012, Arun Raghavan wrote:
> > 
> > > This dumps some information in logs when a process exceeds its CPU or RT
> > > limits (soft and hard). Makes debugging easier when userspace triggers
> > > these limits.
> > 
> > Why do we need to spam the logs with such information?
> > 
> > SIGXCPU is only ever sent by this code. If there is a signal handler
> > in the application it's easy to debug. If not it's even easier, the
> > thing will simply be killed and you get the reason printed.
> 
> I'm not totally sure, but don't we log SIGSEGVs? If so, the same reasoning
> would apply to SIGSEGV.

I think so. Dunno why this was added in the first place. core dumps or
proper signal handlers are telling you usually more than that single
line in dmesg.
 
> > For the SIGKILL case there only a limited number of reasons why a
> > SIGKILL is sent. So no, I rather commit a patch which removes that
> > ugly printk which is already there instead of adding more of them.
> 
> The reason I proposed some kind of printk for SIGKILL, was to get some
> diagnostic information out of the SIGKILL. E g, if you have two threads both
> running on rtprio rlimits in the same process, it would be very interesting to
> know which one of them was causing the kernel to send SIGKILL.

Usually the one which ignored SIGXCPU for quite a while. There is a
reason why SIGXCPU can be handled by the application.

> Also, it could be useful to know whether the SIGKILL was actually sent by the
> kernel, or by some other process feeling evil (e g "kill -9").

Agreed, but instead of adding that printk to the rlimit code I prefer
a generic infrastructure which can be used by all call sites which
issue SIGKILL. Something like: [__]kill_it(flags, task, "Reason");

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] [RESEND] rlimits: Print more information when limits are exceeded
  2012-03-30 14:29     ` Thomas Gleixner
@ 2012-03-30 17:19       ` Arun Raghavan
  0 siblings, 0 replies; 5+ messages in thread
From: Arun Raghavan @ 2012-03-30 17:19 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: David Henningsson, LKML, Peter Zijlstra

On Fri, 2012-03-30 at 16:29 +0200, Thomas Gleixner wrote:
> On Fri, 30 Mar 2012, David Henningsson wrote:
> > On 03/30/2012 03:39 PM, Thomas Gleixner wrote:
> > > On Fri, 24 Feb 2012, Arun Raghavan wrote:
> > > 
> > > > This dumps some information in logs when a process exceeds its CPU or RT
> > > > limits (soft and hard). Makes debugging easier when userspace triggers
> > > > these limits.
> > > 
> > > Why do we need to spam the logs with such information?
> > > 
> > > SIGXCPU is only ever sent by this code. If there is a signal handler
> > > in the application it's easy to debug. If not it's even easier, the
> > > thing will simply be killed and you get the reason printed.
> > 
> > I'm not totally sure, but don't we log SIGSEGVs? If so, the same reasoning
> > would apply to SIGSEGV.
> 
> I think so. Dunno why this was added in the first place. core dumps or
> proper signal handlers are telling you usually more than that single
> line in dmesg.
>  
> > > For the SIGKILL case there only a limited number of reasons why a
> > > SIGKILL is sent. So no, I rather commit a patch which removes that
> > > ugly printk which is already there instead of adding more of them.
> > 
> > The reason I proposed some kind of printk for SIGKILL, was to get some
> > diagnostic information out of the SIGKILL. E g, if you have two threads both
> > running on rtprio rlimits in the same process, it would be very interesting to
> > know which one of them was causing the kernel to send SIGKILL.
> 
> Usually the one which ignored SIGXCPU for quite a while. There is a
> reason why SIGXCPU can be handled by the application.

In general I agree -- I'm happy to rewrite the patch to drop the printk
in the SIGXCPU case.

In the current situation that I'm debugging, there appears to be a
kernel fragment that's busy waiting and eventually gets killed (I'll be
taking up a fix for this separately). In this case, by the time we get
back control, the hard limit seems to be already hit. Knowing the
culprit thread in this case does make things simpler for us.

> > Also, it could be useful to know whether the SIGKILL was actually sent by the
> > kernel, or by some other process feeling evil (e g "kill -9").
> 
> Agreed, but instead of adding that printk to the rlimit code I prefer
> a generic infrastructure which can be used by all call sites which
> issue SIGKILL. Something like: [__]kill_it(flags, task, "Reason");

The other paths that call SIGKILL seem to be slightly different (going
eventually via do_send_sig_info()). Is this actually functionally the
same? If yes, I'll try to rewrite the patch to consolidate some of these
paths as you suggest.

-- Arun


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-03-30 17:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-23 19:29 [PATCH] [RESEND] rlimits: Print more information when limits are exceeded Arun Raghavan
2012-03-30 13:39 ` Thomas Gleixner
2012-03-30 14:12   ` David Henningsson
2012-03-30 14:29     ` Thomas Gleixner
2012-03-30 17:19       ` Arun Raghavan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).