All of lore.kernel.org
 help / color / mirror / Atom feed
* [RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded
@ 2017-02-18  8:37 Arun Raghavan
  2017-02-18 15:47 ` Arun Raghavan
  2017-03-01 10:00 ` Thomas Gleixner
  0 siblings, 2 replies; 5+ messages in thread
From: Arun Raghavan @ 2017-02-18  8:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: Thomas Gleixner, Arun Raghavan

This dumps some information in logs when a process exceeds its CPU or RT
limits (soft and hard). Makes debugging easier when userspace triggers
these limits.

Signed-off-by: Arun Raghavan <arun@arunraghavan.net>
---
 kernel/time/posix-cpu-timers.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Hello,
This has come up a couple of times in the past, but we haven't been able to
resolve whatever issues were pointed out.

In the mean time, we have frustrated users who don't know where they're getting
a SIGKILL from, and I'd really like to have a way for people to not have to go
through this.

The issues that came up the last time were:

 1. SIGXCPU messages shouldn't be needed since they can be caught: it's still
    useful to have the log because it isn't always possible to pin down the
    thread causing the problem in userspace.

 2. SIGKILL logging should be centralised: there seem to be multiple paths that
    trigger a SIGKILL -- and it seemed a bit ugly to try to add a reason
    parameter on all of them for the KILL case. Any other suggestions on how to
    deal with this?

I'm happy to fix this up to actually make it this time, but if there aren't
none, just pushing this out will make our lives a little less painful.

Thanks,
Arun

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index e9e8c10..6dbcf84 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -860,6 +860,9 @@ static void check_thread_timers(struct task_struct *tsk,
 			 * At the hard limit, we just die.
 			 * No need to calculate anything else now.
 			 */
+			printk(KERN_INFO
+				"CPU Watchdog Timeout (hard): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
 			return;
 		}
@@ -872,7 +875,7 @@ static void check_thread_timers(struct task_struct *tsk,
 				sig->rlim[RLIMIT_RTTIME].rlim_cur = soft;
 			}
 			printk(KERN_INFO
-				"RT Watchdog Timeout: %s[%d]\n",
+				"RT Watchdog Timeout (soft): %s[%d]\n",
 				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
 		}
@@ -980,6 +983,9 @@ static void check_process_timers(struct task_struct *tsk,
 			 * At the hard limit, we just die.
 			 * No need to calculate anything else now.
 			 */
+			printk(KERN_INFO
+				"RT Watchdog Timeout (hard): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
 			return;
 		}
@@ -987,6 +993,9 @@ static void check_process_timers(struct task_struct *tsk,
 			/*
 			 * At the soft limit, send a SIGXCPU every second.
 			 */
+			printk(KERN_INFO
+				"CPU Watchdog Timeout (soft): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
 			if (soft < hard) {
 				soft++;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded
  2017-02-18  8:37 [RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded Arun Raghavan
@ 2017-02-18 15:47 ` Arun Raghavan
  2017-03-01 10:00 ` Thomas Gleixner
  1 sibling, 0 replies; 5+ messages in thread
From: Arun Raghavan @ 2017-02-18 15:47 UTC (permalink / raw)
  To: Arun Raghavan, linux-kernel; +Cc: Thomas Gleixner



On Sat, 18 Feb 2017, at 02:07 PM, Arun Raghavan wrote:
> This dumps some information in logs when a process exceeds its CPU or RT
> limits (soft and hard). Makes debugging easier when userspace triggers
> these limits.
> 
> Signed-off-by: Arun Raghavan <arun@arunraghavan.net>
> ---
>  kernel/time/posix-cpu-timers.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> Hello,
> This has come up a couple of times in the past, but we haven't been able
> to
> resolve whatever issues were pointed out.
> 
> In the mean time, we have frustrated users who don't know where they're
> getting
> a SIGKILL from, and I'd really like to have a way for people to not have
> to go
> through this.
> 
> The issues that came up the last time were:
> 
>  1. SIGXCPU messages shouldn't be needed since they can be caught: it's
>  still
>     useful to have the log because it isn't always possible to pin down
>     the
>     thread causing the problem in userspace.
> 
>  2. SIGKILL logging should be centralised: there seem to be multiple
>  paths that
>     trigger a SIGKILL -- and it seemed a bit ugly to try to add a reason
>     parameter on all of them for the KILL case. Any other suggestions on
>     how to
>     deal with this?
> 
> I'm happy to fix this up to actually make it this time, but if there
> aren't
> none, just pushing this out will make our lives a little less painful.

That was meant to read -- "... if there aren't blocking objections to
this, just pushing this out will make our lives a little less painful."

-- Arun

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded
  2017-02-18  8:37 [RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded Arun Raghavan
  2017-02-18 15:47 ` Arun Raghavan
@ 2017-03-01 10:00 ` Thomas Gleixner
  2017-03-01 14:53   ` [PATCH v2] rlimits: Print more information when CPU/RT " Arun Raghavan
  1 sibling, 1 reply; 5+ messages in thread
From: Thomas Gleixner @ 2017-03-01 10:00 UTC (permalink / raw)
  To: Arun Raghavan; +Cc: linux-kernel

On Sat, 18 Feb 2017, Arun Raghavan wrote:

> This dumps some information in logs when a process exceeds its CPU or RT
> limits (soft and hard). Makes debugging easier when userspace triggers
> these limits.

Sigh. This changelog sucks. "dumps some information" is pretty useless and
it does not explain WHY you want to do that. Please structure the changelog
in a way which makes it easy to understand.

1) Problem description

2) Solution

> diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
> index e9e8c10..6dbcf84 100644
> --- a/kernel/time/posix-cpu-timers.c
> +++ b/kernel/time/posix-cpu-timers.c
> @@ -860,6 +860,9 @@ static void check_thread_timers(struct task_struct *tsk,
>  			 * At the hard limit, we just die.
>  			 * No need to calculate anything else now.
>  			 */
> +			printk(KERN_INFO

			pr_info("CPU Watchdog Timeout (hard): %s[%d]\n",

and no artificial line breaks, please.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v2] rlimits: Print more information when CPU/RT limits are exceeded
  2017-03-01 10:00 ` Thomas Gleixner
@ 2017-03-01 14:53   ` Arun Raghavan
  2017-03-13 20:36     ` [tip:timers/core] " tip-bot for Arun Raghavan
  0 siblings, 1 reply; 5+ messages in thread
From: Arun Raghavan @ 2017-03-01 14:53 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, Arun Raghavan

When a process is sent a SIGKILL because it exceeded CPU or RT limits,
the cause may not be obvious in userspace -- daemonised processes just
get killed, and even foreground process just see a 'Killed' message. The
lack of any information on why this might be happening in logs can be
confusing to users who are not aware of this mechanism.

We now add messages which dump the process name and tid in logs when a
process exceeds its CPU or RT limits (soft and hard) in order to make it
clearer to people debugging such issues.

v2:
* Clarified changelog to make the problem and solution clearer
* Switched printk() to pr_info()

Signed-off-by: Arun Raghavan <arun@arunraghavan.net>
---
 kernel/time/posix-cpu-timers.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index b4377a5..bd6893d 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -824,6 +824,8 @@ static void check_thread_timers(struct task_struct *tsk,
 			 * At the hard limit, we just die.
 			 * No need to calculate anything else now.
 			 */
+			pr_info("CPU Watchdog Timeout (hard): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
 			return;
 		}
@@ -835,8 +837,7 @@ static void check_thread_timers(struct task_struct *tsk,
 				soft += USEC_PER_SEC;
 				sig->rlim[RLIMIT_RTTIME].rlim_cur = soft;
 			}
-			printk(KERN_INFO
-				"RT Watchdog Timeout: %s[%d]\n",
+			pr_info("RT Watchdog Timeout (soft): %s[%d]\n",
 				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
 		}
@@ -934,6 +935,8 @@ static void check_process_timers(struct task_struct *tsk,
 			 * At the hard limit, we just die.
 			 * No need to calculate anything else now.
 			 */
+			pr_info("RT Watchdog Timeout (hard): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
 			return;
 		}
@@ -941,6 +944,8 @@ static void check_process_timers(struct task_struct *tsk,
 			/*
 			 * At the soft limit, send a SIGXCPU every second.
 			 */
+			pr_info("CPU Watchdog Timeout (soft): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
 			if (soft < hard) {
 				soft++;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [tip:timers/core] rlimits: Print more information when CPU/RT limits are exceeded
  2017-03-01 14:53   ` [PATCH v2] rlimits: Print more information when CPU/RT " Arun Raghavan
@ 2017-03-13 20:36     ` tip-bot for Arun Raghavan
  0 siblings, 0 replies; 5+ messages in thread
From: tip-bot for Arun Raghavan @ 2017-03-13 20:36 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, linux-kernel, hpa, tglx, arun

Commit-ID:  e7ea7c9806a2681807257ea89085339d33f7fa0b
Gitweb:     http://git.kernel.org/tip/e7ea7c9806a2681807257ea89085339d33f7fa0b
Author:     Arun Raghavan <arun@arunraghavan.net>
AuthorDate: Wed, 1 Mar 2017 20:23:09 +0530
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 13 Mar 2017 21:32:15 +0100

rlimits: Print more information when CPU/RT limits are exceeded

When a process is sent a SIGKILL because it exceeded CPU or RT limits,
the cause may not be obvious in userspace -- daemonised processes just
get killed, and even foreground process just see a 'Killed' message. The
lack of any information on why this might be happening in logs can be
confusing to users who are not aware of this mechanism.

Add messages which dump the process name and tid in dmesg when a process
exceeds its CPU or RT limits (soft and hard) in order to make it clearer to
people debugging such issues.

Signed-off-by: Arun Raghavan <arun@arunraghavan.net>
Link: http://lkml.kernel.org/r/20170301145309.27214-1-arun@arunraghavan.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/time/posix-cpu-timers.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 4513ad1..76bea3a 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -825,6 +825,8 @@ static void check_thread_timers(struct task_struct *tsk,
 			 * At the hard limit, we just die.
 			 * No need to calculate anything else now.
 			 */
+			pr_info("CPU Watchdog Timeout (hard): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
 			return;
 		}
@@ -836,8 +838,7 @@ static void check_thread_timers(struct task_struct *tsk,
 				soft += USEC_PER_SEC;
 				sig->rlim[RLIMIT_RTTIME].rlim_cur = soft;
 			}
-			printk(KERN_INFO
-				"RT Watchdog Timeout: %s[%d]\n",
+			pr_info("RT Watchdog Timeout (soft): %s[%d]\n",
 				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
 		}
@@ -935,6 +936,8 @@ static void check_process_timers(struct task_struct *tsk,
 			 * At the hard limit, we just die.
 			 * No need to calculate anything else now.
 			 */
+			pr_info("RT Watchdog Timeout (hard): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
 			return;
 		}
@@ -942,6 +945,8 @@ static void check_process_timers(struct task_struct *tsk,
 			/*
 			 * At the soft limit, send a SIGXCPU every second.
 			 */
+			pr_info("CPU Watchdog Timeout (soft): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
 			if (soft < hard) {
 				soft++;

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-03-13 20:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-18  8:37 [RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded Arun Raghavan
2017-02-18 15:47 ` Arun Raghavan
2017-03-01 10:00 ` Thomas Gleixner
2017-03-01 14:53   ` [PATCH v2] rlimits: Print more information when CPU/RT " Arun Raghavan
2017-03-13 20:36     ` [tip:timers/core] " tip-bot for Arun Raghavan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.