[RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded

* [RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded
@ 2017-02-18  8:37 Arun Raghavan
  2017-02-18 15:47 ` Arun Raghavan
  2017-03-01 10:00 ` Thomas Gleixner
  0 siblings, 2 replies; 5+ messages in thread
From: Arun Raghavan @ 2017-02-18  8:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: Thomas Gleixner, Arun Raghavan

This dumps some information in logs when a process exceeds its CPU or RT
limits (soft and hard). Makes debugging easier when userspace triggers
these limits.

Signed-off-by: Arun Raghavan <arun@arunraghavan.net>
---
 kernel/time/posix-cpu-timers.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Hello,
This has come up a couple of times in the past, but we haven't been able to
resolve whatever issues were pointed out.

In the mean time, we have frustrated users who don't know where they're getting
a SIGKILL from, and I'd really like to have a way for people to not have to go
through this.

The issues that came up the last time were:

 1. SIGXCPU messages shouldn't be needed since they can be caught: it's still
    useful to have the log because it isn't always possible to pin down the
    thread causing the problem in userspace.

 2. SIGKILL logging should be centralised: there seem to be multiple paths that
    trigger a SIGKILL -- and it seemed a bit ugly to try to add a reason
    parameter on all of them for the KILL case. Any other suggestions on how to
    deal with this?

I'm happy to fix this up to actually make it this time, but if there aren't
none, just pushing this out will make our lives a little less painful.

Thanks,
Arun

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index e9e8c10..6dbcf84 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -860,6 +860,9 @@ static void check_thread_timers(struct task_struct *tsk,
 			 * At the hard limit, we just die.
 			 * No need to calculate anything else now.
 			 */
+			printk(KERN_INFO
+				"CPU Watchdog Timeout (hard): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
 			return;
 		}
@@ -872,7 +875,7 @@ static void check_thread_timers(struct task_struct *tsk,
 				sig->rlim[RLIMIT_RTTIME].rlim_cur = soft;
 			}
 			printk(KERN_INFO
-				"RT Watchdog Timeout: %s[%d]\n",
+				"RT Watchdog Timeout (soft): %s[%d]\n",
 				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
 		}
@@ -980,6 +983,9 @@ static void check_process_timers(struct task_struct *tsk,
 			 * At the hard limit, we just die.
 			 * No need to calculate anything else now.
 			 */
+			printk(KERN_INFO
+				"RT Watchdog Timeout (hard): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
 			return;
 		}
@@ -987,6 +993,9 @@ static void check_process_timers(struct task_struct *tsk,
 			/*
 			 * At the soft limit, send a SIGXCPU every second.
 			 */
+			printk(KERN_INFO
+				"CPU Watchdog Timeout (soft): %s[%d]\n",
+				tsk->comm, task_pid_nr(tsk));
 			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
 			if (soft < hard) {
 				soft++;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 5+ messages in thread