linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
@ 2010-09-23 13:48 Michael Holzheu
  2010-09-23 14:00 ` [RFC][PATCH 01/10] taskstats: Use real microsecond granularity for CPU times Michael Holzheu
                   ` (12 more replies)
  0 siblings, 13 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-23 13:48 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens
  Cc: linux-kernel, linux-s390

Currently tools like "top" gather the task information by reading procfs
files. This has several disadvantages:

* It is very CPU intensive, because a lot of system calls (readdir, open,
  read, close) are necessary.
* No real task snapshot can be provided, because while the procfs files are
  read the system continues running.
* The procfs times granularity is restricted to jiffies.

In parallel to procfs there exists the taskstats binary interface that uses
netlink sockets as transport mechanism to deliver task information to
user space. There exists a taskstats command "TASKSTATS_CMD_ATTR_PID"
to get task information for a given PID. This command can already be used for
tools like top, but has also several disadvantages:

* You first have to find out which PIDs are available in the system. Currently
  we have to use procfs again to do this.
* For each task two system calls have to be issued (First send the command and
  then receive the reply).
* No snapshot mechanism is available.

GOALS OF THIS PATCH SET
-----------------------
The intention of this patch set is to provide better support for tools like
top. The goal is to:

* provide a task snapshot mechanism where we can get a consistent view of
  all running tasks.
* provide a transport mechanism that does not require a lot of system calls
  and that allows implementing low CPU overhead task monitoring.
* provide microsecond CPU time granularity.

FIRST RESULTS
-------------
Together with this kernel patch set also user space code for a new top
utility (ptop) is provided that exploits the new kernel infrastructure. See
patch 10 for more details.

TEST1: System with many sleeping tasks

  for ((i=0; i < 1000; i++))
  do
         sleep 1000000 &
  done

  # ptop_new_proc

             VVVV
  pid   user  sys  ste  total  Name
  (#)    (%)  (%)  (%)    (%)  (str)
  541   0.37 2.39 0.10   2.87  top
  3743  0.03 0.05 0.00   0.07  ptop_new_proc
             ^^^^

Compared to the old top command that has to scan more than 1000 proc
directories the new ptop consumes much less CPU time (0.05% system time
on my s390 system).

TEST2: Show snapshot consistency with system that is 100% busy

  System with 3 CPUs:

  for ((i=0; i < $(cat /proc/cpuinfo  | grep "^processor" | wc -l); i++))
  do
       ./loop &
  done

  # ptop_snap_proc

          VVVV  VVV  VVV                        VVVVV
  pid     user  sys  ste cuser csys cste delay  total Elap+ Name
  (#)      (%)  (%)  (%)   (%)  (%)  (%)   (%)    (%)  (hm) (str)
  23891  99.84 0.06 0.09  0.00 0.00 0.00  0.01  99.99  0:00 loop
  23881  99.66 0.06 0.09  0.00 0.00 0.00  0.20  99.81  0:00 loop
  23886  99.65 0.06 0.09  0.00 0.00 0.00  0.20  99.80  0:00 loop
  2413    0.00 0.00 0.00  0.00 0.00 0.00  0.00   0.01  4:17 sshd
  ...
  V:V:S 299.36 0.36 0.27  0.00 0.00 0.00  0.40 300.00  4:22
                                               ^^^^^^

  With the snapshot mechanism the sum of all tasks CPU times (user + system +
  steal) will be exactly 300.00% CPU time with this testcase. Using
  ptop_snap_proc (see patch 10) this works fine on s390.

PATCHSET OVERVIEW
-----------------
The code is not final and still has a few TODOs. But it is good enough for a
first round of review. The following kernel patches are provided:

[01] Prepare-0: Use real microsecond granularity for taskstats CPU times.
[02] Prepare-1: Restructure taskstats.c in order to be able to add new commands
     more easily.
[03] Prepare-2: Separate the finding of a task_struct by PID or TGID from
     filling the taskstats.
[04] Add new command "TASKSTATS_CMD_ATTR_PIDS" to get a snapshot of multiple
     tasks.
[05] Add procfs interface for taskstats commands. This allows to get a complete
     and consistent snapshot with all tasks using two system calls (ioctl and
     read). Transferring a snapshot of all running tasks is not possible using
     the existing netlink interface, because there we have the socket buffer
     size as restricting factor.
[06] Add TGID to taskstats.
[07] Add steal time per task accounting.
[08] Add cumulative CPU time (user, system and steal) to taskstats.
[09] Fix exit CPU time accounting.

[10] Besides of the kernel patches also user space code is provided that
     exploits the new kernel infrastructure. The user space code provides the
     following:
     1. A proposal for a taskstats user space library:
        1.1 Based on netlink (requires libnl-devel-1.1-5)
        2.1 Based on the new /proc/taskstats interface (see [05])
     2. A proposal for a task snapshot library based on taskstats library (1.1)
     3. A new tool "ptop" (precise top) that uses the libraries



^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC][PATCH 01/10] taskstats: Use real microsecond granularity for CPU times
  2010-09-23 13:48 [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Michael Holzheu
@ 2010-09-23 14:00 ` Michael Holzheu
  2010-10-07  5:08   ` Balbir Singh
  2010-09-23 14:01 ` [RFC][PATCH 02/10] taskstats: Separate taskstats commands Michael Holzheu
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 58+ messages in thread
From: Michael Holzheu @ 2010-09-23 14:00 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Balbir Singh, Ingo Molnar, Heiko Carstens,
	Martin Schwidefsky
  Cc: linux-s390, linux-kernel

Subject: [PATCH] taskstats: Use real microsecond granularity for CPU times

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

The taskstats interface uses microsecond granularity for the user and
system time values. The conversion from cputime to the taskstats values
uses the cputime_to_msecs primitive which effectively limits the 
granularity to milliseconds. Add the cputime_to_usecs primitive for
architectures that have better, more precise CPU time values. Remove
cputime_to_msecs primitive because there is no more user left.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 arch/ia64/include/asm/cputime.h    |    6 +++---
 arch/powerpc/include/asm/cputime.h |   12 ++++++------
 arch/s390/include/asm/cputime.h    |   10 +++++-----
 include/asm-generic/cputime.h      |    6 +++---
 kernel/tsacct.c                    |   10 ++++------
 5 files changed, 21 insertions(+), 23 deletions(-)

--- a/arch/ia64/include/asm/cputime.h
+++ b/arch/ia64/include/asm/cputime.h
@@ -56,10 +56,10 @@ typedef u64 cputime64_t;
 #define jiffies64_to_cputime64(__jif)	((__jif) * (NSEC_PER_SEC / HZ))
 
 /*
- * Convert cputime <-> milliseconds
+ * Convert cputime <-> microseconds
  */
-#define cputime_to_msecs(__ct)		((__ct) / NSEC_PER_MSEC)
-#define msecs_to_cputime(__msecs)	((__msecs) * NSEC_PER_MSEC)
+#define cputime_to_usecs(__ct)		((__ct) / NSEC_PER_USEC)
+#define usecs_to_cputime(__usecs)	((__usecs) * NSEC_PER_USEC)
 
 /*
  * Convert cputime <-> seconds
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -124,23 +124,23 @@ static inline u64 cputime64_to_jiffies64
 }
 
 /*
- * Convert cputime <-> milliseconds
+ * Convert cputime <-> microseconds
  */
 extern u64 __cputime_msec_factor;
 
-static inline unsigned long cputime_to_msecs(const cputime_t ct)
+static inline unsigned long cputime_to_usecs(const cputime_t ct)
 {
-	return mulhdu(ct, __cputime_msec_factor);
+	return mulhdu(ct, __cputime_msec_factor) * USEC_PER_MSEC;
 }
 
-static inline cputime_t msecs_to_cputime(const unsigned long ms)
+static inline cputime_t usecs_to_cputime(const unsigned long us)
 {
 	cputime_t ct;
 	unsigned long sec;
 
 	/* have to be a little careful about overflow */
-	ct = ms % 1000;
-	sec = ms / 1000;
+	ct = us % 1000000;
+	sec = us / 1000000;
 	if (ct) {
 		ct *= tb_ticks_per_sec;
 		do_div(ct, 1000);
--- a/arch/s390/include/asm/cputime.h
+++ b/arch/s390/include/asm/cputime.h
@@ -73,18 +73,18 @@ cputime64_to_jiffies64(cputime64_t cputi
 }
 
 /*
- * Convert cputime to milliseconds and back.
+ * Convert cputime to microseconds and back.
  */
 static inline unsigned int
-cputime_to_msecs(const cputime_t cputime)
+cputime_to_usecs(const cputime_t cputime)
 {
-	return cputime_div(cputime, 4096000);
+	return cputime_div(cputime, 4096);
 }
 
 static inline cputime_t
-msecs_to_cputime(const unsigned int m)
+usecs_to_cputime(const unsigned int m)
 {
-	return (cputime_t) m * 4096000;
+	return (cputime_t) m * 4096;
 }
 
 /*
--- a/include/asm-generic/cputime.h
+++ b/include/asm-generic/cputime.h
@@ -33,10 +33,10 @@ typedef u64 cputime64_t;
 
 
 /*
- * Convert cputime to milliseconds and back.
+ * Convert cputime to microseconds and back.
  */
-#define cputime_to_msecs(__ct)		jiffies_to_msecs(__ct)
-#define msecs_to_cputime(__msecs)	msecs_to_jiffies(__msecs)
+#define cputime_to_usecs(__ct)		jiffies_to_usecs(__ct);
+#define usecs_to_cputime(__msecs)	usecs_to_jiffies(__msecs);
 
 /*
  * Convert cputime to seconds and back.
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -63,12 +63,10 @@ void bacct_add_tsk(struct taskstats *sta
 	stats->ac_ppid	 = pid_alive(tsk) ?
 				rcu_dereference(tsk->real_parent)->tgid : 0;
 	rcu_read_unlock();
-	stats->ac_utime	 = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC;
-	stats->ac_stime	 = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC;
-	stats->ac_utimescaled =
-		cputime_to_msecs(tsk->utimescaled) * USEC_PER_MSEC;
-	stats->ac_stimescaled =
-		cputime_to_msecs(tsk->stimescaled) * USEC_PER_MSEC;
+	stats->ac_utime = cputime_to_usecs(tsk->utime);
+	stats->ac_stime = cputime_to_usecs(tsk->stime);
+	stats->ac_utimescaled = cputime_to_usecs(tsk->utimescaled);
+	stats->ac_stimescaled = cputime_to_usecs(tsk->stimescaled);
 	stats->ac_minflt = tsk->min_flt;
 	stats->ac_majflt = tsk->maj_flt;
 



^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC][PATCH 02/10] taskstats: Separate taskstats commands
  2010-09-23 13:48 [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Michael Holzheu
  2010-09-23 14:00 ` [RFC][PATCH 01/10] taskstats: Use real microsecond granularity for CPU times Michael Holzheu
@ 2010-09-23 14:01 ` Michael Holzheu
  2010-09-27  9:32   ` Balbir Singh
  2010-10-11  7:40   ` Balbir Singh
  2010-09-23 14:01 ` [RFC][PATCH 03/10] taskstats: Split fill_pid function Michael Holzheu
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-23 14:01 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens
  Cc: linux-kernel, linux-s390

Subject: [PATCH] taskstats: Separate taskstats commands

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

This patch moves each taskstats command into a single function. This
makes
the code more readable and makes it easier to add new commands.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 kernel/taskstats.c |  118
+++++++++++++++++++++++++++++++++++------------------
 1 file changed, 78 insertions(+), 40 deletions(-)

--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -424,39 +424,76 @@ err:
 	return rc;
 }
 
-static int taskstats_user_cmd(struct sk_buff *skb, struct genl_info
*info)
+static int cmd_attr_register_cpumask(struct genl_info *info)
 {
-	int rc;
-	struct sk_buff *rep_skb;
-	struct taskstats *stats;
-	size_t size;
 	cpumask_var_t mask;
+	int rc;
 
 	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
 		return -ENOMEM;
-
 	rc = parse(info->attrs[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK], mask);
 	if (rc < 0)
-		goto free_return_rc;
-	if (rc == 0) {
-		rc = add_del_listener(info->snd_pid, mask, REGISTER);
-		goto free_return_rc;
-	}
+		goto out;
+	rc = add_del_listener(info->snd_pid, mask, REGISTER);
+out:
+	free_cpumask_var(mask);
+	return rc;
+}
+
+static int cmd_attr_deregister_cpumask(struct genl_info *info)
+{
+	cpumask_var_t mask;
+	int rc;
 
+	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
 	rc = parse(info->attrs[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK], mask);
 	if (rc < 0)
-		goto free_return_rc;
-	if (rc == 0) {
-		rc = add_del_listener(info->snd_pid, mask, DEREGISTER);
-free_return_rc:
-		free_cpumask_var(mask);
-		return rc;
-	}
+		goto out;
+	rc = add_del_listener(info->snd_pid, mask, DEREGISTER);
+out:
 	free_cpumask_var(mask);
+	return rc;
+}
+
+static int cmd_attr_pid(struct genl_info *info)
+{
+	struct taskstats *stats;
+	struct sk_buff *rep_skb;
+	size_t size;
+	u32 pid;
+	int rc;
+
+	size = nla_total_size(sizeof(u32)) +
+		nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+	rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, size);
+	if (rc < 0)
+		return rc;
+
+	rc = -EINVAL;
+	pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+	stats = mk_reply(rep_skb, TASKSTATS_TYPE_PID, pid);
+	if (!stats)
+		goto err;
+
+	rc = fill_pid(pid, NULL, stats);
+	if (rc < 0)
+		goto err;
+	return send_reply(rep_skb, info);
+err:
+	nlmsg_free(rep_skb);
+	return rc;
+}
+
+static int cmd_attr_tgid(struct genl_info *info)
+{
+	struct taskstats *stats;
+	struct sk_buff *rep_skb;
+	size_t size;
+	u32 tgid;
+	int rc;
 
-	/*
-	 * Size includes space for nested attributes
-	 */
 	size = nla_total_size(sizeof(u32)) +
 		nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
 
@@ -465,33 +502,34 @@ free_return_rc:
 		return rc;
 
 	rc = -EINVAL;
-	if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
-		u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
-		stats = mk_reply(rep_skb, TASKSTATS_TYPE_PID, pid);
-		if (!stats)
-			goto err;
-
-		rc = fill_pid(pid, NULL, stats);
-		if (rc < 0)
-			goto err;
-	} else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
-		u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
-		stats = mk_reply(rep_skb, TASKSTATS_TYPE_TGID, tgid);
-		if (!stats)
-			goto err;
-
-		rc = fill_tgid(tgid, NULL, stats);
-		if (rc < 0)
-			goto err;
-	} else
+	tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+	stats = mk_reply(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+	if (!stats)
 		goto err;
 
+	rc = fill_tgid(tgid, NULL, stats);
+	if (rc < 0)
+		goto err;
 	return send_reply(rep_skb, info);
 err:
 	nlmsg_free(rep_skb);
 	return rc;
 }
 
+static int taskstats_user_cmd(struct sk_buff *skb, struct genl_info
*info)
+{
+	if (info->attrs[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK])
+		return cmd_attr_register_cpumask(info);
+	else if (info->attrs[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK])
+		return cmd_attr_deregister_cpumask(info);
+	else if (info->attrs[TASKSTATS_CMD_ATTR_PID])
+		return cmd_attr_pid(info);
+	else if (info->attrs[TASKSTATS_CMD_ATTR_TGID])
+		return cmd_attr_tgid(info);
+	else
+		return -EINVAL;
+}
+
 static struct taskstats *taskstats_tgid_alloc(struct task_struct *tsk)
 {
 	struct signal_struct *sig = tsk->signal;



^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC][PATCH 03/10] taskstats: Split fill_pid function
  2010-09-23 13:48 [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Michael Holzheu
  2010-09-23 14:00 ` [RFC][PATCH 01/10] taskstats: Use real microsecond granularity for CPU times Michael Holzheu
  2010-09-23 14:01 ` [RFC][PATCH 02/10] taskstats: Separate taskstats commands Michael Holzheu
@ 2010-09-23 14:01 ` Michael Holzheu
  2010-09-23 17:33   ` Oleg Nesterov
                     ` (2 more replies)
  2010-09-23 14:01 ` [RFC][PATCH 04/10] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS Michael Holzheu
                   ` (9 subsequent siblings)
  12 siblings, 3 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-23 14:01 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Balbir Singh, Ingo Molnar, Heiko Carstens,
	Martin Schwidefsky
  Cc: linux-s390, linux-kernel

Subject: [PATCH] taskstats: Split fill_pid function

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

Separate the finding of a task_struct by pid or tgid from filling the taskstats
data. This makes the code more readable.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 kernel/taskstats.c |   50 +++++++++++++++++++++-----------------------------
 1 file changed, 21 insertions(+), 29 deletions(-)

--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -175,22 +175,8 @@ static void send_cpu_listeners(struct sk
 	up_write(&listeners->sem);
 }
 
-static int fill_pid(pid_t pid, struct task_struct *tsk,
-		struct taskstats *stats)
+static void fill_stats(struct task_struct *tsk, struct taskstats *stats)
 {
-	int rc = 0;
-
-	if (!tsk) {
-		rcu_read_lock();
-		tsk = find_task_by_vpid(pid);
-		if (tsk)
-			get_task_struct(tsk);
-		rcu_read_unlock();
-		if (!tsk)
-			return -ESRCH;
-	} else
-		get_task_struct(tsk);
-
 	memset(stats, 0, sizeof(*stats));
 	/*
 	 * Each accounting subsystem adds calls to its functions to
@@ -209,17 +195,27 @@ static int fill_pid(pid_t pid, struct ta
 
 	/* fill in extended acct fields */
 	xacct_add_tsk(stats, tsk);
+}
 
-	/* Define err: label here if needed */
-	put_task_struct(tsk);
-	return rc;
+static int fill_stats_for_pid(pid_t pid, struct taskstats *stats)
+{
+	struct task_struct *tsk;
 
+	rcu_read_lock();
+	tsk = find_task_by_vpid(pid);
+	if (tsk)
+		get_task_struct(tsk);
+	rcu_read_unlock();
+	if (!tsk)
+		return -ESRCH;
+	fill_stats(tsk, stats);
+	put_task_struct(tsk);
+	return 0;
 }
 
-static int fill_tgid(pid_t tgid, struct task_struct *first,
-		struct taskstats *stats)
+static int fill_stats_for_tgid(pid_t tgid, struct taskstats *stats)
 {
-	struct task_struct *tsk;
+	struct task_struct *tsk, *first;
 	unsigned long flags;
 	int rc = -ESRCH;
 
@@ -228,8 +224,7 @@ static int fill_tgid(pid_t tgid, struct 
 	 * leaders who are already counted with the dead tasks
 	 */
 	rcu_read_lock();
-	if (!first)
-		first = find_task_by_vpid(tgid);
+	first = find_task_by_vpid(tgid);
 
 	if (!first || !lock_task_sighand(first, &flags))
 		goto out;
@@ -268,7 +263,6 @@ out:
 	return rc;
 }
 
-
 static void fill_tgid_exit(struct task_struct *tsk)
 {
 	unsigned long flags;
@@ -477,7 +471,7 @@ static int cmd_attr_pid(struct genl_info
 	if (!stats)
 		goto err;
 
-	rc = fill_pid(pid, NULL, stats);
+	rc = fill_stats_for_pid(pid, stats);
 	if (rc < 0)
 		goto err;
 	return send_reply(rep_skb, info);
@@ -507,7 +501,7 @@ static int cmd_attr_tgid(struct genl_inf
 	if (!stats)
 		goto err;
 
-	rc = fill_tgid(tgid, NULL, stats);
+	rc = fill_stats_for_tgid(tgid, stats);
 	if (rc < 0)
 		goto err;
 	return send_reply(rep_skb, info);
@@ -593,9 +587,7 @@ void taskstats_exit(struct task_struct *
 	if (!stats)
 		goto err;
 
-	rc = fill_pid(-1, tsk, stats);
-	if (rc < 0)
-		goto err;
+	fill_stats(tsk, stats);
 
 	/*
 	 * Doesn't matter if tsk is the leader or the last group member leaving



^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC][PATCH 04/10] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS
  2010-09-23 13:48 [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Michael Holzheu
                   ` (2 preceding siblings ...)
  2010-09-23 14:01 ` [RFC][PATCH 03/10] taskstats: Split fill_pid function Michael Holzheu
@ 2010-09-23 14:01 ` Michael Holzheu
  2010-09-23 14:01 ` [RFC][PATCH 05/10] taskstats: Add "/proc/taskstats" Michael Holzheu
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-23 14:01 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Balbir Singh, Ingo Molnar, Heiko Carstens,
	Martin Schwidefsky
  Cc: linux-s390, linux-kernel

Subject: [PATCH] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

The new command is designed to be used by commands like "top" that want
to create a snapshot of the running tasks. The command has the following
arguments:

 * pid:  PID to start searching
 * cnt:  Maximum number of taskstats to be returned
 * time: Timestamp (sched_clock)

The semantics of the command is as follows:

Return at most 'cnt' taskstats greater or equal to 'pid' for tasks that have
been or still are in state TASK_RUNNING between the given 'time' and now.
'time' correlates to the new taskstats field time_ns.

If no more taskstats are found, a final zero taskstats struct is returned that
marks the end of the netlink transmission.

As clock for 'now' and 'time' the sched_clock() function is used and the patch
adds a new field last_depart to the sched_info structure.

Sequence numbers are used to ensure reliable netlink communication with user
space. The first taskstat is returned with the same sequence number as the
command. Following taskstats are transmitted using ascending sequence numbers.

The new command can be used by user space as follows (pseudo code):

Initial: Get taskstats for all tasks

   int start_pid = 0, start_time = 0, oldest_time = INT_MAX;
   struct taskstats taskstats_vec[50];

   do {
          cnt = cmd_pids(start_pid, start_time, taskstats_vec, 50);
          for (i = 0; i < cnt; i++)
                  oldest_time = MIN(oldest_time, taskstats_vec[i].time_ns);
          update_database(taskstats_vec, cnt);
          start_pid = taskstats_vec[cnt - 1].ac_pid;
   } while (cnt == 50);

Update: Get all taskstats for tasks that were active after 'oldest_time'

   new_oldest_time = INT_MAX;
   start_pid = 0;
   do {
          cnt = cmd_pids(start_pid, oldest_time, taskstats_vec, 50);
          for (i = 0; i < cnt; i++)
                  new_oldest_time = MIN(new_oldest_time,
                                        taskstats_vec[i].time_ns);
          update_database(taskstats_vec, cnt);
          start_pid = taskstats_vec[cnt - 1].ac_pid;
   } while (cnt == 50);
   oldest_time = new_oldest_time;

The current approach assumes that sched_clock() can't wrap. If this
assumption is not true, things will become more difficult. The taskstats code
has to detect that sched_clock() wrapped since the last query and then return
a notification to user space. User space could then use oldest_time=0 to
resynchronize.

GOALS OF THIS PATCH
-------------------
Compared to the already existing taskstats command TASKSTATS_CMD_ATTR_PID,
the new command has the following advantages for implementing tools like top:
* No scan of procfs files necessary to find out running tasks.
* A consistent snapshot of task data is possible, if all taskstats can be
  transferred to the socket buffer.
* When using the 'time' parameter, only active tasks have to be transferred.
  This could be used for a special 'low CPU overhead' monitoring mode.
* Less system calls are necessary because only one command has to be sent
  for receiving multiple taskstasts.

OPEN ISSUES
-----------
* Because of the netlink socket buffer size restriction (default 64KB) it
  is not possible to transfer a consistent full taskstats snapshot that
  contains all tasks. See the procfs patch as a proposal to solve this problem.
* Is it a good idea to use the tasklist_lock for snapshot? To get a
  really consistent snapshot this is probably necessary.
* Force only non idle CPUs to account.
* Possible inconsistent data, because we use first taskstats_fill_atomic() and
  afterwards taskstats_fill_sleep().
* Complete the aggregation of swapper tasks.
* Add more filters besides of the 'time' parameter?
* Find a solution, if sched_clock() can wrap.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 arch/s390/kernel/vtime.c       |   19 ++-
 include/linux/sched.h          |    4 
 include/linux/taskstats.h      |   13 ++
 include/linux/taskstats_kern.h |    9 +
 include/linux/tsacct_kern.h    |    4 
 kernel/Makefile                |    2 
 kernel/sched.c                 |    6 +
 kernel/sched_stats.h           |    1 
 kernel/taskstats.c             |   89 +++++++++++++++++-
 kernel/taskstats_snap.c        |  201 +++++++++++++++++++++++++++++++++++++++++
 kernel/tsacct.c                |   32 ++++--
 11 files changed, 352 insertions(+), 28 deletions(-)

Index: git-linux-2.6/arch/s390/kernel/vtime.c
===================================================================
--- git-linux-2.6.orig/arch/s390/kernel/vtime.c	2010-09-23 14:14:04.000000000 +0200
+++ git-linux-2.6/arch/s390/kernel/vtime.c	2010-09-23 14:16:17.000000000 +0200
@@ -56,31 +56,34 @@
 {
 	struct thread_info *ti = task_thread_info(tsk);
 	__u64 timer, clock, user, system, steal;
+	unsigned char clk[16];
 
 	timer = S390_lowcore.last_update_timer;
 	clock = S390_lowcore.last_update_clock;
 	asm volatile ("  STPT %0\n"    /* Store current cpu timer value */
-		      "  STCK %1"      /* Store current tod clock value */
+		      "  STCKE 0(%2)"  /* Store current tod clock value */
 		      : "=m" (S390_lowcore.last_update_timer),
-		        "=m" (S390_lowcore.last_update_clock) );
+		        "=m" (clk) : "a" (clk));
+	S390_lowcore.last_update_clock = *(__u64 *) &clk[1];
+	tsk->acct_time = ((clock - sched_clock_base_cc) * 125) >> 9;
 	S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer;
 	S390_lowcore.steal_timer += S390_lowcore.last_update_clock - clock;
 
 	user = S390_lowcore.user_timer - ti->user_timer;
-	S390_lowcore.steal_timer -= user;
 	ti->user_timer = S390_lowcore.user_timer;
 	account_user_time(tsk, user, user);
 
 	system = S390_lowcore.system_timer - ti->system_timer;
-	S390_lowcore.steal_timer -= system;
 	ti->system_timer = S390_lowcore.system_timer;
 	account_system_time(tsk, hardirq_offset, system, system);
 
 	steal = S390_lowcore.steal_timer;
-	if ((s64) steal > 0) {
-		S390_lowcore.steal_timer = 0;
-		account_steal_time(steal);
-	}
+	S390_lowcore.steal_timer = 0;
+	if (steal >= user + system)
+		steal -= user + system;
+	else
+		steal = 0;
+	account_steal_time(steal);
 }
 
 void account_vtime(struct task_struct *prev, struct task_struct *next)
Index: git-linux-2.6/include/linux/sched.h
===================================================================
--- git-linux-2.6.orig/include/linux/sched.h	2010-09-23 14:14:04.000000000 +0200
+++ git-linux-2.6/include/linux/sched.h	2010-09-23 14:16:17.000000000 +0200
@@ -708,7 +708,8 @@
 
 	/* timestamps */
 	unsigned long long last_arrival,/* when we last ran on a cpu */
-			   last_queued;	/* when we were last queued to run */
+			   last_queued,	/* when we were last queued to run */
+			   last_depart;	/* when we last departed from a cpu */
 #ifdef CONFIG_SCHEDSTATS
 	/* BKL stats */
 	unsigned int bkl_count;
@@ -1278,6 +1279,7 @@
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 	cputime_t prev_utime, prev_stime;
 #endif
+	unsigned long long acct_time;		/* Time for last accounting */
 	unsigned long nvcsw, nivcsw; /* context switch counts */
 	struct timespec start_time; 		/* monotonic time */
 	struct timespec real_start_time;	/* boot based time */
Index: git-linux-2.6/include/linux/taskstats.h
===================================================================
--- git-linux-2.6.orig/include/linux/taskstats.h	2010-09-23 14:14:04.000000000 +0200
+++ git-linux-2.6/include/linux/taskstats.h	2010-09-23 14:16:17.000000000 +0200
@@ -33,7 +33,7 @@
  */
 
 
-#define TASKSTATS_VERSION	7
+#define TASKSTATS_VERSION	8
 #define TS_COMM_LEN		32	/* should be >= TASK_COMM_LEN
 					 * in linux/sched.h */
 
@@ -163,6 +163,10 @@
 	/* Delay waiting for memory reclaim */
 	__u64	freepages_count;
 	__u64	freepages_delay_total;
+	/* version 7 ends here */
+
+	/* Timestamp where data has been collected in ns since boot time */
+	__u64	time_ns;
 };
 
 
@@ -199,9 +203,16 @@
 	TASKSTATS_CMD_ATTR_TGID,
 	TASKSTATS_CMD_ATTR_REGISTER_CPUMASK,
 	TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK,
+	TASKSTATS_CMD_ATTR_PIDS,
 	__TASKSTATS_CMD_ATTR_MAX,
 };
 
+struct taskstats_cmd_pids {
+	__u64	time_ns;
+	__u32	pid;
+	__u32	cnt;
+};
+
 #define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
 
 /* NETLINK_GENERIC related info */
Index: git-linux-2.6/include/linux/taskstats_kern.h
===================================================================
--- git-linux-2.6.orig/include/linux/taskstats_kern.h	2010-09-23 14:14:04.000000000 +0200
+++ git-linux-2.6/include/linux/taskstats_kern.h	2010-09-23 14:16:17.000000000 +0200
@@ -23,6 +23,15 @@
 
 extern void taskstats_exit(struct task_struct *, int group_dead);
 extern void taskstats_init_early(void);
+extern void taskstats_fill(struct task_struct *tsk, struct taskstats *stats);
+extern int taskstats_snap(int pid_start, int cnt, u64 time_ns,
+			  struct taskstats *stats_vec);
+extern int taskstats_snap_user(int pid_start, int cnt, u64 time_ns,
+			       struct taskstats *stats_vec);
+extern void taskstats_fill_atomic(struct task_struct *tsk,
+				  struct taskstats *stats);
+extern void taskstats_fill_sleep(struct task_struct *tsk,
+				 struct taskstats *stats);
 #else
 static inline void taskstats_exit(struct task_struct *tsk, int group_dead)
 {}
Index: git-linux-2.6/include/linux/tsacct_kern.h
===================================================================
--- git-linux-2.6.orig/include/linux/tsacct_kern.h	2010-09-23 14:14:04.000000000 +0200
+++ git-linux-2.6/include/linux/tsacct_kern.h	2010-09-23 14:16:17.000000000 +0200
@@ -18,11 +18,15 @@
 
 #ifdef CONFIG_TASK_XACCT
 extern void xacct_add_tsk(struct taskstats *stats, struct task_struct *p);
+extern void xacct_add_tsk_mem(struct taskstats *stats, struct task_struct *p);
 extern void acct_update_integrals(struct task_struct *tsk);
 extern void acct_clear_integrals(struct task_struct *tsk);
 #else
 static inline void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 {}
+static inline void xacct_add_tsk_mem(struct taskstats *stats,
+				     struct task_struct *p)
+{}
 static inline void acct_update_integrals(struct task_struct *tsk)
 {}
 static inline void acct_clear_integrals(struct task_struct *tsk)
Index: git-linux-2.6/kernel/Makefile
===================================================================
--- git-linux-2.6.orig/kernel/Makefile	2010-09-23 14:14:04.000000000 +0200
+++ git-linux-2.6/kernel/Makefile	2010-09-23 14:16:17.000000000 +0200
@@ -89,7 +89,7 @@
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
-obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o taskstats_snap.o
 obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
 obj-$(CONFIG_LATENCYTOP) += latencytop.o
 obj-$(CONFIG_BINFMT_ELF) += elfcore.o
Index: git-linux-2.6/kernel/sched.c
===================================================================
--- git-linux-2.6.orig/kernel/sched.c	2010-09-23 14:15:42.000000000 +0200
+++ git-linux-2.6/kernel/sched.c	2010-09-23 14:16:17.000000000 +0200
@@ -9187,4 +9187,10 @@
 }
 EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
 
+struct task_struct *get_swapper(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	return rq->idle;
+}
+
 #endif /* #else #ifndef CONFIG_SMP */
Index: git-linux-2.6/kernel/sched_stats.h
===================================================================
--- git-linux-2.6.orig/kernel/sched_stats.h	2010-09-23 14:14:04.000000000 +0200
+++ git-linux-2.6/kernel/sched_stats.h	2010-09-23 14:16:17.000000000 +0200
@@ -236,6 +236,7 @@
 	unsigned long long delta = task_rq(t)->clock -
 					t->sched_info.last_arrival;
 
+	t->sched_info.last_depart = task_rq(t)->clock;
 	rq_sched_info_depart(task_rq(t), delta);
 
 	if (t->state == TASK_RUNNING)
Index: git-linux-2.6/kernel/taskstats.c
===================================================================
--- git-linux-2.6.orig/kernel/taskstats.c	2010-09-23 14:16:16.000000000 +0200
+++ git-linux-2.6/kernel/taskstats.c	2010-09-23 14:16:17.000000000 +0200
@@ -27,6 +27,7 @@
 #include <linux/cgroup.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/vmalloc.h>
 #include <net/genetlink.h>
 #include <asm/atomic.h>
 
@@ -51,7 +52,8 @@
 	[TASKSTATS_CMD_ATTR_PID]  = { .type = NLA_U32 },
 	[TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
 	[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK] = { .type = NLA_STRING },
-	[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK] = { .type = NLA_STRING },};
+	[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK] = { .type = NLA_STRING },
+	[TASKSTATS_CMD_ATTR_PIDS] = { .type = NLA_BINARY },};
 
 static const struct nla_policy cgroupstats_cmd_get_policy[CGROUPSTATS_CMD_ATTR_MAX+1] = {
 	[CGROUPSTATS_CMD_ATTR_FD] = { .type = NLA_U32 },
@@ -175,9 +177,12 @@
 	up_write(&listeners->sem);
 }
 
-static void fill_stats(struct task_struct *tsk, struct taskstats *stats)
+void taskstats_fill_atomic(struct task_struct *tsk, struct taskstats *stats)
 {
 	memset(stats, 0, sizeof(*stats));
+	preempt_disable();
+	stats->time_ns = sched_clock();
+	preempt_enable();
 	/*
 	 * Each accounting subsystem adds calls to its functions to
 	 * fill in relevant parts of struct taskstsats as follows
@@ -197,6 +202,17 @@
 	xacct_add_tsk(stats, tsk);
 }
 
+void taskstats_fill_sleep(struct task_struct *tsk, struct taskstats *stats)
+{
+	xacct_add_tsk_mem(stats, tsk);
+}
+
+void taskstats_fill(struct task_struct *tsk, struct taskstats *stats)
+{
+	taskstats_fill_atomic(tsk, stats);
+	taskstats_fill_sleep(tsk, stats);
+}
+
 static int fill_stats_for_pid(pid_t pid, struct taskstats *stats)
 {
 	struct task_struct *tsk;
@@ -208,7 +224,7 @@
 	rcu_read_unlock();
 	if (!tsk)
 		return -ESRCH;
-	fill_stats(tsk, stats);
+	taskstats_fill(tsk, stats);
 	put_task_struct(tsk);
 	return 0;
 }
@@ -418,6 +434,68 @@
 	return rc;
 }
 
+static int cmd_attr_pids(struct genl_info *info)
+{
+	struct taskstats_cmd_pids *cmd_pids;
+	struct taskstats *stats_vec;
+	struct sk_buff *rep_skb;
+	struct taskstats *stats;
+	unsigned int tsk_cnt, i;
+	size_t size;
+	int  rc;
+
+	size = nla_total_size(sizeof(u32)) +
+		nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+	cmd_pids = nla_data(info->attrs[TASKSTATS_CMD_ATTR_PIDS]);
+
+	if (cmd_pids->cnt > 1000) // XXX socket buffer size check
+		return -EINVAL;
+
+	stats_vec = vmalloc(sizeof(struct taskstats) * cmd_pids->cnt);
+	if (!stats_vec)
+		return -ENOMEM;
+
+	rc = taskstats_snap(cmd_pids->pid, cmd_pids->cnt,
+			    cmd_pids->time_ns, stats_vec);
+	if (rc < 0)
+		goto fail_vfree;
+	tsk_cnt = rc;
+	for (i = 0; i < min(cmd_pids->cnt, tsk_cnt + 1); i++) {
+		rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, size);
+		if (rc < 0)
+			goto fail_vfree;
+		if (i < tsk_cnt) {
+			stats = mk_reply(rep_skb, TASKSTATS_TYPE_PID,
+					 stats_vec[i].ac_pid);
+			if (!stats) {
+				rc = -ENOMEM;
+				goto fail_nlmsg_free;
+			}
+			memcpy(stats, &stats_vec[i], sizeof(*stats));
+		} else {
+			/* zero taskstats marks end of transmission */
+			stats = mk_reply(rep_skb, TASKSTATS_TYPE_PID, 0);
+			if (!stats) {
+				rc = -ENOMEM;
+				goto fail_nlmsg_free;
+			}
+			memset(stats, 0, sizeof(*stats));
+		}
+		rc = send_reply(rep_skb, info);
+		if (rc)
+			goto fail_nlmsg_free;
+		info->snd_seq++;
+	}
+	vfree(stats_vec);
+	return 0;
+
+fail_nlmsg_free:
+	nlmsg_free(rep_skb);
+fail_vfree:
+	vfree(stats_vec);
+	return rc;
+}
+
 static int cmd_attr_register_cpumask(struct genl_info *info)
 {
 	cpumask_var_t mask;
@@ -520,6 +598,8 @@
 		return cmd_attr_pid(info);
 	else if (info->attrs[TASKSTATS_CMD_ATTR_TGID])
 		return cmd_attr_tgid(info);
+	else if (info->attrs[TASKSTATS_CMD_ATTR_PIDS])
+		return cmd_attr_pids(info);
 	else
 		return -EINVAL;
 }
@@ -587,7 +667,7 @@
 	if (!stats)
 		goto err;
 
-	fill_stats(tsk, stats);
+	taskstats_fill(tsk, stats);
 
 	/*
 	 * Doesn't matter if tsk is the leader or the last group member leaving
@@ -647,7 +727,6 @@
 	rc = genl_register_ops(&family, &cgroupstats_ops);
 	if (rc < 0)
 		goto err_cgroup_ops;
-
 	family_registered = 1;
 	printk("registered taskstats version %d\n", TASKSTATS_GENL_VERSION);
 	return 0;
Index: git-linux-2.6/kernel/taskstats_snap.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ git-linux-2.6/kernel/taskstats_snap.c	2010-09-23 14:16:17.000000000 +0200
@@ -0,0 +1,201 @@
+/*
+ * taskstats_snap.c - Create exact taskstats snapshot
+ *
+ * Copyright IBM Corp. 2010
+ * Author(s): Michael Holzheu <holzheu@linux.vnet.ibm.com>
+ */
+
+#include <linux/taskstats_kern.h>
+#include <linux/pid_namespace.h>
+#include <linux/kernel_stat.h>
+#include <linux/vmalloc.h>
+#include <asm/uaccess.h>
+
+static DECLARE_WAIT_QUEUE_HEAD(snapshot_wait);
+static atomic_t snapshot_use;
+
+static int wait_snapshot(void)
+{
+	while (atomic_cmpxchg(&snapshot_use, 0, 1) != 0) {
+		if (wait_event_interruptible(snapshot_wait,
+					     atomic_read(&snapshot_use) == 0))
+			return -ERESTARTSYS;
+	}
+	return 0;
+}
+
+static void wake_up_snapshot(void)
+{
+	atomic_set(&snapshot_use, 0);
+	wake_up_interruptible(&snapshot_wait);
+}
+
+static void force_accounting(void *ptr)
+{
+	account_process_tick(current, 1);
+}
+
+/*
+ * TODO Do not force idle CPUs to do accounting
+ */
+static void account_online_cpus(void)
+{
+	smp_call_function(force_accounting, NULL, 1);
+}
+
+extern struct task_struct *get_swapper(int cpu);
+
+/*
+ * TODO Implement complete taskstasts_add() function that aggregates
+ *      all fields.
+ */
+static void taskstats_add(struct taskstats *ts1, struct taskstats *ts2)
+{
+	ts1->ac_utime += ts2->ac_utime;
+	ts1->ac_stime += ts2->ac_stime;
+	ts1->ac_sttime += ts2->ac_sttime;
+}
+
+static struct task_struct *find_next_task(int pid_start, u64 time_ns)
+{
+	struct pid_namespace *ns = current->nsproxy->pid_ns;
+	struct task_struct *tsk;
+	struct pid *pid;
+
+	do {
+		pid = find_ge_pid(pid_start, ns);
+		if (!pid) {
+			tsk = NULL;
+			break;
+		}
+		tsk = pid_task(pid, PIDTYPE_PID);
+		if (tsk && (tsk->state == TASK_RUNNING ||
+		    tsk->sched_info.last_depart > time_ns)) {
+			get_task_struct(tsk);
+			break;
+		}
+		pid_start++;
+	} while (1);
+	return tsk;
+}
+
+int taskstats_snap(int pid_start, int cnt, u64 time_ns,
+		   struct taskstats *stats_vec)
+{
+	int tsk_cnt = 0, rc, i, cpu, first = 1;
+	struct task_struct *tsk, **tsk_vec;
+	u32 pid_curr = pid_start;
+	struct taskstats *ts;
+	u64 task_snap_time;
+
+	rc = wait_snapshot();
+	if (rc)
+		return rc;
+
+	rc = -ENOMEM;
+	tsk_vec = kmalloc(sizeof(struct task_struct *) * cnt, GFP_KERNEL);
+	if (!tsk_vec)
+		goto fail_wake_up_snapshot;
+	ts = kzalloc(sizeof(*ts), GFP_KERNEL);
+	if (!ts)
+		goto fail_free_tsk_vec;
+
+	task_snap_time = sched_clock();
+
+	/*
+	 * Force running CPUs to do accounting
+	 */
+	account_online_cpus();
+
+	read_lock(&tasklist_lock);
+//	rcu_read_lock();
+
+	/*
+	 * Aggregate swapper tasks (pid = 0)
+	 */
+	if (pid_curr == 0) {
+		for_each_online_cpu(cpu) {
+			tsk = get_swapper(cpu);
+			if (tsk->state != TASK_RUNNING &&
+			    tsk->sched_info.last_depart < time_ns)
+				continue;
+			if (first) {
+				tsk_vec[0] = tsk;
+				taskstats_fill_atomic(tsk, &stats_vec[tsk_cnt]);
+				stats_vec[tsk_cnt].ac_pid = 0;
+				first = 0;
+			} else {
+				taskstats_fill_atomic(tsk, ts);
+				taskstats_add(&stats_vec[tsk_cnt], ts);
+			}
+			if (tsk->acct_time < task_snap_time)
+				stats_vec[tsk_cnt].time_ns = task_snap_time;
+			else
+				stats_vec[tsk_cnt].time_ns = tsk->acct_time;
+		}
+		tsk_cnt++;
+		pid_curr++;
+	}
+	/*
+	 * Collect normal tasks (pid >=1)
+	 */
+	do {
+		tsk = find_next_task(pid_curr, time_ns);
+		if (!tsk)
+			break;
+		taskstats_fill_atomic(tsk, &stats_vec[tsk_cnt]);
+		tsk_vec[tsk_cnt] = tsk;
+		if (tsk->acct_time < task_snap_time)
+			stats_vec[tsk_cnt].time_ns = task_snap_time;
+		else
+			stats_vec[tsk_cnt].time_ns = tsk->acct_time;
+		tsk_cnt++;
+		pid_curr = next_pidmap(current->nsproxy->pid_ns, tsk->pid);
+	} while (tsk_cnt != cnt);
+
+//	rcu_read_unlock();
+	read_unlock(&tasklist_lock);
+
+	/*
+	 * Add rest of accounting information that we can't add under lock
+	 */
+	for (i = 0; i < tsk_cnt; i++) {
+		if (tsk_vec[i]->pid == 0)
+			continue;
+		taskstats_fill_sleep(tsk_vec[i], &stats_vec[i]);
+		put_task_struct(tsk_vec[i]);
+	}
+	rc = tsk_cnt;
+
+	kfree(ts);
+fail_free_tsk_vec:
+	kfree(tsk_vec);
+fail_wake_up_snapshot:
+	wake_up_snapshot();
+	return rc;
+}
+
+int taskstats_snap_user(int pid_start, int cnt, u64 time_ns,
+			struct taskstats *stats_vec)
+{
+	struct taskstats *stats_vec_int;
+	int i, tsk_cnt, rc;
+
+	stats_vec_int = vmalloc(sizeof(struct taskstats) * cnt);
+	if (!stats_vec)
+		return -ENOMEM;
+
+	tsk_cnt = taskstats_snap(pid_start, cnt, time_ns, stats_vec_int);
+
+	for (i = 0; i < tsk_cnt; i++) {
+		if (copy_to_user(&stats_vec[i], &stats_vec_int[i],
+				 sizeof(struct taskstats))) {
+			rc = -EFAULT;
+			goto out;
+		}
+	}
+	rc = tsk_cnt;
+out:
+	vfree(stats_vec_int);
+	return rc;
+}
Index: git-linux-2.6/kernel/tsacct.c
===================================================================
--- git-linux-2.6.orig/kernel/tsacct.c	2010-09-23 14:16:13.000000000 +0200
+++ git-linux-2.6/kernel/tsacct.c	2010-09-23 14:16:17.000000000 +0200
@@ -83,18 +83,6 @@
  */
 void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 {
-	struct mm_struct *mm;
-
-	/* convert pages-usec to Mbyte-usec */
-	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / MB;
-	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
-	mm = get_task_mm(p);
-	if (mm) {
-		/* adjust to KB unit */
-		stats->hiwater_rss   = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
-		stats->hiwater_vm    = get_mm_hiwater_vm(mm)  * PAGE_SIZE / KB;
-		mmput(mm);
-	}
 	stats->read_char	= p->ioac.rchar;
 	stats->write_char	= p->ioac.wchar;
 	stats->read_syscalls	= p->ioac.syscr;
@@ -109,6 +97,26 @@
 	stats->cancelled_write_bytes = 0;
 #endif
 }
+
+/*
+ * fill in memory data (function can sleep)
+ */
+void xacct_add_tsk_mem(struct taskstats *stats, struct task_struct *p)
+{
+	struct mm_struct *mm;
+
+	/* convert pages-usec to Mbyte-usec */
+	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / MB;
+	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
+	mm = get_task_mm(p);
+	if (mm) {
+		/* adjust to KB unit */
+		stats->hiwater_rss   = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
+		stats->hiwater_vm    = get_mm_hiwater_vm(mm)  * PAGE_SIZE / KB;
+		mmput(mm);
+	}
+}
+
 #undef KB
 #undef MB
 



^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC][PATCH 05/10] taskstats: Add "/proc/taskstats"
  2010-09-23 13:48 [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Michael Holzheu
                   ` (3 preceding siblings ...)
  2010-09-23 14:01 ` [RFC][PATCH 04/10] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS Michael Holzheu
@ 2010-09-23 14:01 ` Michael Holzheu
  2010-09-23 14:01 ` [RFC][PATCH 06/10] taskstats: Add thread group ID to taskstats structure Michael Holzheu
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-23 14:01 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Balbir Singh, Ingo Molnar, Heiko Carstens,
	Martin Schwidefsky
  Cc: linux-s390, linux-kernel

Subject: [PATCH] taskstats: Add "/proc/taskstats"

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

Add procfs interface for the TASKSTATS_CMD_ATTR_PIDS taskstats command. A new
procfs file "/proc/taskstats" is introduced. With a ioctl the taskstats
command is defined. With a subsequent read system call the defined command
is executed and the result of the command is transferred into the read buffer.
This allows to get a complete and consistent snapshot with all tasks via two
system calls (ioctl + read), when a sufficiently large buffer is provided.
This is not possible with the existing netlink interface, because there we
have the socket buffer size as restricting factor.

GOALS OF THIS PATCH
-------------------
* Allow transfer of a complete and consistent taskstats snapshot to user space.
* Reduce CPU time for data transmission compared to netlink mechanism,
  because the proc solution is much more lightweight.
* User space code is much easier to write compared to netlink mechanism.

OPEN ISSUES
-----------
Currently only the TASKSTATS_CMD_ATTR_PIDS command is implemented. Implement
the following missing taskstasts commands:
* TASKSTATS_CMD_ATTR_PID
* TASKSTATS_CMD_ATTR_TGID

Use "_IOW()" macros for ioctl definition. For example:
* #define TASKSTATS_IOCTL_PIDS _IOW('t', 1, struct taskstats_cmd_pids)
* #define TASKSTATS_IOCTL_PID  _IOW('t', 2, int)

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 include/linux/taskstats_kern.h |    3 +
 kernel/Makefile                |    3 -
 kernel/taskstats.c             |    1 
 kernel/taskstats_proc.c        |   99 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 105 insertions(+), 1 deletion(-)

Index: git-linux-2.6/include/linux/taskstats_kern.h
===================================================================
--- git-linux-2.6.orig/include/linux/taskstats_kern.h	2010-09-23 14:16:17.000000000 +0200
+++ git-linux-2.6/include/linux/taskstats_kern.h	2010-09-23 14:16:23.000000000 +0200
@@ -32,6 +32,7 @@
 				  struct taskstats *stats);
 extern void taskstats_fill_sleep(struct task_struct *tsk,
 				 struct taskstats *stats);
+extern void taskstats_proc_init(void);
 #else
 static inline void taskstats_exit(struct task_struct *tsk, int group_dead)
 {}
@@ -39,6 +40,8 @@
 {}
 static inline void taskstats_init_early(void)
 {}
+static inline void taskstats_proc_init(void)
+{}
 #endif /* CONFIG_TASKSTATS */
 
 #endif
Index: git-linux-2.6/kernel/Makefile
===================================================================
--- git-linux-2.6.orig/kernel/Makefile	2010-09-23 14:16:17.000000000 +0200
+++ git-linux-2.6/kernel/Makefile	2010-09-23 14:16:23.000000000 +0200
@@ -89,7 +89,8 @@
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
-obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o taskstats_snap.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o taskstats_snap.o \
+			   taskstats_proc.o
 obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
 obj-$(CONFIG_LATENCYTOP) += latencytop.o
 obj-$(CONFIG_BINFMT_ELF) += elfcore.o
Index: git-linux-2.6/kernel/taskstats.c
===================================================================
--- git-linux-2.6.orig/kernel/taskstats.c	2010-09-23 14:16:17.000000000 +0200
+++ git-linux-2.6/kernel/taskstats.c	2010-09-23 14:16:23.000000000 +0200
@@ -727,6 +727,7 @@
 	rc = genl_register_ops(&family, &cgroupstats_ops);
 	if (rc < 0)
 		goto err_cgroup_ops;
+	taskstats_proc_init();
 	family_registered = 1;
 	printk("registered taskstats version %d\n", TASKSTATS_GENL_VERSION);
 	return 0;
Index: git-linux-2.6/kernel/taskstats_proc.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ git-linux-2.6/kernel/taskstats_proc.c	2010-09-23 14:16:23.000000000 +0200
@@ -0,0 +1,99 @@
+/*
+ * taskstats_proc.c - Export per-task statistics to userland using procfs
+ *
+ * Copyright IBM Corp. 2010
+ * Author(s): Michael Holzheu <holzheu@linux.vnet.ibm.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats_kern.h>
+#include <linux/proc_fs.h>
+#include <linux/file.h>
+#include <asm/uaccess.h>
+
+static ssize_t cmd_attr_pids_proc(struct taskstats *stats_vec,
+				  struct taskstats_cmd_pids *cmd_pids)
+{
+	int rc;
+
+	rc = taskstats_snap_user(cmd_pids->pid, cmd_pids->cnt,
+				 cmd_pids->time_ns, stats_vec);
+	if (rc < 0)
+		return rc;
+	else
+		return rc * sizeof(struct taskstats);
+}
+
+struct proc_cmd {
+	int	no;
+	union {
+		struct taskstats_cmd_pids cmd_pids;
+		__u32	pid;
+	} d;
+};
+
+static int proc_taskstats_open(struct inode *inode, struct file *file)
+{
+	struct proc_cmd *proc_cmd;
+
+	proc_cmd = kmalloc(sizeof(*proc_cmd), GFP_KERNEL);
+	if (!proc_cmd)
+		return -ENOMEM;
+	proc_cmd->no = -1;
+	file->private_data = proc_cmd;
+	return 0;
+}
+
+static int proc_taskstats_close(struct inode *inode, struct file *file)
+{
+	kfree(file->private_data);
+	return 0;
+}
+
+static long proc_taskstats_ioctl(struct file *file, unsigned int no,
+				 unsigned long data)
+{
+	struct proc_cmd *proc_cmd = file->private_data;
+	int rc = 0;
+
+	switch (no) {
+	case TASKSTATS_CMD_ATTR_PIDS:
+		proc_cmd->no = no;
+		if (copy_from_user(&proc_cmd->d.cmd_pids, (void *) data,
+				   sizeof(proc_cmd->d.cmd_pids)))
+			rc = -EFAULT;
+		break;
+	default:
+		rc = -EINVAL;
+		break;
+	}
+	return rc;
+}
+
+static ssize_t proc_taskstats_read(struct file *file, char __user *buf,
+				   size_t size, loff_t *ppos)
+{
+	struct proc_cmd *proc_cmd = file->private_data;
+
+	switch (proc_cmd->no) {
+	case TASKSTATS_CMD_ATTR_PIDS:
+		if (*ppos != 0)
+			return -EINVAL;
+		return cmd_attr_pids_proc((struct taskstats *) buf,
+					  &proc_cmd->d.cmd_pids);
+	default:
+		return -EINVAL;
+	}
+}
+
+static const struct file_operations proc_taskstats_ops = {
+	.open		= proc_taskstats_open,
+	.release	= proc_taskstats_close,
+	.read		= proc_taskstats_read,
+	.unlocked_ioctl	= proc_taskstats_ioctl,
+};
+
+void __init taskstats_proc_init(void)
+{
+	proc_create("taskstats", 0444, NULL, &proc_taskstats_ops);
+}




^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC][PATCH 06/10] taskstats: Add thread group ID to taskstats structure
  2010-09-23 13:48 [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Michael Holzheu
                   ` (4 preceding siblings ...)
  2010-09-23 14:01 ` [RFC][PATCH 05/10] taskstats: Add "/proc/taskstats" Michael Holzheu
@ 2010-09-23 14:01 ` Michael Holzheu
  2010-09-23 14:01 ` [RFC][PATCH 07/10] taskstats: Add per task steal time accounting Michael Holzheu
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-23 14:01 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Balbir Singh, Ingo Molnar, Heiko Carstens,
	Martin Schwidefsky
  Cc: linux-s390, linux-kernel

Subject: [PATCH] taskstats: Add thread group ID to taskstats structure

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

The tgit is important for aggregating threads in user space. Therefore
we
add the tgid to the taskstats structure.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 include/linux/taskstats.h |    1 +
 kernel/tsacct.c           |    1 +
 2 files changed, 2 insertions(+)

--- a/include/linux/taskstats.h
+++ b/include/linux/taskstats.h
@@ -167,6 +167,7 @@ struct taskstats {
 
 	/* Timestamp where data has been collected in ns since boot time */
 	__u64	time_ns;
+	__u32	ac_tgid;		/* Thread group ID */
 };
 
 
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -56,6 +56,7 @@ void bacct_add_tsk(struct taskstats *sta
 	stats->ac_nice	 = task_nice(tsk);
 	stats->ac_sched	 = tsk->policy;
 	stats->ac_pid	 = tsk->pid;
+	stats->ac_tgid	 = tsk->tgid;
 	rcu_read_lock();
 	tcred = __task_cred(tsk);
 	stats->ac_uid	 = tcred->uid;



^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC][PATCH 07/10] taskstats: Add per task steal time accounting
  2010-09-23 13:48 [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Michael Holzheu
                   ` (5 preceding siblings ...)
  2010-09-23 14:01 ` [RFC][PATCH 06/10] taskstats: Add thread group ID to taskstats structure Michael Holzheu
@ 2010-09-23 14:01 ` Michael Holzheu
  2010-09-23 14:02 ` [RFC][PATCH 08/10] taskstats: Add cumulative CPU time (user, system and steal) Michael Holzheu
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-23 14:01 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Balbir Singh, Ingo Molnar, Heiko Carstens,
	Martin Schwidefsky
  Cc: linux-s390, linux-kernel

Subject: [PATCH] taskstats: Add per task steal time accounting

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

Currently steal time is only accounted for the whole system. With this
patch we add steal time to the per task CPU time accounting.
The triplet "user time", "system time" and "steal time" represents
all consumed CPU time on hypervisor based systems.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 arch/s390/kernel/vtime.c    |    2 +-
 fs/proc/array.c             |    6 +++---
 include/linux/kernel_stat.h |    2 +-
 include/linux/sched.h       |   14 ++++++++------
 include/linux/taskstats.h   |    1 +
 kernel/exit.c               |    9 +++++++--
 kernel/fork.c               |    1 +
 kernel/posix-cpu-timers.c   |    3 +++
 kernel/sched.c              |   26 ++++++++++++++++++++------
 kernel/sys.c                |   10 +++++-----
 kernel/tsacct.c             |    1 +
 11 files changed, 51 insertions(+), 24 deletions(-)

Index: git-linux-2.6/arch/s390/kernel/vtime.c
===================================================================
--- git-linux-2.6.orig/arch/s390/kernel/vtime.c	2010-09-23 14:16:17.000000000 +0200
+++ git-linux-2.6/arch/s390/kernel/vtime.c	2010-09-23 14:16:37.000000000 +0200
@@ -83,7 +83,7 @@
 		steal -= user + system;
 	else
 		steal = 0;
-	account_steal_time(steal);
+	account_steal_time(tsk, steal);
 }
 
 void account_vtime(struct task_struct *prev, struct task_struct *next)
Index: git-linux-2.6/fs/proc/array.c
===================================================================
--- git-linux-2.6.orig/fs/proc/array.c	2010-09-23 14:14:03.000000000 +0200
+++ git-linux-2.6/fs/proc/array.c	2010-09-23 14:16:37.000000000 +0200
@@ -375,7 +375,7 @@
 	unsigned long long start_time;
 	unsigned long cmin_flt = 0, cmaj_flt = 0;
 	unsigned long  min_flt = 0,  maj_flt = 0;
-	cputime_t cutime, cstime, utime, stime;
+	cputime_t cutime, cstime, utime, stime, sttime;
 	cputime_t cgtime, gtime;
 	unsigned long rsslim = 0;
 	char tcomm[sizeof(task->comm)];
@@ -432,7 +432,7 @@
 
 			min_flt += sig->min_flt;
 			maj_flt += sig->maj_flt;
-			thread_group_times(task, &utime, &stime);
+			thread_group_times(task, &utime, &stime, &sttime);
 			gtime = cputime_add(gtime, sig->gtime);
 		}
 
@@ -448,7 +448,7 @@
 	if (!whole) {
 		min_flt = task->min_flt;
 		maj_flt = task->maj_flt;
-		task_times(task, &utime, &stime);
+		task_times(task, &utime, &stime, &sttime);
 		gtime = task->gtime;
 	}
 
Index: git-linux-2.6/include/linux/kernel_stat.h
===================================================================
--- git-linux-2.6.orig/include/linux/kernel_stat.h	2010-09-23 14:14:03.000000000 +0200
+++ git-linux-2.6/include/linux/kernel_stat.h	2010-09-23 14:16:37.000000000 +0200
@@ -102,7 +102,7 @@
 
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
-extern void account_steal_time(cputime_t);
+extern void account_steal_time(struct task_struct *, cputime_t);
 extern void account_idle_time(cputime_t);
 
 extern void account_process_tick(struct task_struct *, int user);
Index: git-linux-2.6/include/linux/sched.h
===================================================================
--- git-linux-2.6.orig/include/linux/sched.h	2010-09-23 14:16:17.000000000 +0200
+++ git-linux-2.6/include/linux/sched.h	2010-09-23 14:16:37.000000000 +0200
@@ -467,6 +467,7 @@
 struct task_cputime {
 	cputime_t utime;
 	cputime_t stime;
+	cputime_t sttime;
 	unsigned long long sum_exec_runtime;
 };
 /* Alternate field names when used to cache expirations. */
@@ -478,6 +479,7 @@
 	(struct task_cputime) {					\
 		.utime = cputime_zero,				\
 		.stime = cputime_zero,				\
+		.sttime = cputime_zero,				\
 		.sum_exec_runtime = 0,				\
 	}
 
@@ -579,11 +581,11 @@
 	 * Live threads maintain their own counters and add to these
 	 * in __exit_signal, except for the group leader.
 	 */
-	cputime_t utime, stime, cutime, cstime;
+	cputime_t utime, stime, sttime, cutime, cstime, csttime;
 	cputime_t gtime;
 	cputime_t cgtime;
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
-	cputime_t prev_utime, prev_stime;
+	cputime_t prev_utime, prev_stime, prev_sttime;
 #endif
 	unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
 	unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
@@ -1274,10 +1276,10 @@
 	int __user *set_child_tid;		/* CLONE_CHILD_SETTID */
 	int __user *clear_child_tid;		/* CLONE_CHILD_CLEARTID */
 
-	cputime_t utime, stime, utimescaled, stimescaled;
+	cputime_t utime, stime, sttime, utimescaled, stimescaled;
 	cputime_t gtime;
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
-	cputime_t prev_utime, prev_stime;
+	cputime_t prev_utime, prev_stime, prev_sttime;
 #endif
 	unsigned long long acct_time;		/* Time for last accounting */
 	unsigned long nvcsw, nivcsw; /* context switch counts */
@@ -1677,8 +1679,8 @@
 		__put_task_struct(t);
 }
 
-extern void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st);
-extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st);
+extern void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st, cputime_t *stt);
+extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st, cputime_t *stt);
 
 /*
  * Per process flags
Index: git-linux-2.6/include/linux/taskstats.h
===================================================================
--- git-linux-2.6.orig/include/linux/taskstats.h	2010-09-23 14:16:26.000000000 +0200
+++ git-linux-2.6/include/linux/taskstats.h	2010-09-23 14:16:37.000000000 +0200
@@ -168,6 +168,7 @@
 	/* Timestamp where data has been collected in ns since boot time */
 	__u64	time_ns;
 	__u32	ac_tgid;		/* Thread group ID */
+	__u64	ac_sttime;		/* Steal CPU time [usec] */
 };
 
 
Index: git-linux-2.6/kernel/exit.c
===================================================================
--- git-linux-2.6.orig/kernel/exit.c	2010-09-23 14:14:03.000000000 +0200
+++ git-linux-2.6/kernel/exit.c	2010-09-23 14:16:37.000000000 +0200
@@ -115,6 +115,7 @@
 		 */
 		sig->utime = cputime_add(sig->utime, tsk->utime);
 		sig->stime = cputime_add(sig->stime, tsk->stime);
+		sig->sttime = cputime_add(sig->sttime, tsk->sttime);
 		sig->gtime = cputime_add(sig->gtime, tsk->gtime);
 		sig->min_flt += tsk->min_flt;
 		sig->maj_flt += tsk->maj_flt;
@@ -1217,7 +1218,7 @@
 		struct signal_struct *psig;
 		struct signal_struct *sig;
 		unsigned long maxrss;
-		cputime_t tgutime, tgstime;
+		cputime_t tgutime, tgstime, tgsttime;
 
 		/*
 		 * The resource counters for the group leader are in its
@@ -1238,7 +1239,7 @@
 		 * group, which consolidates times for all threads in the
 		 * group including the group leader.
 		 */
-		thread_group_times(p, &tgutime, &tgstime);
+		thread_group_times(p, &tgutime, &tgstime, &tgsttime);
 		spin_lock_irq(&p->real_parent->sighand->siglock);
 		psig = p->real_parent->signal;
 		sig = p->signal;
@@ -1250,6 +1251,10 @@
 			cputime_add(psig->cstime,
 			cputime_add(tgstime,
 				    sig->cstime));
+		psig->csttime =
+			cputime_add(psig->csttime,
+			cputime_add(tgsttime,
+				    sig->csttime));
 		psig->cgtime =
 			cputime_add(psig->cgtime,
 			cputime_add(p->gtime,
Index: git-linux-2.6/kernel/fork.c
===================================================================
--- git-linux-2.6.orig/kernel/fork.c	2010-09-23 14:14:03.000000000 +0200
+++ git-linux-2.6/kernel/fork.c	2010-09-23 14:16:37.000000000 +0200
@@ -1056,6 +1056,7 @@
 
 	p->utime = cputime_zero;
 	p->stime = cputime_zero;
+	p->sttime = cputime_zero;
 	p->gtime = cputime_zero;
 	p->utimescaled = cputime_zero;
 	p->stimescaled = cputime_zero;
Index: git-linux-2.6/kernel/posix-cpu-timers.c
===================================================================
--- git-linux-2.6.orig/kernel/posix-cpu-timers.c	2010-09-23 14:14:03.000000000 +0200
+++ git-linux-2.6/kernel/posix-cpu-timers.c	2010-09-23 14:17:15.000000000 +0200
@@ -237,6 +237,7 @@
 
 	times->utime = sig->utime;
 	times->stime = sig->stime;
+	times->sttime = sig->sttime;
 	times->sum_exec_runtime = sig->sum_sched_runtime;
 
 	rcu_read_lock();
@@ -248,6 +249,7 @@
 	do {
 		times->utime = cputime_add(times->utime, t->utime);
 		times->stime = cputime_add(times->stime, t->stime);
+		times->sttime = cputime_add(times->sttime, t->sttime);
 		times->sum_exec_runtime += t->se.sum_exec_runtime;
 	} while_each_thread(tsk, t);
 out:
@@ -1276,6 +1278,7 @@
 		struct task_cputime task_sample = {
 			.utime = tsk->utime,
 			.stime = tsk->stime,
+			.sttime = tsk->sttime,
 			.sum_exec_runtime = tsk->se.sum_exec_runtime
 		};
 
Index: git-linux-2.6/kernel/sched.c
===================================================================
--- git-linux-2.6.orig/kernel/sched.c	2010-09-23 14:16:17.000000000 +0200
+++ git-linux-2.6/kernel/sched.c	2010-09-23 14:16:37.000000000 +0200
@@ -3412,11 +3412,15 @@
  * Account for involuntary wait time.
  * @steal: the cpu time spent in involuntary wait
  */
-void account_steal_time(cputime_t cputime)
+void account_steal_time(struct task_struct *p, cputime_t cputime)
 {
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
 	cputime64_t cputime64 = cputime_to_cputime64(cputime);
 
+	/* Add steal time to process. */
+	p->sttime = cputime_add(p->sttime, cputime);
+
+	/* Add steal time to cpustat. */
 	cpustat->steal = cputime64_add(cpustat->steal, cputime64);
 }
 
@@ -3464,7 +3468,7 @@
  */
 void account_steal_ticks(unsigned long ticks)
 {
-	account_steal_time(jiffies_to_cputime(ticks));
+	account_steal_time(current, jiffies_to_cputime(ticks));
 }
 
 /*
@@ -3482,13 +3486,16 @@
  * Use precise platform statistics if available:
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING
-void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st,
+		cputime_t *stt)
 {
 	*ut = p->utime;
 	*st = p->stime;
+	*stt = p->sttime;
 }
 
-void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st,
+			cputime_t *stt)
 {
 	struct task_cputime cputime;
 
@@ -3496,6 +3503,7 @@
 
 	*ut = cputime.utime;
 	*st = cputime.stime;
+	*stt = cputime.sttime;
 }
 #else
 
@@ -3503,7 +3511,8 @@
 # define nsecs_to_cputime(__nsecs)	nsecs_to_jiffies(__nsecs)
 #endif
 
-void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st,
+		cputime_t *stt)
 {
 	cputime_t rtime, utime = p->utime, total = cputime_add(utime, p->stime);
 
@@ -3526,15 +3535,18 @@
 	 */
 	p->prev_utime = max(p->prev_utime, utime);
 	p->prev_stime = max(p->prev_stime, cputime_sub(rtime, p->prev_utime));
+	p->prev_sttime = cputime_zero;
 
 	*ut = p->prev_utime;
 	*st = p->prev_stime;
+	*stt = p->prev_sttime;
 }
 
 /*
  * Must be called with siglock held.
  */
-void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st,
+			cputime_t *stt)
 {
 	struct signal_struct *sig = p->signal;
 	struct task_cputime cputime;
@@ -3557,9 +3569,11 @@
 	sig->prev_utime = max(sig->prev_utime, utime);
 	sig->prev_stime = max(sig->prev_stime,
 			      cputime_sub(rtime, sig->prev_utime));
+	sig->prev_sttime = cputime_zero;
 
 	*ut = sig->prev_utime;
 	*st = sig->prev_stime;
+	*stt = sig->prev_sttime;
 }
 #endif
 
Index: git-linux-2.6/kernel/sys.c
===================================================================
--- git-linux-2.6.orig/kernel/sys.c	2010-09-23 14:14:03.000000000 +0200
+++ git-linux-2.6/kernel/sys.c	2010-09-23 14:16:37.000000000 +0200
@@ -880,10 +880,10 @@
 
 void do_sys_times(struct tms *tms)
 {
-	cputime_t tgutime, tgstime, cutime, cstime;
+	cputime_t tgutime, tgstime, tgsttime, cutime, cstime;
 
 	spin_lock_irq(&current->sighand->siglock);
-	thread_group_times(current, &tgutime, &tgstime);
+	thread_group_times(current, &tgutime, &tgstime, &tgsttime);
 	cutime = current->signal->cutime;
 	cstime = current->signal->cstime;
 	spin_unlock_irq(&current->sighand->siglock);
@@ -1488,14 +1488,14 @@
 {
 	struct task_struct *t;
 	unsigned long flags;
-	cputime_t tgutime, tgstime, utime, stime;
+	cputime_t tgutime, tgstime, tgsttime, utime, stime, sttime;
 	unsigned long maxrss = 0;
 
 	memset((char *) r, 0, sizeof *r);
 	utime = stime = cputime_zero;
 
 	if (who == RUSAGE_THREAD) {
-		task_times(current, &utime, &stime);
+		task_times(current, &utime, &stime, &sttime);
 		accumulate_thread_rusage(p, r);
 		maxrss = p->signal->maxrss;
 		goto out;
@@ -1521,7 +1521,7 @@
 				break;
 
 		case RUSAGE_SELF:
-			thread_group_times(p, &tgutime, &tgstime);
+			thread_group_times(p, &tgutime, &tgstime, &tgsttime);
 			utime = cputime_add(utime, tgutime);
 			stime = cputime_add(stime, tgstime);
 			r->ru_nvcsw += p->signal->nvcsw;
Index: git-linux-2.6/kernel/tsacct.c
===================================================================
--- git-linux-2.6.orig/kernel/tsacct.c	2010-09-23 14:16:26.000000000 +0200
+++ git-linux-2.6/kernel/tsacct.c	2010-09-23 14:16:37.000000000 +0200
@@ -66,6 +66,7 @@
 	rcu_read_unlock();
 	stats->ac_utime = cputime_to_usecs(tsk->utime);
 	stats->ac_stime = cputime_to_usecs(tsk->stime);
+	stats->ac_sttime = cputime_to_usecs(tsk->sttime);
 	stats->ac_utimescaled = cputime_to_usecs(tsk->utimescaled);
 	stats->ac_stimescaled = cputime_to_usecs(tsk->stimescaled);
 	stats->ac_minflt = tsk->min_flt;



^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC][PATCH 08/10] taskstats: Add cumulative CPU time (user, system and steal)
  2010-09-23 13:48 [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Michael Holzheu
                   ` (6 preceding siblings ...)
  2010-09-23 14:01 ` [RFC][PATCH 07/10] taskstats: Add per task steal time accounting Michael Holzheu
@ 2010-09-23 14:02 ` Michael Holzheu
  2010-09-23 14:02 ` [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting Michael Holzheu
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-23 14:02 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Balbir Singh, Ingo Molnar, Heiko Carstens,
	Martin Schwidefsky
  Cc: linux-s390, linux-kernel

Subject: [PATCH] taskstats: Add cumulative CPU time (user, system and steal)

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

Add cumulative time of dead childs (user, system and steal) to taskstats.
This can be used by tools like top to monitor 100% CPU time for each sample
interval.

The following algorithm can be used:

* Collect snapshot 1 of all running tasks
* Wait interval
* Collect snapshot 2 of all running tasks

All consumed CPU time in the interval can be calculated as follows:

  snapshot 2 minus snapshot 1 of
  utime, stime, sttime, cutime, cstime and csttime CPU time counters
  of all tasks that are in snapshot 2

  minus

  utime, stime, sttime, cutime, cstime and csttime CPU time counters of all
  tasks that are in snapshot 1, but not in snapshot 2 (tasks that have been
  exited)

To provide a consistent view, the top tool could show the following fields:
 * user:  task utime per interval
 * sys:   task stime per interval
 * ste:   task sttime per interval
 * cuser: utime of exited children per interval
 * csys:  stime of exited children per interval
 * cste:  sttime of exited children per interval
 * total: Sum of all above fields

If the top command notices that a pid disappeared between snapshot 1
and snapshot 2, it has to find its parent and subtract the CPU times
from snapshot 1 of the dead child from the parents cumulative times.

Example:
--------
pid     user   sys  ste  cuser  csys cste  total  Name
(#)      (%)   (%)  (%)    (%)   (%)  (%)    (%)  (str)
17944   0.10  0.01 0.00  54.29 14.36 0.22  68.98  make
18006   0.10  0.01 0.00  55.79 12.23 0.12  68.26  make
18041  48.18  1.51 0.29   0.00  0.00 0.00  49.98  cc1
...

The sum of all "total" CPU counters on a system that is 100% busy should
be exactly the number CPUs multiplied by the interval time. A good tescase
for this is to start a loop program for each CPU and then in parallel
starting a kernel build with "-j 5".

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 include/linux/taskstats.h |    3 +++
 kernel/tsacct.c           |   11 ++++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

--- a/include/linux/taskstats.h
+++ b/include/linux/taskstats.h
@@ -169,6 +169,9 @@ struct taskstats {
 	__u64	time_ns;
 	__u32	ac_tgid;		/* Thread group ID */
 	__u64	ac_sttime;		/* Steal CPU time [usec] */
+	__u64	ac_cutime;		/* User CPU time of childs [usec] */
+	__u64	ac_cstime;		/* System CPU time of childs [usec] */
+	__u64	ac_csttime;		/* Steal CPU time of childs [usec] */
 };
 
 
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -63,6 +63,15 @@ void bacct_add_tsk(struct taskstats *sta
 	stats->ac_gid	 = tcred->gid;
 	stats->ac_ppid	 = pid_alive(tsk) ?
 				rcu_dereference(tsk->real_parent)->tgid : 0;
+	if (tsk->signal) {
+		stats->ac_cutime = cputime_to_usecs(tsk->signal->cutime);
+		stats->ac_cstime = cputime_to_usecs(tsk->signal->cstime);
+		stats->ac_csttime = cputime_to_usecs(tsk->signal->csttime);
+	} else {
+		stats->ac_cutime = 0;
+		stats->ac_cstime = 0;
+		stats->ac_csttime = 0;
+	}
 	rcu_read_unlock();
 	stats->ac_utime = cputime_to_usecs(tsk->utime);
 	stats->ac_stime = cputime_to_usecs(tsk->stime);



^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-23 13:48 [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Michael Holzheu
                   ` (7 preceding siblings ...)
  2010-09-23 14:02 ` [RFC][PATCH 08/10] taskstats: Add cumulative CPU time (user, system and steal) Michael Holzheu
@ 2010-09-23 14:02 ` Michael Holzheu
  2010-09-23 17:10   ` Oleg Nesterov
  2010-09-28  8:21   ` Balbir Singh
  2010-09-23 14:04 ` [RFC][PATCH 10/10] taststats: User space with ptop tool Michael Holzheu
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-23 14:02 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Balbir Singh, Ingo Molnar, Heiko Carstens,
	Martin Schwidefsky
  Cc: linux-s390, linux-kernel

Subject: [PATCH] taskstats: Fix exit CPU time accounting

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

Currently there are code pathes (e.g. for kthreads) where the consumed
CPU time is not accounted to the parents cumulative counters.
Now CPU time is accounted to the parent, if the exit accounting has not
been done correctly.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 include/linux/sched.h |    1 +
 kernel/exit.c         |   35 +++++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

Index: git-linux-2.6/include/linux/sched.h
===================================================================
--- git-linux-2.6.orig/include/linux/sched.h	2010-09-23 14:16:37.000000000 +0200
+++ git-linux-2.6/include/linux/sched.h	2010-09-23 14:17:20.000000000 +0200
@@ -1282,6 +1282,7 @@
 	cputime_t prev_utime, prev_stime, prev_sttime;
 #endif
 	unsigned long long acct_time;		/* Time for last accounting */
+	int exit_accounting_done;
 	unsigned long nvcsw, nivcsw; /* context switch counts */
 	struct timespec start_time; 		/* monotonic time */
 	struct timespec real_start_time;	/* boot based time */
Index: git-linux-2.6/kernel/exit.c
===================================================================
--- git-linux-2.6.orig/kernel/exit.c	2010-09-23 14:16:37.000000000 +0200
+++ git-linux-2.6/kernel/exit.c	2010-09-23 14:17:20.000000000 +0200
@@ -157,11 +157,45 @@
 	put_task_struct(tsk);
 }
 
+static void account_to_parent(struct task_struct *p)
+{
+	struct signal_struct *psig, *sig;
+	struct task_struct *tsk_parent;
+
+	read_lock(&tasklist_lock);
+	tsk_parent = p->real_parent;
+	if (!tsk_parent) {
+		read_unlock(&tasklist_lock);
+		return;
+	}
+	get_task_struct(tsk_parent);
+	read_unlock(&tasklist_lock);
+
+	// printk("XXX Fix accounting: pid=%d ppid=%d\n", p->pid, tsk_parent->pid);
+	spin_lock_irq(&tsk_parent->sighand->siglock);
+	psig = tsk_parent->signal;
+	sig = p->signal;
+	psig->cutime = cputime_add(psig->cutime,
+				   cputime_add(sig->cutime, p->utime));
+	psig->cstime = cputime_add(psig->cstime,
+				   cputime_add(sig->cstime, p->stime));
+	psig->csttime = cputime_add(psig->csttime,
+				    cputime_add(sig->csttime, p->sttime));
+	psig->cgtime = cputime_add(psig->cgtime,
+		       cputime_add(p->gtime,
+		       cputime_add(sig->gtime, sig->cgtime)));
+	p->exit_accounting_done = 1;
+	spin_unlock_irq(&tsk_parent->sighand->siglock);
+	put_task_struct(tsk_parent);
+}
 
 void release_task(struct task_struct * p)
 {
 	struct task_struct *leader;
 	int zap_leader;
+
+	if (!p->exit_accounting_done)
+		account_to_parent(p);
 repeat:
 	tracehook_prepare_release_task(p);
 	/* don't need to get the RCU readlock here - the process is dead and
@@ -1279,6 +1313,7 @@
 			psig->cmaxrss = maxrss;
 		task_io_accounting_add(&psig->ioac, &p->ioac);
 		task_io_accounting_add(&psig->ioac, &sig->ioac);
+		p->exit_accounting_done = 1;
 		spin_unlock_irq(&p->real_parent->sighand->siglock);
 	}
 



^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC][PATCH 10/10] taststats: User space with ptop tool
  2010-09-23 13:48 [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Michael Holzheu
                   ` (8 preceding siblings ...)
  2010-09-23 14:02 ` [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting Michael Holzheu
@ 2010-09-23 14:04 ` Michael Holzheu
  2010-09-23 20:11 ` [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Andrew Morton
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-23 14:04 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Balbir Singh, Ingo Molnar, Heiko Carstens,
	Martin Schwidefsky
  Cc: linux-s390, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5269 bytes --]

***********************
Warning: Your file, s390-tools-taskstats.tar.bz2, contains more than 32 files after decompression and cannot be scanned.
***********************


Taskstats user space

The attached tarball "s390-tools-taskstasts.tar.bz2" contains user space
code that exploits the taskstasts-top kernel patches. This is early
code and probably still a lot of work has to be done here. The
code should build and work on all architectures, not only on s390.

libtaskstats user space library
-------------------------------
include/libtaskstats.h  API definition
libtaskstats_nl         API implementation based on libnl 1.1
libtaskstats_proc       Partial API implementation using
new /proc/taskstats

libtaskstats snapshot user space library
----------------------------------------
include/libtaskstats_snap.h       API definition
libtaskstats_snap/snap_netlink.c  API implementation based on
libtaskstats

Snapshot library test program
-----------------------------
ts_snap_test/ts_snap_test.c   Simple program that uses snapshot library

Precise top user space program (ptop)
-------------------------------------
ptop/dg_libtaskstats.c  Data gatherer using taskstats interface
                        To enable steal time calculation for non s390
                        modify l_calc_sttime_old() and replact "#if 0"
                        with "#if 1".
ptop/sd_core.c          Code for ctime accounting

HOWTO build:
============
1.Install libnl-1.1-5 and libnl-1.1-5-devel
  If this is not possible, you can still build the proc/taskstats based
  code:
  * Remove libtaskstats_nl from the top level Makefile
  * Remove ptop_old_nl, ptop_new_nl and ptop_snap_nl from the
    "ptop" Makefile
2.Build s390-tools:
  # tar xfv s390-tools.tar.bz2
  # cd s390-tools
  # make

HOWTO use ptop:
===============
In the ptop sub directory there are built five versions of ptop:

* ptop_old_nl:    ptop using the old TASKSTATS_CMD_ATTR_PID netlink
                  command together with reading procfs to find
                  running tasks
* ptop_new_nl:    ptop using the new TASKSTATS_CMD_ATTR_PIDS
                  netlink command.
                  This tool only shows tasks that consumed CPU time
                  in the last interval.
* ptop_new_proc:  ptop using the new TASKSTATS_CMD_ATTR_PIDS ioctl on
                  /proc/taskstats.
                  This tool only shows tasks that consumed CPU time
                  in the last interval.
* ptop_snap_nl:   ptop using the snapshot library with underlying
                  netlink taskstats library
* ptop_snap_proc: ptop using the snapshot library with underlying
                  taskstats library that uses /proc/taskstats

First results (on s390):
========================

TEST1: System with many sleeping tasks
--------------------------------------

  for ((i=0; i < 1000; i++))
  do
         sleep 1000000 &
  done

             VVVV
  pid   user  sys  ste  total  Name
  (#)    (%)  (%)  (%)    (%)  (str)
  541   0.37 2.39 0.10   2.87  top
  3645  2.13 1.12 0.14   3.39  ptop_old_nl
  3591  2.20 0.59 0.12   2.92  ptop_snap_nl
  3694  2.16 0.26 0.10   2.51  ptop_snap_proc
  3792  0.03 0.06 0.00   0.09  ptop_new_nl
  3743  0.03 0.05 0.00   0.07  ptop_new_proc
             ^^^^

The ptop user space code is not optimized for a large amount of tasks,
therefore we should concentrate on the system (sys) time. Update time is
2 seconds for all top programs.

* Old top command:
  Because top has to read about 1000 procfs directories, system time
  is very high (2.39%).

* ptop_new_xxx:
  Because only active tasks are transferred, the CPU consumption is very
  low (0.05-0.06% system time).

* ptop_snap_nl/ptop_old_nl:
  The new netlink TASKSTATS_CMD_ATTR_PIDS command only consumes about
  50% of the CPU time (0.59%) compared to the usage of multiple 
  TASKSTATS_CMD_ATTR_PID commands (ptop_old_nl / 1.12%) and scanning
  procfs to find out running tasks.

* ptop_snap_proc/ptop_snap_nl:
  Using the proc/taskstats interface (0.26%) consumes much less system
  time than the netlink interface (0.59%).

TEST2: Show snapshot consistency with system that is 100% busy
--------------------------------------------------------------

  System with 3 CPUs:

  for ((i=0; i < $(cat /proc/cpuinfo  | grep "^processor" | wc -l); i
++))
  do
       ./loop &
  done

  cd linux-2.6.35
  make -j 5

  # ptop_snap_proc
                                            VVVVV
   pid     user  sys  ste cuser  csys cste  total Elap+ Name
   (#)      (%)  (%)  (%)   (%)   (%)  (%)    (%)  (hm) (str)
   8374   75.48 0.41 1.34  0.00  0.00 0.00  77.24  0:01 loop
   8377   73.97 0.27 1.06  0.00  0.00 0.00  75.31  0:01 loop
   8371   70.61 0.38 1.38  0.00  0.00 0.00  72.38  0:01 loop
   10093   0.17 0.30 0.00 25.90 38.19 0.52  65.07  0:00 make
   10548   0.15 0.12 0.00  1.75  4.21 0.06   6.29  0:00 make
   ...
   V:V:S 220.84 2.84 3.86 28.14 43.71 0.60 300.00  0:16
                                           ^^^^^^

  With the snapshot mechanism the sum of all tasks CPU times (user +  
  system + steal + cuser + csystem + csteal) will be exactly 300.00% CPU
  time with this testcase. Using ptop_snap_proc this works fine on
  s390. Unfortunately on x86 the numbers are not as good as on s390.



[-- Attachment #2: s390-tools-taskstats.tar.bz2 --]
[-- Type: application/x-bzip-compressed-tar, Size: 43162 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-23 14:02 ` [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting Michael Holzheu
@ 2010-09-23 17:10   ` Oleg Nesterov
  2010-09-24 12:18     ` Michael Holzheu
  2010-09-28  8:21   ` Balbir Singh
  1 sibling, 1 reply; 58+ messages in thread
From: Oleg Nesterov @ 2010-09-23 17:10 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Balbir Singh, Ingo Molnar, Heiko Carstens, Martin Schwidefsky,
	linux-s390, linux-kernel

Sorry, I didn't look at other patches, but this one looks strange
to me...

On 09/23, Michael Holzheu wrote:
>
> Currently there are code pathes (e.g. for kthreads) where the consumed
> CPU time is not accounted to the parents cumulative counters.

Could you explain more?

> +static void account_to_parent(struct task_struct *p)
> +{
> +	struct signal_struct *psig, *sig;
> +	struct task_struct *tsk_parent;
> +
> +	read_lock(&tasklist_lock);

No need to take tasklist, you can use rcu_read_lock() if you need
get_task_struct(). But this can't help, please see below.

> +	tsk_parent = p->real_parent;
> +	if (!tsk_parent) {
> +		read_unlock(&tasklist_lock);
> +		return;
> +	}
> +	get_task_struct(tsk_parent);
> +	read_unlock(&tasklist_lock);
> +
> +	// printk("XXX Fix accounting: pid=%d ppid=%d\n", p->pid, tsk_parent->pid);
> +	spin_lock_irq(&tsk_parent->sighand->siglock);

This is racy. ->real_parent can exit after we drop tasklist_lock,
->sighand can be NULL.

>  void release_task(struct task_struct * p)
>  {
>  	struct task_struct *leader;
>  	int zap_leader;
> +
> +	if (!p->exit_accounting_done)
> +		account_to_parent(p);
>  repeat:
>  	tracehook_prepare_release_task(p);
>  	/* don't need to get the RCU readlock here - the process is dead and
> @@ -1279,6 +1313,7 @@
>  			psig->cmaxrss = maxrss;
>  		task_io_accounting_add(&psig->ioac, &p->ioac);
>  		task_io_accounting_add(&psig->ioac, &sig->ioac);
> +		p->exit_accounting_done = 1;

Can't understand.

Suppose that a thread T exits and reaps itself (calls release_task).
Now we call account_to_parent() which accounts T->signal->XXX + T->XXX.
After that T calls __exit_signal and does T->signal->XXX += T->XXX.

If another thread exits it does the same and we account the already
exited thread T again?

When the last thread exits, wait_task_zombie() accounts T->signal
once again.

IOW, this looks like the over-accounting to me, no?

Oleg.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 03/10] taskstats: Split fill_pid function
  2010-09-23 14:01 ` [RFC][PATCH 03/10] taskstats: Split fill_pid function Michael Holzheu
@ 2010-09-23 17:33   ` Oleg Nesterov
  2010-09-27  9:33   ` Balbir Singh
  2010-10-11  8:31   ` Balbir Singh
  2 siblings, 0 replies; 58+ messages in thread
From: Oleg Nesterov @ 2010-09-23 17:33 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Balbir Singh, Ingo Molnar, Heiko Carstens, Martin Schwidefsky,
	linux-s390, linux-kernel

On 09/23, Michael Holzheu wrote:
>
> Subject: [PATCH] taskstats: Split fill_pid function
>
> From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
>
> Separate the finding of a task_struct by pid or tgid from filling the taskstats
> data. This makes the code more readable.

I think this is nice cleanup.

Oleg.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
  2010-09-23 13:48 [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Michael Holzheu
                   ` (9 preceding siblings ...)
  2010-09-23 14:04 ` [RFC][PATCH 10/10] taststats: User space with ptop tool Michael Holzheu
@ 2010-09-23 20:11 ` Andrew Morton
  2010-09-23 22:11   ` Matt Helsley
  2010-09-24  9:10   ` Michael Holzheu
  2010-09-24  9:16 ` Balbir Singh
  2010-09-30  8:38 ` Andi Kleen
  12 siblings, 2 replies; 58+ messages in thread
From: Andrew Morton @ 2010-09-23 20:11 UTC (permalink / raw)
  To: holzheu
  Cc: Shailabh Nagar, Venkatesh Pallipadi, Suresh Siddha,
	Peter Zijlstra, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, linux-kernel, linux-s390, containers

On Thu, 23 Sep 2010 15:48:01 +0200
Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:

> Currently tools like "top" gather the task information by reading procfs
> files. This has several disadvantages:
> 
> * It is very CPU intensive, because a lot of system calls (readdir, open,
>   read, close) are necessary.
> * No real task snapshot can be provided, because while the procfs files are
>   read the system continues running.
> * The procfs times granularity is restricted to jiffies.
> 
> In parallel to procfs there exists the taskstats binary interface that uses
> netlink sockets as transport mechanism to deliver task information to
> user space. There exists a taskstats command "TASKSTATS_CMD_ATTR_PID"
> to get task information for a given PID. This command can already be used for
> tools like top, but has also several disadvantages:
> 
> * You first have to find out which PIDs are available in the system. Currently
>   we have to use procfs again to do this.
> * For each task two system calls have to be issued (First send the command and
>   then receive the reply).
> * No snapshot mechanism is available.
> 
> GOALS OF THIS PATCH SET
> -----------------------
> The intention of this patch set is to provide better support for tools like
> top. The goal is to:
> 
> * provide a task snapshot mechanism where we can get a consistent view of
>   all running tasks.
> * provide a transport mechanism that does not require a lot of system calls
>   and that allows implementing low CPU overhead task monitoring.
> * provide microsecond CPU time granularity.

This is a big change!  If this is done right then we're heading in the
direction of deprecating the longstanding way in which userspace
observes the state of Linux processes and we're recommending that the
whole world migrate to taskstats.  I think?

If so, much chin-scratching will be needed, coordination with
util-linux people, etc.

We'd need to think about the implications of taskstats versioning.  It
_is_ a versioned interface, so people can't just go and toss random new
stuff in there at will - it's not like adding a new procfs file, or
adding a new line to an existing one.  I don't know if that's likely to
be a significant problem.

I worry that there's a dependency on CONFIG_NET?  If so then that's a
big problem because in N years time, 99% of the world will be using
taskstats, but a few embedded losers will be stuck using (and having to
support) the old tools.


> FIRST RESULTS
> -------------
> Together with this kernel patch set also user space code for a new top
> utility (ptop) is provided that exploits the new kernel infrastructure. See
> patch 10 for more details.
> 
> TEST1: System with many sleeping tasks
> 
>   for ((i=0; i < 1000; i++))
>   do
>          sleep 1000000 &
>   done
> 
>   # ptop_new_proc
> 
>              VVVV
>   pid   user  sys  ste  total  Name
>   (#)    (%)  (%)  (%)    (%)  (str)
>   541   0.37 2.39 0.10   2.87  top
>   3743  0.03 0.05 0.00   0.07  ptop_new_proc
>              ^^^^
> 
> Compared to the old top command that has to scan more than 1000 proc
> directories the new ptop consumes much less CPU time (0.05% system time
> on my s390 system).

How many CPUs does that system have?

What's the `top' update period?  One second?

So we're saying that a `top -d 1' consumes 2.4% of this
mystery-number-of-CPUs machine?  That's quite a lot.

> PATCHSET OVERVIEW
> -----------------
> The code is not final and still has a few TODOs. But it is good enough for a
> first round of review. The following kernel patches are provided:
> 
> [01] Prepare-0: Use real microsecond granularity for taskstats CPU times.
> [02] Prepare-1: Restructure taskstats.c in order to be able to add new commands
>      more easily.
> [03] Prepare-2: Separate the finding of a task_struct by PID or TGID from
>      filling the taskstats.
> [04] Add new command "TASKSTATS_CMD_ATTR_PIDS" to get a snapshot of multiple
>      tasks.
> [05] Add procfs interface for taskstats commands. This allows to get a complete
>      and consistent snapshot with all tasks using two system calls (ioctl and
>      read). Transferring a snapshot of all running tasks is not possible using
>      the existing netlink interface, because there we have the socket buffer
>      size as restricting factor.

So this is a binary interface which uses an ioctl.  People don't like
ioctls.  Could we have triggered it with a write() instead?

Does this have the potential to save us from the CONFIG_NET=n problem?

> [06] Add TGID to taskstats.
> [07] Add steal time per task accounting.
> [08] Add cumulative CPU time (user, system and steal) to taskstats.

These didn't update the taskstats version number.  Should they have?

> [09] Fix exit CPU time accounting.
> 
> [10] Besides of the kernel patches also user space code is provided that
>      exploits the new kernel infrastructure. The user space code provides the
>      following:
>      1. A proposal for a taskstats user space library:
>         1.1 Based on netlink (requires libnl-devel-1.1-5)
>         2.1 Based on the new /proc/taskstats interface (see [05])
>      2. A proposal for a task snapshot library based on taskstats library (1.1)

ooh, excellent.  A standardised userspace access library.

>      3. A new tool "ptop" (precise top) that uses the libraries

Talk to me about namespaces, please.  A lot of the new code involves
PIDs, but PIDs are not system-wide unique.  A PID is relative to a PID
namespace.  Does everything Just Work?  When userspace sends a PID to
the kernel, that PID is assumed to be within the sending process's PID
namespace?  If so, then please spell it all out in the changelogs.  If
not then that is a problem!

If I can only observe processes in my PID namespace then is that a
problem?  Should I be allowed to observe another PID namespace's
processes?  I assume so, because I might be root.  If so, how is that
to be done?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
  2010-09-23 20:11 ` [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Andrew Morton
@ 2010-09-23 22:11   ` Matt Helsley
  2010-09-24 12:39     ` Michael Holzheu
  2010-09-25 18:19     ` Serge E. Hallyn
  2010-09-24  9:10   ` Michael Holzheu
  1 sibling, 2 replies; 58+ messages in thread
From: Matt Helsley @ 2010-09-23 22:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: holzheu, Shailabh Nagar, Peter Zijlstra, Heiko Carstens,
	Venkatesh Pallipadi, John stultz, containers, linux-s390,
	Balbir Singh, Oleg Nesterov, linux-kernel, Martin Schwidefsky,
	Ingo Molnar, Thomas Gleixner, Suresh Siddha

On Thu, Sep 23, 2010 at 01:11:36PM -0700, Andrew Morton wrote:
> On Thu, 23 Sep 2010 15:48:01 +0200
> Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:
> 
> > Currently tools like "top" gather the task information by reading procfs
> > files. This has several disadvantages:
> > 

<snip>

> >      3. A new tool "ptop" (precise top) that uses the libraries
> 
> Talk to me about namespaces, please.  A lot of the new code involves
> PIDs, but PIDs are not system-wide unique.  A PID is relative to a PID
> namespace.  Does everything Just Work?  When userspace sends a PID to
> the kernel, that PID is assumed to be within the sending process's PID
> namespace?  If so, then please spell it all out in the changelogs.  If
> not then that is a problem!

Good point.

The pid ought to be valid in the _receiving_ task's pid namespace. That
can be difficult or impossible if we're talking about netlink broadcasts.
In this regard process events connector is an example of what not to do.

> If I can only observe processes in my PID namespace then is that a
> problem?  Should I be allowed to observe another PID namespace's
> processes?  I assume so, because I might be root.  If so, how is that
> to be done?

I don't think even "root" can see/use pids outside its namespace (without
Eric's setns patches). If you want to see all the tasks then rely on root
being able to do stuff in the initial pid namespace. If you really want
to use/know pids in the child pid namespaces then setns is also a
nice solution.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
  2010-09-23 20:11 ` [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Andrew Morton
  2010-09-23 22:11   ` Matt Helsley
@ 2010-09-24  9:10   ` Michael Holzheu
  2010-09-24 18:50     ` Andrew Morton
  2010-09-27 10:49     ` Balbir Singh
  1 sibling, 2 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-24  9:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shailabh Nagar, Venkatesh Pallipadi, Suresh Siddha,
	Peter Zijlstra, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, linux-kernel, linux-s390, containers

Hello Andrew,

On Thu, 2010-09-23 at 13:11 -0700, Andrew Morton wrote:
> > GOALS OF THIS PATCH SET
> > -----------------------
> > The intention of this patch set is to provide better support for tools like
> > top. The goal is to:
> > 
> > * provide a task snapshot mechanism where we can get a consistent view of
> >   all running tasks.
> > * provide a transport mechanism that does not require a lot of system calls
> >   and that allows implementing low CPU overhead task monitoring.
> > * provide microsecond CPU time granularity.
> 
> This is a big change!  If this is done right then we're heading in the
> direction of deprecating the longstanding way in which userspace
> observes the state of Linux processes and we're recommending that the
> whole world migrate to taskstats.  I think?

Or it can be used as alternative. Since procfs has its drawbacks (e.g.
performance) an alternative could be helpful. 

And the taskstats interface with the TASKSTATS_CMD_ATTR_PID command
already exists and can be used. So we already have a second mechanism to
query tasks accounting information besides of procfs.

> 
> If so, much chin-scratching will be needed, coordination with
> util-linux people, etc.

I agree.

> We'd need to think about the implications of taskstats versioning.  It
> _is_ a versioned interface, so people can't just go and toss random new
> stuff in there at will - it's not like adding a new procfs file, or
> adding a new line to an existing one.  I don't know if that's likely to
> be a significant problem.

I already thought about that problem. Another problem is that depending
on the kernel config options, some taskstats fields may be not
initialized. E.g. CONFIG_TASK_DELAY_ACCT or CONFIG_TASK_XACCT. Currently
there does not exist a good interface to userspace to query which fields
are valid.

Regarding the taskstats versions  I described a possible solution in the
userspace tarball in the README.libtaskstats file:

The "struct taskstats" structure contains accounting information for one
Linux task. This structure is defined in "/usr/include/linux/taskstats.h".
With new kernel versions new fields can be added to that structure.
In that case the kernel taskstats version number defined with the macro
TASKSTATS_VERSION will be increased.

The taskstats library distinguishes between two taskstats versions:
* Kernel taskstats version (KV)
* Program compile taskstats version (CV)

Depending on the taskstats version CV that is used for compiling the program,
this version numbers can be different:
* KV > CV:
  The libtaskstats library only copies the CV taskstats fields and the fields
  that belong to version > CV will be ignored.
* KV < CV:
  The libtaskstats library only copies the version KV fields and the fields
  that belong to version > KV remain uninitialized.

If a program wants to support multiple taskstats versions, this can be done
using the ts_version() function and process fields according to that version
number.

Example:

  if (ts_version() < 7) {
         fprintf(stderr, "Error: kernel taskstats version too low\n");
         exit(1);
  }
  if (ts_version() >= 7)
         print_attrs_v7();
  if (ts_version() >= 8)
         print_attrs_v8();

In this example the program has to be compiled with a taskstats.h header file
that has at least version 8.

> I worry that there's a dependency on CONFIG_NET?  If so then that's a
> big problem because in N years time, 99% of the world will be using
> taskstats, but a few embedded losers will be stuck using (and having to
> support) the old tools.

Sure, but if we could add the /proc/taskstats approach, this dependency
would not be there.

> 
> > FIRST RESULTS
> > -------------
> > Together with this kernel patch set also user space code for a new top
> > utility (ptop) is provided that exploits the new kernel infrastructure. See
> > patch 10 for more details.
> > 
> > TEST1: System with many sleeping tasks
> > 
> >   for ((i=0; i < 1000; i++))
> >   do
> >          sleep 1000000 &
> >   done
> > 
> >   # ptop_new_proc
> > 
> >              VVVV
> >   pid   user  sys  ste  total  Name
> >   (#)    (%)  (%)  (%)    (%)  (str)
> >   541   0.37 2.39 0.10   2.87  top
> >   3743  0.03 0.05 0.00   0.07  ptop_new_proc
> >              ^^^^
> > 
> > Compared to the old top command that has to scan more than 1000 proc
> > directories the new ptop consumes much less CPU time (0.05% system time
> > on my s390 system).
> 
> How many CPUs does that system have?

The system is a virtual machine and has three CPUs.

> What's the `top' update period?  One second?

The update period is two seconds.

> So we're saying that a `top -d 1' consumes 2.4% of this
> mystery-number-of-CPUs machine?  That's quite a lot.

When I run that testcase on my laptop, 2 CPUs (Intel Core 2 - 2.33GHz),
I get about 1-2% system time for top.

> > PATCHSET OVERVIEW
> > -----------------
> > The code is not final and still has a few TODOs. But it is good enough for a
> > first round of review. The following kernel patches are provided:
> > 
> > [01] Prepare-0: Use real microsecond granularity for taskstats CPU times.
> > [02] Prepare-1: Restructure taskstats.c in order to be able to add new commands
> >      more easily.
> > [03] Prepare-2: Separate the finding of a task_struct by PID or TGID from
> >      filling the taskstats.
> > [04] Add new command "TASKSTATS_CMD_ATTR_PIDS" to get a snapshot of multiple
> >      tasks.
> > [05] Add procfs interface for taskstats commands. This allows to get a complete
> >      and consistent snapshot with all tasks using two system calls (ioctl and
> >      read). Transferring a snapshot of all running tasks is not possible using
> >      the existing netlink interface, because there we have the socket buffer
> >      size as restricting factor.
> 
> So this is a binary interface which uses an ioctl.  People don't like
> ioctls.  Could we have triggered it with a write() instead?

The current idea is the following:

1. Open /proc/taskstats
2. Set the requested command (e.g. TASKSTATS_CMD_ATTR_PIDS) using
   an ioctl. For the TASKSTATS_CMD_ATTR_PIDS ioctl the following
   structure is sent:

   struct taskstats_cmd_pids {
        __u64   time_ns;
        __u32   pid;
        __u32   cnt;
   };

3. After the command is defined, with a read() the command is executed
   and the result is returned to the user's read buffer.

We could replace step 2 with a write, that transfers the command.

> Does this have the potential to save us from the CONFIG_NET=n problem?

Yes

> > [06] Add TGID to taskstats.
> > [07] Add steal time per task accounting.
> > [08] Add cumulative CPU time (user, system and steal) to taskstats.
> 
> These didn't update the taskstats version number.  Should they have?

Patch 04/10 updates the taskstats version number from 7 to 8.
I didn't want to update the version number with each patch.

> > [09] Fix exit CPU time accounting.
> > 
> > [10] Besides of the kernel patches also user space code is provided that
> >      exploits the new kernel infrastructure. The user space code provides the
> >      following:
> >      1. A proposal for a taskstats user space library:
> >         1.1 Based on netlink (requires libnl-devel-1.1-5)
> >         2.1 Based on the new /proc/taskstats interface (see [05])
> >      2. A proposal for a task snapshot library based on taskstats library (1.1)
> 
> ooh, excellent.  A standardised userspace access library.

Yes, at least a proposal for that.

> >      3. A new tool "ptop" (precise top) that uses the libraries
> 
> Talk to me about namespaces, please.  A lot of the new code involves
> PIDs, but PIDs are not system-wide unique.  A PID is relative to a PID
> namespace.  Does everything Just Work?  When userspace sends a PID to
> the kernel, that PID is assumed to be within the sending process's PID
> namespace?  If so, then please spell it all out in the changelogs.  If
> not then that is a problem!

To be honest, I have not tested that. I assumed that the current
taskstats code does this correctly. E.g. it uses find_task_by_vpid() for
TASKSTATS_CMD_ATTR_PID and this function uses
"current->nsproxy->pid_ns". So I would assume that we get only tasks
from the caller's namespace. The new TASKSTATS_CMD_ATTR_PIDS command
also uses also only functions with "current->nsproxy->pid_ns".

> If I can only observe processes in my PID namespace then is that a
> problem?  Should I be allowed to observe another PID namespace's
> processes?  I assume so, because I might be root.  If so, how is that
> to be done?

Good question. Probably I have to learn a bit more about the PID
namespace implementation. Are PIDs over all namespaces unique?

Michael


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
  2010-09-23 13:48 [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Michael Holzheu
                   ` (10 preceding siblings ...)
  2010-09-23 20:11 ` [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Andrew Morton
@ 2010-09-24  9:16 ` Balbir Singh
  2010-09-30  8:38 ` Andi Kleen
  12 siblings, 0 replies; 58+ messages in thread
From: Balbir Singh @ 2010-09-24  9:16 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Martin Schwidefsky, Heiko Carstens,
	linux-kernel, linux-s390

* Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-09-23 15:48:01]:

> Currently tools like "top" gather the task information by reading procfs
> files. This has several disadvantages:
> 
> * It is very CPU intensive, because a lot of system calls (readdir, open,
>   read, close) are necessary.
> * No real task snapshot can be provided, because while the procfs files are
>   read the system continues running.
> * The procfs times granularity is restricted to jiffies.
> 
> In parallel to procfs there exists the taskstats binary interface that uses
> netlink sockets as transport mechanism to deliver task information to
> user space. There exists a taskstats command "TASKSTATS_CMD_ATTR_PID"
> to get task information for a given PID. This command can already be used for
> tools like top, but has also several disadvantages:
> 
> * You first have to find out which PIDs are available in the system. Currently
>   we have to use procfs again to do this.
> * For each task two system calls have to be issued (First send the command and
>   then receive the reply).
> * No snapshot mechanism is available.
> 
> GOALS OF THIS PATCH SET
> -----------------------
> The intention of this patch set is to provide better support for tools like
> top. The goal is to:
> 
> * provide a task snapshot mechanism where we can get a consistent view of
>   all running tasks.
> * provide a transport mechanism that does not require a lot of system calls
>   and that allows implementing low CPU overhead task monitoring.
> * provide microsecond CPU time granularity.
>


Looks like a good set of goals
 
> FIRST RESULTS
> -------------
> Together with this kernel patch set also user space code for a new top
> utility (ptop) is provided that exploits the new kernel infrastructure. See
> patch 10 for more details.
> 
> TEST1: System with many sleeping tasks
> 
>   for ((i=0; i < 1000; i++))
>   do
>          sleep 1000000 &
>   done
> 
>   # ptop_new_proc
> 
>              VVVV
>   pid   user  sys  ste  total  Name
>   (#)    (%)  (%)  (%)    (%)  (str)
>   541   0.37 2.39 0.10   2.87  top
>   3743  0.03 0.05 0.00   0.07  ptop_new_proc
>              ^^^^
> 
> Compared to the old top command that has to scan more than 1000 proc
> directories the new ptop consumes much less CPU time (0.05% system time
> on my s390 system).a

This is very nice!

> 
> TEST2: Show snapshot consistency with system that is 100% busy
> 
>   System with 3 CPUs:
> 
>   for ((i=0; i < $(cat /proc/cpuinfo  | grep "^processor" | wc -l); i++))
>   do
>        ./loop &
>   done
> 
>   # ptop_snap_proc
> 
>           VVVV  VVV  VVV                        VVVVV
>   pid     user  sys  ste cuser csys cste delay  total Elap+ Name
>   (#)      (%)  (%)  (%)   (%)  (%)  (%)   (%)    (%)  (hm) (str)
>   23891  99.84 0.06 0.09  0.00 0.00 0.00  0.01  99.99  0:00 loop
>   23881  99.66 0.06 0.09  0.00 0.00 0.00  0.20  99.81  0:00 loop
>   23886  99.65 0.06 0.09  0.00 0.00 0.00  0.20  99.80  0:00 loop
>   2413    0.00 0.00 0.00  0.00 0.00 0.00  0.00   0.01  4:17 sshd
>   ...
>   V:V:S 299.36 0.36 0.27  0.00 0.00 0.00  0.40 300.00  4:22
>                                                ^^^^^^
> 
>   With the snapshot mechanism the sum of all tasks CPU times (user + system +
>   steal) will be exactly 300.00% CPU time with this testcase. Using
>   ptop_snap_proc (see patch 10) this works fine on s390.
> 
> PATCHSET OVERVIEW
> -----------------
> The code is not final and still has a few TODOs. But it is good enough for a
> first round of review. The following kernel patches are provided:
> 
> [01] Prepare-0: Use real microsecond granularity for taskstats CPU times.
> [02] Prepare-1: Restructure taskstats.c in order to be able to add new commands
>      more easily.
> [03] Prepare-2: Separate the finding of a task_struct by PID or TGID from
>      filling the taskstats.
> [04] Add new command "TASKSTATS_CMD_ATTR_PIDS" to get a snapshot of multiple
>      tasks.
> [05] Add procfs interface for taskstats commands. This allows to get a complete
>      and consistent snapshot with all tasks using two system calls (ioctl and
>      read). Transferring a snapshot of all running tasks is not possible using
>      the existing netlink interface, because there we have the socket buffer
>      size as restricting factor.
> [06] Add TGID to taskstats.
> [07] Add steal time per task accounting.
> [08] Add cumulative CPU time (user, system and steal) to taskstats.
> [09] Fix exit CPU time accounting.

I'll review the patches, in more depth

> 
> [10] Besides of the kernel patches also user space code is provided that
>      exploits the new kernel infrastructure. The user space code provides the
>      following:
>      1. A proposal for a taskstats user space library:
>         1.1 Based on netlink (requires libnl-devel-1.1-5)
>         2.1 Based on the new /proc/taskstats interface (see [05])

I have some code for libnl based exploitation lying around, not sure
if you've seen the same.

>      2. A proposal for a task snapshot library based on taskstats library (1.1)
>      3. A new tool "ptop" (precise top) that uses the libraries
> 
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-23 17:10   ` Oleg Nesterov
@ 2010-09-24 12:18     ` Michael Holzheu
  2010-09-26 18:11       ` Oleg Nesterov
  0 siblings, 1 reply; 58+ messages in thread
From: Michael Holzheu @ 2010-09-24 12:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Balbir Singh, Ingo Molnar, Heiko Carstens, Martin Schwidefsky,
	linux-s390, linux-kernel

Hello Oleg,

On Thu, 2010-09-23 at 19:10 +0200, Oleg Nesterov wrote:
> Sorry, I didn't look at other patches, but this one looks strange
> to me...
> 
> On 09/23, Michael Holzheu wrote:
> >
> > Currently there are code pathes (e.g. for kthreads) where the consumed
> > CPU time is not accounted to the parents cumulative counters.
> 
> Could you explain more?

I think one place was "khelper" (kmod.c). It is created with
kernel_thread() and it exits without having accounted the times with
sys_wait() to the parent's ctimes (see second kernel_thread() invocation
below in kmod.c):

        if (wait == UMH_WAIT_PROC)
                pid = kernel_thread(wait_for_helper, sub_info,
                                    CLONE_FS | CLONE_FILES | SIGCHLD);
        else
                pid = kernel_thread(____call_usermodehelper, sub_info,
                                    CLONE_VFORK | SIGCHLD);

<snip>

> >  void release_task(struct task_struct * p)
> >  {
> >  	struct task_struct *leader;
> >  	int zap_leader;
> > +
> > +	if (!p->exit_accounting_done)
> > +		account_to_parent(p);
> >  repeat:
> >  	tracehook_prepare_release_task(p);
> >  	/* don't need to get the RCU readlock here - the process is dead and
> > @@ -1279,6 +1313,7 @@
> >  			psig->cmaxrss = maxrss;
> >  		task_io_accounting_add(&psig->ioac, &p->ioac);
> >  		task_io_accounting_add(&psig->ioac, &sig->ioac);
> > +		p->exit_accounting_done = 1;
> 
> Can't understand.
> 
> Suppose that a thread T exits and reaps itself (calls release_task).
> Now we call account_to_parent() which accounts T->signal->XXX + T->XXX.
> After that T calls __exit_signal and does T->signal->XXX += T->XXX.
> 
> If another thread exits it does the same and we account the already
> exited thread T again?
> 
> When the last thread exits, wait_task_zombie() accounts T->signal
> once again.
> 
> IOW, this looks like the over-accounting to me, no?

I think you are right and this patch is not correct here.

I had the wrong idea, because I thought that for every exited thread the
parent will do a sys_wait() and the thread's CPU times will be added to
the parent's ctime. This happens only for the thread group leader,
correct? Other threads just exit and add their times to the signal
struct of the process in __exit_signal(). When a sys_wait() is done for
a dead thread group leader, all the times that have been accumulated in
the signal struct are added to the ctime fields of the waiting parent.

I wanted to use the cumulative times (cutime, cstime, csttime) for ptop
in order to show all consumed CPU time in the last interval.

E.g. for the case that between ptop snapshot 1 and 2 a task "X" forked
and also exited, ptop can't see this task, because it is neither in
snapshot 1 nor in snapshot 2. But ptop can still show X's consumed CPU
time in the interval by subtracting the cumulative times of X's parent.

If a task "Y" is in snapshot 1, but not in snapshot 2, we search for the
parent of "Y" and calculate it's ctimes for the last interval as follows
(user time in this example):

parent->cuser_diff =
  parent->snap2->cutime - Y->snap1->utime -                 
  Y->snap1->cutime - parent->snap1->cutime

The result is the CPU time that Y consumed in the last interval.

Example (ptop):

                          VVVVV  VVVV VVVV
   pid     user  sys  ste cuser  csys cste  total Elap+ Name
   (#)      (%)  (%)  (%)   (%)   (%)  (%)    (%)  (hm) (str)
   8374   75.48 0.41 1.34  0.00  0.00 0.00  77.24  0:01 loop
>> 10093   0.17 0.30 0.00 25.90 38.19 0.52  65.07  0:00 make <<

ptop shows cuser, csys and cste for the last interval. In this example
it is the time that dead children of "make" consumed in the last
interval.

Ok, the problem is that I did not consider exiting threads that are no
thread group leaders. When they exit the ctime of the parent is not
updated. Instead the time is accumulated in the signal struct.

To fix this we could also add the signal_struct times (e.g. tguser,
tgsys and tgste) to taskstats. When a task "Z" exits (is in snapshot 1,
but not in snapshot 2), we first check, if the thread group leader is
still in snapshot 2. If this is the case, we do the following
calculation:

tgleader->tguser_diff =
  tgleader->snap2->sig->utime - Z->snap1->utime -
  tgleader->snap1->sig->utime

If add CPU time diffs, cummulated parent diffs and thread group CPU time
diffs, we should get again 100% of the consumed CPU time in the last
ptop interval.

Does this make sense?

Michael


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
  2010-09-23 22:11   ` Matt Helsley
@ 2010-09-24 12:39     ` Michael Holzheu
  2010-09-25 18:19     ` Serge E. Hallyn
  1 sibling, 0 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-24 12:39 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Andrew Morton, Shailabh Nagar, Peter Zijlstra, Heiko Carstens,
	Venkatesh Pallipadi, John stultz, containers, linux-s390,
	Balbir Singh, Oleg Nesterov, linux-kernel, Martin Schwidefsky,
	Ingo Molnar, Thomas Gleixner, Suresh Siddha

Hello Matt,

On Thu, 2010-09-23 at 15:11 -0700, Matt Helsley wrote:
> > Talk to me about namespaces, please.  A lot of the new code involves
> > PIDs, but PIDs are not system-wide unique.  A PID is relative to a PID
> > namespace.  Does everything Just Work?  When userspace sends a PID to
> > the kernel, that PID is assumed to be within the sending process's PID
> > namespace?  If so, then please spell it all out in the changelogs.  If
> > not then that is a problem!
> 
> Good point.
> 
> The pid ought to be valid in the _receiving_ task's pid namespace. That
> can be difficult or impossible if we're talking about netlink broadcasts.
> In this regard process events connector is an example of what not to do.

I think that the netlink taskstats commands are executed in the context
of the calling process (at least my printk shows me that). The command
collects the process data using "current->nsproxy->pid_ns" and creates a
netlink reply. So everything should be fine here. Shouldn't it?

Hmmm, but for exit events, this might be broken in taskstats. The code
looks to me that every exiting task independent from the namespace is
reported as event via taskstat_exit(). Maybe I am missing something...

Michael




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
  2010-09-24  9:10   ` Michael Holzheu
@ 2010-09-24 18:50     ` Andrew Morton
  2010-09-27  9:18       ` Michael Holzheu
  2010-09-27 10:49     ` Balbir Singh
  1 sibling, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2010-09-24 18:50 UTC (permalink / raw)
  To: holzheu
  Cc: Shailabh Nagar, Venkatesh Pallipadi, Suresh Siddha,
	Peter Zijlstra, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, linux-kernel, linux-s390, containers

On Fri, 24 Sep 2010 11:10:15 +0200
Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:

> Hello Andrew,
> 
> On Thu, 2010-09-23 at 13:11 -0700, Andrew Morton wrote:
> > > GOALS OF THIS PATCH SET
> > > -----------------------
> > > The intention of this patch set is to provide better support for tools like
> > > top. The goal is to:
> > > 
> > > * provide a task snapshot mechanism where we can get a consistent view of
> > >   all running tasks.
> > > * provide a transport mechanism that does not require a lot of system calls
> > >   and that allows implementing low CPU overhead task monitoring.
> > > * provide microsecond CPU time granularity.
> > 
> > This is a big change!  If this is done right then we're heading in the
> > direction of deprecating the longstanding way in which userspace
> > observes the state of Linux processes and we're recommending that the
> > whole world migrate to taskstats.  I think?
> 
> Or it can be used as alternative. Since procfs has its drawbacks (e.g.
> performance) an alternative could be helpful. 

And it can be harmful.  More kernel code to maintain and test, more
userspace code to develop, maintain, etc.  Less user testing than if
there was a single interface.

> 
> > I worry that there's a dependency on CONFIG_NET?  If so then that's a
> > big problem because in N years time, 99% of the world will be using
> > taskstats, but a few embedded losers will be stuck using (and having to
> > support) the old tools.
> 
> Sure, but if we could add the /proc/taskstats approach, this dependency
> would not be there.

So why do we need to present the same info over netlink?

If the info is available via procfs then userspace code should use that
and not netlink, because that userspace code would also be applicable
to CONFIG_NET=n systems.

> 
> > Does this have the potential to save us from the CONFIG_NET=n problem?
> 
> Yes

Let's say that when it's all tested ;)

> Are PIDs over all namespaces unique?

Nope.  The same pid can be present in different namespaces at the same
time.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
  2010-09-23 22:11   ` Matt Helsley
  2010-09-24 12:39     ` Michael Holzheu
@ 2010-09-25 18:19     ` Serge E. Hallyn
  1 sibling, 0 replies; 58+ messages in thread
From: Serge E. Hallyn @ 2010-09-25 18:19 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Andrew Morton, Shailabh Nagar, linux-s390, Peter Zijlstra,
	Venkatesh Pallipadi, John stultz, containers, Heiko Carstens,
	Oleg Nesterov, linux-kernel, Suresh Siddha, Martin Schwidefsky,
	Ingo Molnar, holzheu, Thomas Gleixner, Balbir Singh

Quoting Matt Helsley (matthltc@us.ibm.com):
> I don't think even "root" can see/use pids outside its namespace (without

Just to be clear on this, you're right in what you say, but if a task in a child
pidns still has access to the /proc mount of the parent pidns, then it can see
the pids in there, and get information from them, i.e. /proc/pid/maps.  So
in that sense, some people could misinterpret "see/use pids" and think you
weren't right.

-serge

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-24 12:18     ` Michael Holzheu
@ 2010-09-26 18:11       ` Oleg Nesterov
  2010-09-27 13:23         ` Michael Holzheu
  2010-09-27 13:42         ` Martin Schwidefsky
  0 siblings, 2 replies; 58+ messages in thread
From: Oleg Nesterov @ 2010-09-26 18:11 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Balbir Singh, Ingo Molnar, Heiko Carstens, Martin Schwidefsky,
	linux-s390, linux-kernel

Hi,

On 09/24, Michael Holzheu wrote:
>
> On Thu, 2010-09-23 at 19:10 +0200, Oleg Nesterov wrote:
> >
> > On 09/23, Michael Holzheu wrote:
> > >
> > > Currently there are code pathes (e.g. for kthreads) where the consumed
> > > CPU time is not accounted to the parents cumulative counters.
> >
> > Could you explain more?
>
> I think one place was "khelper" (kmod.c). It is created with
> kernel_thread() and it exits without having accounted the times with
> sys_wait() to the parent's ctimes

No. Well yes, it is not accounted, but this is not because it is
kthread.

To simplify the discussion, lets talk about utime/cutime only,
and lets forget about the multithreading.

It is very simple, currently linux accounts the exiting task's
utime and adds its to ->cutime _only_ if parent does do_wait().
If parent ignores SIGCHLD, the child reaps itself and it is not
accounted.

I do not know why it was done this way, but I'm afraid we can't
change this historical behaviour.

> Ok, the problem is that I did not consider exiting threads that are no
> thread group leaders. When they exit the ctime of the parent is not
> updated. Instead the time is accumulated in the signal struct.

I think I am a bit confused, but see above. With or without threads
the whole process can exit without accounting.

Oleg.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
  2010-09-24 18:50     ` Andrew Morton
@ 2010-09-27  9:18       ` Michael Holzheu
  2010-09-27 20:02         ` Andrew Morton
  0 siblings, 1 reply; 58+ messages in thread
From: Michael Holzheu @ 2010-09-27  9:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shailabh Nagar, Venkatesh Pallipadi, Suresh Siddha,
	Peter Zijlstra, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, linux-kernel, linux-s390, containers

Hello Andrew,

On Fri, 2010-09-24 at 11:50 -0700, Andrew Morton wrote:
> > > This is a big change!  If this is done right then we're heading in the
> > > direction of deprecating the longstanding way in which userspace
> > > observes the state of Linux processes and we're recommending that the
> > > whole world migrate to taskstats.  I think?
> > 
> > Or it can be used as alternative. Since procfs has its drawbacks (e.g.
> > performance) an alternative could be helpful. 
> 
> And it can be harmful.  More kernel code to maintain and test, more
> userspace code to develop, maintain, etc.  Less user testing than if
> there was a single interface.

Sure, the value has to be big enough to justify the effort.

But as I said, with taskstats and procfs we already have two interfaces
for getting task information. Currently in procfs there is information
than you can't find in taskstats. But also the other way round in the
taskstats structure there is very useful information that you can't get
under proc. E.g. the task delay times, IO accounting, etc. So currently
tools have to use both interfaces to get all information, which is not
optimal.

> > 
> > > I worry that there's a dependency on CONFIG_NET?  If so then that's a
> > > big problem because in N years time, 99% of the world will be using
> > > taskstats, but a few embedded losers will be stuck using (and having to
> > > support) the old tools.
> > 
> > Sure, but if we could add the /proc/taskstats approach, this dependency
> > would not be there.
> 
> So why do we need to present the same info over netlink?

Good point. It is not really necessary. I started development using the
netlink code. Therefore I first added the new command in the netlink
code. I also thought, it would be a good idea to provide all netlink
commands over the procfs interface to be consistent.

> If the info is available via procfs then userspace code should use that
> and not netlink, because that userspace code would also be applicable
> to CONFIG_NET=n systems.
> 
> > 
> > > Does this have the potential to save us from the CONFIG_NET=n problem?
> > 
> > Yes
> 
> Let's say that when it's all tested ;)

That was more a theoretical statement :-)

I probably still have to ensure that the kernel config options
dependencies are done correctly.

Michael


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 02/10] taskstats: Separate taskstats commands
  2010-09-23 14:01 ` [RFC][PATCH 02/10] taskstats: Separate taskstats commands Michael Holzheu
@ 2010-09-27  9:32   ` Balbir Singh
  2010-10-11  7:40   ` Balbir Singh
  1 sibling, 0 replies; 58+ messages in thread
From: Balbir Singh @ 2010-09-27  9:32 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Martin Schwidefsky, Heiko Carstens,
	linux-kernel, linux-s390

* Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-09-23 16:01:02]:

> Subject: [PATCH] taskstats: Separate taskstats commands
> 
> From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> 
> This patch moves each taskstats command into a single function. This
> makes
> the code more readable and makes it easier to add new commands.
> 
> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> ---
>  kernel/taskstats.c |  118
> +++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 78 insertions(+), 40 deletions(-)
> 
> --- a/kernel/taskstats.c
> +++ b/kernel/taskstats.c
> @@ -424,39 +424,76 @@ err:
>  	return rc;
>  }
> 
> -static int taskstats_user_cmd(struct sk_buff *skb, struct genl_info
> *info)
> +static int cmd_attr_register_cpumask(struct genl_info *info)
>  {
> -	int rc;
> -	struct sk_buff *rep_skb;
> -	struct taskstats *stats;
> -	size_t size;
>  	cpumask_var_t mask;
> +	int rc;
> 
>  	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
>  		return -ENOMEM;
> -
>  	rc = parse(info->attrs[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK], mask);
>  	if (rc < 0)
> -		goto free_return_rc;
> -	if (rc == 0) {
> -		rc = add_del_listener(info->snd_pid, mask, REGISTER);
> -		goto free_return_rc;
> -	}
> +		goto out;
> +	rc = add_del_listener(info->snd_pid, mask, REGISTER);
> +out:
> +	free_cpumask_var(mask);
> +	return rc;
> +}
> +
> +static int cmd_attr_deregister_cpumask(struct genl_info *info)
> +{
> +	cpumask_var_t mask;
> +	int rc;
> 
> +	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> +		return -ENOMEM;
>  	rc = parse(info->attrs[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK], mask);
>  	if (rc < 0)
> -		goto free_return_rc;
> -	if (rc == 0) {
> -		rc = add_del_listener(info->snd_pid, mask, DEREGISTER);
> -free_return_rc:
> -		free_cpumask_var(mask);
> -		return rc;
> -	}
> +		goto out;
> +	rc = add_del_listener(info->snd_pid, mask, DEREGISTER);
> +out:
>  	free_cpumask_var(mask);
> +	return rc;
> +}
> +
> +static int cmd_attr_pid(struct genl_info *info)
> +{
> +	struct taskstats *stats;
> +	struct sk_buff *rep_skb;
> +	size_t size;
> +	u32 pid;
> +	int rc;
> +
> +	size = nla_total_size(sizeof(u32)) +
> +		nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
> +
> +	rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, size);
> +	if (rc < 0)
> +		return rc;
> +
> +	rc = -EINVAL;
> +	pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> +	stats = mk_reply(rep_skb, TASKSTATS_TYPE_PID, pid);
> +	if (!stats)
> +		goto err;
> +
> +	rc = fill_pid(pid, NULL, stats);
> +	if (rc < 0)
> +		goto err;
> +	return send_reply(rep_skb, info);
> +err:
> +	nlmsg_free(rep_skb);
> +	return rc;
> +}
> +
> +static int cmd_attr_tgid(struct genl_info *info)
> +{
> +	struct taskstats *stats;
> +	struct sk_buff *rep_skb;
> +	size_t size;
> +	u32 tgid;
> +	int rc;
> 
> -	/*
> -	 * Size includes space for nested attributes
> -	 */
>  	size = nla_total_size(sizeof(u32)) +
>  		nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
> 
> @@ -465,33 +502,34 @@ free_return_rc:
>  		return rc;
> 
>  	rc = -EINVAL;
> -	if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> -		u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> -		stats = mk_reply(rep_skb, TASKSTATS_TYPE_PID, pid);
> -		if (!stats)
> -			goto err;
> -
> -		rc = fill_pid(pid, NULL, stats);
> -		if (rc < 0)
> -			goto err;
> -	} else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
> -		u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
> -		stats = mk_reply(rep_skb, TASKSTATS_TYPE_TGID, tgid);
> -		if (!stats)
> -			goto err;
> -
> -		rc = fill_tgid(tgid, NULL, stats);
> -		if (rc < 0)
> -			goto err;
> -	} else
> +	tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
> +	stats = mk_reply(rep_skb, TASKSTATS_TYPE_TGID, tgid);
> +	if (!stats)
>  		goto err;
> 
> +	rc = fill_tgid(tgid, NULL, stats);
> +	if (rc < 0)
> +		goto err;
>  	return send_reply(rep_skb, info);
>  err:
>  	nlmsg_free(rep_skb);
>  	return rc;
>  }
> 
> +static int taskstats_user_cmd(struct sk_buff *skb, struct genl_info
> *info)
> +{
> +	if (info->attrs[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK])
> +		return cmd_attr_register_cpumask(info);
> +	else if (info->attrs[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK])
> +		return cmd_attr_deregister_cpumask(info);
> +	else if (info->attrs[TASKSTATS_CMD_ATTR_PID])
> +		return cmd_attr_pid(info);
> +	else if (info->attrs[TASKSTATS_CMD_ATTR_TGID])
> +		return cmd_attr_tgid(info);
> +	else
> +		return -EINVAL;
> +}
> +
>  static struct taskstats *taskstats_tgid_alloc(struct task_struct *tsk)
>  {
>  	struct signal_struct *sig = tsk->signal;
> 
> 


Looks like good clean-up

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 03/10] taskstats: Split fill_pid function
  2010-09-23 14:01 ` [RFC][PATCH 03/10] taskstats: Split fill_pid function Michael Holzheu
  2010-09-23 17:33   ` Oleg Nesterov
@ 2010-09-27  9:33   ` Balbir Singh
  2010-10-11  8:31   ` Balbir Singh
  2 siblings, 0 replies; 58+ messages in thread
From: Balbir Singh @ 2010-09-27  9:33 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Ingo Molnar, Heiko Carstens, Martin Schwidefsky,
	linux-s390, linux-kernel

* Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-09-23 16:01:07]:

> Subject: [PATCH] taskstats: Split fill_pid function
> 
> From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> 
> Separate the finding of a task_struct by pid or tgid from filling the taskstats
> data. This makes the code more readable.
> 
> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> ---
>  kernel/taskstats.c |   50 +++++++++++++++++++++-----------------------------
>  1 file changed, 21 insertions(+), 29 deletions(-)
> 
> --- a/kernel/taskstats.c
> +++ b/kernel/taskstats.c
> @@ -175,22 +175,8 @@ static void send_cpu_listeners(struct sk
>  	up_write(&listeners->sem);
>  }
> 
> -static int fill_pid(pid_t pid, struct task_struct *tsk,
> -		struct taskstats *stats)
> +static void fill_stats(struct task_struct *tsk, struct taskstats *stats)
>  {
> -	int rc = 0;
> -
> -	if (!tsk) {
> -		rcu_read_lock();
> -		tsk = find_task_by_vpid(pid);
> -		if (tsk)
> -			get_task_struct(tsk);
> -		rcu_read_unlock();
> -		if (!tsk)
> -			return -ESRCH;
> -	} else
> -		get_task_struct(tsk);
> -
>  	memset(stats, 0, sizeof(*stats));
>  	/*
>  	 * Each accounting subsystem adds calls to its functions to
> @@ -209,17 +195,27 @@ static int fill_pid(pid_t pid, struct ta
> 
>  	/* fill in extended acct fields */
>  	xacct_add_tsk(stats, tsk);
> +}
> 
> -	/* Define err: label here if needed */
> -	put_task_struct(tsk);
> -	return rc;
> +static int fill_stats_for_pid(pid_t pid, struct taskstats *stats)
> +{
> +	struct task_struct *tsk;
> 
> +	rcu_read_lock();
> +	tsk = find_task_by_vpid(pid);
> +	if (tsk)
> +		get_task_struct(tsk);
> +	rcu_read_unlock();
> +	if (!tsk)
> +		return -ESRCH;
> +	fill_stats(tsk, stats);
> +	put_task_struct(tsk);
> +	return 0;
>  }
> 
> -static int fill_tgid(pid_t tgid, struct task_struct *first,
> -		struct taskstats *stats)
> +static int fill_stats_for_tgid(pid_t tgid, struct taskstats *stats)
>  {
> -	struct task_struct *tsk;
> +	struct task_struct *tsk, *first;
>  	unsigned long flags;
>  	int rc = -ESRCH;
> 
> @@ -228,8 +224,7 @@ static int fill_tgid(pid_t tgid, struct 
>  	 * leaders who are already counted with the dead tasks
>  	 */
>  	rcu_read_lock();
> -	if (!first)
> -		first = find_task_by_vpid(tgid);
> +	first = find_task_by_vpid(tgid);
> 
>  	if (!first || !lock_task_sighand(first, &flags))
>  		goto out;
> @@ -268,7 +263,6 @@ out:
>  	return rc;
>  }
> 
> -
>  static void fill_tgid_exit(struct task_struct *tsk)
>  {
>  	unsigned long flags;
> @@ -477,7 +471,7 @@ static int cmd_attr_pid(struct genl_info
>  	if (!stats)
>  		goto err;
> 
> -	rc = fill_pid(pid, NULL, stats);
> +	rc = fill_stats_for_pid(pid, stats);
>  	if (rc < 0)
>  		goto err;
>  	return send_reply(rep_skb, info);
> @@ -507,7 +501,7 @@ static int cmd_attr_tgid(struct genl_inf
>  	if (!stats)
>  		goto err;
> 
> -	rc = fill_tgid(tgid, NULL, stats);
> +	rc = fill_stats_for_tgid(tgid, stats);
>  	if (rc < 0)
>  		goto err;
>  	return send_reply(rep_skb, info);
> @@ -593,9 +587,7 @@ void taskstats_exit(struct task_struct *
>  	if (!stats)
>  		goto err;
> 
> -	rc = fill_pid(-1, tsk, stats);
> -	if (rc < 0)
> -		goto err;
> +	fill_stats(tsk, stats);
> 
>  	/*
>  	 * Doesn't matter if tsk is the leader or the last group member leaving
> 
>

 
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
  2010-09-24  9:10   ` Michael Holzheu
  2010-09-24 18:50     ` Andrew Morton
@ 2010-09-27 10:49     ` Balbir Singh
  1 sibling, 0 replies; 58+ messages in thread
From: Balbir Singh @ 2010-09-27 10:49 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Andrew Morton, Shailabh Nagar, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Martin Schwidefsky, Heiko Carstens,
	linux-kernel, linux-s390, containers

* Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-09-24 11:10:15]:

> Hello Andrew,
> 
> On Thu, 2010-09-23 at 13:11 -0700, Andrew Morton wrote:
> > > GOALS OF THIS PATCH SET
> > > -----------------------
> > > The intention of this patch set is to provide better support for tools like
> > > top. The goal is to:
> > > 
> > > * provide a task snapshot mechanism where we can get a consistent view of
> > >   all running tasks.
> > > * provide a transport mechanism that does not require a lot of system calls
> > >   and that allows implementing low CPU overhead task monitoring.
> > > * provide microsecond CPU time granularity.
> > 
> > This is a big change!  If this is done right then we're heading in the
> > direction of deprecating the longstanding way in which userspace
> > observes the state of Linux processes and we're recommending that the
> > whole world migrate to taskstats.  I think?
>

Wouldn't I love that :)
 
> Or it can be used as alternative. Since procfs has its drawbacks (e.g.
> performance) an alternative could be helpful. 
> 
> And the taskstats interface with the TASKSTATS_CMD_ATTR_PID command
> already exists and can be used. So we already have a second mechanism to
> query tasks accounting information besides of procfs.
> 

Yes, an alternative for simple data extraction without having to write
network code to extract it.

> > 
> > If so, much chin-scratching will be needed, coordination with
> > util-linux people, etc.
> 
> I agree.
> 
> > We'd need to think about the implications of taskstats versioning.  It
> > _is_ a versioned interface, so people can't just go and toss random new
> > stuff in there at will - it's not like adding a new procfs file, or
> > adding a new line to an existing one.  I don't know if that's likely to
> > be a significant problem.
> 
> I already thought about that problem. Another problem is that depending
> on the kernel config options, some taskstats fields may be not
> initialized. E.g. CONFIG_TASK_DELAY_ACCT or CONFIG_TASK_XACCT. Currently
> there does not exist a good interface to userspace to query which fields
> are valid.
> 
> Regarding the taskstats versions  I described a possible solution in the
> userspace tarball in the README.libtaskstats file:
> 
> The "struct taskstats" structure contains accounting information for one
> Linux task. This structure is defined in "/usr/include/linux/taskstats.h".
> With new kernel versions new fields can be added to that structure.
> In that case the kernel taskstats version number defined with the macro
> TASKSTATS_VERSION will be increased.
>
> The taskstats library distinguishes between two taskstats versions:
> * Kernel taskstats version (KV)
> * Program compile taskstats version (CV)
> 
> Depending on the taskstats version CV that is used for compiling the program,
> this version numbers can be different:
> * KV > CV:
>   The libtaskstats library only copies the CV taskstats fields and the fields
>   that belong to version > CV will be ignored.
> * KV < CV:
>   The libtaskstats library only copies the version KV fields and the fields
>   that belong to version > KV remain uninitialized.
> 
> If a program wants to support multiple taskstats versions, this can be done
> using the ts_version() function and process fields according to that version
> number.
> 
> Example:
> 
>   if (ts_version() < 7) {
>          fprintf(stderr, "Error: kernel taskstats version too low\n");
>          exit(1);
>   }
>   if (ts_version() >= 7)
>          print_attrs_v7();
>   if (ts_version() >= 8)
>          print_attrs_v8();
> 
> In this example the program has to be compiled with a taskstats.h header file
> that has at least version 8.

Fair enough

> 
> > I worry that there's a dependency on CONFIG_NET?  If so then that's a
> > big problem because in N years time, 99% of the world will be using
> > taskstats, but a few embedded losers will be stuck using (and having to
> > support) the old tools.
> 
> Sure, but if we could add the /proc/taskstats approach, this dependency
> would not be there.
> 
> > 
> > > FIRST RESULTS
> > > -------------
> > > Together with this kernel patch set also user space code for a new top
> > > utility (ptop) is provided that exploits the new kernel infrastructure. See
> > > patch 10 for more details.
> > > 
> > > TEST1: System with many sleeping tasks
> > > 
> > >   for ((i=0; i < 1000; i++))
> > >   do
> > >          sleep 1000000 &
> > >   done
> > > 
> > >   # ptop_new_proc
> > > 
> > >              VVVV
> > >   pid   user  sys  ste  total  Name
> > >   (#)    (%)  (%)  (%)    (%)  (str)
> > >   541   0.37 2.39 0.10   2.87  top
> > >   3743  0.03 0.05 0.00   0.07  ptop_new_proc
> > >              ^^^^
> > > 
> > > Compared to the old top command that has to scan more than 1000 proc
> > > directories the new ptop consumes much less CPU time (0.05% system time
> > > on my s390 system).
> > 
> > How many CPUs does that system have?
> 
> The system is a virtual machine and has three CPUs.
> 
> > What's the `top' update period?  One second?
> 
> The update period is two seconds.
> 
> > So we're saying that a `top -d 1' consumes 2.4% of this
> > mystery-number-of-CPUs machine?  That's quite a lot.
> 
> When I run that testcase on my laptop, 2 CPUs (Intel Core 2 - 2.33GHz),
> I get about 1-2% system time for top.
> 
> > > PATCHSET OVERVIEW
> > > -----------------
> > > The code is not final and still has a few TODOs. But it is good enough for a
> > > first round of review. The following kernel patches are provided:
> > > 
> > > [01] Prepare-0: Use real microsecond granularity for taskstats CPU times.
> > > [02] Prepare-1: Restructure taskstats.c in order to be able to add new commands
> > >      more easily.
> > > [03] Prepare-2: Separate the finding of a task_struct by PID or TGID from
> > >      filling the taskstats.
> > > [04] Add new command "TASKSTATS_CMD_ATTR_PIDS" to get a snapshot of multiple
> > >      tasks.
> > > [05] Add procfs interface for taskstats commands. This allows to get a complete
> > >      and consistent snapshot with all tasks using two system calls (ioctl and
> > >      read). Transferring a snapshot of all running tasks is not possible using
> > >      the existing netlink interface, because there we have the socket buffer
> > >      size as restricting factor.
> > 
> > So this is a binary interface which uses an ioctl.  People don't like
> > ioctls.  Could we have triggered it with a write() instead?
> 
> The current idea is the following:
> 
> 1. Open /proc/taskstats
> 2. Set the requested command (e.g. TASKSTATS_CMD_ATTR_PIDS) using
>    an ioctl. For the TASKSTATS_CMD_ATTR_PIDS ioctl the following
>    structure is sent:
> 
>    struct taskstats_cmd_pids {
>         __u64   time_ns;
>         __u32   pid;
>         __u32   cnt;
>    };
> 
> 3. After the command is defined, with a read() the command is executed
>    and the result is returned to the user's read buffer.
> 
> We could replace step 2 with a write, that transfers the command.
>

I don't like ioctls either, write sounds interesting.
 
> > Does this have the potential to save us from the CONFIG_NET=n problem?
> 
> Yes
> 
> > > [06] Add TGID to taskstats.
> > > [07] Add steal time per task accounting.
> > > [08] Add cumulative CPU time (user, system and steal) to taskstats.
> > 
> > These didn't update the taskstats version number.  Should they have?
> 
> Patch 04/10 updates the taskstats version number from 7 to 8.
> I didn't want to update the version number with each patch.
> 
> > > [09] Fix exit CPU time accounting.
> > > 
> > > [10] Besides of the kernel patches also user space code is provided that
> > >      exploits the new kernel infrastructure. The user space code provides the
> > >      following:
> > >      1. A proposal for a taskstats user space library:
> > >         1.1 Based on netlink (requires libnl-devel-1.1-5)
> > >         2.1 Based on the new /proc/taskstats interface (see [05])
> > >      2. A proposal for a task snapshot library based on taskstats library (1.1)
> > 
> > ooh, excellent.  A standardised userspace access library.
> 
> Yes, at least a proposal for that.
> 
> > >      3. A new tool "ptop" (precise top) that uses the libraries
> > 
> > Talk to me about namespaces, please.  A lot of the new code involves
> > PIDs, but PIDs are not system-wide unique.  A PID is relative to a PID
> > namespace.  Does everything Just Work?  When userspace sends a PID to
> > the kernel, that PID is assumed to be within the sending process's PID
> > namespace?  If so, then please spell it all out in the changelogs.  If
> > not then that is a problem!
> 
> To be honest, I have not tested that. I assumed that the current
> taskstats code does this correctly. E.g. it uses find_task_by_vpid() for
> TASKSTATS_CMD_ATTR_PID and this function uses
> "current->nsproxy->pid_ns". So I would assume that we get only tasks
> from the caller's namespace. The new TASKSTATS_CMD_ATTR_PIDS command
> also uses also only functions with "current->nsproxy->pid_ns".
> 
> > If I can only observe processes in my PID namespace then is that a
> > problem?  Should I be allowed to observe another PID namespace's
> > processes?  I assume so, because I might be root.  If so, how is that
> > to be done?
> 
> Good question. Probably I have to learn a bit more about the PID
> namespace implementation. Are PIDs over all namespaces unique?
> 
>

I think the namespaces are OK, we might peep into namespaces nested
within the current one, but that is legal today. 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-26 18:11       ` Oleg Nesterov
@ 2010-09-27 13:23         ` Michael Holzheu
  2010-09-27 13:42         ` Martin Schwidefsky
  1 sibling, 0 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-27 13:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Balbir Singh, Ingo Molnar, Heiko Carstens, Martin Schwidefsky,
	linux-s390, linux-kernel

Hello Oleg,

On Sun, 2010-09-26 at 20:11 +0200, Oleg Nesterov wrote:
> > I think one place was "khelper" (kmod.c). It is created with
> > kernel_thread() and it exits without having accounted the times with
> > sys_wait() to the parent's ctimes
> 
> No. Well yes, it is not accounted, but this is not because it is
> kthread.
> 
> To simplify the discussion, lets talk about utime/cutime only,
> and lets forget about the multithreading.
> 
> It is very simple, currently linux accounts the exiting task's
> utime and adds its to ->cutime _only_ if parent does do_wait().
> If parent ignores SIGCHLD, the child reaps itself and it is not
> accounted.
> 
> I do not know why it was done this way, but I'm afraid we can't
> change this historical behaviour.

Ok thanks, I didn't know this. So time can disappear, if the parent
ignores SIGCHLD.

I am a bit disappointed, because I thought by looking at all tasks of a
system it should be possible to evaluate all consumed CPU time from now
to the time where the system has been booted. That would have been a
nice thing.

> > Ok, the problem is that I did not consider exiting threads that are no
> > thread group leaders. When they exit the ctime of the parent is not
> > updated. Instead the time is accumulated in the signal struct.
> 
> I think I am a bit confused, but see above. With or without threads
> the whole process can exit without accounting.

Sorry that I couldn't explain my thoughts clear enough, believe me, I
tried my best :-)

Michael




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-26 18:11       ` Oleg Nesterov
  2010-09-27 13:23         ` Michael Holzheu
@ 2010-09-27 13:42         ` Martin Schwidefsky
  2010-09-27 16:51           ` Oleg Nesterov
  2010-09-28  8:36           ` Balbir Singh
  1 sibling, 2 replies; 58+ messages in thread
From: Martin Schwidefsky @ 2010-09-27 13:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Peter Zijlstra, Suresh Siddha, John stultz,
	Thomas Gleixner, Balbir Singh, Ingo Molnar, Heiko Carstens,
	linux-s390, linux-kernel

On Sun, 26 Sep 2010 20:11:27 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> Hi,
> 
> On 09/24, Michael Holzheu wrote:
> >
> > On Thu, 2010-09-23 at 19:10 +0200, Oleg Nesterov wrote:
> > >
> > > On 09/23, Michael Holzheu wrote:
> > > >
> > > > Currently there are code pathes (e.g. for kthreads) where the consumed
> > > > CPU time is not accounted to the parents cumulative counters.
> > >
> > > Could you explain more?
> >
> > I think one place was "khelper" (kmod.c). It is created with
> > kernel_thread() and it exits without having accounted the times with
> > sys_wait() to the parent's ctimes
> 
> No. Well yes, it is not accounted, but this is not because it is
> kthread.

We noticed that behavior with kernel threads but as you point out
the problem is bigger than that.
 
> To simplify the discussion, lets talk about utime/cutime only,
> and lets forget about the multithreading.
> 
> It is very simple, currently linux accounts the exiting task's
> utime and adds its to ->cutime _only_ if parent does do_wait().
> If parent ignores SIGCHLD, the child reaps itself and it is not
> accounted.
> 
> I do not know why it was done this way, but I'm afraid we can't
> change this historical behaviour.

Why? I would consider it to be a BUG() that the time is not accounted.
Independent of the fact that a parent wants to see the SIGCHLD and
the exit status of its child the process time of the child should be
accounted, no? And I'm not a particular fan of the "this has always
been that way" reasoning.

> > Ok, the problem is that I did not consider exiting threads that are no
> > thread group leaders. When they exit the ctime of the parent is not
> > updated. Instead the time is accumulated in the signal struct.
> 
> I think I am a bit confused, but see above. With or without threads
> the whole process can exit without accounting.

Got the part about self-reaping processes. But there is another issue:
consider an exiting thread where the group leader is still active.
The time for the thread will be added to the utime/stime fields in
the signal structure. Taskstats will happily ignore that time while
the group leader is still running.

Please keep in mind that we want to get to a point where it is
possible to get a 100% coverage of cpu cycles in the last snapshot
cycle through the taskstats interface. Otherwise the precise top
would not be very precise ..

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-27 13:42         ` Martin Schwidefsky
@ 2010-09-27 16:51           ` Oleg Nesterov
  2010-09-28  7:09             ` Martin Schwidefsky
  2010-09-29 19:19             ` Roland McGrath
  2010-09-28  8:36           ` Balbir Singh
  1 sibling, 2 replies; 58+ messages in thread
From: Oleg Nesterov @ 2010-09-27 16:51 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Peter Zijlstra, Suresh Siddha, John stultz,
	Thomas Gleixner, Balbir Singh, Ingo Molnar, Heiko Carstens,
	linux-s390, linux-kernel, Roland McGrath

On 09/27, Martin Schwidefsky wrote:
>
> On Sun, 26 Sep 2010 20:11:27 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > No. Well yes, it is not accounted, but this is not because it is
> > kthread.
>
> We noticed that behavior with kernel threads but as you point out
> the problem is bigger than that.
>
> > It is very simple, currently linux accounts the exiting task's
> > utime and adds its to ->cutime _only_ if parent does do_wait().
> > If parent ignores SIGCHLD, the child reaps itself and it is not
> > accounted.
> >
> > I do not know why it was done this way, but I'm afraid we can't
> > change this historical behaviour.
>
> Why?

Please don't ask me ;) I was equally surprised when I studied this
code in the past.

> I would consider it to be a BUG() that the time is not accounted.
> Independent of the fact that a parent wants to see the SIGCHLD and
> the exit status of its child the process time of the child should be
> accounted, no?

I do not know. It doesn't look like a BUG(), I mean it looks as if
the code was intentionally written this way.

> And I'm not a particular fan of the "this has always
> been that way" reasoning.

Me too, but unfortunately it often happens, sometimes we can't improve
things just because we should not break existing programs.

Once again, don't get me wrong. Personally I agree, to me it makes
sense to move the "update parent's cxxx" code from wait_task_zombie()
to __exit_signal(), and account children unconditionally.

But I do not know who can approve this very much user-visible change.
Perhaps Roland.

> Got the part about self-reaping processes. But there is another issue:
> consider an exiting thread where the group leader is still active.
> The time for the thread will be added to the utime/stime fields in
> the signal structure.

(to clarify, s/group leader/last thread/)

> Taskstats will happily ignore that time while
> the group leader is still running.

Sorry. I didn't read the whole series and I forgot everything I knew
about taskstats. But I don't understand this "ignore" above, every
exiting thread calls taskstats_exit()->fill_pid()->bacct_add_tsk()
and reports ->ac_utime?

Never mind. You seem to want to update, say, cutime when a sub-thread
exits, before the whole process exits, right? Again, trivial to
implement but this is another user-visible change.

Oleg.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
  2010-09-27  9:18       ` Michael Holzheu
@ 2010-09-27 20:02         ` Andrew Morton
  2010-09-28  8:17           ` Balbir Singh
  0 siblings, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2010-09-27 20:02 UTC (permalink / raw)
  To: holzheu
  Cc: Shailabh Nagar, Venkatesh Pallipadi, Suresh Siddha,
	Peter Zijlstra, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, linux-kernel, linux-s390, containers

On Mon, 27 Sep 2010 11:18:47 +0200
Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:

> Hello Andrew,
> 
> On Fri, 2010-09-24 at 11:50 -0700, Andrew Morton wrote:
> > > > This is a big change!  If this is done right then we're heading in the
> > > > direction of deprecating the longstanding way in which userspace
> > > > observes the state of Linux processes and we're recommending that the
> > > > whole world migrate to taskstats.  I think?
> > > 
> > > Or it can be used as alternative. Since procfs has its drawbacks (e.g.
> > > performance) an alternative could be helpful. 
> > 
> > And it can be harmful.  More kernel code to maintain and test, more
> > userspace code to develop, maintain, etc.  Less user testing than if
> > there was a single interface.
> 
> Sure, the value has to be big enough to justify the effort.
> 
> But as I said, with taskstats and procfs we already have two interfaces
> for getting task information.

That doesn't mean it was the right thing to do!  For the reasons I
outline above, it can be the wrong thing to do and strengthening one of
the alternatives worsens the problem.

> Currently in procfs there is information
> than you can't find in taskstats. But also the other way round in the
> taskstats structure there is very useful information that you can't get
> under proc. E.g. the task delay times, IO accounting, etc.

Sounds like a big screwup ;)

Look at it this way: if you were going to sit down and start to design
a new operating system from scratch, would you design the task status
reporting system as it currently stands in Linux?  Don't think so!

> So currently
> tools have to use both interfaces to get all information, which is not
> optimal.
> 
> > > 
> > > > I worry that there's a dependency on CONFIG_NET?  If so then that's a
> > > > big problem because in N years time, 99% of the world will be using
> > > > taskstats, but a few embedded losers will be stuck using (and having to
> > > > support) the old tools.
> > > 
> > > Sure, but if we could add the /proc/taskstats approach, this dependency
> > > would not be there.
> > 
> > So why do we need to present the same info over netlink?
> 
> Good point. It is not really necessary. I started development using the
> netlink code. Therefore I first added the new command in the netlink
> code. I also thought, it would be a good idea to provide all netlink
> commands over the procfs interface to be consistent.

Maybe we should have delivered taskstats over procfs from day one.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-27 16:51           ` Oleg Nesterov
@ 2010-09-28  7:09             ` Martin Schwidefsky
  2010-09-29 19:19             ` Roland McGrath
  1 sibling, 0 replies; 58+ messages in thread
From: Martin Schwidefsky @ 2010-09-28  7:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Peter Zijlstra, Suresh Siddha, John stultz,
	Thomas Gleixner, Balbir Singh, Ingo Molnar, Heiko Carstens,
	linux-s390, linux-kernel, Roland McGrath

On Mon, 27 Sep 2010 18:51:33 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 09/27, Martin Schwidefsky wrote:
> >
> > On Sun, 26 Sep 2010 20:11:27 +0200
> > Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > No. Well yes, it is not accounted, but this is not because it is
> > > kthread.
> >
> > We noticed that behavior with kernel threads but as you point out
> > the problem is bigger than that.
> >
> > > It is very simple, currently linux accounts the exiting task's
> > > utime and adds its to ->cutime _only_ if parent does do_wait().
> > > If parent ignores SIGCHLD, the child reaps itself and it is not
> > > accounted.
> > >
> > > I do not know why it was done this way, but I'm afraid we can't
> > > change this historical behaviour.
> >
> > Why?
> 
> Please don't ask me ;) I was equally surprised when I studied this
> code in the past.

But I do ask >:-)
 
> > I would consider it to be a BUG() that the time is not accounted.
> > Independent of the fact that a parent wants to see the SIGCHLD and
> > the exit status of its child the process time of the child should be
> > accounted, no?
> 
> I do not know. It doesn't look like a BUG(), I mean it looks as if
> the code was intentionally written this way.

Well, one thing to consider is the fact that the exiting process can
not the process accounting by itself. While a process is still running
it uses cpu which has to bet accounted to cstime of the parent.
So logically some other process has to do the cutime/cstime update.

> > And I'm not a particular fan of the "this has always
> > been that way" reasoning.
> 
> Me too, but unfortunately it often happens, sometimes we can't improve
> things just because we should not break existing programs.
> 
> Once again, don't get me wrong. Personally I agree, to me it makes
> sense to move the "update parent's cxxx" code from wait_task_zombie()
> to __exit_signal(), and account children unconditionally.
> 
> But I do not know who can approve this very much user-visible change.
> Perhaps Roland.

I wonder which user space tool would break.

> > Got the part about self-reaping processes. But there is another issue:
> > consider an exiting thread where the group leader is still active.
> > The time for the thread will be added to the utime/stime fields in
> > the signal structure.
> 
> (to clarify, s/group leader/last thread/)
> 
> > Taskstats will happily ignore that time while
> > the group leader is still running.
> 
> Sorry. I didn't read the whole series and I forgot everything I knew
> about taskstats. But I don't understand this "ignore" above, every
> exiting thread calls taskstats_exit()->fill_pid()->bacct_add_tsk()
> and reports ->ac_utime?
> 
> Never mind. You seem to want to update, say, cutime when a sub-thread
> exits, before the whole process exits, right? Again, trivial to
> implement but this is another user-visible change.

No, not necessarily. But to get to 100% coverage of the cpu time we
need to be able to "find" the time. Currently the time of already
exited threads of a thread group is invisible via the taskstats
interface. New taskstats fields would do for this particular problem.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
  2010-09-27 20:02         ` Andrew Morton
@ 2010-09-28  8:17           ` Balbir Singh
  0 siblings, 0 replies; 58+ messages in thread
From: Balbir Singh @ 2010-09-28  8:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: holzheu, Shailabh Nagar, Venkatesh Pallipadi, Suresh Siddha,
	Peter Zijlstra, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Martin Schwidefsky, Heiko Carstens,
	linux-kernel, linux-s390, containers

* Andrew Morton <akpm@linux-foundation.org> [2010-09-27 13:02:56]:

> > Good point. It is not really necessary. I started development using the
> > netlink code. Therefore I first added the new command in the netlink
> > code. I also thought, it would be a good idea to provide all netlink
> > commands over the procfs interface to be consistent.
> 
> Maybe we should have delivered taskstats over procfs from day one.
>

The intention was to provide taskstats over a scalable backend to deal
with a large amount of data, including exit notifications. We provided
some information like blkioi delay data on proc, but not the whole structure. 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-23 14:02 ` [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting Michael Holzheu
  2010-09-23 17:10   ` Oleg Nesterov
@ 2010-09-28  8:21   ` Balbir Singh
  2010-09-28 16:50     ` Michael Holzheu
  1 sibling, 1 reply; 58+ messages in thread
From: Balbir Singh @ 2010-09-28  8:21 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Ingo Molnar, Heiko Carstens, Martin Schwidefsky,
	linux-s390, linux-kernel

* Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-09-23 16:02:21]:

> Subject: [PATCH] taskstats: Fix exit CPU time accounting
> 
> From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> 
> Currently there are code pathes (e.g. for kthreads) where the consumed
> CPU time is not accounted to the parents cumulative counters.
> Now CPU time is accounted to the parent, if the exit accounting has not
> been done correctly.
>

Does this impact account of the init process? Why do we care about
accounting the time to the parent? In the case of tgid, all threads
data makes sense. What is the benefit or gap we are trying to address
in terms of lost data or accountability?
 
> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> ---
>  include/linux/sched.h |    1 +
>  kernel/exit.c         |   35 +++++++++++++++++++++++++++++++++++
>  2 files changed, 36 insertions(+)
> 
> Index: git-linux-2.6/include/linux/sched.h
> ===================================================================
> --- git-linux-2.6.orig/include/linux/sched.h	2010-09-23 14:16:37.000000000 +0200
> +++ git-linux-2.6/include/linux/sched.h	2010-09-23 14:17:20.000000000 +0200
> @@ -1282,6 +1282,7 @@
>  	cputime_t prev_utime, prev_stime, prev_sttime;
>  #endif
>  	unsigned long long acct_time;		/* Time for last accounting */
> +	int exit_accounting_done;
>  	unsigned long nvcsw, nivcsw; /* context switch counts */
>  	struct timespec start_time; 		/* monotonic time */
>  	struct timespec real_start_time;	/* boot based time */
> Index: git-linux-2.6/kernel/exit.c
> ===================================================================
> --- git-linux-2.6.orig/kernel/exit.c	2010-09-23 14:16:37.000000000 +0200
> +++ git-linux-2.6/kernel/exit.c	2010-09-23 14:17:20.000000000 +0200
> @@ -157,11 +157,45 @@
>  	put_task_struct(tsk);
>  }
> 
> +static void account_to_parent(struct task_struct *p)
> +{
> +	struct signal_struct *psig, *sig;
> +	struct task_struct *tsk_parent;
> +
> +	read_lock(&tasklist_lock);
> +	tsk_parent = p->real_parent;
> +	if (!tsk_parent) {
> +		read_unlock(&tasklist_lock);
> +		return;
> +	}
> +	get_task_struct(tsk_parent);
> +	read_unlock(&tasklist_lock);
> +
> +	// printk("XXX Fix accounting: pid=%d ppid=%d\n", p->pid, tsk_parent->pid);
> +	spin_lock_irq(&tsk_parent->sighand->siglock);
> +	psig = tsk_parent->signal;
> +	sig = p->signal;
> +	psig->cutime = cputime_add(psig->cutime,
> +				   cputime_add(sig->cutime, p->utime));
> +	psig->cstime = cputime_add(psig->cstime,
> +				   cputime_add(sig->cstime, p->stime));
> +	psig->csttime = cputime_add(psig->csttime,
> +				    cputime_add(sig->csttime, p->sttime));
> +	psig->cgtime = cputime_add(psig->cgtime,
> +		       cputime_add(p->gtime,
> +		       cputime_add(sig->gtime, sig->cgtime)));
> +	p->exit_accounting_done = 1;
> +	spin_unlock_irq(&tsk_parent->sighand->siglock);
> +	put_task_struct(tsk_parent);
> +}
> 
>  void release_task(struct task_struct * p)
>  {
>  	struct task_struct *leader;
>  	int zap_leader;
> +
> +	if (!p->exit_accounting_done)
> +		account_to_parent(p);
>  repeat:
>  	tracehook_prepare_release_task(p);
>  	/* don't need to get the RCU readlock here - the process is dead and
> @@ -1279,6 +1313,7 @@
>  			psig->cmaxrss = maxrss;
>  		task_io_accounting_add(&psig->ioac, &p->ioac);
>  		task_io_accounting_add(&psig->ioac, &sig->ioac);
> +		p->exit_accounting_done = 1;
>  		spin_unlock_irq(&p->real_parent->sighand->siglock);
>  	}
> 
> 
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-27 13:42         ` Martin Schwidefsky
  2010-09-27 16:51           ` Oleg Nesterov
@ 2010-09-28  8:36           ` Balbir Singh
  2010-09-28  9:08             ` Martin Schwidefsky
  1 sibling, 1 reply; 58+ messages in thread
From: Balbir Singh @ 2010-09-28  8:36 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Oleg Nesterov, Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Peter Zijlstra, Suresh Siddha, John stultz,
	Thomas Gleixner, Ingo Molnar, Heiko Carstens, linux-s390,
	linux-kernel

* Martin Schwidefsky <schwidefsky@de.ibm.com> [2010-09-27 15:42:57]:

> On Sun, 26 Sep 2010 20:11:27 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
> 
> > Hi,
> > 
> > On 09/24, Michael Holzheu wrote:
> > >
> > > On Thu, 2010-09-23 at 19:10 +0200, Oleg Nesterov wrote:
> > > >
> > > > On 09/23, Michael Holzheu wrote:
> > > > >
> > > > > Currently there are code pathes (e.g. for kthreads) where the consumed
> > > > > CPU time is not accounted to the parents cumulative counters.
> > > >
> > > > Could you explain more?
> > >
> > > I think one place was "khelper" (kmod.c). It is created with
> > > kernel_thread() and it exits without having accounted the times with
> > > sys_wait() to the parent's ctimes
> > 
> > No. Well yes, it is not accounted, but this is not because it is
> > kthread.
> 
> We noticed that behavior with kernel threads but as you point out
> the problem is bigger than that.
> 
> > To simplify the discussion, lets talk about utime/cutime only,
> > and lets forget about the multithreading.
> > 
> > It is very simple, currently linux accounts the exiting task's
> > utime and adds its to ->cutime _only_ if parent does do_wait().
> > If parent ignores SIGCHLD, the child reaps itself and it is not
> > accounted.
> > 
> > I do not know why it was done this way, but I'm afraid we can't
> > change this historical behaviour.
> 
> Why? I would consider it to be a BUG() that the time is not accounted.
> Independent of the fact that a parent wants to see the SIGCHLD and
> the exit status of its child the process time of the child should be
> accounted, no? And I'm not a particular fan of the "this has always
> been that way" reasoning.
> 
> > > Ok, the problem is that I did not consider exiting threads that are no
> > > thread group leaders. When they exit the ctime of the parent is not
> > > updated. Instead the time is accumulated in the signal struct.
> > 
> > I think I am a bit confused, but see above. With or without threads
> > the whole process can exit without accounting.
> 
> Got the part about self-reaping processes. But there is another issue:
> consider an exiting thread where the group leader is still active.
> The time for the thread will be added to the utime/stime fields in
> the signal structure. Taskstats will happily ignore that time while
> the group leader is still running.
>

Why do you say that? Not sure your comment is very clean, in
fill_tgid, we do

1. Accumulate signal stats (contains stats for dead threads)
2. Accumulate stats for current threads

fill_tgid_exit does something similar
 
> Please keep in mind that we want to get to a point where it is
> possible to get a 100% coverage of cpu cycles in the last snapshot
> cycle through the taskstats interface. Otherwise the precise top
> would not be very precise ..

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-28  8:36           ` Balbir Singh
@ 2010-09-28  9:08             ` Martin Schwidefsky
  2010-09-28  9:23               ` Balbir Singh
  0 siblings, 1 reply; 58+ messages in thread
From: Martin Schwidefsky @ 2010-09-28  9:08 UTC (permalink / raw)
  To: balbir
  Cc: Oleg Nesterov, Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Peter Zijlstra, Suresh Siddha, John stultz,
	Thomas Gleixner, Ingo Molnar, Heiko Carstens, linux-s390,
	linux-kernel

On Tue, 28 Sep 2010 14:06:02 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * Martin Schwidefsky <schwidefsky@de.ibm.com> [2010-09-27 15:42:57]:
> > Got the part about self-reaping processes. But there is another issue:
> > consider an exiting thread where the group leader is still active.
> > The time for the thread will be added to the utime/stime fields in
> > the signal structure. Taskstats will happily ignore that time while
> > the group leader is still running.
> >
> 
> Why do you say that? Not sure your comment is very clean, in
> fill_tgid, we do
> 
> 1. Accumulate signal stats (contains stats for dead threads)
> 2. Accumulate stats for current threads
> 
> fill_tgid_exit does something similar

Hmm, I can't find anything in the code where the tsk->signal->{utime,stime}
gets transferred to the taskstats record. There is a loop in fill_tgid over
the threads of the process but all it does is to call delayacct_add_tsk.
And that function does nothing with the cpu time of dead threads which is
stored in the signal structure. In addition which taskstats field is
supposed to contain the cpu time of the dead thread, ac_utime/ac_stime?

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-28  9:08             ` Martin Schwidefsky
@ 2010-09-28  9:23               ` Balbir Singh
  2010-09-28 10:36                 ` Martin Schwidefsky
  0 siblings, 1 reply; 58+ messages in thread
From: Balbir Singh @ 2010-09-28  9:23 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Oleg Nesterov, Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Peter Zijlstra, Suresh Siddha, John stultz,
	Thomas Gleixner, Ingo Molnar, Heiko Carstens, linux-s390,
	linux-kernel

* Martin Schwidefsky <schwidefsky@de.ibm.com> [2010-09-28 11:08:28]:

> On Tue, 28 Sep 2010 14:06:02 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * Martin Schwidefsky <schwidefsky@de.ibm.com> [2010-09-27 15:42:57]:
> > > Got the part about self-reaping processes. But there is another issue:
> > > consider an exiting thread where the group leader is still active.
> > > The time for the thread will be added to the utime/stime fields in
> > > the signal structure. Taskstats will happily ignore that time while
> > > the group leader is still running.
> > >
> > 
> > Why do you say that? Not sure your comment is very clean, in
> > fill_tgid, we do
> > 
> > 1. Accumulate signal stats (contains stats for dead threads)
> > 2. Accumulate stats for current threads
> > 
> > fill_tgid_exit does something similar
> 
> Hmm, I can't find anything in the code where the tsk->signal->{utime,stime}

That is left to the actual subsystem (I should have been clearer in
stating that the limitation is not with the taskstats infrastructure
itself). Yes, your observation is indeed correct.

taskstats code is expected to contain the callback for the subsystems
it supports. delayacct() already does the right thing today, AFAICS

> gets transferred to the taskstats record. There is a loop in fill_tgid over
> the threads of the process but all it does is to call delayacct_add_tsk.
> And that function does nothing with the cpu time of dead threads which is
> stored in the signal structure. In addition which taskstats field is
> supposed to contain the cpu time of the dead thread, ac_utime/ac_stime?
>

I've not focused much on ac_*, The changes that need to occur are we
need to get the tsacct.c callbacks into taskstats.

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-28  9:23               ` Balbir Singh
@ 2010-09-28 10:36                 ` Martin Schwidefsky
  2010-09-28 10:39                   ` Balbir Singh
  0 siblings, 1 reply; 58+ messages in thread
From: Martin Schwidefsky @ 2010-09-28 10:36 UTC (permalink / raw)
  To: balbir
  Cc: Oleg Nesterov, Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Peter Zijlstra, Suresh Siddha, John stultz,
	Thomas Gleixner, Ingo Molnar, Heiko Carstens, linux-s390,
	linux-kernel

On Tue, 28 Sep 2010 14:53:55 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * Martin Schwidefsky <schwidefsky@de.ibm.com> [2010-09-28 11:08:28]:
> 
> > On Tue, 28 Sep 2010 14:06:02 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * Martin Schwidefsky <schwidefsky@de.ibm.com> [2010-09-27 15:42:57]:
> > > > Got the part about self-reaping processes. But there is another issue:
> > > > consider an exiting thread where the group leader is still active.
> > > > The time for the thread will be added to the utime/stime fields in
> > > > the signal structure. Taskstats will happily ignore that time while
> > > > the group leader is still running.
> > > >
> > > 
> > > Why do you say that? Not sure your comment is very clean, in
> > > fill_tgid, we do
> > > 
> > > 1. Accumulate signal stats (contains stats for dead threads)
> > > 2. Accumulate stats for current threads
> > > 
> > > fill_tgid_exit does something similar
> > 
> > Hmm, I can't find anything in the code where the tsk->signal->{utime,stime}
> 
> That is left to the actual subsystem (I should have been clearer in
> stating that the limitation is not with the taskstats infrastructure
> itself). Yes, your observation is indeed correct.
> 
> taskstats code is expected to contain the callback for the subsystems
> it supports. delayacct() already does the right thing today, AFAICS
> 
> > gets transferred to the taskstats record. There is a loop in fill_tgid over
> > the threads of the process but all it does is to call delayacct_add_tsk.
> > And that function does nothing with the cpu time of dead threads which is
> > stored in the signal structure. In addition which taskstats field is
> > supposed to contain the cpu time of the dead thread, ac_utime/ac_stime?
> >
> 
> I've not focused much on ac_*, The changes that need to occur are we
> need to get the tsacct.c callbacks into taskstats.

Ok, so tsacct.c is the right place to implement it. We will need new fields
in struct taskstats to contain the tsk->signal->{utime,stime} and some code
in bacct_add_tsk to do the copy. Agreed ?

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-28 10:36                 ` Martin Schwidefsky
@ 2010-09-28 10:39                   ` Balbir Singh
  0 siblings, 0 replies; 58+ messages in thread
From: Balbir Singh @ 2010-09-28 10:39 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Oleg Nesterov, Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Peter Zijlstra, Suresh Siddha, John stultz,
	Thomas Gleixner, Ingo Molnar, Heiko Carstens, linux-s390,
	linux-kernel

* Martin Schwidefsky <schwidefsky@de.ibm.com> [2010-09-28 12:36:25]:

> On Tue, 28 Sep 2010 14:53:55 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * Martin Schwidefsky <schwidefsky@de.ibm.com> [2010-09-28 11:08:28]:
> > 
> > > On Tue, 28 Sep 2010 14:06:02 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > * Martin Schwidefsky <schwidefsky@de.ibm.com> [2010-09-27 15:42:57]:
> > > > > Got the part about self-reaping processes. But there is another issue:
> > > > > consider an exiting thread where the group leader is still active.
> > > > > The time for the thread will be added to the utime/stime fields in
> > > > > the signal structure. Taskstats will happily ignore that time while
> > > > > the group leader is still running.
> > > > >
> > > > 
> > > > Why do you say that? Not sure your comment is very clean, in
> > > > fill_tgid, we do
> > > > 
> > > > 1. Accumulate signal stats (contains stats for dead threads)
> > > > 2. Accumulate stats for current threads
> > > > 
> > > > fill_tgid_exit does something similar
> > > 
> > > Hmm, I can't find anything in the code where the tsk->signal->{utime,stime}
> > 
> > That is left to the actual subsystem (I should have been clearer in
> > stating that the limitation is not with the taskstats infrastructure
> > itself). Yes, your observation is indeed correct.
> > 
> > taskstats code is expected to contain the callback for the subsystems
> > it supports. delayacct() already does the right thing today, AFAICS
> > 
> > > gets transferred to the taskstats record. There is a loop in fill_tgid over
> > > the threads of the process but all it does is to call delayacct_add_tsk.
> > > And that function does nothing with the cpu time of dead threads which is
> > > stored in the signal structure. In addition which taskstats field is
> > > supposed to contain the cpu time of the dead thread, ac_utime/ac_stime?
> > >
> > 
> > I've not focused much on ac_*, The changes that need to occur are we
> > need to get the tsacct.c callbacks into taskstats.
> 
> Ok, so tsacct.c is the right place to implement it. We will need new fields
> in struct taskstats to contain the tsk->signal->{utime,stime} and some code
> in bacct_add_tsk to do the copy. Agreed ?
>

Yes, the caveat though is that tsacct.c or xacct was not designed to
provide notifications for tgid exit or provide tgid data. I don't
recollect why they decided to go down that path or if there are some
limitations that affect it. 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-28  8:21   ` Balbir Singh
@ 2010-09-28 16:50     ` Michael Holzheu
  0 siblings, 0 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-28 16:50 UTC (permalink / raw)
  To: balbir
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Ingo Molnar, Heiko Carstens, Martin Schwidefsky,
	linux-s390, linux-kernel

Hello Balbir,

On Tue, 2010-09-28 at 13:51 +0530, Balbir Singh wrote:
> * Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-09-23 16:02:21]:
> 
> > Subject: [PATCH] taskstats: Fix exit CPU time accounting
> > 
> > From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> > 
> > Currently there are code pathes (e.g. for kthreads) where the consumed
> > CPU time is not accounted to the parents cumulative counters.
> > Now CPU time is accounted to the parent, if the exit accounting has not
> > been done correctly.
> >
> 
> Does this impact account of the init process? Why do we care about
> accounting the time to the parent? In the case of tgid, all threads
> data makes sense. What is the benefit or gap we are trying to address
> in terms of lost data or accountability?

We care about the cumulative times because we wanted to write
a top command that can get 100% of all consumed CPU time in an
interval without using exit events.

I tried to write the idea down. Hopefully it is clear enough...

HOWTO calculate 100% consumed CPU time between two taskstats snapshots
======================================================================

In the following the idea of getting 100% of consumed CPU time between two
taskstats snapshots without using exit events is described. For simplicity we
use CPU-time as synonym for "user time", "system time" and "steal time".

In order to show the consumed CPU time in an interval a top tool has to:

* Collect snapshot 1 of all running tasks
* Wait interval
* Collect snapshot 2 of all running tasks

A snapshot contains the following data for each task:

 * time-task:    CPU time that has been consumed by task itself:
                 task->(u/s/st-time)
 * time-child:   CPU time that has been consumed by dead children of task:
                 task->signal->(cu/cs/cst-time)
 * time-thread:  CPU time that has been consumed by dead threads of
                 thread group of thread group leader:
                 task->signal->(u/s/st-time)

All consumed CPU time in the interval can be calculated as follows:
 
  For all tasks that are in snapshot 1 AND in snapshot 2:

    (time-task[2] - time-task[1]) +
    (time-child[2] - time-child[1]) +
    (time-thread[2] - time-thread[1] {for thread group leader})

  minus

  For all tasks that are in snapshot 1 but NOT in snapshot 2 (tasks that have
  been exited):

    time-task[1] +
    time-child[1] +
    time-thread[1] (if thread group has exited)

    We have to subtract those CPU times in order to get the CPU time
    of the exited tasks that has been consumed in the last interval.

To provide a consistent view, the top tool could show the following fields:
 * user:  task utime per interval
 * sys:   task stime per interval
 * ste:   task sttime per interval
 * cuser: utime of exited children per interval
 * csys:  stime of exited children per interval
 * cste:  sttime of exited children per interval
 * tuser: utime of exited threads per interval (only for thread group leader)
 * tsys:  stime of exited threads per interval (only for thread group leader)
 * tste:  sttime of exited threads per interval (only for thread group leader)
 * total: Sum of all above fields

If the top command notices that a PID disappeared between snapshot 1
and snapshot 2, it has to do the following:

If task is not the thread group leader (pid != tgid):
  Find its thread group leader and subtract the CPU times from snapshot 1
  of the dead task from the thread group leader's time-thread interval
  difference.
else
  Find its parent and subtract the CPU times from snapshot 1 of the dead child
  from the parents time-child interval difference.

Example output:
---------------
pid     user   sys  ste  cuser  csys cste tuser tsys tste total  Name
(#)      (%)   (%)  (%)    (%)   (%)  (%)   (%)  (%)  (%)   (%)  (str)
17944   0.10  0.01 0.00  54.29 14.36 0.22  0.00 0.00 0.00 68.98  make
18006   0.10  0.01 0.00  55.79 12.23 0.12  0.00 0.00 0.00 68.26  make
18041  48.18  1.51 0.29   0.00  0.00 0.00  0.00 0.00 0.00 49.98  cc1
...

The sum of all "total" CPU counters on a system that is 100% busy should
be exactly the number CPUs multiplied by the interval time. A good testcase
for this is to start a loop program for each CPU and then in parallel
starting a kernel build with "-j 5".

OPEN ISSUE:

A current problem with the Linux kernel is that CPU time can disappear,
if a child of a parent that ignores (SIGCHLD) dies.

Michael


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-27 16:51           ` Oleg Nesterov
  2010-09-28  7:09             ` Martin Schwidefsky
@ 2010-09-29 19:19             ` Roland McGrath
  2010-09-30 13:47               ` Michael Holzheu
  1 sibling, 1 reply; 58+ messages in thread
From: Roland McGrath @ 2010-09-29 19:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Martin Schwidefsky, Michael Holzheu, Shailabh Nagar,
	Andrew Morton, Venkatesh Pallipadi, Peter Zijlstra,
	Suresh Siddha, John stultz, Thomas Gleixner, Balbir Singh,
	Ingo Molnar, Heiko Carstens, linux-s390, linux-kernel

> > I would consider it to be a BUG() that the time is not accounted.
> > Independent of the fact that a parent wants to see the SIGCHLD and
> > the exit status of its child the process time of the child should be
> > accounted, no?
> 
> I do not know. It doesn't look like a BUG(), I mean it looks as if
> the code was intentionally written this way.

POSIX specifies this behavior: "If the child is never waited for (for
example, if the parent has SA_NOCLDWAIT set or sets SIGCHLD to SIG_IGN),
the resource information for the child process is discarded and not
included in the resource information provided by getrusage()."


Thanks,
Roland

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
  2010-09-23 13:48 [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Michael Holzheu
                   ` (11 preceding siblings ...)
  2010-09-24  9:16 ` Balbir Singh
@ 2010-09-30  8:38 ` Andi Kleen
  2010-09-30 13:56   ` Michael Holzheu
  12 siblings, 1 reply; 58+ messages in thread
From: Andi Kleen @ 2010-09-30  8:38 UTC (permalink / raw)
  To: holzheu; +Cc: linux-kernel

Michael Holzheu <holzheu@linux.vnet.ibm.com> writes:
>
> Compared to the old top command that has to scan more than 1000 proc
> directories the new ptop consumes much less CPU time (0.05% system time
> on my s390 system).

Sounds like a nice advantage (and in a way goes back to old kmem ps) 

Are there plans to make the standard top command use this new interface
or will it be only something for specialized utilities?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-29 19:19             ` Roland McGrath
@ 2010-09-30 13:47               ` Michael Holzheu
  2010-10-05  8:57                 ` Roland McGrath
  0 siblings, 1 reply; 58+ messages in thread
From: Michael Holzheu @ 2010-09-30 13:47 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Oleg Nesterov, Martin Schwidefsky, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Peter Zijlstra, Suresh Siddha, John stultz,
	Thomas Gleixner, Balbir Singh, Ingo Molnar, Heiko Carstens,
	linux-s390, linux-kernel

Hello Roland,

On Wed, 2010-09-29 at 12:19 -0700, Roland McGrath wrote: 
> > > I would consider it to be a BUG() that the time is not accounted.
> > > Independent of the fact that a parent wants to see the SIGCHLD and
> > > the exit status of its child the process time of the child should be
> > > accounted, no?
> > 
> > I do not know. It doesn't look like a BUG(), I mean it looks as if
> > the code was intentionally written this way.
> 
> POSIX specifies this behavior: "If the child is never waited for (for
> example, if the parent has SA_NOCLDWAIT set or sets SIGCHLD to SIG_IGN),
> the resource information for the child process is discarded and not
> included in the resource information provided by getrusage()."

Thanks! That information was missing! Although still for me it not seems
to be a good decision to do it that way. Because of that it currently is
not possible to evaluate all consumed CPU time by looking at the current
processes. Time can simply disappear.

What about adding a new set of CPU time fields (e.g. cr-times) for the
cumulative "autoreap" children times to the signal struct and export
them via taskstats?

Then the following set of CPU times will give a complete picture (I also
added steal time (st) that is currently not accounted in Linux per
task):

* task->(u/s/st-time):
  Time that has been consumed by task itself

* task->signal->(c-u/s/st-time):
  Time that has been consumed by dead children of process where parent
  has done a sys_wait()

* task->signal->(u/s/st-time):
  Time that has been consumed by dead threads of thread group of process
  - NEW: Has to be exported via taskstats

* task->signal->(cr-u/s/st-time):
  Time that has been consumed by dead children that reaped 
  themselves, because parent ignored SIGCHLD or has set SA_NOCLDWAIT
  - NEW: Fields have to be added to signal struct
  - NEW: Has to be exported via taskstats

Michael


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
  2010-09-30  8:38 ` Andi Kleen
@ 2010-09-30 13:56   ` Michael Holzheu
  0 siblings, 0 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-09-30 13:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

Hello Andi,

On Thu, 2010-09-30 at 10:38 +0200, Andi Kleen wrote:
> Michael Holzheu <holzheu@linux.vnet.ibm.com> writes:
> >
> > Compared to the old top command that has to scan more than 1000 proc
> > directories the new ptop consumes much less CPU time (0.05% system time
> > on my s390 system).
> 
> Sounds like a nice advantage (and in a way goes back to old kmem ps) 
> 
> Are there plans to make the standard top command use this new interface
> or will it be only something for specialized utilities?

Currently there are no plans. We first have to discuss, if this should
be done at all. But if the new interface is there, of course the
standard top command could also use it. Especially, if we have an easy
to use user space library.

One problem when converting the standard top command to use the new
interface will be that currently not all information that is in proc is
in taskstats (and also the other way round).

Michael


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-09-30 13:47               ` Michael Holzheu
@ 2010-10-05  8:57                 ` Roland McGrath
  2010-10-06  9:29                   ` Michael Holzheu
  0 siblings, 1 reply; 58+ messages in thread
From: Roland McGrath @ 2010-10-05  8:57 UTC (permalink / raw)
  To: holzheu
  Cc: Oleg Nesterov, Martin Schwidefsky, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Peter Zijlstra, Suresh Siddha, John stultz,
	Thomas Gleixner, Balbir Singh, Ingo Molnar, Heiko Carstens,
	linux-s390, linux-kernel

> Thanks! That information was missing! Although still for me it not seems
> to be a good decision to do it that way. Because of that it currently is
> not possible to evaluate all consumed CPU time by looking at the current
> processes. Time can simply disappear.

I agree that it seems dubious.  I don't know why that decision was made in
POSIX, but that's how it is.  Anyway, POSIX only constrains what we report
in the POSIX calls, i.e. getrusage, times, waitid, SIGCHLD siginfo_t.
Nothing says we can't track more information and make it accessible in
other ways on Linux.

> What about adding a new set of CPU time fields (e.g. cr-times) for the
> cumulative "autoreap" children times to the signal struct and export
> them via taskstats?

I don't have a particular opinion about the details of how you export the
information.  Something generally along those lines certainly sounds
reasonable to me.

> Then the following set of CPU times will give a complete picture (I also
> added steal time (st) that is currently not accounted in Linux per
> task):
> 
> * task->(u/s/st-time):
>   Time that has been consumed by task itself

There is also "gtime", "guest time" when the task is a kvm vcpu.

There is also "sched time" (task->se.sum_exec_runtime), which is
all states of task time, tracked by a different method than [usg]time.

> * task->signal->(c-u/s/st-time):
>   Time that has been consumed by dead children of process where parent
>   has done a sys_wait()

Also cgtime (to gtime as cutime is to utime).

> * task->signal->(u/s/st-time):
>   Time that has been consumed by dead threads of thread group of process
>   - NEW: Has to be exported via taskstats

Also gtime here.  These are reported as part of the aggregate process times
that include both live and dead threads, but not distinguished.

> * task->signal->(cr-u/s/st-time):
>   Time that has been consumed by dead children that reaped 
>   themselves, because parent ignored SIGCHLD or has set SA_NOCLDWAIT
>   - NEW: Fields have to be added to signal struct
>   - NEW: Has to be exported via taskstats

Note that there are other stats aside from times that are treated the same
way (c{min,maj}_flt, cn{v,iv}csw, c{in,ou}block, cmaxrss, and io accounting).

What probably makes sense is to move all those cfoo fields from
signal_struct into foo fields in a new struct, and then signal_struct can
have "struct child_stats reaped_children, ignored_children" or whatnot.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-10-05  8:57                 ` Roland McGrath
@ 2010-10-06  9:29                   ` Michael Holzheu
  2010-10-06 15:26                     ` Oleg Nesterov
  0 siblings, 1 reply; 58+ messages in thread
From: Michael Holzheu @ 2010-10-06  9:29 UTC (permalink / raw)
  To: Roland McGrath, Oleg Nesterov
  Cc: Martin Schwidefsky, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Peter Zijlstra, Suresh Siddha, John stultz,
	Thomas Gleixner, Balbir Singh, Ingo Molnar, Heiko Carstens,
	linux-s390, linux-kernel

Hello Roland and Oleg,

On Tue, 2010-10-05 at 01:57 -0700, Roland McGrath wrote: 
> > Thanks! That information was missing! Although still for me it not seems
> > to be a good decision to do it that way. Because of that it currently is
> > not possible to evaluate all consumed CPU time by looking at the current
> > processes. Time can simply disappear.
> 
> I agree that it seems dubious.  I don't know why that decision was made in
> POSIX, but that's how it is.  Anyway, POSIX only constrains what we report
> in the POSIX calls, i.e. getrusage, times, waitid, SIGCHLD siginfo_t.
> Nothing says we can't track more information and make it accessible in
> other ways on Linux.

Yes, I think there would be a benefit for process time accounting, if we
would do that.

> > * task->signal->(cr-u/s/st-time):
> >   Time that has been consumed by dead children that reaped 
> >   themselves, because parent ignored SIGCHLD or has set SA_NOCLDWAIT
> >   - NEW: Fields have to be added to signal struct
> >   - NEW: Has to be exported via taskstats
> 
> Note that there are other stats aside from times that are treated the same
> way (c{min,maj}_flt, cn{v,iv}csw, c{in,ou}block, cmaxrss, and io accounting).
> 
> What probably makes sense is to move all those cfoo fields from
> signal_struct into foo fields in a new struct, and then signal_struct can
> have "struct child_stats reaped_children, ignored_children" or whatnot.

I created an experimental patch for that. There I defined a new
structure "cdata" and added two instances of it (cdata_wait and
cdata_acct) to the signal_struct. The cdata_acct member contains all CPU
time.

The patch also approaches another ugly Unix behavior regarding process
accounting. If a parent process dies before his children, the children
get the reaper process (init) as new parent. If we want to determine the
CPU usage of a process tree with cumulative time, this is very
suboptimal. To fix this I added a new process relationship tree for
accounting.

The following patch applies to git head on top of patch:
https://patchwork.kernel.org/patch/202022/

Michael
-------
Subject: [PATCH] taskstats: Improve cumulative resource accounting

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

Currently the cumulative time accounting in Linux has two major drawbacks:

* Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
  to the cumulative time of the parents, if the parents ignore SIGCHLD
  or have set SA_NOCLDWAIT. This behaviour has the major drawback that
  it is not possible to calculate all consumed CPU time of a system by
  looking at the current tasks. CPU time can be lost.

* When a parent process dies, its children get the init process as
  new parent. For accounting this is suboptimal, because then init
  gets the CPU time of the tasks. For accounting it would be much better,
  if the CPU time is passed along the relationship tree using the
  cumulative time counters as would have happened if the child had died
  before the parent. E.g. then it would be possible to look at the login
  shell process cumulative times to get all CPU time that has been consumed
  by it's children, grandchildren, etc. This would allow accounting without
  the need of exit events for all dead processes.

This patch adds a new set of cumulative time counters. We then have two
cumulative counter sets:

* cdata_wait: Traditional cumulative time used e.g. by getrusage.
* cdata_acct: Cumulative time that also includes dead processes with
              parents that ignore SIGCHLD or have set SA_NOCLDWAIT.
              cdata_acct will be exported by taskstats.

Besides of that the patch adds an "acct_parent" pointer next to the parent
pointer and a "children_acct" list next to the children list to the
task_struct in order to remember the correct accounting task relationship.

With this patch and the following time fields it is now possible to
calculate all the consumed CPU time of a system by looking at the current
tasks:

* task->(u/s/st/g-time):
  Time that has been consumed by task itself

* task->signal->cdata_acct.(c-u/s/st/g-time):
  All time that has been consumed by dead children of process. Includes
  also time from processes that reaped themselves, because the parent
  ignored SIGCHLD or has set SA_NOCLDWAIT

* task->signal->(u/s/st/g-time):
  Time that has been consumed by dead threads of thread group of process

Having this is prerequisite for the following use cases:

I. A top command that shows exactly 100% of all consumed CPU time between
two task snapshots without using task exit events. Exit events are not
necessary, because if tasks die between the two snapshots all time can be
found in the cumulative counters of the parent processes or thread group
leaders.

II. Do accounting by registering an exit event for each login shell. When
the shell exits, we get all CPU time of the shell's children by looking at
the cumulative data. No exit events for all tasks are required. To
implement that we also have to add a new taskstats feature to filter exit
events by PID.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 fs/binfmt_elf.c           |    4 -
 fs/proc/array.c           |   10 +-
 fs/proc/base.c            |    3 
 include/linux/init_task.h |    2 
 include/linux/sched.h     |   39 +++++++---
 include/linux/taskstats.h |    4 +
 kernel/exit.c             |  169 +++++++++++++++++++++++++++++-----------------
 kernel/fork.c             |    6 +
 kernel/sys.c              |   24 +++---
 kernel/tsacct.c           |   13 +++
 10 files changed, 183 insertions(+), 91 deletions(-)

--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1296,8 +1296,8 @@ static void fill_prstatus(struct elf_prs
 		cputime_to_timeval(p->utime, &prstatus->pr_utime);
 		cputime_to_timeval(p->stime, &prstatus->pr_stime);
 	}
-	cputime_to_timeval(p->signal->cutime, &prstatus->pr_cutime);
-	cputime_to_timeval(p->signal->cstime, &prstatus->pr_cstime);
+	cputime_to_timeval(p->signal->cdata_wait.cutime, &prstatus->pr_cutime);
+	cputime_to_timeval(p->signal->cdata_wait.cstime, &prstatus->pr_cstime);
 }
 
 static int fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p,
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -413,11 +413,11 @@ static int do_task_stat(struct seq_file 
 		num_threads = get_nr_threads(task);
 		collect_sigign_sigcatch(task, &sigign, &sigcatch);
 
-		cmin_flt = sig->cmin_flt;
-		cmaj_flt = sig->cmaj_flt;
-		cutime = sig->cutime;
-		cstime = sig->cstime;
-		cgtime = sig->cgtime;
+		cmin_flt = sig->cdata_wait.cmin_flt;
+		cmaj_flt = sig->cdata_wait.cmaj_flt;
+		cutime = sig->cdata_wait.cutime;
+		cstime = sig->cdata_wait.cstime;
+		cgtime = sig->cdata_wait.cgtime;
 		rsslim = ACCESS_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur);
 
 		/* add up live thread stats at the group level */
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2617,7 +2617,8 @@ static int do_io_accounting(struct task_
 	if (whole && lock_task_sighand(task, &flags)) {
 		struct task_struct *t = task;
 
-		task_io_accounting_add(&acct, &task->signal->ioac);
+		task_io_accounting_add(&acct,
+				       &task->signal->cdata_wait.ioac);
 		while_each_thread(task, t)
 			task_io_accounting_add(&acct, &t->ioac);
 
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -135,7 +135,9 @@ extern struct cred init_cred;
 	.real_parent	= &tsk,						\
 	.parent		= &tsk,						\
 	.children	= LIST_HEAD_INIT(tsk.children),			\
+	.children_acct	= LIST_HEAD_INIT(tsk.children_acct),		\
 	.sibling	= LIST_HEAD_INIT(tsk.sibling),			\
+	.sibling_acct	= LIST_HEAD_INIT(tsk.sibling_acct),		\
 	.group_leader	= &tsk,						\
 	.real_cred	= &init_cred,					\
 	.cred		= &init_cred,					\
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -507,6 +507,20 @@ struct thread_group_cputimer {
 };
 
 /*
+ * Cumulative resource counters for reaped dead child processes.
+ * Live threads maintain their own counters and add to these
+ * in __exit_signal, except for the group leader.
+ */
+struct cdata {
+	cputime_t cutime, cstime, cgtime;
+	unsigned long cnvcsw, cnivcsw;
+	unsigned long cmin_flt, cmaj_flt;
+	unsigned long cinblock, coublock;
+	unsigned long cmaxrss;
+	struct task_io_accounting ioac;
+};
+
+/*
  * NOTE! "signal_struct" does not have it's own
  * locking, because a shared signal_struct always
  * implies a shared sighand_struct, so locking
@@ -573,22 +587,19 @@ struct signal_struct {
 
 	struct tty_struct *tty; /* NULL if no tty */
 
-	/*
-	 * Cumulative resource counters for dead threads in the group,
-	 * and for reaped dead child processes forked by this group.
-	 * Live threads maintain their own counters and add to these
-	 * in __exit_signal, except for the group leader.
-	 */
-	cputime_t utime, stime, cutime, cstime;
+	/* Cumulative resource counters for all dead child processes */
+	struct cdata cdata_wait; /* parents have done sys_wait() */
+	struct cdata cdata_acct; /* complete cumulative data from acct tree */
+
+	cputime_t utime, stime;
 	cputime_t gtime;
-	cputime_t cgtime;
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 	cputime_t prev_utime, prev_stime;
 #endif
-	unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
-	unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
-	unsigned long inblock, oublock, cinblock, coublock;
-	unsigned long maxrss, cmaxrss;
+	unsigned long nvcsw, nivcsw;
+	unsigned long min_flt, maj_flt;
+	unsigned long inblock, oublock;
+	unsigned long maxrss;
 	struct task_io_accounting ioac;
 
 	/*
@@ -1248,6 +1259,7 @@ struct task_struct {
 	 * older sibling, respectively.  (p->father can be replaced with 
 	 * p->real_parent->pid)
 	 */
+	struct task_struct *acct_parent; /* accounting parent process */
 	struct task_struct *real_parent; /* real parent process */
 	struct task_struct *parent; /* recipient of SIGCHLD, wait4() reports */
 	/*
@@ -1255,6 +1267,8 @@ struct task_struct {
 	 */
 	struct list_head children;	/* list of my children */
 	struct list_head sibling;	/* linkage in my parent's children list */
+	struct list_head children_acct;	/* list of my accounting children */
+	struct list_head sibling_acct;	/* linkage in my parent's accounting children list */
 	struct task_struct *group_leader;	/* threadgroup leader */
 
 	/*
@@ -1273,6 +1287,7 @@ struct task_struct {
 	int __user *set_child_tid;		/* CLONE_CHILD_SETTID */
 	int __user *clear_child_tid;		/* CLONE_CHILD_CLEARTID */
 
+	int exit_accounting_done;
 	cputime_t utime, stime, utimescaled, stimescaled;
 	cputime_t gtime;
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
--- a/include/linux/taskstats.h
+++ b/include/linux/taskstats.h
@@ -163,6 +163,10 @@ struct taskstats {
 	/* Delay waiting for memory reclaim */
 	__u64	freepages_count;
 	__u64	freepages_delay_total;
+	__u64   ac_cutime;		/* User CPU time of childs [usec] */
+	__u64   ac_cstime;		/* System CPU time of childs [usec] */
+	__u64   ac_tutime;		/* User CPU time of threads [usec] */
+	__u64   ac_tstime;		/* System CPU time of threads [usec] */
 };
 
 
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -50,6 +50,7 @@
 #include <linux/perf_event.h>
 #include <trace/events/sched.h>
 #include <linux/hw_breakpoint.h>
+#include <linux/kernel_stat.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -68,11 +69,76 @@ static void __unhash_process(struct task
 
 		list_del_rcu(&p->tasks);
 		list_del_init(&p->sibling);
+		list_del_init(&p->sibling_acct);
 		__get_cpu_var(process_counts)--;
 	}
 	list_del_rcu(&p->thread_group);
 }
 
+static void __account_ctime(struct task_struct *p, struct cdata *pcd,
+			    struct cdata *ccd)
+{
+	struct signal_struct *sig = p->signal;
+	cputime_t tgutime, tgstime;
+	unsigned long maxrss;
+
+	thread_group_times(p, &tgutime, &tgstime);
+
+	pcd->cutime = cputime_add(pcd->cutime,
+				  cputime_add(tgutime, ccd->cutime));
+	pcd->cstime = cputime_add(pcd->cstime,
+				  cputime_add(tgstime, ccd->cstime));
+	pcd->cgtime = cputime_add(pcd->cgtime, cputime_add(p->gtime,
+			   cputime_add(sig->gtime, ccd->cgtime)));
+
+	pcd->cmin_flt += p->min_flt + sig->min_flt + ccd->cmin_flt;
+	pcd->cmaj_flt += p->maj_flt + sig->maj_flt + ccd->cmaj_flt;
+	pcd->cnvcsw += p->nvcsw + sig->nvcsw + ccd->cnvcsw;
+	pcd->cnivcsw += p->nivcsw + sig->nivcsw + ccd->cnivcsw;
+	pcd->cinblock += task_io_get_inblock(p) + sig->inblock + ccd->cinblock;
+	pcd->coublock += task_io_get_oublock(p) + sig->oublock + ccd->coublock;
+	maxrss = max(sig->maxrss, ccd->cmaxrss);
+	if (pcd->cmaxrss < maxrss)
+		pcd->cmaxrss = maxrss;
+
+	maxrss = max(sig->maxrss, ccd->cmaxrss);
+	if (pcd->cmaxrss < maxrss)
+		pcd->cmaxrss = maxrss;
+
+	task_io_accounting_add(&pcd->ioac, &p->ioac);
+	task_io_accounting_add(&pcd->ioac, &ccd->ioac);
+	task_io_accounting_add(&pcd->ioac, &ccd->ioac);
+}
+
+static void __account_to_parent(struct task_struct *p, int wait)
+{
+	/*
+	 * The resource counters for the group leader are in its
+	 * own task_struct.  Those for dead threads in the group
+	 * are in its signal_struct, as are those for the child
+	 * processes it has previously reaped.  All these
+	 * accumulate in the parent's signal_struct c* fields.
+	 *
+	 * We don't bother to take a lock here to protect these
+	 * p->signal fields, because they are only touched by
+	 * __exit_signal, which runs with tasklist_lock
+	 * write-locked anyway, and so is excluded here.  We do
+	 * need to protect the access to parent->signal fields,
+	 * as other threads in the parent group can be right
+	 * here reaping other children at the same time.
+	 *
+	 * We use thread_group_times() to get times for the thread
+	 * group, which consolidates times for all threads in the
+	 * group including the group leader.
+	 */
+	if (wait)
+		__account_ctime(p, &p->real_parent->signal->cdata_wait,
+				&p->signal->cdata_wait);
+	__account_ctime(p, &p->acct_parent->signal->cdata_acct,
+			&p->signal->cdata_acct);
+	p->exit_accounting_done = 1;
+}
+
 /*
  * This function expects the tasklist_lock write-locked.
  */
@@ -90,6 +156,24 @@ static void __exit_signal(struct task_st
 
 	posix_cpu_timers_exit(tsk);
 	if (group_dead) {
+		if (!tsk->exit_accounting_done) {
+#ifdef __s390x__
+		/*
+		 * FIXME: On s390 we can call account_process_tick to update
+		 * CPU time information. This is probably not valid on other
+		 * architectures.
+		 */
+			if (current == tsk)
+				account_process_tick(current, 1);
+#endif
+			/*
+			 * FIXME: This somehow has to be moved to
+			 * finish_task_switch(), because otherwise
+			 * if the process accounts itself, the CPU time
+			 * that is used for this code will be lost.
+			 */
+			__account_to_parent(tsk, 0);
+		}
 		posix_cpu_timers_exit_group(tsk);
 		tty = sig->tty;
 		sig->tty = NULL;
@@ -103,6 +187,15 @@ static void __exit_signal(struct task_st
 
 		if (tsk == sig->curr_target)
 			sig->curr_target = next_thread(tsk);
+#ifdef __s390x__
+		/*
+		 * FIXME: On s390 we can call account_process_tick to update
+		 * CPU time information. This is probably not valid on other
+		 * architectures.
+		 */
+		if (current == tsk)
+			account_process_tick(current, 1);
+#endif
 		/*
 		 * Accumulate here the counters for all threads but the
 		 * group leader as they die, so they can be added into
@@ -122,7 +215,8 @@ static void __exit_signal(struct task_st
 		sig->nivcsw += tsk->nivcsw;
 		sig->inblock += task_io_get_inblock(tsk);
 		sig->oublock += task_io_get_oublock(tsk);
-		task_io_accounting_add(&sig->ioac, &tsk->ioac);
+		task_io_accounting_add(&sig->cdata_wait.ioac,
+				       &tsk->ioac);
 		sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	}
 
@@ -334,7 +428,10 @@ static void reparent_to_kthreadd(void)
 	ptrace_unlink(current);
 	/* Reparent to init */
 	current->real_parent = current->parent = kthreadd_task;
+	current->acct_parent = current->acct_parent->acct_parent;
 	list_move_tail(&current->sibling, &current->real_parent->children);
+	list_move_tail(&current->sibling_acct,
+		       &current->acct_parent->children_acct);
 
 	/* Set the exit signal to SIGCHLD so we signal init on exit */
 	current->exit_signal = SIGCHLD;
@@ -772,6 +869,15 @@ static void forget_original_parent(struc
 	LIST_HEAD(dead_children);
 
 	write_lock_irq(&tasklist_lock);
+	list_for_each_entry_safe(p, n, &father->children_acct, sibling_acct) {
+		struct task_struct *t = p;
+		do {
+			t->acct_parent = t->acct_parent->acct_parent;
+		} while_each_thread(p, t);
+		list_move_tail(&p->sibling_acct,
+			       &p->acct_parent->children_acct);
+	}
+
 	/*
 	 * Note that exit_ptrace() and find_new_reaper() might
 	 * drop tasklist_lock and reacquire it.
@@ -799,6 +905,7 @@ static void forget_original_parent(struc
 
 	list_for_each_entry_safe(p, n, &dead_children, sibling) {
 		list_del_init(&p->sibling);
+		list_del_init(&p->sibling_acct);
 		release_task(p);
 	}
 }
@@ -1214,66 +1321,8 @@ static int wait_task_zombie(struct wait_
 	 * !task_detached() to filter out sub-threads.
 	 */
 	if (likely(!traced) && likely(!task_detached(p))) {
-		struct signal_struct *psig;
-		struct signal_struct *sig;
-		unsigned long maxrss;
-		cputime_t tgutime, tgstime;
-
-		/*
-		 * The resource counters for the group leader are in its
-		 * own task_struct.  Those for dead threads in the group
-		 * are in its signal_struct, as are those for the child
-		 * processes it has previously reaped.  All these
-		 * accumulate in the parent's signal_struct c* fields.
-		 *
-		 * We don't bother to take a lock here to protect these
-		 * p->signal fields, because they are only touched by
-		 * __exit_signal, which runs with tasklist_lock
-		 * write-locked anyway, and so is excluded here.  We do
-		 * need to protect the access to parent->signal fields,
-		 * as other threads in the parent group can be right
-		 * here reaping other children at the same time.
-		 *
-		 * We use thread_group_times() to get times for the thread
-		 * group, which consolidates times for all threads in the
-		 * group including the group leader.
-		 */
-		thread_group_times(p, &tgutime, &tgstime);
 		spin_lock_irq(&p->real_parent->sighand->siglock);
-		psig = p->real_parent->signal;
-		sig = p->signal;
-		psig->cutime =
-			cputime_add(psig->cutime,
-			cputime_add(tgutime,
-				    sig->cutime));
-		psig->cstime =
-			cputime_add(psig->cstime,
-			cputime_add(tgstime,
-				    sig->cstime));
-		psig->cgtime =
-			cputime_add(psig->cgtime,
-			cputime_add(p->gtime,
-			cputime_add(sig->gtime,
-				    sig->cgtime)));
-		psig->cmin_flt +=
-			p->min_flt + sig->min_flt + sig->cmin_flt;
-		psig->cmaj_flt +=
-			p->maj_flt + sig->maj_flt + sig->cmaj_flt;
-		psig->cnvcsw +=
-			p->nvcsw + sig->nvcsw + sig->cnvcsw;
-		psig->cnivcsw +=
-			p->nivcsw + sig->nivcsw + sig->cnivcsw;
-		psig->cinblock +=
-			task_io_get_inblock(p) +
-			sig->inblock + sig->cinblock;
-		psig->coublock +=
-			task_io_get_oublock(p) +
-			sig->oublock + sig->coublock;
-		maxrss = max(sig->maxrss, sig->cmaxrss);
-		if (psig->cmaxrss < maxrss)
-			psig->cmaxrss = maxrss;
-		task_io_accounting_add(&psig->ioac, &p->ioac);
-		task_io_accounting_add(&psig->ioac, &sig->ioac);
+		__account_to_parent(p, 1);
 		spin_unlock_irq(&p->real_parent->sighand->siglock);
 	}
 
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1047,7 +1047,9 @@ static struct task_struct *copy_process(
 	delayacct_tsk_init(p);	/* Must remain after dup_task_struct() */
 	copy_flags(clone_flags, p);
 	INIT_LIST_HEAD(&p->children);
+	INIT_LIST_HEAD(&p->children_acct);
 	INIT_LIST_HEAD(&p->sibling);
+	INIT_LIST_HEAD(&p->sibling_acct);
 	rcu_copy_process(p);
 	p->vfork_done = NULL;
 	spin_lock_init(&p->alloc_lock);
@@ -1231,8 +1233,10 @@ static struct task_struct *copy_process(
 	/* CLONE_PARENT re-uses the old parent */
 	if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
 		p->real_parent = current->real_parent;
+		p->acct_parent = current->acct_parent;
 		p->parent_exec_id = current->parent_exec_id;
 	} else {
+		p->acct_parent = current;
 		p->real_parent = current;
 		p->parent_exec_id = current->self_exec_id;
 	}
@@ -1275,6 +1279,8 @@ static struct task_struct *copy_process(
 			attach_pid(p, PIDTYPE_PGID, task_pgrp(current));
 			attach_pid(p, PIDTYPE_SID, task_session(current));
 			list_add_tail(&p->sibling, &p->real_parent->children);
+			list_add_tail(&p->sibling_acct,
+				      &p->acct_parent->children_acct);
 			list_add_tail_rcu(&p->tasks, &init_task.tasks);
 			__get_cpu_var(process_counts)++;
 		}
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -884,8 +884,8 @@ void do_sys_times(struct tms *tms)
 
 	spin_lock_irq(&current->sighand->siglock);
 	thread_group_times(current, &tgutime, &tgstime);
-	cutime = current->signal->cutime;
-	cstime = current->signal->cstime;
+	cutime = current->signal->cdata_wait.cutime;
+	cstime = current->signal->cdata_wait.cstime;
 	spin_unlock_irq(&current->sighand->siglock);
 	tms->tms_utime = cputime_to_clock_t(tgutime);
 	tms->tms_stime = cputime_to_clock_t(tgstime);
@@ -1490,6 +1490,7 @@ static void k_getrusage(struct task_stru
 	unsigned long flags;
 	cputime_t tgutime, tgstime, utime, stime;
 	unsigned long maxrss = 0;
+	struct cdata *cd;
 
 	memset((char *) r, 0, sizeof *r);
 	utime = stime = cputime_zero;
@@ -1507,15 +1508,16 @@ static void k_getrusage(struct task_stru
 	switch (who) {
 		case RUSAGE_BOTH:
 		case RUSAGE_CHILDREN:
-			utime = p->signal->cutime;
-			stime = p->signal->cstime;
-			r->ru_nvcsw = p->signal->cnvcsw;
-			r->ru_nivcsw = p->signal->cnivcsw;
-			r->ru_minflt = p->signal->cmin_flt;
-			r->ru_majflt = p->signal->cmaj_flt;
-			r->ru_inblock = p->signal->cinblock;
-			r->ru_oublock = p->signal->coublock;
-			maxrss = p->signal->cmaxrss;
+			cd = &p->signal->cdata_wait;
+			utime = cd->cutime;
+			stime = cd->cstime;
+			r->ru_nvcsw = cd->cnvcsw;
+			r->ru_nivcsw = cd->cnivcsw;
+			r->ru_minflt = cd->cmin_flt;
+			r->ru_majflt = cd->cmaj_flt;
+			r->ru_inblock = cd->cinblock;
+			r->ru_oublock = cd->coublock;
+			maxrss = cd->cmaxrss;
 
 			if (who == RUSAGE_CHILDREN)
 				break;
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -62,6 +62,19 @@ void bacct_add_tsk(struct taskstats *sta
 	stats->ac_gid	 = tcred->gid;
 	stats->ac_ppid	 = pid_alive(tsk) ?
 				rcu_dereference(tsk->real_parent)->tgid : 0;
+	if (tsk->signal && tsk->tgid == tsk->pid) {
+		struct cdata *cd = &tsk->signal->cdata_acct;
+
+		stats->ac_cutime = cputime_to_usecs(cd->cutime);
+		stats->ac_cstime = cputime_to_usecs(cd->cstime);
+		stats->ac_tutime = cputime_to_usecs(tsk->signal->utime);
+		stats->ac_tstime = cputime_to_usecs(tsk->signal->stime);
+	} else {
+		stats->ac_cutime = 0;
+		stats->ac_cstime = 0;
+		stats->ac_tutime = 0;
+		stats->ac_tstime = 0;
+	}
 	rcu_read_unlock();
 	stats->ac_utime = cputime_to_usecs(tsk->utime);
 	stats->ac_stime = cputime_to_usecs(tsk->stime);



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-10-06  9:29                   ` Michael Holzheu
@ 2010-10-06 15:26                     ` Oleg Nesterov
  2010-10-07 15:06                       ` Michael Holzheu
  0 siblings, 1 reply; 58+ messages in thread
From: Oleg Nesterov @ 2010-10-06 15:26 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Roland McGrath, Martin Schwidefsky, Shailabh Nagar,
	Andrew Morton, Venkatesh Pallipadi, Peter Zijlstra,
	Suresh Siddha, John stultz, Thomas Gleixner, Balbir Singh,
	Ingo Molnar, Heiko Carstens, linux-s390, linux-kernel

I didn't read the whole patch, but some parts doesn't look right,

On 10/06, Michael Holzheu wrote:
>
> The patch also approaches another ugly Unix behavior regarding process
> accounting. If a parent process dies before his children, the children
> get the reaper process (init) as new parent. If we want to determine the
> CPU usage of a process tree with cumulative time, this is very
> suboptimal. To fix this I added a new process relationship tree for
> accounting.

Well, I must admit, I can't say I like the complications this change adds ;)
In any case, imho this change needs a separate patch/discussion.

> Besides of that the patch adds an "acct_parent" pointer next to the parent
> pointer and a "children_acct" list next to the children list to the
> task_struct in order to remember the correct accounting task relationship.

I am not sure I understand the "correct accounting" above. ->acct_parent
adds the "parallel" hierarchy. In the simplest case, suppose that some
process P forks the child C and exits. Then C->acct_parent == P->real_parent
(P->acct_parent in general). I am not sure this is always good.

Anyway,

> @@ -90,6 +156,24 @@ static void __exit_signal(struct task_st
>  
>  	posix_cpu_timers_exit(tsk);
>  	if (group_dead) {
> +		if (!tsk->exit_accounting_done) {
> +#ifdef __s390x__
> +		/*
> +		 * FIXME: On s390 we can call account_process_tick to update
> +		 * CPU time information. This is probably not valid on other
> +		 * architectures.
> +		 */
> +			if (current == tsk)
> +				account_process_tick(current, 1);
> +#endif
> +			/*
> +			 * FIXME: This somehow has to be moved to
> +			 * finish_task_switch(), because otherwise
> +			 * if the process accounts itself, the CPU time
> +			 * that is used for this code will be lost.
> +			 */
> +			__account_to_parent(tsk, 0);

We hold the wrong ->siglock here.

Also, the logic behind ->exit_accounting_done looks wrong (and unneeded)
but I am not sure...

> @@ -772,6 +869,15 @@ static void forget_original_parent(struc
>  	LIST_HEAD(dead_children);
>  
>  	write_lock_irq(&tasklist_lock);
> +	list_for_each_entry_safe(p, n, &father->children_acct, sibling_acct) {
> +		struct task_struct *t = p;
> +		do {
> +			t->acct_parent = t->acct_parent->acct_parent;
> +		} while_each_thread(p, t);
> +		list_move_tail(&p->sibling_acct,
> +			       &p->acct_parent->children_acct);

This is certainly wrong if there are other live threads in father's
thread-group.

Also, you need to change de_thread() if it changes the leader.

>  	list_for_each_entry_safe(p, n, &dead_children, sibling) {
>  		list_del_init(&p->sibling);
> +		list_del_init(&p->sibling_acct);

This list_del() can race with ->acct_parent if it in turn exits and
does forget_original_parent() -> list_move_tail(sibling_acct).

Oleg.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 01/10] taskstats: Use real microsecond granularity for CPU times
  2010-09-23 14:00 ` [RFC][PATCH 01/10] taskstats: Use real microsecond granularity for CPU times Michael Holzheu
@ 2010-10-07  5:08   ` Balbir Singh
  2010-10-08 15:08     ` Michael Holzheu
  0 siblings, 1 reply; 58+ messages in thread
From: Balbir Singh @ 2010-10-07  5:08 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Ingo Molnar, Heiko Carstens, Martin Schwidefsky,
	linux-s390, linux-kernel

* Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-09-23 16:00:52]:

> Subject: [PATCH] taskstats: Use real microsecond granularity for CPU times
> 
> From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> 
> The taskstats interface uses microsecond granularity for the user and
> system time values. The conversion from cputime to the taskstats values
> uses the cputime_to_msecs primitive which effectively limits the 
> granularity to milliseconds. Add the cputime_to_usecs primitive for
> architectures that have better, more precise CPU time values. Remove
> cputime_to_msecs primitive because there is no more user left.
> 
> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>

Looks good to me.

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-10-06 15:26                     ` Oleg Nesterov
@ 2010-10-07 15:06                       ` Michael Holzheu
  2010-10-11 12:37                         ` Oleg Nesterov
  0 siblings, 1 reply; 58+ messages in thread
From: Michael Holzheu @ 2010-10-07 15:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Roland McGrath, Martin Schwidefsky, Shailabh Nagar,
	Andrew Morton, Venkatesh Pallipadi, Peter Zijlstra,
	Suresh Siddha, John stultz, Thomas Gleixner, Balbir Singh,
	Ingo Molnar, Heiko Carstens, linux-s390, linux-kernel

Hello Oleg,

On Wed, 2010-10-06 at 17:26 +0200, Oleg Nesterov wrote:
> > The patch also approaches another ugly Unix behavior regarding process
> > accounting. If a parent process dies before his children, the children
> > get the reaper process (init) as new parent. If we want to determine the
> > CPU usage of a process tree with cumulative time, this is very
> > suboptimal. To fix this I added a new process relationship tree for
> > accounting.
> 
> Well, I must admit, I can't say I like the complications this change adds ;)
> In any case, imho this change needs a separate patch/discussion.

Well, to be honest, I have not expected that people will love this
patch. At least not immediately :-)

I just wanted to show one idea how we could solve my problem with the
lost time.

> > Besides of that the patch adds an "acct_parent" pointer next to the parent
> > pointer and a "children_acct" list next to the children list to the
> > task_struct in order to remember the correct accounting task relationship.
> 
> I am not sure I understand the "correct accounting" above. ->acct_parent
> adds the "parallel" hierarchy.

Correct. The patch adds a "parallel" hierarchy that IMHO is better
suited for accounting purposes. For me the correct accounting hierarchy
means that cumulative time is passed along the real relationship tree.
That means if you have two task snapshots it is clear which tasks
inherited the time of the dead children. Example:

P1 is parent of P2 is parent of P3

Snapshot 1: P1 -> P2 -> P3
Snapshot 2: P1

We know that P2 and P3 died in the last interval. With the current
cumulative time accounting we can't say, if P1 got the CPU time of P3
(if P3 died before P2) or if init got the time (if P2 died before P3).
With my patch we know that P1 got all the CPU time of P2 and P3.

Without the patch a top command had somehow to receive the "reparent to
init" events or it had to receive all task exit events. The latter is
probably very CPU intensive for workloads that create many short-running
processes.

> In the simplest case, suppose that some
> process P forks the child C and exits. Then C->acct_parent == P->real_parent
> (P->acct_parent in general). I am not sure this is always good.

For most cases both trees are identical. Only, if reparent to init
happens, the trees become different. Maybe it is possible to optimize
the code that only if the trees differ the information is stored.

> Anyway,
> 
> > @@ -90,6 +156,24 @@ static void __exit_signal(struct task_st
> >  
> >  	posix_cpu_timers_exit(tsk);
> >  	if (group_dead) {
> > +		if (!tsk->exit_accounting_done) {
> > +#ifdef __s390x__
> > +		/*
> > +		 * FIXME: On s390 we can call account_process_tick to update
> > +		 * CPU time information. This is probably not valid on other
> > +		 * architectures.
> > +		 */
> > +			if (current == tsk)
> > +				account_process_tick(current, 1);
> > +#endif
> > +			/*
> > +			 * FIXME: This somehow has to be moved to
> > +			 * finish_task_switch(), because otherwise
> > +			 * if the process accounts itself, the CPU time
> > +			 * that is used for this code will be lost.
> > +			 */
> > +			__account_to_parent(tsk, 0);
> 
> We hold the wrong ->siglock here.

Right, we hold the siglock of the exiting task, but we should hold the
lock of the parent, correct?

> Also, the logic behind ->exit_accounting_done looks wrong (and unneeded)
> but I am not sure...

I think the logic is correct, but probably a better implementation is
possible. The member "exit_accounting_done" is used for the following
two cases:

1. Process becomes a zombie and parent waits for it:
   * wait_task_zombie() calls __account_to_parent(p, 1) that sets 
     exit_accounting_done = 1
   * Then release_task()/__exit_signal() is called that does no
     accounting, because exit_accounting_done is already set to 1
2. Process reaps itself:
   * release_task()/__exit_signal() is called, exit_accounting_done is
     still 0, therefore __account_to_parent(tsk, 0) is called

> > @@ -772,6 +869,15 @@ static void forget_original_parent(struc
> >  	LIST_HEAD(dead_children);
> >  
> >  	write_lock_irq(&tasklist_lock);
> > +	list_for_each_entry_safe(p, n, &father->children_acct, sibling_acct) {
> > +		struct task_struct *t = p;
> > +		do {
> > +			t->acct_parent = t->acct_parent->acct_parent;
> > +		} while_each_thread(p, t);
> > +		list_move_tail(&p->sibling_acct,
> > +			       &p->acct_parent->children_acct);
> 
> This is certainly wrong if there are other live threads in father's
> thread-group.

Sorry for my ignorance. Probably I have not understood what happens, if
a thread group leader dies. My assumption was that then the whole thread
group dies. Also I assumed that a parent can only be a thread group
leader.

So what you are saying is that when a thread group leader dies and there
are other live threads in his thread group, his children will NOT get
init as new parent? Instead they get the new thread group leader as
parent?

> Also, you need to change de_thread() if it changes the leader.

Something like the following?

--- a/fs/exec.c
+++ b/fs/exec.c
@@ -850,6 +850,7 @@ static int de_thread(struct task_struct 

                list_replace_rcu(&leader->tasks, &tsk->tasks);
                list_replace_init(&leader->sibling, &tsk->sibling);
+               list_replace_init(&leader->sibling_acct, &tsk->sibling_acct);

                tsk->group_leader = tsk;
                leader->group_leader = tsk;


> 
> >  	list_for_each_entry_safe(p, n, &dead_children, sibling) {
> >  		list_del_init(&p->sibling);
> > +		list_del_init(&p->sibling_acct);
> 
> This list_del() can race with ->acct_parent if it in turn exits and
> does forget_original_parent() -> list_move_tail(sibling_acct).

Ok, to fix this I could move the list_del_init(&p->sibling_acct) to
reparent_leader():

@@ -761,6 +868,7 @@ static void reparent_leader(struct task_
                if (task_detached(p)) {
                        p->exit_state = EXIT_DEAD;
                        list_move_tail(&p->sibling, dead);
+                       list_del_init(&p->sibling_acct);
                }
        }


Michael


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 01/10] taskstats: Use real microsecond granularity for CPU times
  2010-10-07  5:08   ` Balbir Singh
@ 2010-10-08 15:08     ` Michael Holzheu
  2010-10-08 16:39       ` Balbir Singh
  0 siblings, 1 reply; 58+ messages in thread
From: Michael Holzheu @ 2010-10-08 15:08 UTC (permalink / raw)
  To: balbir
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Ingo Molnar, Heiko Carstens, Martin Schwidefsky,
	linux-s390, linux-kernel

On Thu, 2010-10-07 at 10:38 +0530, Balbir Singh wrote:
> * Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-09-23 16:00:52]:
> 
> > Subject: [PATCH] taskstats: Use real microsecond granularity for CPU times
> > 
> > From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> > 
> > The taskstats interface uses microsecond granularity for the user and
> > system time values. The conversion from cputime to the taskstats values
> > uses the cputime_to_msecs primitive which effectively limits the 
> > granularity to milliseconds. Add the cputime_to_usecs primitive for
> > architectures that have better, more precise CPU time values. Remove
> > cputime_to_msecs primitive because there is no more user left.
> > 
> > Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> 
> Looks good to me.

Can I take that as an Acked-by?

Michael



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 01/10] taskstats: Use real microsecond granularity for CPU times
  2010-10-08 15:08     ` Michael Holzheu
@ 2010-10-08 16:39       ` Balbir Singh
  0 siblings, 0 replies; 58+ messages in thread
From: Balbir Singh @ 2010-10-08 16:39 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Ingo Molnar, Heiko Carstens, Martin Schwidefsky,
	linux-s390, linux-kernel

* Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-10-08 17:08:12]:

> On Thu, 2010-10-07 at 10:38 +0530, Balbir Singh wrote:
> > * Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-09-23 16:00:52]:
> > 
> > > Subject: [PATCH] taskstats: Use real microsecond granularity for CPU times
> > > 
> > > From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> > > 
> > > The taskstats interface uses microsecond granularity for the user and
> > > system time values. The conversion from cputime to the taskstats values
> > > uses the cputime_to_msecs primitive which effectively limits the 
> > > granularity to milliseconds. Add the cputime_to_usecs primitive for
> > > architectures that have better, more precise CPU time values. Remove
> > > cputime_to_msecs primitive because there is no more user left.
> > > 
> > > Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> > 
> > Looks good to me.
> 
> Can I take that as an Acked-by?
>

Certainly

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 02/10] taskstats: Separate taskstats commands
  2010-09-23 14:01 ` [RFC][PATCH 02/10] taskstats: Separate taskstats commands Michael Holzheu
  2010-09-27  9:32   ` Balbir Singh
@ 2010-10-11  7:40   ` Balbir Singh
  1 sibling, 0 replies; 58+ messages in thread
From: Balbir Singh @ 2010-10-11  7:40 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Martin Schwidefsky, Heiko Carstens,
	linux-kernel, linux-s390

* Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-09-23 16:01:02]:

> Subject: [PATCH] taskstats: Separate taskstats commands
> 
> From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> 
> This patch moves each taskstats command into a single function. This
> makes
> the code more readable and makes it easier to add new commands.
> 
> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> ---
>  kernel/taskstats.c |  118
> +++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 78 insertions(+), 40 deletions(-)
> 
> --- a/kernel/taskstats.c
> +++ b/kernel/taskstats.c
> @@ -424,39 +424,76 @@ err:
>  	return rc;
>  }
> 
> -static int taskstats_user_cmd(struct sk_buff *skb, struct genl_info
> *info)
> +static int cmd_attr_register_cpumask(struct genl_info *info)
>  {
> -	int rc;
> -	struct sk_buff *rep_skb;
> -	struct taskstats *stats;
> -	size_t size;
>  	cpumask_var_t mask;
> +	int rc;
> 
>  	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
>  		return -ENOMEM;
> -
>  	rc = parse(info->attrs[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK], mask);
>  	if (rc < 0)
> -		goto free_return_rc;
> -	if (rc == 0) {
> -		rc = add_del_listener(info->snd_pid, mask, REGISTER);
> -		goto free_return_rc;
> -	}
> +		goto out;
> +	rc = add_del_listener(info->snd_pid, mask, REGISTER);
> +out:
> +	free_cpumask_var(mask);
> +	return rc;
> +}
> +
> +static int cmd_attr_deregister_cpumask(struct genl_info *info)
> +{
> +	cpumask_var_t mask;
> +	int rc;
> 
> +	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> +		return -ENOMEM;
>  	rc = parse(info->attrs[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK], mask);
>  	if (rc < 0)
> -		goto free_return_rc;
> -	if (rc == 0) {
> -		rc = add_del_listener(info->snd_pid, mask, DEREGISTER);
> -free_return_rc:
> -		free_cpumask_var(mask);
> -		return rc;
> -	}
> +		goto out;
> +	rc = add_del_listener(info->snd_pid, mask, DEREGISTER);
> +out:
>  	free_cpumask_var(mask);
> +	return rc;
> +}
> +static int cmd_attr_pid(struct genl_info *info)
> +{
> +	struct taskstats *stats;
> +	struct sk_buff *rep_skb;
> +	size_t size;
> +	u32 pid;
> +	int rc;
> +
> +	size = nla_total_size(sizeof(u32)) +
> +		nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
> +
> +	rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, size);
> +	if (rc < 0)
> +		return rc;
> +
> +	rc = -EINVAL;
> +	pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> +	stats = mk_reply(rep_skb, TASKSTATS_TYPE_PID, pid);
> +	if (!stats)
> +		goto err;
> +
> +	rc = fill_pid(pid, NULL, stats);
> +	if (rc < 0)
> +		goto err;
> +	return send_reply(rep_skb, info);
> +err:
> +	nlmsg_free(rep_skb);
> +	return rc;
> +}
> +
> +static int cmd_attr_tgid(struct genl_info *info)
> +{
> +	struct taskstats *stats;
> +	struct sk_buff *rep_skb;
> +	size_t size;
> +	u32 tgid;
> +	int rc;
> 
> -	/*
> -	 * Size includes space for nested attributes
> -	 */
>  	size = nla_total_size(sizeof(u32)) +
>  		nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
> 
> @@ -465,33 +502,34 @@ free_return_rc:
>  		return rc;
> 
>  	rc = -EINVAL;
> -	if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> -		u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> -		stats = mk_reply(rep_skb, TASKSTATS_TYPE_PID, pid);
> -		if (!stats)
> -			goto err;
> -
> -		rc = fill_pid(pid, NULL, stats);
> -		if (rc < 0)
> -			goto err;
> -	} else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
> -		u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
> -		stats = mk_reply(rep_skb, TASKSTATS_TYPE_TGID, tgid);
> -		if (!stats)
> -			goto err;
> -
> -		rc = fill_tgid(tgid, NULL, stats);
> -		if (rc < 0)
> -			goto err;
> -	} else
> +	tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
> +	stats = mk_reply(rep_skb, TASKSTATS_TYPE_TGID, tgid);
> +	if (!stats)
>  		goto err;
> 
> +	rc = fill_tgid(tgid, NULL, stats);
> +	if (rc < 0)
> +		goto err;
>  	return send_reply(rep_skb, info);
>  err:
>  	nlmsg_free(rep_skb);
>  	return rc;
>  }
> 
> +static int taskstats_user_cmd(struct sk_buff *skb, struct genl_info
> *info)
> +{
> +	if (info->attrs[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK])
> +		return cmd_attr_register_cpumask(info);
> +	else if (info->attrs[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK])
> +		return cmd_attr_deregister_cpumask(info);
> +	else if (info->attrs[TASKSTATS_CMD_ATTR_PID])
> +		return cmd_attr_pid(info);
> +	else if (info->attrs[TASKSTATS_CMD_ATTR_TGID])
> +		return cmd_attr_tgid(info);
> +	else
> +		return -EINVAL;
> +}
> +
>  static struct taskstats *taskstats_tgid_alloc(struct task_struct *tsk)
>  {
>  	struct signal_struct *sig = tsk->signal;
> 
>

Looks good (sorry for the delay in reviewing, I expect to be slow for
another two weeks or so)

 
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 03/10] taskstats: Split fill_pid function
  2010-09-23 14:01 ` [RFC][PATCH 03/10] taskstats: Split fill_pid function Michael Holzheu
  2010-09-23 17:33   ` Oleg Nesterov
  2010-09-27  9:33   ` Balbir Singh
@ 2010-10-11  8:31   ` Balbir Singh
  2 siblings, 0 replies; 58+ messages in thread
From: Balbir Singh @ 2010-10-11  8:31 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Peter Zijlstra, Suresh Siddha, John stultz, Thomas Gleixner,
	Oleg Nesterov, Ingo Molnar, Heiko Carstens, Martin Schwidefsky,
	linux-s390, linux-kernel

* Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-09-23 16:01:07]:

> Subject: [PATCH] taskstats: Split fill_pid function
> 
> From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> 
> Separate the finding of a task_struct by pid or tgid from filling the taskstats
> data. This makes the code more readable.
> 
> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
 
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-10-07 15:06                       ` Michael Holzheu
@ 2010-10-11 12:37                         ` Oleg Nesterov
  2010-10-12 13:10                           ` Michael Holzheu
  0 siblings, 1 reply; 58+ messages in thread
From: Oleg Nesterov @ 2010-10-11 12:37 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Roland McGrath, Martin Schwidefsky, Shailabh Nagar,
	Andrew Morton, Venkatesh Pallipadi, Peter Zijlstra,
	Suresh Siddha, John stultz, Thomas Gleixner, Balbir Singh,
	Ingo Molnar, Heiko Carstens, linux-s390, linux-kernel

On 10/07, Michael Holzheu wrote:
>
> On Wed, 2010-10-06 at 17:26 +0200, Oleg Nesterov wrote:
> >
> > I am not sure I understand the "correct accounting" above. ->acct_parent
> > adds the "parallel" hierarchy.
>
> Correct. The patch adds a "parallel" hierarchy that IMHO is better
> suited for accounting purposes. For me the correct accounting hierarchy
> means that cumulative time is passed along the real relationship tree.
> That means if you have two task snapshots it is clear which tasks
> inherited the time of the dead children. Example:
>
> P1 is parent of P2 is parent of P3
>
> Snapshot 1: P1 -> P2 -> P3
> Snapshot 2: P1
>
> We know that P2 and P3 died in the last interval. With the current
> cumulative time accounting we can't say, if P1 got the CPU time of P3
> (if P3 died before P2) or if init got the time (if P2 died before P3).
> With my patch we know that P1 got all the CPU time of P2 and P3.

Yes, I see what the patch does.

(but "P1 got all the CPU time of P2 and P3" doesn't look 100% right,
 see below).

Still I am not sure. First of all, again, this complicates the core
kernel code for top. And note that this parallel hierarchy is not
visible to userspace (it can only see the "side effects" in cdata_acct).

But more importanly, I do not understand why this is always better.
Say, if the task does daemonize(), it wants to become the child
of init, and imho in this case it should be accounted accordinly.

> > > @@ -90,6 +156,24 @@ static void __exit_signal(struct task_st
> > >
> > >  	posix_cpu_timers_exit(tsk);
> > >  	if (group_dead) {
> > > +		if (!tsk->exit_accounting_done) {
> > > +#ifdef __s390x__
> > > +		/*
> > > +		 * FIXME: On s390 we can call account_process_tick to update
> > > +		 * CPU time information. This is probably not valid on other
> > > +		 * architectures.
> > > +		 */
> > > +			if (current == tsk)
> > > +				account_process_tick(current, 1);
> > > +#endif
> > > +			/*
> > > +			 * FIXME: This somehow has to be moved to
> > > +			 * finish_task_switch(), because otherwise
> > > +			 * if the process accounts itself, the CPU time
> > > +			 * that is used for this code will be lost.
> > > +			 */
> > > +			__account_to_parent(tsk, 0);
> >
> > We hold the wrong ->siglock here.
>
> Right, we hold the siglock of the exiting task, but we should hold the
> lock of the parent, correct?

Yes. __account_to_parent() should be caller later, after spin_unlock(siglock),
__exit_signal() has another "if (group_dead)" check below.

This also means __account_to_parent() should take ->siglock itself.

Probably this is a matter of taste, but I do not understand why
__account_to_parent() takes the boolean "wait" argument. The caller
can just pass the correct task_struct* which is either ->real_parent
or ->acct_parent.

> > Also, the logic behind ->exit_accounting_done looks wrong (and unneeded)
> > but I am not sure...
>
> I think the logic is correct,

OK, I misread the patch as if we always account the exited task in
parent's cdata_acct,

	+       struct cdata cdata_wait; /* parents have done sys_wait() */
	+       struct cdata cdata_acct; /* complete cumulative data from acct tree */

while in fact the "complete" data is cdata_wait + cdata_acct.

Hmm. Let's return to your example above,

	> Snapshot 1: P1 -> P2 -> P3
	> Snapshot 2: P1
	> ...
	> P1 got all the CPU time of P2 and P3

Suppose that P2 dies before P3. Then P3 dies, /sbin/init does wait and
accounts this task. This means it is not accounted in P1->signal->cdata_acct,
no?

> > > @@ -772,6 +869,15 @@ static void forget_original_parent(struc
> > >  	LIST_HEAD(dead_children);
> > >
> > >  	write_lock_irq(&tasklist_lock);
> > > +	list_for_each_entry_safe(p, n, &father->children_acct, sibling_acct) {
> > > +		struct task_struct *t = p;
> > > +		do {
> > > +			t->acct_parent = t->acct_parent->acct_parent;
> > > +		} while_each_thread(p, t);
> > > +		list_move_tail(&p->sibling_acct,
> > > +			       &p->acct_parent->children_acct);
> >
> > This is certainly wrong if there are other live threads in father's
> > thread-group.
>
> Sorry for my ignorance. Probably I have not understood what happens, if
> a thread group leader dies. My assumption was that then the whole thread
> group dies.

No. A thread group dies when the last thread dies. If a leader exits
it becomes a zombie until all other sub-threads exit.

> Also I assumed that a parent can only be a thread group
> leader.

No. If a thread T does fork(), the child's ->real_parent is T, not
T->group_leader. If T exits, we do not reparent its children to init
(unless it is the last thread, of course), we pick another live
thread in this thread group for reparenting.

> > Also, you need to change de_thread() if it changes the leader.
>
> Something like the following?
>
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -850,6 +850,7 @@ static int de_thread(struct task_struct
>
>                 list_replace_rcu(&leader->tasks, &tsk->tasks);
>                 list_replace_init(&leader->sibling, &tsk->sibling);
> +               list_replace_init(&leader->sibling_acct, &tsk->sibling_acct);

Yes.

> > > +		list_del_init(&p->sibling_acct);
> >
> > This list_del() can race with ->acct_parent if it in turn exits and
> > does forget_original_parent() -> list_move_tail(sibling_acct).
>
> Ok, to fix this I could move the list_del_init(&p->sibling_acct) to
> reparent_leader():
>
> @@ -761,6 +868,7 @@ static void reparent_leader(struct task_
>                 if (task_detached(p)) {
>                         p->exit_state = EXIT_DEAD;
>                         list_move_tail(&p->sibling, dead);
> +                       list_del_init(&p->sibling_acct);

Probably yes... Well, currently I do not really understand how this
all looks with this patch applied, but perhaps this list_del() is
not needed at all? We are going to call release_task(p), and it
should remove p from ->children_acct and ->children anyway?

Oleg.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-10-11 12:37                         ` Oleg Nesterov
@ 2010-10-12 13:10                           ` Michael Holzheu
  2010-10-14 13:47                             ` Oleg Nesterov
  0 siblings, 1 reply; 58+ messages in thread
From: Michael Holzheu @ 2010-10-12 13:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Roland McGrath, Martin Schwidefsky, Shailabh Nagar,
	Andrew Morton, Venkatesh Pallipadi, Peter Zijlstra,
	Suresh Siddha, John stultz, Thomas Gleixner, Balbir Singh,
	Ingo Molnar, Heiko Carstens, linux-s390, linux-kernel

Hello Oleg,

First of all many thanks for all your time that you have spent for
reviewing the patches and giving us useful feedback!

On Mon, 2010-10-11 at 14:37 +0200, Oleg Nesterov wrote: 
> On 10/07, Michael Holzheu wrote:
> >
> > On Wed, 2010-10-06 at 17:26 +0200, Oleg Nesterov wrote:
> > >
> > > I am not sure I understand the "correct accounting" above. ->acct_parent
> > > adds the "parallel" hierarchy.
> >
> > Correct. The patch adds a "parallel" hierarchy that IMHO is better
> > suited for accounting purposes. For me the correct accounting hierarchy
> > means that cumulative time is passed along the real relationship tree.
> > That means if you have two task snapshots it is clear which tasks
> > inherited the time of the dead children. Example:
> >
> > P1 is parent of P2 is parent of P3
> >
> > Snapshot 1: P1 -> P2 -> P3
> > Snapshot 2: P1
> >
> > We know that P2 and P3 died in the last interval. With the current
> > cumulative time accounting we can't say, if P1 got the CPU time of P3
> > (if P3 died before P2) or if init got the time (if P2 died before P3).
> > With my patch we know that P1 got all the CPU time of P2 and P3.
> 
> Yes, I see what the patch does.
> 
> (but "P1 got all the CPU time of P2 and P3" doesn't look 100% right,
>  see below).
> 
> Still I am not sure. First of all, again, this complicates the core
> kernel code for top. And note that this parallel hierarchy is not
> visible to userspace (it can only see the "side effects" in cdata_acct).

In order to make everything work with my top command, we have to make
the new hierarchy visible to userspace. We would have to include
acct_parent->tgid in taskstats. Maybe one more reason for not doing
it ...

> But more importanly, I do not understand why this is always better.
> Say, if the task does daemonize(), it wants to become the child
> of init, and imho in this case it should be accounted accordinly.

Hmmm, sure... You can say if a daemon detaches itself, it is explicitly
wanted that it should be accounted to init. My argumentation with the
parallel tree is that the tasks (and their older relatives) that started
other tasks are responsible for taking the CPU time of their children
and grandchildren. That might not be what is wanted in case of
daemonize().

The main advantage of the new hierarchy compared to the old one is that
if you have two snapshots, you can always clearly say which relative has
gotten the CPU time of dead tasks. As stated earlier we can achieve that
also by capturing the reparent events in userspace. Maybe I should make
a patch for that. Do you think that could be an acceptable alternative? 

If that also is not acceptable, we have to capture all task exit events
between two snapshots. But this can be a lot of overhead for accounting.

> This also means __account_to_parent() should take ->siglock itself.
> 
> Probably this is a matter of taste, but I do not understand why
> __account_to_parent() takes the boolean "wait" argument. The caller
> can just pass the correct task_struct* which is either ->real_parent
> or ->acct_parent.
> 
> > > Also, the logic behind ->exit_accounting_done looks wrong (and unneeded)
> > > but I am not sure...
> >
> > I think the logic is correct,
> 
> OK, I misread the patch as if we always account the exited task in
> parent's cdata_acct,
> 
> 	+       struct cdata cdata_wait; /* parents have done sys_wait() */
> 	+       struct cdata cdata_acct; /* complete cumulative data from acct tree */
> 
> while in fact the "complete" data is cdata_wait + cdata_acct.

No. The complete data is in cdata_acct. It contains both, the task times
where sys_wait() has been done and the task times, where the tasks have
reaped themselves.

> Hmm. Let's return to your example above,
> 
> 	> Snapshot 1: P1 -> P2 -> P3
> 	> Snapshot 2: P1
> 	> ...
> 	> P1 got all the CPU time of P2 and P3
> 
> Suppose that P2 dies before P3. Then P3 dies, /sbin/init does wait and
> accounts this task. This means it is not accounted in P1->signal->cdata_acct,
> no?

No. __account_to_parent() with wait=1 is called when init waits for P3.
Then both sets are updated cdata_acct and cdata_wait:

+static void __account_to_parent(struct task_struct *p, int wait)
+{
+       if (wait)
+               __account_ctime(p, &p->real_parent->signal->cdata_wait,
+                               &p->signal->cdata_wait);
+       __account_ctime(p, &p->acct_parent->signal->cdata_acct,
+                       &p->signal->cdata_acct);
+       p->exit_accounting_done = 1;

If a tasks reaps itself, only cdata_acct is updated.

> > > > @@ -772,6 +869,15 @@ static void forget_original_parent(struc
> > > >  	LIST_HEAD(dead_children);
> > > >
> > > >  	write_lock_irq(&tasklist_lock);
> > > > +	list_for_each_entry_safe(p, n, &father->children_acct, sibling_acct) {
> > > > +		struct task_struct *t = p;
> > > > +		do {
> > > > +			t->acct_parent = t->acct_parent->acct_parent;
> > > > +		} while_each_thread(p, t);
> > > > +		list_move_tail(&p->sibling_acct,
> > > > +			       &p->acct_parent->children_acct);
> > >
> > > This is certainly wrong if there are other live threads in father's
> > > thread-group.
> >
> > Sorry for my ignorance. Probably I have not understood what happens, if
> > a thread group leader dies. My assumption was that then the whole thread
> > group dies.
> 
> No. A thread group dies when the last thread dies. If a leader exits
> it becomes a zombie until all other sub-threads exit.

That brought me to another question: Does this mean that the thread
group leader never changes and is always alive (at least as zombie) as
long as the thread group lives?

> > Also I assumed that a parent can only be a thread group
> > leader.
> 
> No. If a thread T does fork(), the child's ->real_parent is T, not
> T->group_leader. If T exits, we do not reparent its children to init
> (unless it is the last thread, of course), we pick another live
> thread in this thread group for reparenting.

Ok, I hope that I understand now. So either we could set the acct_parent
to the thread group leader in fork(), or we use the new parent in the
thread group if there are live threads left, when a thread exits.

Something like the following:

 static void forget_original_parent(struct task_struct *father)
 {
+       struct pid_namespace *pid_ns = task_active_pid_ns(father);
        struct task_struct *p, *n, *reaper;
        LIST_HEAD(dead_children);

        exit_ptrace(father);

        reaper = find_new_reaper(father);

+       list_for_each_entry_safe(p, n, &father->children_acct, sibling_acct) {
+               struct task_struct *t = p;
+               do {
+                       if (pid_ns->child_reaper == reaper)
+                               t->acct_parent = t->acct_parent->acct_parent;
+                       else
+                               t->acct_parent = reaper;
+               } while_each_thread(p, t);
+               list_move_tail(&p->sibling_acct,
+                              &p->acct_parent->children_acct);
+       }
+

Michael


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-10-12 13:10                           ` Michael Holzheu
@ 2010-10-14 13:47                             ` Oleg Nesterov
  2010-10-15 14:34                               ` Michael Holzheu
  0 siblings, 1 reply; 58+ messages in thread
From: Oleg Nesterov @ 2010-10-14 13:47 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Roland McGrath, Martin Schwidefsky, Shailabh Nagar,
	Andrew Morton, Venkatesh Pallipadi, Peter Zijlstra,
	Suresh Siddha, John stultz, Thomas Gleixner, Balbir Singh,
	Ingo Molnar, Heiko Carstens, linux-s390, linux-kernel

Michael, sorry for delay...

On 10/12, Michael Holzheu wrote:
>
> On Mon, 2010-10-11 at 14:37 +0200, Oleg Nesterov wrote:
> >
> > > > Also, the logic behind ->exit_accounting_done looks wrong (and unneeded)
> > > > but I am not sure...
> > >
> > > I think the logic is correct,
> >
> > OK, I misread the patch as if we always account the exited task in
> > parent's cdata_acct,
> >
> > 	+       struct cdata cdata_wait; /* parents have done sys_wait() */
> > 	+       struct cdata cdata_acct; /* complete cumulative data from acct tree */
> >
> > while in fact the "complete" data is cdata_wait + cdata_acct.
>
> No. The complete data is in cdata_acct. It contains both, the task times
> where sys_wait() has been done and the task times, where the tasks have
> reaped themselves.

Hmm. This means my first understanding was correct. But now I am
confused again, see below.

> > Hmm. Let's return to your example above,
> >
> > 	> Snapshot 1: P1 -> P2 -> P3
> > 	> Snapshot 2: P1
> > 	> ...
> > 	> P1 got all the CPU time of P2 and P3
> >
> > Suppose that P2 dies before P3. Then P3 dies, /sbin/init does wait and
> > accounts this task. This means it is not accounted in P1->signal->cdata_acct,
> > no?
>
> No. __account_to_parent() with wait=1 is called when init waits for P3.
> Then both sets are updated cdata_acct and cdata_wait:
>
> +static void __account_to_parent(struct task_struct *p, int wait)
> +{
> +       if (wait)
> +               __account_ctime(p, &p->real_parent->signal->cdata_wait,
> +                               &p->signal->cdata_wait);
> +       __account_ctime(p, &p->acct_parent->signal->cdata_acct,
> +                       &p->signal->cdata_acct);
> +       p->exit_accounting_done = 1;
>
> If a tasks reaps itself, only cdata_acct is updated.

Yes. But __account_to_parent() always sets p->exit_accounting_done = 1.
And __exit_signal() calls __account_to_parent() only if it is not set.

This means that we update either cdata_wait (if the child was reaped
by parent) or cdata_acct (the process auto-reaps itself).

That is why I thought that ->exit_accounting_done should die, and
__exit_signal() should always call __account_to_parent() to update
cdata_acct.

Or I missed something? Confused ;)

 > > Sorry for my ignorance. Probably I have not understood what happens, if
> > > a thread group leader dies. My assumption was that then the whole thread
> > > group dies.
> >
> > No. A thread group dies when the last thread dies. If a leader exits
> > it becomes a zombie until all other sub-threads exit.
>
> That brought me to another question: Does this mean that the thread
> group leader never changes and is always alive (at least as zombie) as
> long as the thread group lives?

Yes. Except de_thread() can change the leader. The new leader is the
thread which calls exec.

> > > Also I assumed that a parent can only be a thread group
> > > leader.
> >
> > No. If a thread T does fork(), the child's ->real_parent is T, not
> > T->group_leader. If T exits, we do not reparent its children to init
> > (unless it is the last thread, of course), we pick another live
> > thread in this thread group for reparenting.
>
> Ok, I hope that I understand now. So either we could set the acct_parent
> to the thread group leader in fork(), or we use the new parent in the
> thread group if there are live threads left, when a thread exits.
>
> Something like the following:
>
>  static void forget_original_parent(struct task_struct *father)
>  {
> +       struct pid_namespace *pid_ns = task_active_pid_ns(father);
>         struct task_struct *p, *n, *reaper;
>         LIST_HEAD(dead_children);
>
>         exit_ptrace(father);
>
>         reaper = find_new_reaper(father);
>
> +       list_for_each_entry_safe(p, n, &father->children_acct, sibling_acct) {
> +               struct task_struct *t = p;
> +               do {
> +                       if (pid_ns->child_reaper == reaper)
> +                               t->acct_parent = t->acct_parent->acct_parent;
> +                       else
> +                               t->acct_parent = reaper;
> +               } while_each_thread(p, t);
> +               list_move_tail(&p->sibling_acct,
> +                              &p->acct_parent->children_acct);
> +       }
> +

I think you can simplify this, but  I am not sure right now.

First of all, ->acct_parent should be moved from task_struct to
signal_struct. No need to initialize t->acct_parent unless t is
the group leader (this means we can avoid do/while_each_thread
loop during re-parenting, but de_thread needs another trivial
change).


No need to change forget_original_parent() at all, instead we
can the single line

	p->signal->acct_parent = father->signal->acct_parent;

to reparent_leader(), after the "if (same_thread_group())" check.

What do you think?

Oleg.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-10-14 13:47                             ` Oleg Nesterov
@ 2010-10-15 14:34                               ` Michael Holzheu
  2010-10-19 14:17                                 ` Oleg Nesterov
  0 siblings, 1 reply; 58+ messages in thread
From: Michael Holzheu @ 2010-10-15 14:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Roland McGrath, Martin Schwidefsky, Shailabh Nagar,
	Andrew Morton, Venkatesh Pallipadi, Peter Zijlstra,
	Suresh Siddha, John stultz, Thomas Gleixner, Balbir Singh,
	Ingo Molnar, Heiko Carstens, linux-s390, linux-kernel

Hello Oleg,

On Thu, 2010-10-14 at 15:47 +0200, Oleg Nesterov wrote:

[snip]

> > +static void __account_to_parent(struct task_struct *p, int wait)
> > +{
> > +       if (wait)
> > +               __account_ctime(p, &p->real_parent->signal->cdata_wait,
> > +                               &p->signal->cdata_wait);
> > +       __account_ctime(p, &p->acct_parent->signal->cdata_acct,
> > +                       &p->signal->cdata_acct);
> > +       p->exit_accounting_done = 1;
> >
> > If a tasks reaps itself, only cdata_acct is updated.
> 
> Yes. But __account_to_parent() always sets p->exit_accounting_done = 1.
> And __exit_signal() calls __account_to_parent() only if it is not set.
> 
> This means that we update either cdata_wait (if the child was reaped
> by parent) or cdata_acct (the process auto-reaps itself).

No. The accounting of cdata_acct is done unconditionally in
__account_to_parent(). It is done for both cases wait=0 and wait=1,
therefore no CPU time gets lost. Accounting of cdata_wait is done only
on the sys_wait() path, where "wait" is "1".

> That is why I thought that ->exit_accounting_done should die, and
> __exit_signal() should always call __account_to_parent() to update
> cdata_acct.

"exit_accounting_done" is used to find out, if cumulative accounting has
already been done in the sys_wait() path. If it has not done and the
process reaps itself, __account_to_parent() is called in signal_exit().

I think it works as it currently is. But as already said, this probably
could be done better. At least your confusion seems to prove that :-)

>  > > Sorry for my ignorance. Probably I have not understood what happens, if
> > > > a thread group leader dies. My assumption was that then the whole thread
> > > > group dies.
> > >
> > > No. A thread group dies when the last thread dies. If a leader exits
> > > it becomes a zombie until all other sub-threads exit.
> >
> > That brought me to another question: Does this mean that the thread
> > group leader never changes and is always alive (at least as zombie) as
> > long as the thread group lives?
> 
> Yes. Except de_thread() can change the leader. The new leader is the
> thread which calls exec.

de_thread() is also a very interesting spot for accounting. The thread
that calls exec() gets a bit of the identity of the old thread group
leader e.g. PID and start time, but it keeps the old CPU times. This
looks strange to me.

Wouldn't it be better to either exchange the accounting data between old
and new leader or add the current accounting data of the new leader to
the signal struct and initialize them with zero again? Regarding the
implementation of my top command I would prefer the first solution.

What do you think?

> > > > Also I assumed that a parent can only be a thread group
> > > > leader.
> > >
> > > No. If a thread T does fork(), the child's ->real_parent is T, not
> > > T->group_leader. If T exits, we do not reparent its children to init
> > > (unless it is the last thread, of course), we pick another live
> > > thread in this thread group for reparenting.
> >
> > Ok, I hope that I understand now. So either we could set the acct_parent
> > to the thread group leader in fork(), or we use the new parent in the
> > thread group if there are live threads left, when a thread exits.
> >
> > Something like the following:
> >
> >  static void forget_original_parent(struct task_struct *father)
> >  {
> > +       struct pid_namespace *pid_ns = task_active_pid_ns(father);
> >         struct task_struct *p, *n, *reaper;
> >         LIST_HEAD(dead_children);
> >
> >         exit_ptrace(father);
> >
> >         reaper = find_new_reaper(father);
> >
> > +       list_for_each_entry_safe(p, n, &father->children_acct, sibling_acct) {
> > +               struct task_struct *t = p;
> > +               do {
> > +                       if (pid_ns->child_reaper == reaper)
> > +                               t->acct_parent = t->acct_parent->acct_parent;
> > +                       else
> > +                               t->acct_parent = reaper;
> > +               } while_each_thread(p, t);
> > +               list_move_tail(&p->sibling_acct,
> > +                              &p->acct_parent->children_acct);
> > +       }
> > +
> 
> I think you can simplify this, but  I am not sure right now.
> 
> First of all, ->acct_parent should be moved from task_struct to
> signal_struct. No need to initialize t->acct_parent unless t is
> the group leader (this means we can avoid do/while_each_thread
> loop during re-parenting, but de_thread needs another trivial
> change).
> No need to change forget_original_parent() at all, instead we
> can the single line
> 
> 	p->signal->acct_parent = father->signal->acct_parent;
> 
> to reparent_leader(), after the "if (same_thread_group())" check.
> 
> What do you think?

I think it is not that easy because we still have to maintain the
children_acct list. This list is used to reparent all the accounting
children to the new accounting parent.

But in principle you are right that acct_parent could be moved to the
signal_struct because we only have to change it, when a thread group
leader dies. Not sure if the code gets easier by this change because we
might have to do a lot of signal_struct locking. Let me check...

Michael


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-10-15 14:34                               ` Michael Holzheu
@ 2010-10-19 14:17                                 ` Oleg Nesterov
  2010-10-22 16:53                                   ` Michael Holzheu
  0 siblings, 1 reply; 58+ messages in thread
From: Oleg Nesterov @ 2010-10-19 14:17 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Roland McGrath, Martin Schwidefsky, Shailabh Nagar,
	Andrew Morton, Venkatesh Pallipadi, Peter Zijlstra,
	Suresh Siddha, John stultz, Thomas Gleixner, Balbir Singh,
	Ingo Molnar, Heiko Carstens, linux-s390, linux-kernel

On 10/15, Michael Holzheu wrote:
>
> On Thu, 2010-10-14 at 15:47 +0200, Oleg Nesterov wrote:
>
> > Yes. But __account_to_parent() always sets p->exit_accounting_done = 1.
> > And __exit_signal() calls __account_to_parent() only if it is not set.
> >
> > This means that we update either cdata_wait (if the child was reaped
> > by parent) or cdata_acct (the process auto-reaps itself).
>
> No. The accounting of cdata_acct is done unconditionally in
> __account_to_parent(). It is done for both cases wait=0 and wait=1,
> therefore no CPU time gets lost. Accounting of cdata_wait is done only
> on the sys_wait() path, where "wait" is "1".

Ah, got it, I didn't notice this detail.

Thanks.

> I think it works as it currently is. But as already said, this probably
> could be done better. At least your confusion seems to prove that :-)

Perhaps ;)

To me, it would be cleaner and simpler if you kill ->exit_accounting_done.
Both wait_task_zombie() and __exit_signal() could just call
__account_to_parent(parent_for_accounting) unconditionally passing
either real_parent or acct_parent as an argument. This also saves a
word in task_struct.

> de_thread() is also a very interesting spot for accounting. The thread
> that calls exec() gets a bit of the identity of the old thread group
> leader e.g. PID and start time, but it keeps the old CPU times. This
> looks strange to me.

Well, the main thread represents the whole process for ps/etc, that
is why we update ->start_time.

But,

> Wouldn't it be better to either exchange the accounting data between old
> and new leader

I dunno. The exiting old leader will update sig->utime/etc, so we do not
lose this info from the "whole process" pov. But yes, if user-space
looks at the single thread with that TGID it can notice that, say, utime
goes backward.

> or add the current accounting data of the new leader to
> the signal struct and initialize them with zero again?

Sorry, I don't understand this "initialize them with zero". What
is "them" ?

> > I think you can simplify this, but  I am not sure right now.
> >
> > First of all, ->acct_parent should be moved from task_struct to
> > signal_struct. No need to initialize t->acct_parent unless t is
> > the group leader (this means we can avoid do/while_each_thread
> > loop during re-parenting, but de_thread needs another trivial
> > change).
> > No need to change forget_original_parent() at all, instead we
> > can the single line
> >
> > 	p->signal->acct_parent = father->signal->acct_parent;
> >
> > to reparent_leader(), after the "if (same_thread_group())" check.
> >
> > What do you think?
>
> I think it is not that easy because we still have to maintain the
> children_acct list. This list is used to reparent all the accounting
> children to the new accounting parent.

Yes, sure, reparent_leader() should also do list_move_tail(acct_sibling),
I forget to mention this.

I guess you already understand this, but just in case. Please look at
sibling/children relationship. We do not add the sub-threads on
->children list, only the main thread.

However, every thread has its own ->parent and ->children, this is
because we have __WNOTHREAD. But acct-parenting doesn't have this
problem, only the main thread needs the properly initialized
->acct_parent, it is never needed until the whole process dies.

> But in principle you are right that acct_parent could be moved to the
> signal_struct because we only have to change it, when a thread group
> leader dies.

Yes. And if we move it into signal_struct, then we shouldn't worry
about updating it in de_thread().

However, de_thread() should do list_replace_init(leader->acct_sibling)
to add the new leader to acct_children.

I am not sure this really makes sense, but in fact you can move
->acct_sibling and ->acct_childen from task_struct to signal_struct
as well, note that you can trivially find the group leader looking
at signal->leader_pid. (actually, ->group_leader should be moved
to signal_struct, but this is another story). In this case de_thread()
needs no changes, and we save the space in task_struct.

Oleg.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting
  2010-10-19 14:17                                 ` Oleg Nesterov
@ 2010-10-22 16:53                                   ` Michael Holzheu
  0 siblings, 0 replies; 58+ messages in thread
From: Michael Holzheu @ 2010-10-22 16:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Roland McGrath, Martin Schwidefsky, Shailabh Nagar,
	Andrew Morton, Venkatesh Pallipadi, Peter Zijlstra,
	Suresh Siddha, John stultz, Thomas Gleixner, Balbir Singh,
	Ingo Molnar, Heiko Carstens, linux-s390, linux-kernel

Hello Oleg,

On Tue, 2010-10-19 at 16:17 +0200, Oleg Nesterov wrote:
> On 10/15, Michael Holzheu wrote:
> >
> > On Thu, 2010-10-14 at 15:47 +0200, Oleg Nesterov wrote:
> >
> > > Yes. But __account_to_parent() always sets p->exit_accounting_done = 1.
> > > And __exit_signal() calls __account_to_parent() only if it is not set.
> > >
> > > This means that we update either cdata_wait (if the child was reaped
> > > by parent) or cdata_acct (the process auto-reaps itself).
> >
> > No. The accounting of cdata_acct is done unconditionally in
> > __account_to_parent(). It is done for both cases wait=0 and wait=1,
> > therefore no CPU time gets lost. Accounting of cdata_wait is done only
> > on the sys_wait() path, where "wait" is "1".
> 
> Ah, got it, I didn't notice this detail.
> 
> Thanks.
> 
> > I think it works as it currently is. But as already said, this probably
> > could be done better. At least your confusion seems to prove that :-)
> 
> Perhaps ;)
> 
> To me, it would be cleaner and simpler if you kill ->exit_accounting_done.
> Both wait_task_zombie() and __exit_signal() could just call
> __account_to_parent(parent_for_accounting) unconditionally passing
> either real_parent or acct_parent as an argument. This also saves a
> word in task_struct.

Yes, this works and is better than using the exit_accounting_done flag.
I changed my patch accordingly, see below!

> > de_thread() is also a very interesting spot for accounting. The thread
> > that calls exec() gets a bit of the identity of the old thread group
> > leader e.g. PID and start time, but it keeps the old CPU times. This
> > looks strange to me.
> 
> Well, the main thread represents the whole process for ps/etc, that
> is why we update ->start_time.
> 
> But,
> 
> > Wouldn't it be better to either exchange the accounting data between old
> > and new leader
> 
> I dunno. The exiting old leader will update sig->utime/etc, so we do not
> lose this info from the "whole process" pov. But yes, if user-space
> looks at the single thread with that TGID it can notice that, say, utime
> goes backward.

I think it would be an improvement, if we exchange the accounting data
between the old and the new leader. After that, for user space
accounting will look consistent again. I will try to make a patch as
proposal for that.

> > or add the current accounting data of the new leader to
> > the signal struct and initialize them with zero again?
> 
> Sorry, I don't understand this "initialize them with zero". What
> is "them" ?

Sorry, I meant that the accounting fields of the new leader could be set
to zero. The semantics would then be that at exec() with the new binary
also task accounting is started from the beginning.

[snip]

> I am not sure this really makes sense, but in fact you can move
> ->acct_sibling and ->acct_childen from task_struct to signal_struct
> as well, note that you can trivially find the group leader looking
> at signal->leader_pid.

And I use pid_task() to get the leader task_struct via leader_pid?

> (actually, ->group_leader should be moved
> to signal_struct, but this is another story).

Then we would not need pid_task()...

> In this case de_thread()
> needs no changes, and we save the space in task_struct.

Below I attached a patch that implements your proposal. I moved
everything to the signal struct and instead of creating a task_struct
accounting tree, I now create a signal_struct tree which represents the
processes (in contrast to threads).

Probably some locking is missing in the patch, but it should show the
principle.

Michael
---
 fs/binfmt_elf.c           |    4 -
 fs/proc/array.c           |   10 +--
 fs/proc/base.c            |    3 
 include/linux/init_task.h |    2 
 include/linux/sched.h     |   35 ++++++++--
 kernel/exit.c             |  150 +++++++++++++++++++++++++++-------------------
 kernel/fork.c             |    7 ++
 kernel/sys.c              |   24 +++----
 8 files changed, 149 insertions(+), 86 deletions(-)

--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1296,8 +1296,8 @@ static void fill_prstatus(struct elf_prs
 		cputime_to_timeval(p->utime, &prstatus->pr_utime);
 		cputime_to_timeval(p->stime, &prstatus->pr_stime);
 	}
-	cputime_to_timeval(p->signal->cutime, &prstatus->pr_cutime);
-	cputime_to_timeval(p->signal->cstime, &prstatus->pr_cstime);
+	cputime_to_timeval(p->signal->cdata_wait.cutime, &prstatus->pr_cutime);
+	cputime_to_timeval(p->signal->cdata_wait.cstime, &prstatus->pr_cstime);
 }
 
 static int fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p,
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -413,11 +413,11 @@ static int do_task_stat(struct seq_file
 		num_threads = get_nr_threads(task);
 		collect_sigign_sigcatch(task, &sigign, &sigcatch);
 
-		cmin_flt = sig->cmin_flt;
-		cmaj_flt = sig->cmaj_flt;
-		cutime = sig->cutime;
-		cstime = sig->cstime;
-		cgtime = sig->cgtime;
+		cmin_flt = sig->cdata_wait.cmin_flt;
+		cmaj_flt = sig->cdata_wait.cmaj_flt;
+		cutime = sig->cdata_wait.cutime;
+		cstime = sig->cdata_wait.cstime;
+		cgtime = sig->cdata_wait.cgtime;
 		rsslim = ACCESS_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur);
 
 		/* add up live thread stats at the group level */
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2617,7 +2617,8 @@ static int do_io_accounting(struct task_
 	if (whole && lock_task_sighand(task, &flags)) {
 		struct task_struct *t = task;
 
-		task_io_accounting_add(&acct, &task->signal->ioac);
+		task_io_accounting_add(&acct,
+				       &task->signal->cdata_wait.ioac);
 		while_each_thread(task, t)
 			task_io_accounting_add(&acct, &t->ioac);
 
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -29,6 +29,8 @@ extern struct fs_struct init_fs;
 		.running = 0,						\
 		.lock = __SPIN_LOCK_UNLOCKED(sig.cputimer.lock),	\
 	},								\
+	.acct_sibling	= LIST_HEAD_INIT(sig.acct_sibling),		\
+	.acct_children	= LIST_HEAD_INIT(sig.acct_children),		\
 }
 
 extern struct nsproxy init_nsproxy;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -507,6 +507,20 @@ struct thread_group_cputimer {
 };
 
 /*
+ * Cumulative resource counters for reaped dead child processes.
+ * Live threads maintain their own counters and add to these
+ * in __exit_signal, except for the group leader.
+ */
+struct cdata {
+	cputime_t cutime, cstime, csttime, cgtime;
+	unsigned long cnvcsw, cnivcsw;
+	unsigned long cmin_flt, cmaj_flt;
+	unsigned long cinblock, coublock;
+	unsigned long cmaxrss;
+	struct task_io_accounting ioac;
+};
+
+/*
  * NOTE! "signal_struct" does not have it's own
  * locking, because a shared signal_struct always
  * implies a shared sighand_struct, so locking
@@ -579,17 +593,24 @@ struct signal_struct {
 	 * Live threads maintain their own counters and add to these
 	 * in __exit_signal, except for the group leader.
 	 */
-	cputime_t utime, stime, cutime, cstime;
+	cputime_t utime, stime;
 	cputime_t gtime;
-	cputime_t cgtime;
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 	cputime_t prev_utime, prev_stime;
 #endif
-	unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
-	unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
-	unsigned long inblock, oublock, cinblock, coublock;
-	unsigned long maxrss, cmaxrss;
-	struct task_io_accounting ioac;
+	unsigned long nvcsw, nivcsw;
+	unsigned long min_flt, maj_flt;
+	unsigned long inblock, oublock;
+	unsigned long maxrss;
+
+	/* Cumulative resource counters for all dead child processes */
+	struct cdata cdata_wait; /* parents have done sys_wait() */
+	struct cdata cdata_acct; /* complete cumulative data from acct tree */
+
+	/* Parallel accounting tree */
+	struct signal_struct *acct_parent; /* accounting parent process */
+	struct list_head acct_children;    /* list of my accounting children */
+	struct list_head acct_sibling;     /* linkage in my parent's list */
 
 	/*
 	 * Cumulative ns of schedule CPU time fo dead threads in the
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -50,6 +50,7 @@
 #include <linux/perf_event.h>
 #include <trace/events/sched.h>
 #include <linux/hw_breakpoint.h>
+#include <linux/kernel_stat.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -73,6 +74,60 @@ static void __unhash_process(struct task
 	list_del_rcu(&p->thread_group);
 }
 
+static void __account_ctime(struct task_struct *p, struct cdata *pcd,
+			    struct cdata *ccd)
+{
+	struct signal_struct *sig = p->signal;
+	cputime_t tgutime, tgstime;
+	unsigned long maxrss;
+
+	/*
+	 * The resource counters for the group leader are in its
+	 * own task_struct.  Those for dead threads in the group
+	 * are in its signal_struct, as are those for the child
+	 * processes it has previously reaped.  All these
+	 * accumulate in the parent's signal_struct c* fields.
+	 *
+	 * We don't bother to take a lock here to protect these
+	 * p->signal fields, because they are only touched by
+	 * __exit_signal, which runs with tasklist_lock
+	 * write-locked anyway, and so is excluded here.  We do
+	 * need to protect the access to parent->signal fields,
+	 * as other threads in the parent group can be right
+	 * here reaping other children at the same time.
+	 *
+	 * We use thread_group_times() to get times for the thread
+	 * group, which consolidates times for all threads in the
+	 * group including the group leader.
+	 */
+	thread_group_times(p, &tgutime, &tgstime);
+
+	pcd->cutime = cputime_add(pcd->cutime,
+				  cputime_add(tgutime, ccd->cutime));
+	pcd->cstime = cputime_add(pcd->cstime,
+				  cputime_add(tgstime, ccd->cstime));
+	pcd->cgtime = cputime_add(pcd->cgtime, cputime_add(p->gtime,
+			   cputime_add(sig->gtime, ccd->cgtime)));
+
+	pcd->cmin_flt += p->min_flt + sig->min_flt + ccd->cmin_flt;
+	pcd->cmaj_flt += p->maj_flt + sig->maj_flt + ccd->cmaj_flt;
+	pcd->cnvcsw += p->nvcsw + sig->nvcsw + ccd->cnvcsw;
+	pcd->cnivcsw += p->nivcsw + sig->nivcsw + ccd->cnivcsw;
+	pcd->cinblock += task_io_get_inblock(p) + sig->inblock + ccd->cinblock;
+	pcd->coublock += task_io_get_oublock(p) + sig->oublock + ccd->coublock;
+	maxrss = max(sig->maxrss, ccd->cmaxrss);
+	if (pcd->cmaxrss < maxrss)
+		pcd->cmaxrss = maxrss;
+
+	maxrss = max(sig->maxrss, ccd->cmaxrss);
+	if (pcd->cmaxrss < maxrss)
+		pcd->cmaxrss = maxrss;
+
+	task_io_accounting_add(&pcd->ioac, &p->ioac);
+	task_io_accounting_add(&pcd->ioac, &ccd->ioac);
+	task_io_accounting_add(&pcd->ioac, &ccd->ioac);
+}
+
 /*
  * This function expects the tasklist_lock write-locked.
  */
@@ -83,6 +138,21 @@ static void __exit_signal(struct task_st
 	struct sighand_struct *sighand;
 	struct tty_struct *uninitialized_var(tty);
 
+	if (group_dead) {
+		struct task_struct *acct_parent =
+			pid_task(sig->acct_parent->leader_pid, PIDTYPE_PID);
+		/*
+		 * FIXME: This somehow has to be moved to
+		 * finish_task_switch(), because otherwise
+		 * if the process accounts itself, the CPU time
+		 * that is used for this code will be lost.
+		 */
+		spin_lock(&acct_parent->sighand->siglock);
+		__account_ctime(tsk, &sig->acct_parent->cdata_acct,
+				&sig->cdata_acct);
+		spin_unlock(&acct_parent->sighand->siglock);
+		list_del_init(&sig->acct_sibling);
+	}
 	sighand = rcu_dereference_check(tsk->sighand,
 					rcu_read_lock_held() ||
 					lockdep_tasklist_lock_is_held());
@@ -122,7 +192,8 @@ static void __exit_signal(struct task_st
 		sig->nivcsw += tsk->nivcsw;
 		sig->inblock += task_io_get_inblock(tsk);
 		sig->oublock += task_io_get_oublock(tsk);
-		task_io_accounting_add(&sig->ioac, &tsk->ioac);
+		task_io_accounting_add(&sig->cdata_wait.ioac,
+				       &tsk->ioac);
 		sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	}
 
@@ -731,6 +802,17 @@ static struct task_struct *find_new_reap
 	return pid_ns->child_reaper;
 }
 
+static void reparent_acct(struct signal_struct *father)
+{
+	struct signal_struct *c, *n, *new_parent;
+
+	new_parent = father->acct_parent;
+	list_for_each_entry_safe(c, n, &father->acct_children, acct_sibling) {
+		c->acct_parent = new_parent;
+		list_move_tail(&c->acct_sibling, &new_parent->acct_children);
+	}
+}
+
 /*
 * Any that need to be release_task'd are put on the @dead list.
  */
@@ -748,6 +830,11 @@ static void reparent_leader(struct task_
 	if (same_thread_group(p->real_parent, father))
 		return;
 
+	/*
+	 * Father is thread group leader
+	 */
+	reparent_acct(father->signal);
+
 	/* We don't want people slaying init.  */
 	p->exit_signal = SIGCHLD;
 
@@ -1212,66 +1299,9 @@ static int wait_task_zombie(struct wait_
 	 * !task_detached() to filter out sub-threads.
 	 */
 	if (likely(!traced) && likely(!task_detached(p))) {
-		struct signal_struct *psig;
-		struct signal_struct *sig;
-		unsigned long maxrss;
-		cputime_t tgutime, tgstime;
-
-		/*
-		 * The resource counters for the group leader are in its
-		 * own task_struct.  Those for dead threads in the group
-		 * are in its signal_struct, as are those for the child
-		 * processes it has previously reaped.  All these
-		 * accumulate in the parent's signal_struct c* fields.
-		 *
-		 * We don't bother to take a lock here to protect these
-		 * p->signal fields, because they are only touched by
-		 * __exit_signal, which runs with tasklist_lock
-		 * write-locked anyway, and so is excluded here.  We do
-		 * need to protect the access to parent->signal fields,
-		 * as other threads in the parent group can be right
-		 * here reaping other children at the same time.
-		 *
-		 * We use thread_group_times() to get times for the thread
-		 * group, which consolidates times for all threads in the
-		 * group including the group leader.
-		 */
-		thread_group_times(p, &tgutime, &tgstime);
 		spin_lock_irq(&p->real_parent->sighand->siglock);
-		psig = p->real_parent->signal;
-		sig = p->signal;
-		psig->cutime =
-			cputime_add(psig->cutime,
-			cputime_add(tgutime,
-				    sig->cutime));
-		psig->cstime =
-			cputime_add(psig->cstime,
-			cputime_add(tgstime,
-				    sig->cstime));
-		psig->cgtime =
-			cputime_add(psig->cgtime,
-			cputime_add(p->gtime,
-			cputime_add(sig->gtime,
-				    sig->cgtime)));
-		psig->cmin_flt +=
-			p->min_flt + sig->min_flt + sig->cmin_flt;
-		psig->cmaj_flt +=
-			p->maj_flt + sig->maj_flt + sig->cmaj_flt;
-		psig->cnvcsw +=
-			p->nvcsw + sig->nvcsw + sig->cnvcsw;
-		psig->cnivcsw +=
-			p->nivcsw + sig->nivcsw + sig->cnivcsw;
-		psig->cinblock +=
-			task_io_get_inblock(p) +
-			sig->inblock + sig->cinblock;
-		psig->coublock +=
-			task_io_get_oublock(p) +
-			sig->oublock + sig->coublock;
-		maxrss = max(sig->maxrss, sig->cmaxrss);
-		if (psig->cmaxrss < maxrss)
-			psig->cmaxrss = maxrss;
-		task_io_accounting_add(&psig->ioac, &p->ioac);
-		task_io_accounting_add(&psig->ioac, &sig->ioac);
+		__account_ctime(p, &p->real_parent->signal->cdata_wait,
+				&p->signal->cdata_wait);
 		spin_unlock_irq(&p->real_parent->sighand->siglock);
 	}
 
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1282,6 +1282,13 @@ static struct task_struct *copy_process(
 		nr_threads++;
 	}
 
+	if (!(clone_flags & CLONE_THREAD)) {
+		p->signal->acct_parent = current->signal;
+		INIT_LIST_HEAD(&p->signal->acct_children);
+		INIT_LIST_HEAD(&p->signal->acct_sibling);
+		list_add_tail(&p->signal->acct_sibling,
+			      &current->signal->acct_children);
+	}
 	total_forks++;
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -884,8 +884,8 @@ void do_sys_times(struct tms *tms)
 
 	spin_lock_irq(&current->sighand->siglock);
 	thread_group_times(current, &tgutime, &tgstime);
-	cutime = current->signal->cutime;
-	cstime = current->signal->cstime;
+	cutime = current->signal->cdata_wait.cutime;
+	cstime = current->signal->cdata_wait.cstime;
 	spin_unlock_irq(&current->sighand->siglock);
 	tms->tms_utime = cputime_to_clock_t(tgutime);
 	tms->tms_stime = cputime_to_clock_t(tgstime);
@@ -1490,6 +1490,7 @@ static void k_getrusage(struct task_stru
 	unsigned long flags;
 	cputime_t tgutime, tgstime, utime, stime;
 	unsigned long maxrss = 0;
+	struct cdata *cd;
 
 	memset((char *) r, 0, sizeof *r);
 	utime = stime = cputime_zero;
@@ -1507,15 +1508,16 @@ static void k_getrusage(struct task_stru
 	switch (who) {
 		case RUSAGE_BOTH:
 		case RUSAGE_CHILDREN:
-			utime = p->signal->cutime;
-			stime = p->signal->cstime;
-			r->ru_nvcsw = p->signal->cnvcsw;
-			r->ru_nivcsw = p->signal->cnivcsw;
-			r->ru_minflt = p->signal->cmin_flt;
-			r->ru_majflt = p->signal->cmaj_flt;
-			r->ru_inblock = p->signal->cinblock;
-			r->ru_oublock = p->signal->coublock;
-			maxrss = p->signal->cmaxrss;
+			cd = &p->signal->cdata_wait;
+			utime = cd->cutime;
+			stime = cd->cstime;
+			r->ru_nvcsw = cd->cnvcsw;
+			r->ru_nivcsw = cd->cnivcsw;
+			r->ru_minflt = cd->cmin_flt;
+			r->ru_majflt = cd->cmaj_flt;
+			r->ru_inblock = cd->cinblock;
+			r->ru_oublock = cd->coublock;
+			maxrss = cd->cmaxrss;
 
 			if (who == RUSAGE_CHILDREN)
 				break;



^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2010-10-22 16:53 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-23 13:48 [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Michael Holzheu
2010-09-23 14:00 ` [RFC][PATCH 01/10] taskstats: Use real microsecond granularity for CPU times Michael Holzheu
2010-10-07  5:08   ` Balbir Singh
2010-10-08 15:08     ` Michael Holzheu
2010-10-08 16:39       ` Balbir Singh
2010-09-23 14:01 ` [RFC][PATCH 02/10] taskstats: Separate taskstats commands Michael Holzheu
2010-09-27  9:32   ` Balbir Singh
2010-10-11  7:40   ` Balbir Singh
2010-09-23 14:01 ` [RFC][PATCH 03/10] taskstats: Split fill_pid function Michael Holzheu
2010-09-23 17:33   ` Oleg Nesterov
2010-09-27  9:33   ` Balbir Singh
2010-10-11  8:31   ` Balbir Singh
2010-09-23 14:01 ` [RFC][PATCH 04/10] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS Michael Holzheu
2010-09-23 14:01 ` [RFC][PATCH 05/10] taskstats: Add "/proc/taskstats" Michael Holzheu
2010-09-23 14:01 ` [RFC][PATCH 06/10] taskstats: Add thread group ID to taskstats structure Michael Holzheu
2010-09-23 14:01 ` [RFC][PATCH 07/10] taskstats: Add per task steal time accounting Michael Holzheu
2010-09-23 14:02 ` [RFC][PATCH 08/10] taskstats: Add cumulative CPU time (user, system and steal) Michael Holzheu
2010-09-23 14:02 ` [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting Michael Holzheu
2010-09-23 17:10   ` Oleg Nesterov
2010-09-24 12:18     ` Michael Holzheu
2010-09-26 18:11       ` Oleg Nesterov
2010-09-27 13:23         ` Michael Holzheu
2010-09-27 13:42         ` Martin Schwidefsky
2010-09-27 16:51           ` Oleg Nesterov
2010-09-28  7:09             ` Martin Schwidefsky
2010-09-29 19:19             ` Roland McGrath
2010-09-30 13:47               ` Michael Holzheu
2010-10-05  8:57                 ` Roland McGrath
2010-10-06  9:29                   ` Michael Holzheu
2010-10-06 15:26                     ` Oleg Nesterov
2010-10-07 15:06                       ` Michael Holzheu
2010-10-11 12:37                         ` Oleg Nesterov
2010-10-12 13:10                           ` Michael Holzheu
2010-10-14 13:47                             ` Oleg Nesterov
2010-10-15 14:34                               ` Michael Holzheu
2010-10-19 14:17                                 ` Oleg Nesterov
2010-10-22 16:53                                   ` Michael Holzheu
2010-09-28  8:36           ` Balbir Singh
2010-09-28  9:08             ` Martin Schwidefsky
2010-09-28  9:23               ` Balbir Singh
2010-09-28 10:36                 ` Martin Schwidefsky
2010-09-28 10:39                   ` Balbir Singh
2010-09-28  8:21   ` Balbir Singh
2010-09-28 16:50     ` Michael Holzheu
2010-09-23 14:04 ` [RFC][PATCH 10/10] taststats: User space with ptop tool Michael Holzheu
2010-09-23 20:11 ` [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Andrew Morton
2010-09-23 22:11   ` Matt Helsley
2010-09-24 12:39     ` Michael Holzheu
2010-09-25 18:19     ` Serge E. Hallyn
2010-09-24  9:10   ` Michael Holzheu
2010-09-24 18:50     ` Andrew Morton
2010-09-27  9:18       ` Michael Holzheu
2010-09-27 20:02         ` Andrew Morton
2010-09-28  8:17           ` Balbir Singh
2010-09-27 10:49     ` Balbir Singh
2010-09-24  9:16 ` Balbir Singh
2010-09-30  8:38 ` Andi Kleen
2010-09-30 13:56   ` Michael Holzheu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).