linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH v2 0/7] taskstats: Enhancements for precise process accounting (version 2)
@ 2010-11-11 17:03 Michael Holzheu
  2010-11-11 17:03 ` [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS Michael Holzheu
                   ` (6 more replies)
  0 siblings, 7 replies; 52+ messages in thread
From: Michael Holzheu @ 2010-11-11 17:03 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath
  Cc: linux-kernel, linux-s390

CHANGE HISTORY OF THIS PATCH SET
---------------------------------
Version 2
---------
* The following patches from version 1 have been accepted upstream for 2.6.37:
  + taskstats: use real microsecond granularity for CPU times
    git commit: d57af9b2142f31a39dcfdeb30776baadfc802827
  + taskstats: separate taskstats commands
    git commit: 9323312592cca636d7c2580dc85fa4846efa86a2
  + taskstats: split fill_pid function
    git commit: 3d9e0cf1fe007b88db55d43dfdb6839e1a029ca5

* Comment from Andrew Morton:
  I replaced the /proc/taskstats ioctls with a write interface (see patch [2])

* Based on discussions with Oleg Nesterov and Roland McGrath:
  We have identified the following problems of the current approach for getting
  100% CPU time between two taskstats snapshots without using exit events:

  1)Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
    to the cumulative time of the parents, if the parents ignore SIGCHLD
    or have set SA_NOCLDWAIT. This behaviour has the major drawback that
    it is not possible to calculate all consumed CPU time of a system by
    looking at the current tasks. CPU time can be lost.

  2)When a parent process dies, its children get the init process as
    new parent. For accounting this is suboptimal, because then init
    gets the CPU time of the tasks. For accounting it would be better,
    if the CPU time is passed along the relationship tree using the
    cumulative time counters as would have happened if the child had died
    before the parent. The main benefit of this approach is that between
    two task snapshots it is always clear which parent got the CPU time
    of dead tasks. Otherwise there are situations, where we can't
    determine if either init or an older relative of a dead task has
    inherited the CPU time.

  3)If a non-leader thread calls exec(), in the de_thread() function it
    gets some of the identity of the old thread group leader e.g. the PID
    and the start time. But it keeps the old CPU times. This can lead to
    confusion in user space, because CPU time can go backwards for the thread
    group leader.

  As a result of this discussion I have developed a new patch [5] (see below):
  * It adds a second set of accounting data to the signal_struct.
    This set is used to save all accounting data and resolves issue (1).
  * In addition to that, the patch adds a new accounting hierarchy for
    processes in the signal_struct to resolve issue (2).

  There are three possible alternatives to this approach:
  * Introduce "reparent" events for tasks that get init as new parent.
  * Get all task exit events (very expensive).
  * Live with the fact that accounting without exit events cannot be done
    100% accurate on UNIX/Linux and provide a solution that works with the
    cumulative time accounting as it is currently available under Linux.

  I have developed an additional patch [6] to resolve issue (3).
  It swaps the accounting data between the forking thread and the thread
  group leader in the de_thread() function. But perhaps another solution
  should be found for that problem because this might cause problems with
  the scheduler that gets confused by the swapped values.

DESCRIPTION OF PATCH SET
------------------------
Currently tools like "top" gather the task information by reading procfs
files. This has several disadvantages:

* It is very CPU intensive, because a lot of system calls (readdir, open,
  read, close) are necessary.
* No real task snapshot can be provided, because while the procfs files are
  read the system continues running.
* It is not possible to identify 100% of all consumed CPU time between two
  snapshots.
* The procfs times granularity is restricted to jiffies.

In parallel to procfs there is the taskstats binary interface that uses
netlink sockets as transport mechanism to deliver task information to
user space. There is a taskstats command "TASKSTATS_CMD_ATTR_PID"
to get task information for a given PID. This command can already be used for
tools like top, but has also several disadvantages:

* You first have to find out which PIDs are available in the system. Currently
  we have to use procfs again to do this.
* For each task two system calls have to be issued (First send the command and
  then receive the reply).
* No snapshot mechanism is available.

GOALS OF THIS PATCH SET
-----------------------
The intention of this patch set is to provide better support for tools like
top. The goal is to:

* provide a task snapshot mechanism where we can get a consistent view of
  all running tasks.
* identify 100% of the consumed CPU time between two snapshots without
  using task exit events.
* provide a transport mechanism that does not require a lot of system calls
  and that allows implementing low CPU overhead task monitoring.
* provide microsecond CPU time granularity.

FIRST RESULTS
-------------
Together with this kernel patch set also user space code for a new top
utility (ptop) is provided that exploits the new kernel infrastructure. See
patch 6 for more details.

TEST1: System with many sleeping tasks

  for ((i=0; i < 1000; i++))
  do
         sleep 1000000 &
  done

  # ptop_new_proc

             VVVV
  pid   user  sys  ste  total  Name
  (#)    (%)  (%)  (%)    (%)  (str)
  541   0.37 2.39 0.10   2.87  top
  3743  0.03 0.05 0.00   0.07  ptop_new_proc
             ^^^^

Compared to the old top command that has to scan more than 1000 proc
directories the new ptop consumes much less CPU time (0.05% system time
on my s390 system).

TEST2: Show snapshot consistency with system that is 100% busy

  System with 2 CPUs:

  for ((i=0; i < $(cat /proc/cpuinfo  | grep "^processor" | wc -l); i++))
  do
       ./loop &
  done
  cd linux-2.6
  make -j 5

  # ptop_snap_proc
  pid     user  sys stea cuser  csys cstea xuser xsys xstea  total Name
  (#)      (%)  (%)  (%)   (%)   (%)   (%)   (%)  (%)   (%)    (%) (str)
  2802   43.22 0.35 0.22  0.00  0.00  0.00  0.00 0.00  0.00  43.79 loop
  2799   35.96 0.33 0.21  0.00  0.00  0.00  0.00 0.00  0.00  36.50 loop
  2811    0.04 0.05 0.00 23.22 12.97  0.19  0.00 0.00  0.00  36.46 make
  2796   35.80 0.32 0.19  0.00  0.00  0.00  0.00 0.00  0.00  36.30 loop
  2987   15.93 2.14 0.07  8.23  3.12  0.06  0.00 0.00  0.00  29.53 make
  3044   11.56 1.72 0.22  0.00  0.00  0.00  0.00 0.00  0.00  13.50 make
  3053    1.92 0.73 0.01  0.00  0.00  0.00  0.00 0.00  0.00   2.65 make
  ....
  V:V:S 144.76 6.24 1.22 31.44 16.09  0.25  0.00 0.00  0.00 200.00
                                                            ^^^^^^

  With the snapshot mechanism the sum of all tasks CPU times will be exactly
  200.00% CPU time with this testcase. The following CPU times are used:
  * user/sys/stea:    Time that has been consumed by task itself
  * cuser/csys/cstea: Time that has been consumed by dead children of task
  * xuser/xsys/xstea: Time that has been consumed by dead threads of thread
                      group of task

  Note that the CPU times on x86 are not as precise as on s390.

PATCHSET OVERVIEW
-----------------
Patches apply on:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

The code is not final and still has several TODOs. The following kernel patches
are provided:

[1] Add new command "TASKSTATS_CMD_ATTR_PIDS" to get a snapshot of multiple
    tasks.
[2] Add procfs interface for taskstats commands. This allows to get a complete
    and consistent snapshot with all tasks using two system calls (write and
    read). Transferring a snapshot of all running tasks is not possible using
    the existing netlink interface, because there we have the socket buffer
    size as restricting factor.
[3] Add TGID to taskstats.
[4] Add steal time per task accounting.
[5] Improve cumulative CPU time accounting.
[6] Fix accounting for non-leader thread exec.

[7] Besides of the kernel patches also user space code is provided that
    exploits the new kernel infrastructure. The user space code provides the
    following:
    1. A proposal for a taskstats user space library:
       1.1 Based on netlink (requires libnl-devel-1.1-5)
       2.1 Based on the new /proc/taskstats interface (see patch [2])
    2. A proposal for a task snapshot library based on taskstats library (1.1)
    3. A new tool "ptop" (precise top) that uses the libraries

* Especially patch [1] "Add new command "TASKSTATS_CMD_ATTR_PIDS" to
  get a snapshot of multiple tasks" needs more review because it provides
  the main functionality for getting consistent task snapshots. Up to now
  I did not get any feedback on version 1 of this patch.

* Also the user space part with the libraries [7] would need some review.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS
  2010-11-11 17:03 [RFC][PATCH v2 0/7] taskstats: Enhancements for precise process accounting (version 2) Michael Holzheu
@ 2010-11-11 17:03 ` Michael Holzheu
  2010-11-13 19:20   ` Peter Zijlstra
  2010-11-13 19:39   ` Peter Zijlstra
  2010-11-11 17:03 ` [RFC][PATCH v2 2/7] taskstats: Add "/proc/taskstats" Michael Holzheu
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 52+ messages in thread
From: Michael Holzheu @ 2010-11-11 17:03 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath
  Cc: linux-kernel, linux-s390

[-- Attachment #1: 01-taskstats-top-cmd-pids.patch --]
[-- Type: text/plain, Size: 18971 bytes --]

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

The new command is designed to be used by commands like "top" that want
to create a snapshot of the running tasks. The command has the following
arguments:

 * pid:  PID to start searching
 * cnt:  Maximum number of taskstats to be returned
 * time: Timestamp (sched_clock)

The semantics of the command is as follows:

Return at most 'cnt' taskstats greater or equal to 'pid' for tasks that have
been or still are in state TASK_RUNNING between the given 'time' and now.
'time' correlates to the new taskstats field time_ns.

If no more taskstats are found, a final zero taskstats struct is returned that
marks the end of the netlink transmission.

As clock for 'now' and 'time' the sched_clock() function is used and the patch
adds a new field last_depart to the sched_info structure.

Sequence numbers are used to ensure reliable netlink communication with user
space. The first taskstat is returned with the same sequence number as the
command. Following taskstats are transmitted using ascending sequence numbers.

The new command can be used by user space as follows (pseudo code):

Initial: Get taskstats for all tasks

   int start_pid = 0, start_time = 0, oldest_time = INT_MAX;
   struct taskstats taskstats_vec[50];

   do {
          cnt = cmd_pids(start_pid, start_time, taskstats_vec, 50);
          for (i = 0; i < cnt; i++)
                  oldest_time = MIN(oldest_time, taskstats_vec[i].time_ns);
          update_database(taskstats_vec, cnt);
          start_pid = taskstats_vec[cnt - 1].ac_pid;
   } while (cnt == 50);

Update: Get all taskstats for tasks that were active after 'oldest_time'

   new_oldest_time = INT_MAX;
   start_pid = 0;
   do {
          cnt = cmd_pids(start_pid, oldest_time, taskstats_vec, 50);
          for (i = 0; i < cnt; i++)
                  new_oldest_time = MIN(new_oldest_time,
                                        taskstats_vec[i].time_ns);
          update_database(taskstats_vec, cnt);
          start_pid = taskstats_vec[cnt - 1].ac_pid;
   } while (cnt == 50);
   oldest_time = new_oldest_time;

The current approach assumes that sched_clock() can't wrap. If this
assumption is not true, things will become more difficult. The taskstats code
has to detect that sched_clock() wrapped since the last query and then return
a notification to user space. User space could then use oldest_time=0 to
resynchronize.

GOALS OF THIS PATCH
-------------------
Compared to the already existing taskstats command TASKSTATS_CMD_ATTR_PID,
the new command has the following advantages for implementing tools like top:
* No scan of procfs files necessary to find out running tasks.
* A consistent snapshot of task data is possible, if all taskstats can be
  transferred to the socket buffer.
* When using the 'time' parameter, only active tasks have to be transferred.
  This could be used for a special 'low CPU overhead' monitoring mode.
* Less system calls are necessary because only one command has to be sent
  for receiving multiple taskstasts.

OPEN ISSUES
-----------
* Because of the netlink socket buffer size restriction (default 64KB) it
  is not possible to transfer a consistent full taskstats snapshot that
  contains all tasks. See the procfs patch as a proposal to solve this problem.
* Is it a good idea to use the tasklist_lock for snapshot? To get a
  really consistent snapshot this is probably necessary.
* Force only non idle CPUs to account.
* Possible inconsistent data, because we use first taskstats_fill_atomic() and
  afterwards taskstats_fill_sleep().
* Complete the aggregation of swapper tasks.
* Add more filters besides of the 'time' parameter?
* Find a solution, if sched_clock() can wrap.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 include/linux/sched.h          |    4 
 include/linux/taskstats.h      |   13 ++
 include/linux/taskstats_kern.h |    9 +
 include/linux/tsacct_kern.h    |    4 
 kernel/Makefile                |    2 
 kernel/sched.c                 |    6 +
 kernel/sched_stats.h           |    1 
 kernel/taskstats.c             |   89 +++++++++++++++++-
 kernel/taskstats_snap.c        |  199 +++++++++++++++++++++++++++++++++++++++++
 kernel/tsacct.c                |   32 ++++--
 10 files changed, 339 insertions(+), 20 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -718,7 +718,8 @@ struct sched_info {
 
 	/* timestamps */
 	unsigned long long last_arrival,/* when we last ran on a cpu */
-			   last_queued;	/* when we were last queued to run */
+			   last_queued,	/* when we were last queued to run */
+			   last_depart;	/* when we last departed from a cpu */
 #ifdef CONFIG_SCHEDSTATS
 	/* BKL stats */
 	unsigned int bkl_count;
@@ -1298,6 +1299,7 @@ struct task_struct {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 	cputime_t prev_utime, prev_stime;
 #endif
+	unsigned long long acct_time;		/* Time for last accounting */
 	unsigned long nvcsw, nivcsw; /* context switch counts */
 	struct timespec start_time; 		/* monotonic time */
 	struct timespec real_start_time;	/* boot based time */
--- a/include/linux/taskstats.h
+++ b/include/linux/taskstats.h
@@ -33,7 +33,7 @@
  */
 
 
-#define TASKSTATS_VERSION	7
+#define TASKSTATS_VERSION	8
 #define TS_COMM_LEN		32	/* should be >= TASK_COMM_LEN
 					 * in linux/sched.h */
 
@@ -163,6 +163,10 @@ struct taskstats {
 	/* Delay waiting for memory reclaim */
 	__u64	freepages_count;
 	__u64	freepages_delay_total;
+	/* version 7 ends here */
+
+	/* Timestamp where data has been collected in ns since boot time */
+	__u64	time_ns;
 };
 
 
@@ -199,9 +203,16 @@ enum {
 	TASKSTATS_CMD_ATTR_TGID,
 	TASKSTATS_CMD_ATTR_REGISTER_CPUMASK,
 	TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK,
+	TASKSTATS_CMD_ATTR_PIDS,
 	__TASKSTATS_CMD_ATTR_MAX,
 };
 
+struct taskstats_cmd_pids {
+	__u64	time_ns;
+	__u32	pid;
+	__u32	cnt;
+};
+
 #define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
 
 /* NETLINK_GENERIC related info */
--- a/include/linux/taskstats_kern.h
+++ b/include/linux/taskstats_kern.h
@@ -23,6 +23,15 @@ static inline void taskstats_tgid_free(s
 
 extern void taskstats_exit(struct task_struct *, int group_dead);
 extern void taskstats_init_early(void);
+extern void taskstats_fill(struct task_struct *tsk, struct taskstats *stats);
+extern int taskstats_snap(int pid_start, int cnt, u64 time_ns,
+			  struct taskstats *stats_vec);
+extern int taskstats_snap_user(int pid_start, int cnt, u64 time_ns,
+			       struct taskstats *stats_vec);
+extern void taskstats_fill_atomic(struct task_struct *tsk,
+				  struct taskstats *stats);
+extern void taskstats_fill_sleep(struct task_struct *tsk,
+				 struct taskstats *stats);
 #else
 static inline void taskstats_exit(struct task_struct *tsk, int group_dead)
 {}
--- a/include/linux/tsacct_kern.h
+++ b/include/linux/tsacct_kern.h
@@ -18,11 +18,15 @@ static inline void bacct_add_tsk(struct
 
 #ifdef CONFIG_TASK_XACCT
 extern void xacct_add_tsk(struct taskstats *stats, struct task_struct *p);
+extern void xacct_add_tsk_mem(struct taskstats *stats, struct task_struct *p);
 extern void acct_update_integrals(struct task_struct *tsk);
 extern void acct_clear_integrals(struct task_struct *tsk);
 #else
 static inline void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 {}
+static inline void xacct_add_tsk_mem(struct taskstats *stats,
+				     struct task_struct *p)
+{}
 static inline void acct_update_integrals(struct task_struct *tsk)
 {}
 static inline void acct_clear_integrals(struct task_struct *tsk)
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -90,7 +90,7 @@ obj-$(CONFIG_TINY_PREEMPT_RCU) += rcutin
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
-obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o taskstats_snap.o
 obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
 obj-$(CONFIG_LATENCYTOP) += latencytop.o
 obj-$(CONFIG_BINFMT_ELF) += elfcore.o
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9400,4 +9400,10 @@ void synchronize_sched_expedited(void)
 }
 EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
 
+struct task_struct *get_swapper(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	return rq->idle;
+}
+
 #endif /* #else #ifndef CONFIG_SMP */
--- a/kernel/sched_stats.h
+++ b/kernel/sched_stats.h
@@ -218,6 +218,7 @@ static inline void sched_info_depart(str
 	unsigned long long delta = task_rq(t)->clock -
 					t->sched_info.last_arrival;
 
+	t->sched_info.last_depart = task_rq(t)->clock;
 	rq_sched_info_depart(task_rq(t), delta);
 
 	if (t->state == TASK_RUNNING)
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -27,6 +27,7 @@
 #include <linux/cgroup.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/vmalloc.h>
 #include <net/genetlink.h>
 #include <asm/atomic.h>
 
@@ -51,7 +52,8 @@ static const struct nla_policy taskstats
 	[TASKSTATS_CMD_ATTR_PID]  = { .type = NLA_U32 },
 	[TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
 	[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK] = { .type = NLA_STRING },
-	[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK] = { .type = NLA_STRING },};
+	[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK] = { .type = NLA_STRING },
+	[TASKSTATS_CMD_ATTR_PIDS] = { .type = NLA_BINARY },};
 
 static const struct nla_policy cgroupstats_cmd_get_policy[CGROUPSTATS_CMD_ATTR_MAX+1] = {
 	[CGROUPSTATS_CMD_ATTR_FD] = { .type = NLA_U32 },
@@ -175,9 +177,12 @@ static void send_cpu_listeners(struct sk
 	up_write(&listeners->sem);
 }
 
-static void fill_stats(struct task_struct *tsk, struct taskstats *stats)
+void taskstats_fill_atomic(struct task_struct *tsk, struct taskstats *stats)
 {
 	memset(stats, 0, sizeof(*stats));
+	preempt_disable();
+	stats->time_ns = sched_clock();
+	preempt_enable();
 	/*
 	 * Each accounting subsystem adds calls to its functions to
 	 * fill in relevant parts of struct taskstsats as follows
@@ -197,6 +202,17 @@ static void fill_stats(struct task_struc
 	xacct_add_tsk(stats, tsk);
 }
 
+void taskstats_fill_sleep(struct task_struct *tsk, struct taskstats *stats)
+{
+	xacct_add_tsk_mem(stats, tsk);
+}
+
+void taskstats_fill(struct task_struct *tsk, struct taskstats *stats)
+{
+	taskstats_fill_atomic(tsk, stats);
+	taskstats_fill_sleep(tsk, stats);
+}
+
 static int fill_stats_for_pid(pid_t pid, struct taskstats *stats)
 {
 	struct task_struct *tsk;
@@ -208,7 +224,7 @@ static int fill_stats_for_pid(pid_t pid,
 	rcu_read_unlock();
 	if (!tsk)
 		return -ESRCH;
-	fill_stats(tsk, stats);
+	taskstats_fill(tsk, stats);
 	put_task_struct(tsk);
 	return 0;
 }
@@ -424,6 +440,68 @@ err:
 	return rc;
 }
 
+static int cmd_attr_pids(struct genl_info *info)
+{
+	struct taskstats_cmd_pids *cmd_pids;
+	struct taskstats *stats_vec;
+	struct sk_buff *rep_skb;
+	struct taskstats *stats;
+	unsigned int tsk_cnt, i;
+	size_t size;
+	int  rc;
+
+	size = nla_total_size(sizeof(u32)) +
+		nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+	cmd_pids = nla_data(info->attrs[TASKSTATS_CMD_ATTR_PIDS]);
+
+	if (cmd_pids->cnt > 1000) // XXX socket buffer size check
+		return -EINVAL;
+
+	stats_vec = vmalloc(sizeof(struct taskstats) * cmd_pids->cnt);
+	if (!stats_vec)
+		return -ENOMEM;
+
+	rc = taskstats_snap(cmd_pids->pid, cmd_pids->cnt,
+			    cmd_pids->time_ns, stats_vec);
+	if (rc < 0)
+		goto fail_vfree;
+	tsk_cnt = rc;
+	for (i = 0; i < min(cmd_pids->cnt, tsk_cnt + 1); i++) {
+		rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, size);
+		if (rc < 0)
+			goto fail_vfree;
+		if (i < tsk_cnt) {
+			stats = mk_reply(rep_skb, TASKSTATS_TYPE_PID,
+					 stats_vec[i].ac_pid);
+			if (!stats) {
+				rc = -ENOMEM;
+				goto fail_nlmsg_free;
+			}
+			memcpy(stats, &stats_vec[i], sizeof(*stats));
+		} else {
+			/* zero taskstats marks end of transmission */
+			stats = mk_reply(rep_skb, TASKSTATS_TYPE_PID, 0);
+			if (!stats) {
+				rc = -ENOMEM;
+				goto fail_nlmsg_free;
+			}
+			memset(stats, 0, sizeof(*stats));
+		}
+		rc = send_reply(rep_skb, info);
+		if (rc)
+			goto fail_nlmsg_free;
+		info->snd_seq++;
+	}
+	vfree(stats_vec);
+	return 0;
+
+fail_nlmsg_free:
+	nlmsg_free(rep_skb);
+fail_vfree:
+	vfree(stats_vec);
+	return rc;
+}
+
 static int cmd_attr_register_cpumask(struct genl_info *info)
 {
 	cpumask_var_t mask;
@@ -526,6 +604,8 @@ static int taskstats_user_cmd(struct sk_
 		return cmd_attr_pid(info);
 	else if (info->attrs[TASKSTATS_CMD_ATTR_TGID])
 		return cmd_attr_tgid(info);
+	else if (info->attrs[TASKSTATS_CMD_ATTR_PIDS])
+		return cmd_attr_pids(info);
 	else
 		return -EINVAL;
 }
@@ -593,7 +673,7 @@ void taskstats_exit(struct task_struct *
 	if (!stats)
 		goto err;
 
-	fill_stats(tsk, stats);
+	taskstats_fill(tsk, stats);
 
 	/*
 	 * Doesn't matter if tsk is the leader or the last group member leaving
@@ -653,7 +733,6 @@ static int __init taskstats_init(void)
 	rc = genl_register_ops(&family, &cgroupstats_ops);
 	if (rc < 0)
 		goto err_cgroup_ops;
-
 	family_registered = 1;
 	printk("registered taskstats version %d\n", TASKSTATS_GENL_VERSION);
 	return 0;
--- /dev/null
+++ b/kernel/taskstats_snap.c
@@ -0,0 +1,199 @@
+/*
+ * taskstats_snap.c - Create exact taskstats snapshot
+ *
+ * Copyright IBM Corp. 2010
+ * Author(s): Michael Holzheu <holzheu@linux.vnet.ibm.com>
+ */
+
+#include <linux/taskstats_kern.h>
+#include <linux/pid_namespace.h>
+#include <linux/kernel_stat.h>
+#include <linux/vmalloc.h>
+#include <asm/uaccess.h>
+
+static DECLARE_WAIT_QUEUE_HEAD(snapshot_wait);
+static atomic_t snapshot_use;
+
+static int wait_snapshot(void)
+{
+	while (atomic_cmpxchg(&snapshot_use, 0, 1) != 0) {
+		if (wait_event_interruptible(snapshot_wait,
+					     atomic_read(&snapshot_use) == 0))
+			return -ERESTARTSYS;
+	}
+	return 0;
+}
+
+static void wake_up_snapshot(void)
+{
+	atomic_set(&snapshot_use, 0);
+	wake_up_interruptible(&snapshot_wait);
+}
+
+static void force_accounting(void *ptr)
+{
+	account_process_tick(current, 1);
+}
+
+/*
+ * TODO Do not force idle CPUs to do accounting
+ */
+static void account_online_cpus(void)
+{
+	smp_call_function(force_accounting, NULL, 1);
+}
+
+extern struct task_struct *get_swapper(int cpu);
+
+/*
+ * TODO Implement complete taskstasts_add() function that aggregates
+ *      all fields.
+ */
+static void taskstats_add(struct taskstats *ts1, struct taskstats *ts2)
+{
+	ts1->ac_utime += ts2->ac_utime;
+	ts1->ac_stime += ts2->ac_stime;
+	ts1->ac_sttime += ts2->ac_sttime;
+}
+
+static struct task_struct *find_next_task(int pid_start, u64 time_ns)
+{
+	struct pid_namespace *ns = current->nsproxy->pid_ns;
+	struct task_struct *tsk;
+	struct pid *pid;
+
+	do {
+		pid = find_ge_pid(pid_start, ns);
+		if (!pid) {
+			tsk = NULL;
+			break;
+		}
+		tsk = pid_task(pid, PIDTYPE_PID);
+		if (tsk && (tsk->state == TASK_RUNNING ||
+		    tsk->sched_info.last_depart > time_ns)) {
+			get_task_struct(tsk);
+			break;
+		}
+		pid_start++;
+	} while (1);
+	return tsk;
+}
+
+int taskstats_snap(int pid_start, int cnt, u64 time_ns,
+		   struct taskstats *stats_vec)
+{
+	int tsk_cnt = 0, rc, i, cpu, first = 1;
+	struct task_struct *tsk, **tsk_vec;
+	u32 pid_curr = pid_start;
+	struct taskstats *ts;
+	u64 task_snap_time;
+
+	rc = wait_snapshot();
+	if (rc)
+		return rc;
+
+	rc = -ENOMEM;
+	tsk_vec = kmalloc(sizeof(struct task_struct *) * cnt, GFP_KERNEL);
+	if (!tsk_vec)
+		goto fail_wake_up_snapshot;
+	ts = kzalloc(sizeof(*ts), GFP_KERNEL);
+	if (!ts)
+		goto fail_free_tsk_vec;
+
+	task_snap_time = sched_clock();
+
+	/*
+	 * Force running CPUs to do accounting
+	 */
+	account_online_cpus();
+
+	read_lock(&tasklist_lock);
+
+	/*
+	 * Aggregate swapper tasks (pid = 0)
+	 */
+	if (pid_curr == 0) {
+		for_each_online_cpu(cpu) {
+			tsk = get_swapper(cpu);
+			if (tsk->state != TASK_RUNNING &&
+			    tsk->sched_info.last_depart < time_ns)
+				continue;
+			if (first) {
+				tsk_vec[0] = tsk;
+				taskstats_fill_atomic(tsk, &stats_vec[tsk_cnt]);
+				stats_vec[tsk_cnt].ac_pid = 0;
+				first = 0;
+			} else {
+				taskstats_fill_atomic(tsk, ts);
+				taskstats_add(&stats_vec[tsk_cnt], ts);
+			}
+			if (tsk->acct_time < task_snap_time)
+				stats_vec[tsk_cnt].time_ns = task_snap_time;
+			else
+				stats_vec[tsk_cnt].time_ns = tsk->acct_time;
+		}
+		tsk_cnt++;
+		pid_curr++;
+	}
+	/*
+	 * Collect normal tasks (pid >=1)
+	 */
+	do {
+		tsk = find_next_task(pid_curr, time_ns);
+		if (!tsk)
+			break;
+		taskstats_fill_atomic(tsk, &stats_vec[tsk_cnt]);
+		tsk_vec[tsk_cnt] = tsk;
+		if (tsk->acct_time < task_snap_time)
+			stats_vec[tsk_cnt].time_ns = task_snap_time;
+		else
+			stats_vec[tsk_cnt].time_ns = tsk->acct_time;
+		tsk_cnt++;
+		pid_curr = next_pidmap(current->nsproxy->pid_ns, tsk->pid);
+	} while (tsk_cnt != cnt);
+
+	read_unlock(&tasklist_lock);
+
+	/*
+	 * Add rest of accounting information that we can't add under lock
+	 */
+	for (i = 0; i < tsk_cnt; i++) {
+		if (tsk_vec[i]->pid == 0)
+			continue;
+		taskstats_fill_sleep(tsk_vec[i], &stats_vec[i]);
+		put_task_struct(tsk_vec[i]);
+	}
+	rc = tsk_cnt;
+
+	kfree(ts);
+fail_free_tsk_vec:
+	kfree(tsk_vec);
+fail_wake_up_snapshot:
+	wake_up_snapshot();
+	return rc;
+}
+
+int taskstats_snap_user(int pid_start, int cnt, u64 time_ns,
+			struct taskstats *stats_vec)
+{
+	struct taskstats *stats_vec_int;
+	int i, tsk_cnt, rc;
+
+	stats_vec_int = vmalloc(sizeof(struct taskstats) * cnt);
+	if (!stats_vec)
+		return -ENOMEM;
+
+	tsk_cnt = taskstats_snap(pid_start, cnt, time_ns, stats_vec_int);
+
+	for (i = 0; i < tsk_cnt; i++) {
+		if (copy_to_user(&stats_vec[i], &stats_vec_int[i],
+				 sizeof(struct taskstats))) {
+			rc = -EFAULT;
+			goto out;
+		}
+	}
+	rc = tsk_cnt;
+out:
+	vfree(stats_vec_int);
+	return rc;
+}
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -83,18 +83,6 @@ void bacct_add_tsk(struct taskstats *sta
  */
 void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 {
-	struct mm_struct *mm;
-
-	/* convert pages-usec to Mbyte-usec */
-	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / MB;
-	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
-	mm = get_task_mm(p);
-	if (mm) {
-		/* adjust to KB unit */
-		stats->hiwater_rss   = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
-		stats->hiwater_vm    = get_mm_hiwater_vm(mm)  * PAGE_SIZE / KB;
-		mmput(mm);
-	}
 	stats->read_char	= p->ioac.rchar;
 	stats->write_char	= p->ioac.wchar;
 	stats->read_syscalls	= p->ioac.syscr;
@@ -109,6 +97,26 @@ void xacct_add_tsk(struct taskstats *sta
 	stats->cancelled_write_bytes = 0;
 #endif
 }
+
+/*
+ * fill in memory data (function can sleep)
+ */
+void xacct_add_tsk_mem(struct taskstats *stats, struct task_struct *p)
+{
+	struct mm_struct *mm;
+
+	/* convert pages-usec to Mbyte-usec */
+	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / MB;
+	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
+	mm = get_task_mm(p);
+	if (mm) {
+		/* adjust to KB unit */
+		stats->hiwater_rss   = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
+		stats->hiwater_vm    = get_mm_hiwater_vm(mm)  * PAGE_SIZE / KB;
+		mmput(mm);
+	}
+}
+
 #undef KB
 #undef MB
 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC][PATCH v2 2/7] taskstats: Add "/proc/taskstats"
  2010-11-11 17:03 [RFC][PATCH v2 0/7] taskstats: Enhancements for precise process accounting (version 2) Michael Holzheu
  2010-11-11 17:03 ` [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS Michael Holzheu
@ 2010-11-11 17:03 ` Michael Holzheu
  2010-11-11 17:03 ` [RFC][PATCH v2 3/7] taskstats: Add thread group ID to taskstats structure Michael Holzheu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 52+ messages in thread
From: Michael Holzheu @ 2010-11-11 17:03 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath
  Cc: linux-kernel, linux-s390

[-- Attachment #1: 02-taskstats-top-proc.patch --]
[-- Type: text/plain, Size: 5791 bytes --]

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

CHANGE HISTORY OF THIS PATCH
----------------------------
Version 2
---------
Replace ioctl interface with write interface (requested by Andrew Morton)

The taskstats command is now set by one sys_write() with a buffer that
contains first the taskstats command number (32 bit) and directly after that
number the command payload.

DESCRIPTION
-----------
Add procfs interface for the TASKSTATS_CMD_ATTR_PIDS taskstats command. A new
procfs file "/proc/taskstats" is introduced. With sys_write() the taskstats
command is defined. With a subsequent read system call the defined command
is executed and the result of the command is transferred into the read buffer.
This allows to get a complete and consistent snapshot with all tasks via two
system calls (write + read), when a sufficiently large buffer is provided.
This is not possible with the existing netlink interface, because there we
have the socket buffer size as restricting factor.

GOALS OF THIS PATCH
-------------------
* Allow transfer of a complete and consistent taskstats snapshot to user space.
* Reduce CPU time for data transmission compared to netlink mechanism,
  because the proc solution is much more lightweight.
* User space code is much easier to write compared to netlink mechanism.

OPEN ISSUES
-----------
Currently only the TASKSTATS_CMD_ATTR_PIDS command is implemented. Implement
the following missing taskstasts commands:
* TASKSTATS_CMD_ATTR_PID
* TASKSTATS_CMD_ATTR_TGID

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 include/linux/taskstats_kern.h |    3 +
 kernel/Makefile                |    3 -
 kernel/taskstats.c             |    1 
 kernel/taskstats_proc.c        |  107 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 113 insertions(+), 1 deletion(-)

--- a/include/linux/taskstats_kern.h
+++ b/include/linux/taskstats_kern.h
@@ -32,6 +32,7 @@ extern void taskstats_fill_atomic(struct
 				  struct taskstats *stats);
 extern void taskstats_fill_sleep(struct task_struct *tsk,
 				 struct taskstats *stats);
+extern void taskstats_proc_init(void);
 #else
 static inline void taskstats_exit(struct task_struct *tsk, int group_dead)
 {}
@@ -39,6 +40,8 @@ static inline void taskstats_tgid_free(s
 {}
 static inline void taskstats_init_early(void)
 {}
+static inline void taskstats_proc_init(void)
+{}
 #endif /* CONFIG_TASKSTATS */
 
 #endif
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -90,7 +90,8 @@ obj-$(CONFIG_TINY_PREEMPT_RCU) += rcutin
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
-obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o taskstats_snap.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o taskstats_snap.o \
+			   taskstats_proc.o
 obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
 obj-$(CONFIG_LATENCYTOP) += latencytop.o
 obj-$(CONFIG_BINFMT_ELF) += elfcore.o
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -733,6 +733,7 @@ static int __init taskstats_init(void)
 	rc = genl_register_ops(&family, &cgroupstats_ops);
 	if (rc < 0)
 		goto err_cgroup_ops;
+	taskstats_proc_init();
 	family_registered = 1;
 	printk("registered taskstats version %d\n", TASKSTATS_GENL_VERSION);
 	return 0;
--- /dev/null
+++ b/kernel/taskstats_proc.c
@@ -0,0 +1,107 @@
+/*
+ * taskstats_proc.c - Export per-task statistics to userland using procfs
+ *
+ * Copyright IBM Corp. 2010
+ * Author(s): Michael Holzheu <holzheu@linux.vnet.ibm.com>
+ */
+
+#include <linux/taskstats_kern.h>
+#include <linux/proc_fs.h>
+#include <linux/kernel.h>
+#include <linux/file.h>
+#include <asm/uaccess.h>
+
+struct proc_cmd {
+	u32 no;
+	union {
+		struct taskstats_cmd_pids cmd_pids;
+		u32 pid;
+	} d;
+};
+
+static ssize_t cmd_attr_pids_proc(struct taskstats *stats_vec,
+				  struct taskstats_cmd_pids *cmd_pids)
+{
+	int rc;
+
+	rc = taskstats_snap_user(cmd_pids->pid, cmd_pids->cnt,
+				 cmd_pids->time_ns, stats_vec);
+	if (rc < 0)
+		return rc;
+	else
+		return rc * sizeof(struct taskstats);
+}
+
+static int proc_taskstats_open(struct inode *inode, struct file *file)
+{
+	struct proc_cmd *proc_cmd;
+
+	proc_cmd = kmalloc(sizeof(*proc_cmd), GFP_KERNEL);
+	if (!proc_cmd)
+		return -ENOMEM;
+	proc_cmd->no = -1;
+	file->private_data = proc_cmd;
+	return 0;
+}
+
+static int proc_taskstats_close(struct inode *inode, struct file *file)
+{
+	kfree(file->private_data);
+	return 0;
+}
+
+static ssize_t proc_taskstats_write(struct file *file,
+				    const char __user *buf, size_t count,
+				    loff_t *ppos)
+{
+	struct proc_cmd *proc_cmd = file->private_data;
+	u32 no;
+
+	if (*ppos != 0)
+		return -EINVAL;
+	if (count < sizeof(no))
+		return -EINVAL;
+	if (copy_from_user(&no, buf, sizeof(no)))
+		return -EFAULT;
+	switch (no) {
+	case TASKSTATS_CMD_ATTR_PIDS:
+		if (count - sizeof(no) < sizeof(proc_cmd->d.cmd_pids))
+		    	return -EINVAL;
+		if (copy_from_user(&proc_cmd->d.cmd_pids, buf + sizeof(no),
+				   sizeof(proc_cmd->d.cmd_pids)))
+			return -EFAULT;
+		break;
+	default:
+		return -EINVAL;
+	}
+	proc_cmd->no = no;
+	return count;
+}
+
+static ssize_t proc_taskstats_read(struct file *file, char __user *buf,
+				   size_t size, loff_t *ppos)
+{
+	struct proc_cmd *proc_cmd = file->private_data;
+
+	if (*ppos != 0)
+		return -EINVAL;
+	switch (proc_cmd->no) {
+	case TASKSTATS_CMD_ATTR_PIDS:
+		return cmd_attr_pids_proc((struct taskstats *) buf,
+					  &proc_cmd->d.cmd_pids);
+	default:
+		return -EINVAL;
+	}
+}
+
+static const struct file_operations proc_taskstats_ops = {
+	.open		= proc_taskstats_open,
+	.release	= proc_taskstats_close,
+	.read		= proc_taskstats_read,
+	.write		= proc_taskstats_write,
+};
+
+void __init taskstats_proc_init(void)
+{
+	proc_create("taskstats", 0666, NULL, &proc_taskstats_ops);
+}


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC][PATCH v2 3/7] taskstats: Add thread group ID to taskstats structure
  2010-11-11 17:03 [RFC][PATCH v2 0/7] taskstats: Enhancements for precise process accounting (version 2) Michael Holzheu
  2010-11-11 17:03 ` [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS Michael Holzheu
  2010-11-11 17:03 ` [RFC][PATCH v2 2/7] taskstats: Add "/proc/taskstats" Michael Holzheu
@ 2010-11-11 17:03 ` Michael Holzheu
  2010-11-11 17:03 ` [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting Michael Holzheu
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 52+ messages in thread
From: Michael Holzheu @ 2010-11-11 17:03 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath
  Cc: linux-kernel, linux-s390

[-- Attachment #1: 03-taskstats-top-add-tgid.patch --]
[-- Type: text/plain, Size: 891 bytes --]

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

The tgid is important for aggregating threads in user space. Therefore we
add the tgid to the taskstats structure.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 include/linux/taskstats.h |    1 +
 kernel/tsacct.c           |    1 +
 2 files changed, 2 insertions(+)

--- a/include/linux/taskstats.h
+++ b/include/linux/taskstats.h
@@ -167,6 +167,7 @@ struct taskstats {
 
 	/* Timestamp where data has been collected in ns since boot time */
 	__u64	time_ns;
+	__u32	ac_tgid;		/* Thread group ID */
 };
 
 
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -56,6 +56,7 @@ void bacct_add_tsk(struct taskstats *sta
 	stats->ac_nice	 = task_nice(tsk);
 	stats->ac_sched	 = tsk->policy;
 	stats->ac_pid	 = tsk->pid;
+	stats->ac_tgid	 = tsk->tgid;
 	rcu_read_lock();
 	tcred = __task_cred(tsk);
 	stats->ac_uid	 = tcred->uid;


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-11 17:03 [RFC][PATCH v2 0/7] taskstats: Enhancements for precise process accounting (version 2) Michael Holzheu
                   ` (2 preceding siblings ...)
  2010-11-11 17:03 ` [RFC][PATCH v2 3/7] taskstats: Add thread group ID to taskstats structure Michael Holzheu
@ 2010-11-11 17:03 ` Michael Holzheu
  2010-11-13 19:38   ` Peter Zijlstra
  2010-11-11 17:03 ` [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU " Michael Holzheu
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 52+ messages in thread
From: Michael Holzheu @ 2010-11-11 17:03 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath
  Cc: linux-kernel, linux-s390

[-- Attachment #1: 04-taskstats-top-add-sttime.patch --]
[-- Type: text/plain, Size: 13511 bytes --]

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

Currently steal time is only accounted for the whole system. With this
patch we add steal time to the per task CPU time accounting.
The triplet "user time", "system time" and "steal time" represents
all consumed CPU time on hypervisor based systems.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 arch/s390/kernel/vtime.c    |   19 +++++++++++--------
 fs/proc/array.c             |    6 +++---
 include/linux/kernel_stat.h |    2 +-
 include/linux/sched.h       |   14 ++++++++------
 include/linux/taskstats.h   |    1 +
 kernel/exit.c               |    9 +++++++--
 kernel/fork.c               |    1 +
 kernel/posix-cpu-timers.c   |    3 +++
 kernel/sched.c              |   26 ++++++++++++++++++++------
 kernel/sys.c                |   10 +++++-----
 kernel/tsacct.c             |    1 +
 11 files changed, 61 insertions(+), 31 deletions(-)

--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -56,31 +56,34 @@ static void do_account_vtime(struct task
 {
 	struct thread_info *ti = task_thread_info(tsk);
 	__u64 timer, clock, user, system, steal;
+	unsigned char clk[16];
 
 	timer = S390_lowcore.last_update_timer;
 	clock = S390_lowcore.last_update_clock;
 	asm volatile ("  STPT %0\n"    /* Store current cpu timer value */
-		      "  STCK %1"      /* Store current tod clock value */
+		      "  STCKE 0(%2)"  /* Store current tod clock value */
 		      : "=m" (S390_lowcore.last_update_timer),
-		        "=m" (S390_lowcore.last_update_clock) );
+		        "=m" (clk) : "a" (clk));
+	S390_lowcore.last_update_clock = *(__u64 *) &clk[1];
+	tsk->acct_time = ((clock - sched_clock_base_cc) * 125) >> 9;
 	S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer;
 	S390_lowcore.steal_timer += S390_lowcore.last_update_clock - clock;
 
 	user = S390_lowcore.user_timer - ti->user_timer;
-	S390_lowcore.steal_timer -= user;
 	ti->user_timer = S390_lowcore.user_timer;
 	account_user_time(tsk, user, user);
 
 	system = S390_lowcore.system_timer - ti->system_timer;
-	S390_lowcore.steal_timer -= system;
 	ti->system_timer = S390_lowcore.system_timer;
 	account_system_time(tsk, hardirq_offset, system, system);
 
 	steal = S390_lowcore.steal_timer;
-	if ((s64) steal > 0) {
-		S390_lowcore.steal_timer = 0;
-		account_steal_time(steal);
-	}
+	S390_lowcore.steal_timer = 0;
+	if (steal >= user + system)
+		steal -= user + system;
+	else
+		steal = 0;
+	account_steal_time(tsk, steal);
 }
 
 void account_vtime(struct task_struct *prev, struct task_struct *next)
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -375,7 +375,7 @@ static int do_task_stat(struct seq_file
 	unsigned long long start_time;
 	unsigned long cmin_flt = 0, cmaj_flt = 0;
 	unsigned long  min_flt = 0,  maj_flt = 0;
-	cputime_t cutime, cstime, utime, stime;
+	cputime_t cutime, cstime, utime, stime, sttime;
 	cputime_t cgtime, gtime;
 	unsigned long rsslim = 0;
 	char tcomm[sizeof(task->comm)];
@@ -432,7 +432,7 @@ static int do_task_stat(struct seq_file
 
 			min_flt += sig->min_flt;
 			maj_flt += sig->maj_flt;
-			thread_group_times(task, &utime, &stime);
+			thread_group_times(task, &utime, &stime, &sttime);
 			gtime = cputime_add(gtime, sig->gtime);
 		}
 
@@ -448,7 +448,7 @@ static int do_task_stat(struct seq_file
 	if (!whole) {
 		min_flt = task->min_flt;
 		maj_flt = task->maj_flt;
-		task_times(task, &utime, &stime);
+		task_times(task, &utime, &stime, &sttime);
 		gtime = task->gtime;
 	}
 
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -116,7 +116,7 @@ extern unsigned long long task_delta_exe
 
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
-extern void account_steal_time(cputime_t);
+extern void account_steal_time(struct task_struct *, cputime_t);
 extern void account_idle_time(cputime_t);
 
 extern void account_process_tick(struct task_struct *, int user);
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -470,6 +470,7 @@ struct cpu_itimer {
 struct task_cputime {
 	cputime_t utime;
 	cputime_t stime;
+	cputime_t sttime;
 	unsigned long long sum_exec_runtime;
 };
 /* Alternate field names when used to cache expirations. */
@@ -481,6 +482,7 @@ struct task_cputime {
 	(struct task_cputime) {					\
 		.utime = cputime_zero,				\
 		.stime = cputime_zero,				\
+		.sttime = cputime_zero,				\
 		.sum_exec_runtime = 0,				\
 	}
 
@@ -582,11 +584,11 @@ struct signal_struct {
 	 * Live threads maintain their own counters and add to these
 	 * in __exit_signal, except for the group leader.
 	 */
-	cputime_t utime, stime, cutime, cstime;
+	cputime_t utime, stime, sttime, cutime, cstime, csttime;
 	cputime_t gtime;
 	cputime_t cgtime;
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
-	cputime_t prev_utime, prev_stime;
+	cputime_t prev_utime, prev_stime, prev_sttime;
 #endif
 	unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
 	unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
@@ -1294,10 +1296,10 @@ struct task_struct {
 	int __user *set_child_tid;		/* CLONE_CHILD_SETTID */
 	int __user *clear_child_tid;		/* CLONE_CHILD_CLEARTID */
 
-	cputime_t utime, stime, utimescaled, stimescaled;
+	cputime_t utime, stime, sttime, utimescaled, stimescaled;
 	cputime_t gtime;
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
-	cputime_t prev_utime, prev_stime;
+	cputime_t prev_utime, prev_stime, prev_sttime;
 #endif
 	unsigned long long acct_time;		/* Time for last accounting */
 	unsigned long nvcsw, nivcsw; /* context switch counts */
@@ -1694,8 +1696,8 @@ static inline void put_task_struct(struc
 		__put_task_struct(t);
 }
 
-extern void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st);
-extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st);
+extern void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st, cputime_t *stt);
+extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st, cputime_t *stt);
 
 /*
  * Per process flags
--- a/include/linux/taskstats.h
+++ b/include/linux/taskstats.h
@@ -168,6 +168,7 @@ struct taskstats {
 	/* Timestamp where data has been collected in ns since boot time */
 	__u64	time_ns;
 	__u32	ac_tgid;		/* Thread group ID */
+	__u64	ac_sttime;		/* Steal CPU time [usec] */
 };
 
 
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -124,6 +124,7 @@ static void __exit_signal(struct task_st
 		 */
 		sig->utime = cputime_add(sig->utime, tsk->utime);
 		sig->stime = cputime_add(sig->stime, tsk->stime);
+		sig->sttime = cputime_add(sig->sttime, tsk->sttime);
 		sig->gtime = cputime_add(sig->gtime, tsk->gtime);
 		sig->min_flt += tsk->min_flt;
 		sig->maj_flt += tsk->maj_flt;
@@ -1228,7 +1229,7 @@ static int wait_task_zombie(struct wait_
 		struct signal_struct *psig;
 		struct signal_struct *sig;
 		unsigned long maxrss;
-		cputime_t tgutime, tgstime;
+		cputime_t tgutime, tgstime, tgsttime;
 
 		/*
 		 * The resource counters for the group leader are in its
@@ -1249,7 +1250,7 @@ static int wait_task_zombie(struct wait_
 		 * group, which consolidates times for all threads in the
 		 * group including the group leader.
 		 */
-		thread_group_times(p, &tgutime, &tgstime);
+		thread_group_times(p, &tgutime, &tgstime, &tgsttime);
 		spin_lock_irq(&p->real_parent->sighand->siglock);
 		psig = p->real_parent->signal;
 		sig = p->signal;
@@ -1261,6 +1262,10 @@ static int wait_task_zombie(struct wait_
 			cputime_add(psig->cstime,
 			cputime_add(tgstime,
 				    sig->cstime));
+		psig->csttime =
+			cputime_add(psig->csttime,
+			cputime_add(tgsttime,
+				    sig->csttime));
 		psig->cgtime =
 			cputime_add(psig->cgtime,
 			cputime_add(p->gtime,
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1062,6 +1062,7 @@ static struct task_struct *copy_process(
 
 	p->utime = cputime_zero;
 	p->stime = cputime_zero;
+	p->sttime = cputime_zero;
 	p->gtime = cputime_zero;
 	p->utimescaled = cputime_zero;
 	p->stimescaled = cputime_zero;
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -237,6 +237,7 @@ void thread_group_cputime(struct task_st
 
 	times->utime = sig->utime;
 	times->stime = sig->stime;
+	times->sttime = sig->sttime;
 	times->sum_exec_runtime = sig->sum_sched_runtime;
 
 	rcu_read_lock();
@@ -248,6 +249,7 @@ void thread_group_cputime(struct task_st
 	do {
 		times->utime = cputime_add(times->utime, t->utime);
 		times->stime = cputime_add(times->stime, t->stime);
+		times->sttime = cputime_add(times->sttime, t->sttime);
 		times->sum_exec_runtime += t->se.sum_exec_runtime;
 	} while_each_thread(tsk, t);
 out:
@@ -1276,6 +1278,7 @@ static inline int fastpath_timer_check(s
 		struct task_cputime task_sample = {
 			.utime = tsk->utime,
 			.stime = tsk->stime,
+			.sttime = tsk->sttime,
 			.sum_exec_runtime = tsk->se.sum_exec_runtime
 		};
 
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3542,11 +3542,15 @@ void account_system_time(struct task_str
  * Account for involuntary wait time.
  * @steal: the cpu time spent in involuntary wait
  */
-void account_steal_time(cputime_t cputime)
+void account_steal_time(struct task_struct *p, cputime_t cputime)
 {
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
 	cputime64_t cputime64 = cputime_to_cputime64(cputime);
 
+	/* Add steal time to process. */
+	p->sttime = cputime_add(p->sttime, cputime);
+
+	/* Add steal time to cpustat. */
 	cpustat->steal = cputime64_add(cpustat->steal, cputime64);
 }
 
@@ -3594,7 +3598,7 @@ void account_process_tick(struct task_st
  */
 void account_steal_ticks(unsigned long ticks)
 {
-	account_steal_time(jiffies_to_cputime(ticks));
+	account_steal_time(current, jiffies_to_cputime(ticks));
 }
 
 /*
@@ -3612,13 +3616,16 @@ void account_idle_ticks(unsigned long ti
  * Use precise platform statistics if available:
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING
-void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st,
+		cputime_t *stt)
 {
 	*ut = p->utime;
 	*st = p->stime;
+	*stt = p->sttime;
 }
 
-void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st,
+			cputime_t *stt)
 {
 	struct task_cputime cputime;
 
@@ -3626,6 +3633,7 @@ void thread_group_times(struct task_stru
 
 	*ut = cputime.utime;
 	*st = cputime.stime;
+	*stt = cputime.sttime;
 }
 #else
 
@@ -3633,7 +3641,8 @@ void thread_group_times(struct task_stru
 # define nsecs_to_cputime(__nsecs)	nsecs_to_jiffies(__nsecs)
 #endif
 
-void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st,
+		cputime_t *stt)
 {
 	cputime_t rtime, utime = p->utime, total = cputime_add(utime, p->stime);
 
@@ -3656,15 +3665,18 @@ void task_times(struct task_struct *p, c
 	 */
 	p->prev_utime = max(p->prev_utime, utime);
 	p->prev_stime = max(p->prev_stime, cputime_sub(rtime, p->prev_utime));
+	p->prev_sttime = cputime_zero;
 
 	*ut = p->prev_utime;
 	*st = p->prev_stime;
+	*stt = p->prev_sttime;
 }
 
 /*
  * Must be called with siglock held.
  */
-void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st,
+			cputime_t *stt)
 {
 	struct signal_struct *sig = p->signal;
 	struct task_cputime cputime;
@@ -3687,9 +3699,11 @@ void thread_group_times(struct task_stru
 	sig->prev_utime = max(sig->prev_utime, utime);
 	sig->prev_stime = max(sig->prev_stime,
 			      cputime_sub(rtime, sig->prev_utime));
+	sig->prev_sttime = cputime_zero;
 
 	*ut = sig->prev_utime;
 	*st = sig->prev_stime;
+	*stt = sig->prev_sttime;
 }
 #endif
 
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -880,10 +880,10 @@ change_okay:
 
 void do_sys_times(struct tms *tms)
 {
-	cputime_t tgutime, tgstime, cutime, cstime;
+	cputime_t tgutime, tgstime, tgsttime, cutime, cstime;
 
 	spin_lock_irq(&current->sighand->siglock);
-	thread_group_times(current, &tgutime, &tgstime);
+	thread_group_times(current, &tgutime, &tgstime, &tgsttime);
 	cutime = current->signal->cutime;
 	cstime = current->signal->cstime;
 	spin_unlock_irq(&current->sighand->siglock);
@@ -1488,14 +1488,14 @@ static void k_getrusage(struct task_stru
 {
 	struct task_struct *t;
 	unsigned long flags;
-	cputime_t tgutime, tgstime, utime, stime;
+	cputime_t tgutime, tgstime, tgsttime, utime, stime, sttime;
 	unsigned long maxrss = 0;
 
 	memset((char *) r, 0, sizeof *r);
 	utime = stime = cputime_zero;
 
 	if (who == RUSAGE_THREAD) {
-		task_times(current, &utime, &stime);
+		task_times(current, &utime, &stime, &sttime);
 		accumulate_thread_rusage(p, r);
 		maxrss = p->signal->maxrss;
 		goto out;
@@ -1521,7 +1521,7 @@ static void k_getrusage(struct task_stru
 				break;
 
 		case RUSAGE_SELF:
-			thread_group_times(p, &tgutime, &tgstime);
+			thread_group_times(p, &tgutime, &tgstime, &tgsttime);
 			utime = cputime_add(utime, tgutime);
 			stime = cputime_add(stime, tgstime);
 			r->ru_nvcsw += p->signal->nvcsw;
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -66,6 +66,7 @@ void bacct_add_tsk(struct taskstats *sta
 	rcu_read_unlock();
 	stats->ac_utime = cputime_to_usecs(tsk->utime);
 	stats->ac_stime = cputime_to_usecs(tsk->stime);
+	stats->ac_sttime = cputime_to_usecs(tsk->sttime);
 	stats->ac_utimescaled = cputime_to_usecs(tsk->utimescaled);
 	stats->ac_stimescaled = cputime_to_usecs(tsk->stimescaled);
 	stats->ac_minflt = tsk->min_flt;


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-11 17:03 [RFC][PATCH v2 0/7] taskstats: Enhancements for precise process accounting (version 2) Michael Holzheu
                   ` (3 preceding siblings ...)
  2010-11-11 17:03 ` [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting Michael Holzheu
@ 2010-11-11 17:03 ` Michael Holzheu
  2010-11-13 18:38   ` Oleg Nesterov
  2010-11-11 17:03 ` [RFC][PATCH v2 6/7] taskstats: Fix accounting for non-leader thread exec Michael Holzheu
  2010-11-11 17:11 ` [RFC][PATCH v2 7/7] taskstats: Precise process accounting user space Michael Holzheu
  6 siblings, 1 reply; 52+ messages in thread
From: Michael Holzheu @ 2010-11-11 17:03 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath
  Cc: linux-kernel, linux-s390

[-- Attachment #1: 05-taskstats-top-improve-ctime.patch --]
[-- Type: text/plain, Size: 21325 bytes --]

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

CHANGE HISTORY OF THIS PATCH
----------------------------
Version 2
---------
* Add cumulative dead thread CPU times to taskstats
* Introduce parallel accounting process hierarchy (based on discussions
  with Oleg Nesterov and Roland McGrath)

DESCRIPTION
-----------
Currently the cumulative time accounting in Linux has two major drawbacks
for tools like the top command:

* Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
  to the cumulative time of the parents, if the parents ignore SIGCHLD
  or have set SA_NOCLDWAIT. This behaviour has the major drawback that
  it is not possible to calculate all consumed CPU time of a system by
  looking at the current tasks. CPU time can be lost.

* When a parent process dies, its children get the init process as
  new parent. For accounting this is suboptimal, because then init
  gets the CPU time of the tasks. For accounting it would be better,
  if the CPU time is passed along the relationship tree using the
  cumulative time counters as would have happened if the child had died
  before the parent. The main benefit of that approach is that between
  two tasks snapshots it is always clear which parent got the CPU time
  of dead tasks. Otherwise there are situations, where we can't
  determine if either init or an older relative of a dead task has
  inherited the CPU time.

  Another benefit would be that then e.g. it would be possible to look
  at the login shell process cumulative times to get all CPU time that has
  been consumed by it's children, grandchildren, etc. This would allow
  accounting without the need of exit events for all dead processes.
  Note that this has a small fault because processes can use
  daemonize to detach themselfes from their parents.

This patch adds a new set of cumulative time counters. We then have two
cumulative counter sets:

* cdata_wait: Traditional cumulative time used e.g. by getrusage.
* cdata_acct: Cumulative time that also includes dead processes with
              parents that ignore SIGCHLD or have set SA_NOCLDWAIT.
              cdata_acct will be exported by taskstats.

Besides of that the patch adds an "acct_parent" pointer next to the parent
pointer and a "children_acct" list next to the children list to the
signal_struct in order to remember the accounting process relationship.

With this patch and the following time fields it is now possible to
calculate all the consumed CPU time of a system in an interval by looking at
the tasks of two taskstats snapshots:

* task->(u/s/st/g-time):
  Time that has been consumed by task itself

* task->signal->cdata_acct.(c-u/s/st/g-time):
  All time that has been consumed by dead children of process. Includes
  also time from processes that reaped themselves, because the parent
  ignored SIGCHLD or has set SA_NOCLDWAIT

* task->signal->(u/s/st/g-time):
  Time that has been consumed by dead threads of thread group of process

Having this is prerequisite for the following use cases:

USE CASE I: A top command that shows exactly 100% of all consumed CPU time
between two task snapshots without using task exit events. Exit events are not
necessary, because if tasks die between the two snapshots all time can be
found in the cumulative counters of the parent processes or thread group
leaders.

To provide a consistent view, the top tool could show the following fields
(all values are per interval):
 * user:  task utime
 * sys:   task stime
 * ste:   task sttime
 * cuser: utime of exited children
 * csys:  stime of exited children
 * cste:  sttime of exited children
 * xuser: utime of exited threads in thread group
 * xsys:  stime of exited threads in thread group
 * xste:  sttime of exited threads in thread group
 * total: Sum of all above fields

If the top command notices that a pid disappeared between snapshot 1
and snapshot 2, it has to find its accounting parent and subtract the CPU
times from snapshot 1 of the dead child from the parents cumulative times.

Example:
--------
                  CPU time of dead tasks in last interval
                         VVVVV  VVVV VVVV
pid     user   sys  ste  cuser  csys cste xuser xsys xste total  Name
(#)      (%)   (%)  (%)    (%)   (%)  (%)   (%)  (%)  (%)   (%)  (str)
17944   0.10  0.01 0.00  54.29 14.36 0.22   0.0  0.0  0.0 68.98  make
18006   0.10  0.01 0.00  55.79 12.23 0.12   0.0  0.0  0.0 68.26  make
18041  48.18  1.51 0.29   0.00  0.00 0.00   0.0  0.0  0.0 49.98  cc1
...

The sum of all "total" CPU counters on a system that is 100% busy should
be exactly the number CPUs multiplied by the interval time. A good tescase
for this is to start a loop program for each CPU and then in parallel
starting a kernel build with "-j 5".

USE CASE II: Implement precise replacement for "time" command and get all
consumed CPU time of children tree without using exit events.

Example:
--------
# cd linux-2.6
# tstime make -j 3
Cumulative CPU times:
---------------------
user..: 1000266540
system: 121721391
steal.: 3480230

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 fs/binfmt_elf.c           |    4 -
 fs/proc/array.c           |   10 +-
 fs/proc/base.c            |    3 
 include/linux/init_task.h |    2 
 include/linux/sched.h     |   39 +++++++---
 include/linux/taskstats.h |    6 +
 kernel/exit.c             |  175 ++++++++++++++++++++++++++++------------------
 kernel/fork.c             |    7 +
 kernel/sys.c              |   24 +++---
 kernel/tsacct.c           |   33 ++++++++
 10 files changed, 205 insertions(+), 98 deletions(-)

--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1296,8 +1296,8 @@ static void fill_prstatus(struct elf_prs
 		cputime_to_timeval(p->utime, &prstatus->pr_utime);
 		cputime_to_timeval(p->stime, &prstatus->pr_stime);
 	}
-	cputime_to_timeval(p->signal->cutime, &prstatus->pr_cutime);
-	cputime_to_timeval(p->signal->cstime, &prstatus->pr_cstime);
+	cputime_to_timeval(p->signal->cdata_wait.cutime, &prstatus->pr_cutime);
+	cputime_to_timeval(p->signal->cdata_wait.cstime, &prstatus->pr_cstime);
 }
 
 static int fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p,
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -413,11 +413,11 @@ static int do_task_stat(struct seq_file
 		num_threads = get_nr_threads(task);
 		collect_sigign_sigcatch(task, &sigign, &sigcatch);
 
-		cmin_flt = sig->cmin_flt;
-		cmaj_flt = sig->cmaj_flt;
-		cutime = sig->cutime;
-		cstime = sig->cstime;
-		cgtime = sig->cgtime;
+		cmin_flt = sig->cdata_wait.cmin_flt;
+		cmaj_flt = sig->cdata_wait.cmaj_flt;
+		cutime = sig->cdata_wait.cutime;
+		cstime = sig->cdata_wait.cstime;
+		cgtime = sig->cdata_wait.cgtime;
 		rsslim = ACCESS_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur);
 
 		/* add up live thread stats at the group level */
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2671,7 +2671,8 @@ static int do_io_accounting(struct task_
 	if (whole && lock_task_sighand(task, &flags)) {
 		struct task_struct *t = task;
 
-		task_io_accounting_add(&acct, &task->signal->ioac);
+		task_io_accounting_add(&acct,
+				       &task->signal->cdata_wait.ioac);
 		while_each_thread(task, t)
 			task_io_accounting_add(&acct, &t->ioac);
 
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -31,6 +31,8 @@ extern struct fs_struct init_fs;
 	},								\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
+	.acct_parent	= NULL,						\
+	.acct_children	= LIST_HEAD_INIT(sig.acct_children),		\
 }
 
 extern struct nsproxy init_nsproxy;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -512,6 +512,20 @@ struct thread_group_cputimer {
 };
 
 /*
+ * Cumulative resource counters for reaped dead child processes.
+ * Live threads maintain their own counters and add to these
+ * in __exit_signal, except for the group leader.
+ */
+struct cdata {
+	cputime_t cutime, cstime, csttime, cgtime;
+	unsigned long cnvcsw, cnivcsw;
+	unsigned long cmin_flt, cmaj_flt;
+	unsigned long cinblock, coublock;
+	unsigned long cmaxrss;
+	struct task_io_accounting ioac;
+};
+
+/*
  * NOTE! "signal_struct" does not have it's own
  * locking, because a shared signal_struct always
  * implies a shared sighand_struct, so locking
@@ -578,23 +592,28 @@ struct signal_struct {
 
 	struct tty_struct *tty; /* NULL if no tty */
 
+	/* Cumulative resource counters for all dead child processes */
+	struct cdata cdata_wait; /* parents have done sys_wait() */
+	struct cdata cdata_acct; /* complete cumulative data from acct tree */
+
+	/* Parallel accounting tree */
+	struct signal_struct *acct_parent; /* accounting parent process */
+	struct list_head acct_children;    /* list of my accounting children */
+	struct list_head acct_sibling;     /* linkage in my parent's list */
+
 	/*
 	 * Cumulative resource counters for dead threads in the group,
-	 * and for reaped dead child processes forked by this group.
-	 * Live threads maintain their own counters and add to these
-	 * in __exit_signal, except for the group leader.
 	 */
-	cputime_t utime, stime, sttime, cutime, cstime, csttime;
+	cputime_t utime, stime, sttime;
+
 	cputime_t gtime;
-	cputime_t cgtime;
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 	cputime_t prev_utime, prev_stime, prev_sttime;
 #endif
-	unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
-	unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
-	unsigned long inblock, oublock, cinblock, coublock;
-	unsigned long maxrss, cmaxrss;
-	struct task_io_accounting ioac;
+	unsigned long nvcsw, nivcsw;
+	unsigned long min_flt, maj_flt;
+	unsigned long inblock, oublock;
+	unsigned long maxrss;
 
 	/*
 	 * Cumulative ns of schedule CPU time fo dead threads in the
--- a/include/linux/taskstats.h
+++ b/include/linux/taskstats.h
@@ -169,6 +169,12 @@ struct taskstats {
 	__u64	time_ns;
 	__u32	ac_tgid;		/* Thread group ID */
 	__u64	ac_sttime;		/* Steal CPU time [usec] */
+	__u64	ac_cutime;		/* User CPU time of children [usec] */
+	__u64	ac_cstime;		/* System CPU time of children [usec] */
+	__u64	ac_csttime;		/* Steal CPU time of children [usec] */
+	__u64	ac_tutime;		/* User CPU time of threads [usec] */
+	__u64	ac_tstime;		/* System CPU time of threads [usec] */
+	__u64	ac_tsttime;		/* Steal CPU time of threads [usec] */
 };
 
 
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -51,6 +51,7 @@
 #include <trace/events/sched.h>
 #include <linux/hw_breakpoint.h>
 #include <linux/oom.h>
+#include <linux/kernel_stat.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -74,6 +75,85 @@ static void __unhash_process(struct task
 	list_del_rcu(&p->thread_group);
 }
 
+static void __account_ctime(struct task_struct *p, struct task_struct *parent,
+			    struct cdata *pcd, struct cdata *ccd)
+{
+	struct signal_struct *sig = p->signal;
+	cputime_t tgutime, tgstime, tgsttime;
+	unsigned long maxrss, flags;
+
+	/*
+	 * The resource counters for the group leader are in its
+	 * own task_struct.  Those for dead threads in the group
+	 * are in its signal_struct, as are those for the child
+	 * processes it has previously reaped.  All these
+	 * accumulate in the parent's signal_struct c* fields.
+	 *
+	 * We don't bother to take a lock here to protect these
+	 * p->signal fields, because they are only touched by
+	 * __exit_signal, which runs with tasklist_lock
+	 * write-locked anyway, and so is excluded here.  We do
+	 * need to protect the access to parent->signal fields,
+	 * as other threads in the parent group can be right
+	 * here reaping other children at the same time.
+	 *
+	 * We use thread_group_times() to get times for the thread
+	 * group, which consolidates times for all threads in the
+	 * group including the group leader.
+	 */
+	thread_group_times(p, &tgutime, &tgstime, &tgsttime);
+
+	spin_lock_irqsave(&parent->sighand->siglock, flags);
+	pcd->cutime = cputime_add(pcd->cutime,
+				  cputime_add(tgutime, ccd->cutime));
+	pcd->cstime = cputime_add(pcd->cstime,
+				  cputime_add(tgstime, ccd->cstime));
+	pcd->csttime = cputime_add(pcd->csttime,
+				   cputime_add(tgsttime, ccd->csttime));
+	pcd->cgtime = cputime_add(pcd->cgtime, cputime_add(p->gtime,
+			   cputime_add(sig->gtime, ccd->cgtime)));
+
+	pcd->cmin_flt += p->min_flt + sig->min_flt + ccd->cmin_flt;
+	pcd->cmaj_flt += p->maj_flt + sig->maj_flt + ccd->cmaj_flt;
+	pcd->cnvcsw += p->nvcsw + sig->nvcsw + ccd->cnvcsw;
+	pcd->cnivcsw += p->nivcsw + sig->nivcsw + ccd->cnivcsw;
+	pcd->cinblock += task_io_get_inblock(p) + sig->inblock + ccd->cinblock;
+	pcd->coublock += task_io_get_oublock(p) + sig->oublock + ccd->coublock;
+	maxrss = max(sig->maxrss, ccd->cmaxrss);
+	if (pcd->cmaxrss < maxrss)
+		pcd->cmaxrss = maxrss;
+
+	maxrss = max(sig->maxrss, ccd->cmaxrss);
+	if (pcd->cmaxrss < maxrss)
+		pcd->cmaxrss = maxrss;
+
+	task_io_accounting_add(&pcd->ioac, &p->ioac);
+	task_io_accounting_add(&pcd->ioac, &ccd->ioac);
+	task_io_accounting_add(&pcd->ioac, &ccd->ioac);
+	spin_unlock_irqrestore(&parent->sighand->siglock, flags);
+}
+
+static void reparent_acct(struct signal_struct *father)
+{
+	struct signal_struct *c, *n, *new_parent;
+	struct task_struct *tsk;
+
+	rcu_read_lock();
+	tsk = pid_task(father->leader_pid, PIDTYPE_PID);
+	if (!list_empty(&father->acct_children))
+		printk("XXX forget: %i (%s)\n", tsk->pid, tsk->comm);
+	new_parent = father->acct_parent;
+	list_for_each_entry_safe(c, n, &father->acct_children, acct_sibling) {
+		struct task_struct *ctsk = pid_task(c->leader_pid, PIDTYPE_PID);
+		struct task_struct *ptsk = pid_task(new_parent->leader_pid, PIDTYPE_PID);
+		printk("  XXX move: %i (%s) to %i (%s) %p\n",
+		       ctsk->pid, ctsk->comm, ptsk->pid, ptsk->comm, new_parent);
+		c->acct_parent = new_parent;
+		list_move_tail(&c->acct_sibling, &new_parent->acct_children);
+	}
+	rcu_read_unlock();
+}
+
 /*
  * This function expects the tasklist_lock write-locked.
  */
@@ -84,6 +164,29 @@ static void __exit_signal(struct task_st
 	struct sighand_struct *sighand;
 	struct tty_struct *uninitialized_var(tty);
 
+#ifdef __s390x__
+	/*
+	 * FIXME: On s390 we can call account_process_tick to update
+	 * CPU time information. This is probably not valid on other
+	 * architectures.
+	 */
+	if (current == tsk)
+		account_process_tick(current, 1);
+#endif
+	if (group_dead) {
+		struct task_struct *acct_parent =
+			pid_task(sig->acct_parent->leader_pid, PIDTYPE_PID);
+		/*
+		 * FIXME: This somehow has to be moved to
+		 * finish_task_switch(), because otherwise
+		 * if the process accounts itself, the CPU time
+		 * that is used for this code will be lost.
+		 */
+		__account_ctime(tsk, acct_parent, &sig->acct_parent->cdata_acct,
+				&sig->cdata_acct);
+		reparent_acct(sig);
+		list_del_init(&sig->acct_sibling);
+	}
 	sighand = rcu_dereference_check(tsk->sighand,
 					rcu_read_lock_held() ||
 					lockdep_tasklist_lock_is_held());
@@ -132,7 +235,8 @@ static void __exit_signal(struct task_st
 		sig->nivcsw += tsk->nivcsw;
 		sig->inblock += task_io_get_inblock(tsk);
 		sig->oublock += task_io_get_oublock(tsk);
-		task_io_accounting_add(&sig->ioac, &tsk->ioac);
+		task_io_accounting_add(&sig->cdata_wait.ioac,
+				       &tsk->ioac);
 		sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	}
 
@@ -1226,73 +1330,10 @@ static int wait_task_zombie(struct wait_
 	 * !task_detached() to filter out sub-threads.
 	 */
 	if (likely(!traced) && likely(!task_detached(p))) {
-		struct signal_struct *psig;
-		struct signal_struct *sig;
-		unsigned long maxrss;
-		cputime_t tgutime, tgstime, tgsttime;
-
-		/*
-		 * The resource counters for the group leader are in its
-		 * own task_struct.  Those for dead threads in the group
-		 * are in its signal_struct, as are those for the child
-		 * processes it has previously reaped.  All these
-		 * accumulate in the parent's signal_struct c* fields.
-		 *
-		 * We don't bother to take a lock here to protect these
-		 * p->signal fields, because they are only touched by
-		 * __exit_signal, which runs with tasklist_lock
-		 * write-locked anyway, and so is excluded here.  We do
-		 * need to protect the access to parent->signal fields,
-		 * as other threads in the parent group can be right
-		 * here reaping other children at the same time.
-		 *
-		 * We use thread_group_times() to get times for the thread
-		 * group, which consolidates times for all threads in the
-		 * group including the group leader.
-		 */
-		thread_group_times(p, &tgutime, &tgstime, &tgsttime);
-		spin_lock_irq(&p->real_parent->sighand->siglock);
-		psig = p->real_parent->signal;
-		sig = p->signal;
-		psig->cutime =
-			cputime_add(psig->cutime,
-			cputime_add(tgutime,
-				    sig->cutime));
-		psig->cstime =
-			cputime_add(psig->cstime,
-			cputime_add(tgstime,
-				    sig->cstime));
-		psig->csttime =
-			cputime_add(psig->csttime,
-			cputime_add(tgsttime,
-				    sig->csttime));
-		psig->cgtime =
-			cputime_add(psig->cgtime,
-			cputime_add(p->gtime,
-			cputime_add(sig->gtime,
-				    sig->cgtime)));
-		psig->cmin_flt +=
-			p->min_flt + sig->min_flt + sig->cmin_flt;
-		psig->cmaj_flt +=
-			p->maj_flt + sig->maj_flt + sig->cmaj_flt;
-		psig->cnvcsw +=
-			p->nvcsw + sig->nvcsw + sig->cnvcsw;
-		psig->cnivcsw +=
-			p->nivcsw + sig->nivcsw + sig->cnivcsw;
-		psig->cinblock +=
-			task_io_get_inblock(p) +
-			sig->inblock + sig->cinblock;
-		psig->coublock +=
-			task_io_get_oublock(p) +
-			sig->oublock + sig->coublock;
-		maxrss = max(sig->maxrss, sig->cmaxrss);
-		if (psig->cmaxrss < maxrss)
-			psig->cmaxrss = maxrss;
-		task_io_accounting_add(&psig->ioac, &p->ioac);
-		task_io_accounting_add(&psig->ioac, &sig->ioac);
-		spin_unlock_irq(&p->real_parent->sighand->siglock);
+		__account_ctime(p, p->real_parent,
+				&p->real_parent->signal->cdata_wait,
+				&p->signal->cdata_wait);
 	}
-
 	/*
 	 * Now we are sure this task is interesting, and no other
 	 * thread can reap it because we set its state to EXIT_DEAD.
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1289,6 +1289,13 @@ static struct task_struct *copy_process(
 		nr_threads++;
 	}
 
+	if (!(clone_flags & CLONE_THREAD)) {
+		p->signal->acct_parent = current->signal;
+		INIT_LIST_HEAD(&p->signal->acct_children);
+		INIT_LIST_HEAD(&p->signal->acct_sibling);
+		list_add_tail(&p->signal->acct_sibling,
+			      &current->signal->acct_children);
+	}
 	total_forks++;
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -884,8 +884,8 @@ void do_sys_times(struct tms *tms)
 
 	spin_lock_irq(&current->sighand->siglock);
 	thread_group_times(current, &tgutime, &tgstime, &tgsttime);
-	cutime = current->signal->cutime;
-	cstime = current->signal->cstime;
+	cutime = current->signal->cdata_wait.cutime;
+	cstime = current->signal->cdata_wait.cstime;
 	spin_unlock_irq(&current->sighand->siglock);
 	tms->tms_utime = cputime_to_clock_t(tgutime);
 	tms->tms_stime = cputime_to_clock_t(tgstime);
@@ -1490,6 +1490,7 @@ static void k_getrusage(struct task_stru
 	unsigned long flags;
 	cputime_t tgutime, tgstime, tgsttime, utime, stime, sttime;
 	unsigned long maxrss = 0;
+	struct cdata *cd;
 
 	memset((char *) r, 0, sizeof *r);
 	utime = stime = cputime_zero;
@@ -1507,15 +1508,16 @@ static void k_getrusage(struct task_stru
 	switch (who) {
 		case RUSAGE_BOTH:
 		case RUSAGE_CHILDREN:
-			utime = p->signal->cutime;
-			stime = p->signal->cstime;
-			r->ru_nvcsw = p->signal->cnvcsw;
-			r->ru_nivcsw = p->signal->cnivcsw;
-			r->ru_minflt = p->signal->cmin_flt;
-			r->ru_majflt = p->signal->cmaj_flt;
-			r->ru_inblock = p->signal->cinblock;
-			r->ru_oublock = p->signal->coublock;
-			maxrss = p->signal->cmaxrss;
+			cd = &p->signal->cdata_wait;
+			utime = cd->cutime;
+			stime = cd->cstime;
+			r->ru_nvcsw = cd->cnvcsw;
+			r->ru_nivcsw = cd->cnivcsw;
+			r->ru_minflt = cd->cmin_flt;
+			r->ru_majflt = cd->cmaj_flt;
+			r->ru_inblock = cd->cinblock;
+			r->ru_oublock = cd->coublock;
+			maxrss = cd->cmaxrss;
 
 			if (who == RUSAGE_CHILDREN)
 				break;
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -61,8 +61,37 @@ void bacct_add_tsk(struct taskstats *sta
 	tcred = __task_cred(tsk);
 	stats->ac_uid	 = tcred->uid;
 	stats->ac_gid	 = tcred->gid;
-	stats->ac_ppid	 = pid_alive(tsk) ?
-				rcu_dereference(tsk->real_parent)->tgid : 0;
+	/*
+	 * FIXME: Add new field "ac_acct_ppid" and "ac_acct_ctime" to
+	 * taskstats
+	 */
+	if (pid_alive(tsk) && tsk->signal) {
+		struct signal_struct *acct_parent = tsk->signal->acct_parent;
+		if (acct_parent && acct_parent->leader_pid)
+			stats->ac_ppid = pid_task(acct_parent->leader_pid,
+						  PIDTYPE_PID)->tgid;
+		else
+			stats->ac_ppid = 0;
+	} else {
+		stats->ac_ppid = 0;
+	}
+	if (tsk->signal && tsk->tgid == tsk->pid) {
+		struct cdata *cd = &tsk->signal->cdata_acct;
+
+		stats->ac_cutime = cputime_to_usecs(cd->cutime);
+		stats->ac_cstime = cputime_to_usecs(cd->cstime);
+		stats->ac_csttime = cputime_to_usecs(cd->csttime);
+		stats->ac_tutime = cputime_to_usecs(tsk->signal->utime);
+		stats->ac_tstime = cputime_to_usecs(tsk->signal->stime);
+		stats->ac_tsttime = cputime_to_usecs(tsk->signal->sttime);
+	} else {
+		stats->ac_cutime = 0;
+		stats->ac_cstime = 0;
+		stats->ac_csttime = 0;
+		stats->ac_tutime = 0;
+		stats->ac_tstime = 0;
+		stats->ac_tsttime = 0;
+	}
 	rcu_read_unlock();
 	stats->ac_utime = cputime_to_usecs(tsk->utime);
 	stats->ac_stime = cputime_to_usecs(tsk->stime);


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC][PATCH v2 6/7] taskstats: Fix accounting for non-leader thread exec
  2010-11-11 17:03 [RFC][PATCH v2 0/7] taskstats: Enhancements for precise process accounting (version 2) Michael Holzheu
                   ` (4 preceding siblings ...)
  2010-11-11 17:03 ` [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU " Michael Holzheu
@ 2010-11-11 17:03 ` Michael Holzheu
  2010-11-11 17:11 ` [RFC][PATCH v2 7/7] taskstats: Precise process accounting user space Michael Holzheu
  6 siblings, 0 replies; 52+ messages in thread
From: Michael Holzheu @ 2010-11-11 17:03 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath
  Cc: linux-kernel, linux-s390

[-- Attachment #1: 06-taskstats-top-fix-exec.patch --]
[-- Type: text/plain, Size: 2453 bytes --]

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

If a non-leader thread calls exec(), in the de_thread() function it
gets some of the identity of the old thread group leader e.g. the PID
and the start time. But it keeps the old CPU times. This may lead to
confusion in user space, because CPU time can go backwards for the
thread group leader. For example the top command shows the following
for that test case:

# top -H
  PID USER  PR  NI  VIRT  RES  SHR S    %CPU      %MEM  TIME+   COMMAND  
17278 root  20   0 84128  664  500 R  >>9999.0<<  0.0  0:02.75 pthread_exec

To fix this problem, this patch exchanges the accounting data between the
exec() calling thread an the thread group leader in de_thread().
One problem with this patch could be that the scheduler might get confused
by the CPU time change.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 fs/exec.c |   22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

--- a/fs/exec.c
+++ b/fs/exec.c
@@ -844,16 +844,32 @@ static int de_thread(struct task_struct
 
 		/*
 		 * The only record we have of the real-time age of a
-		 * process, regardless of execs it's done, is start_time.
+		 * process, regardless of execs it's done, is start_time
+		 * and the accounting data.
 		 * All the past CPU time is accumulated in signal_struct
 		 * from sister threads now dead.  But in this non-leader
 		 * exec, nothing survives from the original leader thread,
 		 * whose birth marks the true age of this process now.
 		 * When we take on its identity by switching to its PID, we
-		 * also take its birthdate (always earlier than our own).
+		 * also take its birthdate (always earlier than our own)
+		 * and the accounting data.
 		 */
 		tsk->start_time = leader->start_time;
-
+		swap(tsk->nvcsw, leader->nvcsw);
+		swap(tsk->nivcsw, leader->nivcsw);
+		swap(tsk->min_flt, leader->min_flt);
+		swap(tsk->ioac, leader->ioac);
+		swap(tsk->utime, leader->utime);
+		swap(tsk->stime, leader->stime);
+		swap(tsk->gtime, leader->gtime);
+		swap(tsk->sttime, leader->sttime);
+		swap(tsk->acct_time, leader->acct_time);
+		swap(tsk->real_start_time, leader->real_start_time);
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+		swap(tsk->prev_utime, leader->prev_utime);
+		swap(tsk->prev_stime, leader->prev_stime);
+		swap(tsk->prev_sttime, leader->prev_sttime);
+#endif
 		BUG_ON(!same_thread_group(leader, tsk));
 		BUG_ON(has_group_leader_pid(tsk));
 		/*


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC][PATCH v2 7/7] taskstats: Precise process accounting user space
  2010-11-11 17:03 [RFC][PATCH v2 0/7] taskstats: Enhancements for precise process accounting (version 2) Michael Holzheu
                   ` (5 preceding siblings ...)
  2010-11-11 17:03 ` [RFC][PATCH v2 6/7] taskstats: Fix accounting for non-leader thread exec Michael Holzheu
@ 2010-11-11 17:11 ` Michael Holzheu
  6 siblings, 0 replies; 52+ messages in thread
From: Michael Holzheu @ 2010-11-11 17:11 UTC (permalink / raw)
  To: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath
  Cc: linux-kernel, linux-s390

[-- Attachment #1: Type: text/plain, Size: 5483 bytes --]

Taskstats user space

The attached tarball "s390-tools-taskstats.tar.bz2" contains user space code
that exploits the taskstasts-top kernel patches. This is early code and
probably still a lot of work has to be done here. The code should build
and work on all architectures, not only on s390.

libtaskstats user space library
-------------------------------
include/libtaskstats.h  API definition
libtaskstats_nl         API implementation based on libnl 1.1
libtaskstats_proc       Partial API implementation using new /proc/taskstats

libtaskstats snapshot user space library
----------------------------------------
include/libtaskstats_snap.h       API definition
libtaskstats_snap/snap_netlink.c  API implementation based on libtaskstats

Snapshot library test program
-----------------------------
ts_snap_test/ts_snap_test.c   Simple program that uses snapshot library

Precise top user space program (ptop)
-------------------------------------
ptop/dg_libtaskstats.c  Data gatherer using taskstats interface
                        To disable steal time calculation for non s390
                        modify l_calc_sttime_old() and replace "#if 1"
                        with "#if 0".
ptop/sd_core.c          Code for ctime accounting

HOWTO build:
============
1.Install libnl-1.1-5 and libnl-1.1-5-devel
  If this is not possible, you can still build the proc/taskstats based code:
  * Remove libtaskstats_nl from the top level Makefile
  * Remove ptop_old_nl, ptop_new_nl and ptop_snap_nl from the "ptop" Makefile
2.Build s390-tools:
  # tar xfv s390-tools.tar.bz2
  # cd s390-tools
  # make

HOWTO use ptop:
===============
In the ptop sub directory there are built five versions of ptop:

* ptop_old_nl:    ptop using the old TASKSTATS_CMD_ATTR_PID netlink command
                  together with reading procfs to find running tasks
* ptop_new_nl:    ptop using the new TASKSTATS_CMD_ATTR_PIDS netlink command.
                  This tool only shows tasks that consumed CPU time in the
                  last interval.
* ptop_new_proc:  ptop using the new TASKSTATS_CMD_ATTR_PIDS ioctl on
                  /proc/taskstats.
                  This tool only shows tasks that consumed CPU time in the
                  last interval.
* ptop_snap_nl:   ptop using the snapshot library with underlying netlink
                  taskstats library
* ptop_snap_proc: ptop using the snapshot library with underlying taskstats
                  library that uses /proc/taskstats

First results (on s390):
========================

TEST1: System with many sleeping tasks
--------------------------------------

  for ((i=0; i < 1000; i++))
  do
         sleep 1000000 &
  done

  # ptop_new_proc

             VVVV
  pid   user  sys  ste  total  Name
  (#)    (%)  (%)  (%)    (%)  (str)
  541   0.37 2.39 0.10   2.87  top
  3645  2.13 1.12 0.14   3.39  ptop_old_nl
  3591  2.20 0.59 0.12   2.92  ptop_snap_nl
  3694  2.16 0.26 0.10   2.51  ptop_snap_proc
  3792  0.03 0.06 0.00   0.09  ptop_new_nl
  3743  0.03 0.05 0.00   0.07  ptop_new_proc
             ^^^^

The ptop user space code is not optimized for a large amount of tasks,
therefore we should concentrate on the system (sys) time. Update time is
2 seconds for all top programs.

* Old top command:
  Because top has to read about 1000 procfs directories, system time is very
  high (2.39%).

* ptop_new_xxx:
  Because only active tasks are transferred, the CPU consumption is very low
  (0.05-0.06% system time).

* ptop_snap_nl/ptop_old_nl:
  The new netlink TASKSTATS_CMD_ATTR_PIDS command only consumes about 50% of
  the CPU time (0.59%) compared to the usage of multiple TASKSTATS_CMD_ATTR_PID
  commands (ptop_old_nl / 1.12%) and scanning procfs to find out running tasks.

* ptop_snap_proc/ptop_snap_nl:
  Using the proc/taskstats interface (0.26%) consumes much less system time
  than the netlink interface (0.59%).

TEST2: Show snapshot consistency with system that is 100% busy
--------------------------------------------------------------

  System with 2 CPUs:

  for ((i=0; i < $(cat /proc/cpuinfo  | grep "^processor" | wc -l); i++))
  do
       ./loop &
  done
  cd linux-2.6
  make -j 5

  # ptop_snap_proc
  pid     user  sys stea cuser  csys cstea xuser xsys xstea  total Name
  (#)      (%)  (%)  (%)   (%)   (%)   (%)   (%)  (%)   (%)    (%) (str)
  2802   43.22 0.35 0.22  0.00  0.00  0.00  0.00 0.00  0.00  43.79 loop
  2799   35.96 0.33 0.21  0.00  0.00  0.00  0.00 0.00  0.00  36.50 loop
  2811    0.04 0.05 0.00 23.22 12.97  0.19  0.00 0.00  0.00  36.46 make
  2796   35.80 0.32 0.19  0.00  0.00  0.00  0.00 0.00  0.00  36.30 loop
  2987   15.93 2.14 0.07  8.23  3.12  0.06  0.00 0.00  0.00  29.53 make
  3044   11.56 1.72 0.22  0.00  0.00  0.00  0.00 0.00  0.00  13.50 make
  3053    1.92 0.73 0.01  0.00  0.00  0.00  0.00 0.00  0.00   2.65 make
  ....
  V:V:S 144.76 6.24 1.22 31.44 16.09  0.25  0.00 0.00  0.00 200.00
                                                            ^^^^^^

  With the snapshot mechanism the sum of all tasks CPU times will be exactly
  200.00% CPU time with this testcase. The following CPU times are used:
  * user/sys/stea:    Time that has been consumed by task itself
  * cuser/csys/cstea: All time that has been consumed by dead children of
                      process.
  * xuser/xsys/xstea: Time that has been consumed by dead threads of thread
                      group of task

  Note that the CPU times on x86 are not as precise as on s390.



[-- Attachment #2: s390-tools-taskstats.tar.bz2 --]
[-- Type: application/x-bzip-compressed-tar, Size: 42754 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-11 17:03 ` [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU " Michael Holzheu
@ 2010-11-13 18:38   ` Oleg Nesterov
  2010-11-15 15:55     ` Martin Schwidefsky
                       ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Oleg Nesterov @ 2010-11-13 18:38 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

First of all, let me repeat, I am not going to discuss whether we need
these changes or not, I do not know even if I understand your motivation.

My only concern is correctness, but

On 11/11, Michael Holzheu wrote:
>
> * Add cumulative dead thread CPU times to taskstats
> * Introduce parallel accounting process hierarchy (based on discussions
>   with Oleg Nesterov and Roland McGrath)

Michael, I really think this patch does too many different things.
I forgot the details of our previous discussion, and I got lost
trying to understand the new version.

I already asked you to split these changes, perhaps you can do this?
Say, bacct_add_tsk() looks overcomplicated, the change in copy_process()
shouldn't introduce the new CLONE_THREAD check, not sure I understand
why release_task() was chosen for reparenting, other nits...

But it is not easy to discuss these completely different things
looking at the single patch.

Imho, it would be much better if you make a separate patch which
adds acct_parent/etc and implements the parallel hierarchy. This
also makes sense because it touches the core kernel.

Personally I think that even "struct cdata" + __account_ctime() helper
needs a separate patch, and perhaps this change makes sense by itself
as cleanup. And this way the "trivial" changes (like the changes in
k_getrusage) won't distract from the functional changes.

The final change should introduce cdata_acct and actually implement
the complete cumulative accounting.

At least, please distill parallel accounting process hierarchy.
We do not want bugs in this area ;)

What do you think?

Oleg.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS
  2010-11-11 17:03 ` [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS Michael Holzheu
@ 2010-11-13 19:20   ` Peter Zijlstra
  2010-11-15 15:53     ` Michael Holzheu
  2010-11-13 19:39   ` Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-13 19:20 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

On Thu, 2010-11-11 at 18:03 +0100, Michael Holzheu wrote:
> As clock for 'now' and 'time' the sched_clock() function is used and the patch

> +       preempt_disable();
> +       stats->time_ns = sched_clock();
> +       preempt_enable();

> +       task_snap_time = sched_clock();

That's just plain broken...



> +       t->sched_info.last_depart = task_rq(t)->clock;

Are you sure you don't mean task_rq(t)->clock_task ? 




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-11 17:03 ` [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting Michael Holzheu
@ 2010-11-13 19:38   ` Peter Zijlstra
  2010-11-15 14:50     ` Martin Schwidefsky
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-13 19:38 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

On Thu, 2010-11-11 at 18:03 +0100, Michael Holzheu wrote:
> From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> 
> Currently steal time is only accounted for the whole system. With this
> patch we add steal time to the per task CPU time accounting.
> The triplet "user time", "system time" and "steal time" represents
> all consumed CPU time on hypervisor based systems. 

Does that really make sense? Its not like the hypervisor really knows
anything about tasks and won't steal from one? Its really a vcpu
feature.

What added benefit will all this extra accounting give?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS
  2010-11-11 17:03 ` [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS Michael Holzheu
  2010-11-13 19:20   ` Peter Zijlstra
@ 2010-11-13 19:39   ` Peter Zijlstra
  2010-11-13 20:00     ` Balbir Singh
  2010-11-15 14:50     ` Michael Holzheu
  1 sibling, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-13 19:39 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

On Thu, 2010-11-11 at 18:03 +0100, Michael Holzheu wrote:
> +       if (cmd_pids->cnt > 1000) // XXX socket buffer size check

What's the implication of this limit? Does that mean that if there's
more than 1000 tasks on the system the whole interface falls flat on its
face or does that mean we get at most 1000 tasks worth of information
per syscall?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS
  2010-11-13 19:39   ` Peter Zijlstra
@ 2010-11-13 20:00     ` Balbir Singh
  2010-11-15 14:50     ` Michael Holzheu
  1 sibling, 0 replies; 52+ messages in thread
From: Balbir Singh @ 2010-11-13 20:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Martin Schwidefsky, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2010-11-13 20:39:44]:

> On Thu, 2010-11-11 at 18:03 +0100, Michael Holzheu wrote:
> > +       if (cmd_pids->cnt > 1000) // XXX socket buffer size check
> 
> What's the implication of this limit? Does that mean that if there's
> more than 1000 tasks on the system the whole interface falls flat on its
> face or does that mean we get at most 1000 tasks worth of information
> per syscall?

Since the transport is not reliable, we need to ensure we don't drop
the data from the kernel and never receive it in user space or
partially receive it in user space.


-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS
  2010-11-13 19:39   ` Peter Zijlstra
  2010-11-13 20:00     ` Balbir Singh
@ 2010-11-15 14:50     ` Michael Holzheu
  1 sibling, 0 replies; 52+ messages in thread
From: Michael Holzheu @ 2010-11-15 14:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

Hello Peter,

On Sat, 2010-11-13 at 20:39 +0100, Peter Zijlstra wrote:
> On Thu, 2010-11-11 at 18:03 +0100, Michael Holzheu wrote:
> > +       if (cmd_pids->cnt > 1000) // XXX socket buffer size check
> 
> What's the implication of this limit? Does that mean that if there's
> more than 1000 tasks on the system the whole interface falls flat on its
> face or does that mean we get at most 1000 tasks worth of information
> per syscall?

In the final code I wanted to replace the check for 1000 by a check for
the available socket buffer. Currently I am not sure, if this is
possible.

"cmd_pids->cnt" is the number of taskstats that userspace requests from
the kernel. When the request is processed, the kernel sends cnt
taskstats structures to the netlink socket. This is buffered in the
netlink socket until userspace receives it. Because the socket buffer is
limited in size there is an upper limit for the number of taskstats
structures that can be received with one TASKSTATS_CMD_ATTR_PIDS
command. If the user wants to receive more than that limit, he has to
send multiple TASKSTATS_CMD_ATTR_PIDS commands.

Michael





^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-13 19:38   ` Peter Zijlstra
@ 2010-11-15 14:50     ` Martin Schwidefsky
  2010-11-15 15:11       ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Martin Schwidefsky @ 2010-11-15 14:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390

On Sat, 13 Nov 2010 20:38:02 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2010-11-11 at 18:03 +0100, Michael Holzheu wrote:
> > From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> > 
> > Currently steal time is only accounted for the whole system. With this
> > patch we add steal time to the per task CPU time accounting.
> > The triplet "user time", "system time" and "steal time" represents
> > all consumed CPU time on hypervisor based systems. 
> 
> Does that really make sense? Its not like the hypervisor really knows
> anything about tasks and won't steal from one? Its really a vcpu
> feature.
> 
> What added benefit will all this extra accounting give?

Currently the linux kernel keeps track of used cpu cycles per task,
steal time is reported only per cpu. With the patch steal cycles are
reported per task just like used cpu cycles, giving the complete picture
on a per task basis. Without the patch you don't know if the task has
been waiting or got its cycles stolen. A matter of granularity.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-15 14:50     ` Martin Schwidefsky
@ 2010-11-15 15:11       ` Peter Zijlstra
  2010-11-15 17:42         ` Martin Schwidefsky
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-15 15:11 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390, jeremy.fitzhardinge

On Mon, 2010-11-15 at 15:50 +0100, Martin Schwidefsky wrote:
> On Sat, 13 Nov 2010 20:38:02 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Thu, 2010-11-11 at 18:03 +0100, Michael Holzheu wrote:
> > > From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> > > 
> > > Currently steal time is only accounted for the whole system. With this
> > > patch we add steal time to the per task CPU time accounting.
> > > The triplet "user time", "system time" and "steal time" represents
> > > all consumed CPU time on hypervisor based systems. 
> > 
> > Does that really make sense? Its not like the hypervisor really knows
> > anything about tasks and won't steal from one? Its really a vcpu
> > feature.
> > 
> > What added benefit will all this extra accounting give?
> 
> Currently the linux kernel keeps track of used cpu cycles per task,
> steal time is reported only per cpu. With the patch steal cycles are
> reported per task just like used cpu cycles, giving the complete picture
> on a per task basis. Without the patch you don't know if the task has
> been waiting or got its cycles stolen. A matter of granularity. 

That doesn't answer my question at all. Why do you want to know? Also,
once we change the scheduler to not account steal time to tasks like it
currently does (as Jeremy has been proposing to do several times now)
this should become totally redundant as it will always be 0, no?

Thing is, all I'm seeing is overhead here, the vast majority of systems
simply don't have any steal time at all. So again, what does this buy us
except a gazillion wasted bytes and cycles?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS
  2010-11-13 19:20   ` Peter Zijlstra
@ 2010-11-15 15:53     ` Michael Holzheu
  2010-11-15 16:06       ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Holzheu @ 2010-11-15 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

Hello Peter,

On Sat, 2010-11-13 at 20:20 +0100, Peter Zijlstra wrote:
> On Thu, 2010-11-11 at 18:03 +0100, Michael Holzheu wrote:
> > As clock for 'now' and 'time' the sched_clock() function is used and the patch
> 
> > +       preempt_disable();
> > +       stats->time_ns = sched_clock();
> > +       preempt_enable();
> 
> > +       task_snap_time = sched_clock();
> 
> That's just plain broken...

What exactly do you mean? Do you mean that we should not use
sched_clock() in general or that it is called twice?

> 
> 
> > +       t->sched_info.last_depart = task_rq(t)->clock;
> 
> Are you sure you don't mean task_rq(t)->clock_task ?

Maybe... I want to save in "last_depart" a sched_clock() timestamp that
is as precise as possible.

We use "last_depart" for the TASKSTATS_CMD_ATTR_PIDS command to find out
which tasks have been running on a CPU since the last taskstats
snapshot. We return all tasks where last_depart > MIN(stats->time_ns for
all tasks of last snapshot).

Michael




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-13 18:38   ` Oleg Nesterov
@ 2010-11-15 15:55     ` Martin Schwidefsky
  2010-11-15 16:03       ` Peter Zijlstra
  2010-11-16 16:57     ` Michael Holzheu
  2010-11-16 17:34     ` Michael Holzheu
  2 siblings, 1 reply; 52+ messages in thread
From: Martin Schwidefsky @ 2010-11-15 15:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Peter Zijlstra, Ingo Molnar,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390

On Sat, 13 Nov 2010 19:38:10 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> First of all, let me repeat, I am not going to discuss whether we need
> these changes or not, I do not know even if I understand your motivation.

Hmm, is the motivation of whole patch series unclear or just the details
in this one? What we want is a low-overhead tool that precisely shows
where the cpu spent its time (or didn't because of steal time). The
granularity target is tenths of microseconds, something that should be
possible with decent hardware. 
What makes life hard is the corner case of a task that has been reparented
to init. That is reason for this complicated secondary accounting structure.

> My only concern is correctness, but

You review effort to improve correctness is very much appreciated. We'll
try to split the patches to make review easier.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-15 15:55     ` Martin Schwidefsky
@ 2010-11-15 16:03       ` Peter Zijlstra
  2010-11-15 17:49         ` Martin Schwidefsky
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-15 16:03 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Oleg Nesterov, Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, John stultz,
	Thomas Gleixner, Balbir Singh, Heiko Carstens, Roland McGrath,
	linux-kernel, linux-s390

On Mon, 2010-11-15 at 16:55 +0100, Martin Schwidefsky wrote:
>  What we want is a low-overhead tool that precisely shows
> where the cpu spent its time (or didn't because of steal time). The
> granularity target is tenths of microseconds, something that should be
> possible with decent hardware. 

To what purpose? 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS
  2010-11-15 15:53     ` Michael Holzheu
@ 2010-11-15 16:06       ` Peter Zijlstra
  2010-11-15 17:09         ` Michael Holzheu
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-15 16:06 UTC (permalink / raw)
  To: holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

On Mon, 2010-11-15 at 16:53 +0100, Michael Holzheu wrote:
> Hello Peter,
> 
> On Sat, 2010-11-13 at 20:20 +0100, Peter Zijlstra wrote:
> > On Thu, 2010-11-11 at 18:03 +0100, Michael Holzheu wrote:
> > > As clock for 'now' and 'time' the sched_clock() function is used and the patch
> > 
> > > +       preempt_disable();
> > > +       stats->time_ns = sched_clock();
> > > +       preempt_enable();
> > 
> > > +       task_snap_time = sched_clock();
> > 
> > That's just plain broken...
> 
> What exactly do you mean? Do you mean that we should not use
> sched_clock() in general or that it is called twice?

That you should not use sched_clock(), and if you do (you really
shouldn't) you should have disabled IRQs around it.

> > 
> > 
> > > +       t->sched_info.last_depart = task_rq(t)->clock;
> > 
> > Are you sure you don't mean task_rq(t)->clock_task ?
> 
> Maybe... I want to save in "last_depart" a sched_clock() timestamp that
> is as precise as possible.
> 
> We use "last_depart" for the TASKSTATS_CMD_ATTR_PIDS command to find out
> which tasks have been running on a CPU since the last taskstats
> snapshot. We return all tasks where last_depart > MIN(stats->time_ns for
> all tasks of last snapshot).

What does last departed mean? That is what timeline are you counting in?
Do you want time as tasks see it, or time as your wallclock sees it?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS
  2010-11-15 16:06       ` Peter Zijlstra
@ 2010-11-15 17:09         ` Michael Holzheu
  2010-11-15 17:21           ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Holzheu @ 2010-11-15 17:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

Hello Peter,

On Mon, 2010-11-15 at 17:06 +0100, Peter Zijlstra wrote:
> On Mon, 2010-11-15 at 16:53 +0100, Michael Holzheu wrote:
> > Hello Peter,
> > 
> > On Sat, 2010-11-13 at 20:20 +0100, Peter Zijlstra wrote:
> > > On Thu, 2010-11-11 at 18:03 +0100, Michael Holzheu wrote:
> > > > As clock for 'now' and 'time' the sched_clock() function is used and the patch
> > > 
> > > > +       preempt_disable();
> > > > +       stats->time_ns = sched_clock();
> > > > +       preempt_enable();
> > > 
> > > > +       task_snap_time = sched_clock();
> > > 
> > > That's just plain broken...
> > 
> > What exactly do you mean? Do you mean that we should not use
> > sched_clock() in general or that it is called twice?
> 
> That you should not use sched_clock(),

What should we use instead?

> and if you do (you really
> shouldn't) you should have disabled IRQs around it.
> 
> > > 
> > > 
> > > > +       t->sched_info.last_depart = task_rq(t)->clock;
> > > 
> > > Are you sure you don't mean task_rq(t)->clock_task ?
> > 
> > Maybe... I want to save in "last_depart" a sched_clock() timestamp that
> > is as precise as possible.
> > 
> > We use "last_depart" for the TASKSTATS_CMD_ATTR_PIDS command to find out
> > which tasks have been running on a CPU since the last taskstats
> > snapshot. We return all tasks where last_depart > MIN(stats->time_ns for
> > all tasks of last snapshot).
> 
> What does last departed mean? That is what timeline are you counting in?
> Do you want time as tasks see it, or time as your wallclock sees it?

"last_depart" should be the time stamp, where the task has left a CPU
the last time.

We assume that we can compare "last_depart" with "time_ns" in the
taskstats structure, if we use task_rq(t)->clock for last_depart and
sched_clock() for stats->time_ns. We also assume that we get wallclock
intervals in nanoseconds, if we look at two sched_clock() timestamps.

"stats->time_ns" is used as timestamp for the next snapshot query and
for calculation of the snapshot interval time. So there are three
important timestamps:
* struct task_struct:
  sched_info.last_depart: Last time task has left CPU
* struct taskstats:
  time_ns: Timestamp where taskstats data is generated
* sturuct cmd_pids:
  time_ns: Timestamp for TASKSTATS_CMD_ATTR_PIDS command. 

Example:
1. Get initial snapshot with cmd_pids->time_ns=0:
   - All tasks are returned.
    snapshot_time = MIN(stats->time_ns) for all received taskstats
2. Get second snapshot with cmd_pids->time_ns = snapshot_time
   - Only tasks that were active after "snapshot_time" are returned.

Michael


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS
  2010-11-15 17:09         ` Michael Holzheu
@ 2010-11-15 17:21           ` Peter Zijlstra
  2010-11-16 12:16             ` Michael Holzheu
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-15 17:21 UTC (permalink / raw)
  To: holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

On Mon, 2010-11-15 at 18:09 +0100, Michael Holzheu wrote:

> > That you should not use sched_clock(),
> 
> What should we use instead?

Depends on what you want, look at kernel/sched_clock.c

> > What does last departed mean? That is what timeline are you counting in?
> > Do you want time as tasks see it, or time as your wallclock sees it?
> 
> "last_depart" should be the time stamp, where the task has left a CPU
> the last time.
> 
> We assume that we can compare "last_depart" with "time_ns" in the
> taskstats structure,

I think you assume I actually know anything about taskstat :-), its the
thing I always say =n to in my config file and have so far happily
ignored all code of.

>  if we use task_rq(t)->clock for last_depart and
> sched_clock() for stats->time_ns.

Then you're up shit creek because rq->clock doesn't necessarily have any
correlation to sched_clock().

>  We also assume that we get wallclock
> intervals in nanoseconds, if we look at two sched_clock() timestamps.

Invalid assumption.

> "stats->time_ns" is used as timestamp for the next snapshot query and
> for calculation of the snapshot interval time. So there are three
> important timestamps:
> * struct task_struct:
>   sched_info.last_depart: Last time task has left CPU

So you're essentially replicating the data in
sched_entity::statistics::wait_start ?

> * struct taskstats:
>   time_ns: Timestamp where taskstats data is generated
> * sturuct cmd_pids:
>   time_ns: Timestamp for TASKSTATS_CMD_ATTR_PIDS command. 
> 
> Example:
> 1. Get initial snapshot with cmd_pids->time_ns=0:
>    - All tasks are returned.
>     snapshot_time = MIN(stats->time_ns) for all received taskstats
> 2. Get second snapshot with cmd_pids->time_ns = snapshot_time
>    - Only tasks that were active after "snapshot_time" are returned.

/me can only hope all this will only increase overhead for those of us
who actually use any of this..

I'm still upset over ->[us]timescaled existing, this patch set looks to
me like it will only upset me more.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-15 15:11       ` Peter Zijlstra
@ 2010-11-15 17:42         ` Martin Schwidefsky
  2010-11-15 17:45           ` Peter Zijlstra
                             ` (3 more replies)
  0 siblings, 4 replies; 52+ messages in thread
From: Martin Schwidefsky @ 2010-11-15 17:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390, jeremy.fitzhardinge

On Mon, 15 Nov 2010 16:11:23 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Mon, 2010-11-15 at 15:50 +0100, Martin Schwidefsky wrote:
> > On Sat, 13 Nov 2010 20:38:02 +0100
> > Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > 
> > > On Thu, 2010-11-11 at 18:03 +0100, Michael Holzheu wrote:
> > > > From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> > > > 
> > > > Currently steal time is only accounted for the whole system. With this
> > > > patch we add steal time to the per task CPU time accounting.
> > > > The triplet "user time", "system time" and "steal time" represents
> > > > all consumed CPU time on hypervisor based systems. 
> > > 
> > > Does that really make sense? Its not like the hypervisor really knows
> > > anything about tasks and won't steal from one? Its really a vcpu
> > > feature.
> > > 
> > > What added benefit will all this extra accounting give?
> > 
> > Currently the linux kernel keeps track of used cpu cycles per task,
> > steal time is reported only per cpu. With the patch steal cycles are
> > reported per task just like used cpu cycles, giving the complete picture
> > on a per task basis. Without the patch you don't know if the task has
> > been waiting or got its cycles stolen. A matter of granularity. 
> 
> That doesn't answer my question at all. Why do you want to know? Also,
> once we change the scheduler to not account steal time to tasks like it
> currently does (as Jeremy has been proposing to do several times now)
> this should become totally redundant as it will always be 0, no?

At least on s390 and powerpc we already do not account steal time to tasks,
the user and system time is "real" cpu. I do not know if that is true for
ia64 as well which is the third architecture with VIRT_CPU_ACCOUNTING=y.
The steal time of a task tells us how much more progress a task could have
done if the hypervisor would not steal cpu. Now you could argue that the
steal time for a cpu is good enough for that purpose but steal time is not
necessarily uniform over all tasks. And we already do calculate this number,
we just do not store it right now.

> Thing is, all I'm seeing is overhead here, the vast majority of systems
> simply don't have any steal time at all. So again, what does this buy us
> except a gazillion wasted bytes and cycles?

There are 40 bytes more in the task structure and a few instructions more
in account_steal_time. I would not call that gazillions wasted bytes and
cycles. It is a minimal overhead. Would you prefer another #ifdef orgy to
avoid the overhead for VIRT_CPU_ACCOUNTING=n? We can certainly do that.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-15 17:42         ` Martin Schwidefsky
@ 2010-11-15 17:45           ` Peter Zijlstra
  2010-11-15 17:47           ` Peter Zijlstra
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-15 17:45 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390, jeremy.fitzhardinge

On Mon, 2010-11-15 at 18:42 +0100, Martin Schwidefsky wrote:
> I do not know if that is true for
> ia64 as well which is the third architecture with VIRT_CPU_ACCOUNTING=y. 

IIRC ia64 uses VIRT_CPU_ACCOUNTING to get high-res task accounting, it
was done before CFS came around and used ns based accounting for all
tasks.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-15 17:42         ` Martin Schwidefsky
  2010-11-15 17:45           ` Peter Zijlstra
@ 2010-11-15 17:47           ` Peter Zijlstra
  2010-11-15 17:48           ` Peter Zijlstra
  2010-11-15 17:50           ` Peter Zijlstra
  3 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-15 17:47 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390, jeremy.fitzhardinge

On Mon, 2010-11-15 at 18:42 +0100, Martin Schwidefsky wrote:
> There are 40 bytes more in the task structure and a few instructions more
> in account_steal_time. I would not call that gazillions wasted bytes and
> cycles. 

Taken over all machines running Linux on real hardware that's gobs of
memory wasted and lots of cycles spend adding 0s.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-15 17:42         ` Martin Schwidefsky
  2010-11-15 17:45           ` Peter Zijlstra
  2010-11-15 17:47           ` Peter Zijlstra
@ 2010-11-15 17:48           ` Peter Zijlstra
  2010-11-15 17:50           ` Peter Zijlstra
  3 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-15 17:48 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390, jeremy.fitzhardinge

On Mon, 2010-11-15 at 18:42 +0100, Martin Schwidefsky wrote:
> Would you prefer another #ifdef orgy to
> avoid the overhead for VIRT_CPU_ACCOUNTING=n? We can certainly do
> that.
> 
That still doesn't answer the question, why do you want this? What is
per task steal time accounting good for? What problem is being solved?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-15 16:03       ` Peter Zijlstra
@ 2010-11-15 17:49         ` Martin Schwidefsky
  2010-11-15 17:51           ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Martin Schwidefsky @ 2010-11-15 17:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, John stultz,
	Thomas Gleixner, Balbir Singh, Heiko Carstens, Roland McGrath,
	linux-kernel, linux-s390

On Mon, 15 Nov 2010 17:03:56 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Mon, 2010-11-15 at 16:55 +0100, Martin Schwidefsky wrote:
> >  What we want is a low-overhead tool that precisely shows
> > where the cpu spent its time (or didn't because of steal time). The
> > granularity target is tenths of microseconds, something that should be
> > possible with decent hardware. 
> 
> To what purpose? 
 
Is that a trick question? Why do we have tools like "top"? Or process
accounting? The point is that the quality of the numbers we get right
now is rather bad, the overhead of scanning /proc is horrendous and
the 10ms granularity is rather coarse.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-15 17:42         ` Martin Schwidefsky
                             ` (2 preceding siblings ...)
  2010-11-15 17:48           ` Peter Zijlstra
@ 2010-11-15 17:50           ` Peter Zijlstra
  2010-11-15 17:59             ` Martin Schwidefsky
  3 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-15 17:50 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390, jeremy.fitzhardinge

On Mon, 2010-11-15 at 18:42 +0100, Martin Schwidefsky wrote:
> The steal time of a task tells us how much more progress a task could have
> done if the hypervisor would not steal cpu. Now you could argue that the
> steal time for a cpu is good enough for that purpose but steal time is not
> necessarily uniform over all tasks. And we already do calculate this number,
> we just do not store it right now. 

If you make the scheduler take steal time into account like Jeremy
proposed then you schedule on serviced time and the steal time gain is
proportional to the existing service distribution.

Still, then you know, then what are you going to do about it? Are you
going to avoid the hypervisor from scheduling when that one task is
running?

What good is knowing something you cannot do anything about.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-15 17:49         ` Martin Schwidefsky
@ 2010-11-15 17:51           ` Peter Zijlstra
  2010-11-15 18:00             ` Martin Schwidefsky
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-15 17:51 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Oleg Nesterov, Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, John stultz,
	Thomas Gleixner, Balbir Singh, Heiko Carstens, Roland McGrath,
	linux-kernel, linux-s390

On Mon, 2010-11-15 at 18:49 +0100, Martin Schwidefsky wrote:
> > To what purpose? 
>  
> Is that a trick question? Why do we have tools like "top"? Or process
> accounting? The point is that the quality of the numbers we get right
> now is rather bad, the overhead of scanning /proc is horrendous and
> the 10ms granularity is rather coarse. 

But you're not just replacing top, you're adding all kinds of new
accounting crap all over the place.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-15 17:50           ` Peter Zijlstra
@ 2010-11-15 17:59             ` Martin Schwidefsky
  2010-11-15 18:08               ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Martin Schwidefsky @ 2010-11-15 17:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390, jeremy.fitzhardinge

On Mon, 15 Nov 2010 18:50:41 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Mon, 2010-11-15 at 18:42 +0100, Martin Schwidefsky wrote:
> > The steal time of a task tells us how much more progress a task could have
> > done if the hypervisor would not steal cpu. Now you could argue that the
> > steal time for a cpu is good enough for that purpose but steal time is not
> > necessarily uniform over all tasks. And we already do calculate this number,
> > we just do not store it right now. 
> 
> If you make the scheduler take steal time into account like Jeremy
> proposed then you schedule on serviced time and the steal time gain is
> proportional to the existing service distribution.
> 
> Still, then you know, then what are you going to do about it? Are you
> going to avoid the hypervisor from scheduling when that one task is
> running?
> 
> What good is knowing something you cannot do anything about.

Steal time per task is at least good for performance problem analysis.
Sometimes knowing what is not the cause of a performance problem can help you
tremendously. If a task is slow and has no steal time, well then the hypervisor
is likely not the culprit. On the other hand if you do see lots of steal time
for a task while the rest of the system doesn't cause any steal time can tell
you something as well. That task might hit a specific function which causes
hypervisor overhead. The usefulness depends on the situation, it is another
data point which may or may not help you.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-15 17:51           ` Peter Zijlstra
@ 2010-11-15 18:00             ` Martin Schwidefsky
  2010-11-15 18:10               ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Martin Schwidefsky @ 2010-11-15 18:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, John stultz,
	Thomas Gleixner, Balbir Singh, Heiko Carstens, Roland McGrath,
	linux-kernel, linux-s390

On Mon, 15 Nov 2010 18:51:57 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Mon, 2010-11-15 at 18:49 +0100, Martin Schwidefsky wrote:
> > > To what purpose? 
> >  
> > Is that a trick question? Why do we have tools like "top"? Or process
> > accounting? The point is that the quality of the numbers we get right
> > now is rather bad, the overhead of scanning /proc is horrendous and
> > the 10ms granularity is rather coarse. 
> 
> But you're not just replacing top, you're adding all kinds of new
> accounting crap all over the place.

We DO replace top. Patch #7 of 7.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-15 17:59             ` Martin Schwidefsky
@ 2010-11-15 18:08               ` Peter Zijlstra
  2010-11-16  8:51                 ` Martin Schwidefsky
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-15 18:08 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390, jeremy.fitzhardinge

On Mon, 2010-11-15 at 18:59 +0100, Martin Schwidefsky wrote:
> Steal time per task is at least good for performance problem analysis.
> Sometimes knowing what is not the cause of a performance problem can help you
> tremendously. If a task is slow and has no steal time, well then the hypervisor
> is likely not the culprit. On the other hand if you do see lots of steal time
> for a task while the rest of the system doesn't cause any steal time can tell
> you something as well. That task might hit a specific function which causes
> hypervisor overhead. The usefulness depends on the situation, it is another
> data point which may or may not help you. 

If performance analysis is the only reason, why not add a tracepoint on
vcpu enter that reports the duration the vcpu was out for and use perf
to gather said data? It can tell you what process was running and what
instruction it was at when the vcpu went away.

No need to add 40 bytes per task for that.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-15 18:00             ` Martin Schwidefsky
@ 2010-11-15 18:10               ` Peter Zijlstra
  2010-11-16  8:54                 ` Martin Schwidefsky
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-15 18:10 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Oleg Nesterov, Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, John stultz,
	Thomas Gleixner, Balbir Singh, Heiko Carstens, Roland McGrath,
	linux-kernel, linux-s390

On Mon, 2010-11-15 at 19:00 +0100, Martin Schwidefsky wrote:
> On Mon, 15 Nov 2010 18:51:57 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Mon, 2010-11-15 at 18:49 +0100, Martin Schwidefsky wrote:
> > > > To what purpose? 
> > >  
> > > Is that a trick question? Why do we have tools like "top"? Or process
> > > accounting? The point is that the quality of the numbers we get right
> > > now is rather bad, the overhead of scanning /proc is horrendous and
> > > the 10ms granularity is rather coarse. 
> > 
> > But you're not just replacing top, you're adding all kinds of new
> > accounting crap all over the place.
> 
> We DO replace top. Patch #7 of 7.

You _also_ replace top, but its not by far the only thing you do. If you
simply replaced top you wouldn't need to add tons of extra accounting,
would you?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-15 18:08               ` Peter Zijlstra
@ 2010-11-16  8:51                 ` Martin Schwidefsky
  2010-11-16 12:16                   ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Martin Schwidefsky @ 2010-11-16  8:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390, jeremy.fitzhardinge

On Mon, 15 Nov 2010 19:08:44 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Mon, 2010-11-15 at 18:59 +0100, Martin Schwidefsky wrote:
> > Steal time per task is at least good for performance problem analysis.
> > Sometimes knowing what is not the cause of a performance problem can help you
> > tremendously. If a task is slow and has no steal time, well then the hypervisor
> > is likely not the culprit. On the other hand if you do see lots of steal time
> > for a task while the rest of the system doesn't cause any steal time can tell
> > you something as well. That task might hit a specific function which causes
> > hypervisor overhead. The usefulness depends on the situation, it is another
> > data point which may or may not help you. 
> 
> If performance analysis is the only reason, why not add a tracepoint on
> vcpu enter that reports the duration the vcpu was out for and use perf
> to gather said data? It can tell you what process was running and what
> instruction it was at when the vcpu went away.
> 
> No need to add 40 bytes per task for that.

Which vcpu enter? We usually have z/VM as our hypervisor and want to be able
to do performance analysis with the data we gather inside the guest. There
is no vcpu enter we could put a tracepoint on. We would have to put tracepoints
on every possible interaction point with z/VM to get this data. To me it seems
a lot simpler to add the per-task steal time.

And if it is really the additional 40 bytes on x86 that bother you so much,
we could put them behind #ifdef CONFIG_VIRT_CPU_ACCOUNTING. There already
is one in the task_struct for prev_utime and prev_stime. 

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-15 18:10               ` Peter Zijlstra
@ 2010-11-16  8:54                 ` Martin Schwidefsky
  0 siblings, 0 replies; 52+ messages in thread
From: Martin Schwidefsky @ 2010-11-16  8:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, John stultz,
	Thomas Gleixner, Balbir Singh, Heiko Carstens, Roland McGrath,
	linux-kernel, linux-s390

On Mon, 15 Nov 2010 19:10:06 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Mon, 2010-11-15 at 19:00 +0100, Martin Schwidefsky wrote:
> > On Mon, 15 Nov 2010 18:51:57 +0100
> > Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > 
> > > On Mon, 2010-11-15 at 18:49 +0100, Martin Schwidefsky wrote:
> > > > > To what purpose? 
> > > >  
> > > > Is that a trick question? Why do we have tools like "top"? Or process
> > > > accounting? The point is that the quality of the numbers we get right
> > > > now is rather bad, the overhead of scanning /proc is horrendous and
> > > > the 10ms granularity is rather coarse. 
> > > 
> > > But you're not just replacing top, you're adding all kinds of new
> > > accounting crap all over the place.
> > 
> > We DO replace top. Patch #7 of 7.
> 
> You _also_ replace top, but its not by far the only thing you do. If you
> simply replaced top you wouldn't need to add tons of extra accounting,
> would you?

There are basically two things we want to accomplish:
1) Make top faster by replacing the /proc based data gathering with taskstats.
   To really make it go fast with taskstats we filter with last_depart to
   read only tasks that have changed since the last snapshot.
2) Make top more precise. That is where all of the extra accounting comes
   into play.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-16  8:51                 ` Martin Schwidefsky
@ 2010-11-16 12:16                   ` Peter Zijlstra
  2010-11-16 15:33                     ` Martin Schwidefsky
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-16 12:16 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390, jeremy.fitzhardinge,
	Avi Kivity

On Tue, 2010-11-16 at 09:51 +0100, Martin Schwidefsky wrote:
> On Mon, 15 Nov 2010 19:08:44 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Mon, 2010-11-15 at 18:59 +0100, Martin Schwidefsky wrote:
> > > Steal time per task is at least good for performance problem analysis.
> > > Sometimes knowing what is not the cause of a performance problem can help you
> > > tremendously. If a task is slow and has no steal time, well then the hypervisor
> > > is likely not the culprit. On the other hand if you do see lots of steal time
> > > for a task while the rest of the system doesn't cause any steal time can tell
> > > you something as well. That task might hit a specific function which causes
> > > hypervisor overhead. The usefulness depends on the situation, it is another
> > > data point which may or may not help you. 
> > 
> > If performance analysis is the only reason, why not add a tracepoint on
> > vcpu enter that reports the duration the vcpu was out for and use perf
> > to gather said data? It can tell you what process was running and what
> > instruction it was at when the vcpu went away.
> > 
> > No need to add 40 bytes per task for that.
> 
> Which vcpu enter? We usually have z/VM as our hypervisor and want to be able
> to do performance analysis with the data we gather inside the guest. There
> is no vcpu enter we could put a tracepoint on. We would have to put tracepoints
> on every possible interaction point with z/VM to get this data. To me it seems
> a lot simpler to add the per-task steal time.

Oh, you guys don't have a hypercall wrapper to exploit? Because from
what I heard from the kvm/xen/lguest people I gathered they could in
fact do something like I proposed.

In fact, kvm seems to already have these tracepoints: kvm_exit/kvm_entry
and it has a separate excplicit hypercall tracepoint as well:
kvm_hypercall.

Except that the per-task steal time gives you lot less detail, being
able to profile on vcpu exit/enter gives you a much more powerfull
performance tool. Aside from being able to measure the steal-time it
allows you to instantly find hypercalls (both explicit as well as
implicit), so you can also measure the hypercall induced steal-time as
well.

> And if it is really the additional 40 bytes on x86 that bother you so much,
> we could put them behind #ifdef CONFIG_VIRT_CPU_ACCOUNTING. There already
> is one in the task_struct for prev_utime and prev_stime. 

Making it configurable would definitely help the embedded people, not
sure about VIRT_CPU_ACCOUNTING though, I bet the x86 virt weird^Wpeople
would like it too -- if only to strive for feature parity if nothing
else :/

Its just that I'm not at all convinced its the best approach to solve
the problem posed, and once its committed we're stuck with it due to
ABI.

We should be very careful not to die a death of thousand cuts with all
this accounting madness, there's way too many weird-ass process
accounting junk that adds ABI constraints as it is.

I think its definitely worth investing extra time to implement these
tracepoints if at all possible on your architecture before committing
yourself to something like this.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS
  2010-11-15 17:21           ` Peter Zijlstra
@ 2010-11-16 12:16             ` Michael Holzheu
  2010-11-16 12:36               ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Holzheu @ 2010-11-16 12:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

Hello Peter,

On Mon, 2010-11-15 at 18:21 +0100, Peter Zijlstra wrote:
> On Mon, 2010-11-15 at 18:09 +0100, Michael Holzheu wrote:
> 
> > > That you should not use sched_clock(),
> > 
> > What should we use instead?
> 
> Depends on what you want, look at kernel/sched_clock.c
> 
> > > What does last departed mean? That is what timeline are you counting in?
> > > Do you want time as tasks see it, or time as your wallclock sees it?
> > 
> > "last_depart" should be the time stamp, where the task has left a CPU
> > the last time.
> > 
> > We assume that we can compare "last_depart" with "time_ns" in the
> > taskstats structure,
> 
> I think you assume I actually know anything about taskstat :-), its the
> thing I always say =n to in my config file and have so far happily
> ignored all code of.
> 
> >  if we use task_rq(t)->clock for last_depart and
> > sched_clock() for stats->time_ns.
> 
> Then you're up shit creek because rq->clock doesn't necessarily have any
> correlation to sched_clock().
> 
> >  We also assume that we get wallclock
> > intervals in nanoseconds, if we look at two sched_clock() timestamps.
> 
> Invalid assumption.

Ok, thanks. So sched_clock() seems to be a bad idea for our purposes.

An alternative approach could be to have a global counter for the task
snapshots, which is increased each time a snapshot is created for
userspace. In addition to that we had to add a snapshot counter field to
the task_struct that is set to the current value of the global counter
each time a task leaves a CPU. Then userspace could ask for all tasks
that have been active after snapshot number x. In the response userspace
gets all tasks that have a snapshot number bigger than x together with
the new snapshot number y that can be used for the next query.

Still it would be useful to add a timestamp of the creation of the
taskstats data in the response to userspace for calculating the interval
time between two snapshots. Would the usage of ktime_get() be valid for
that purpose?

Michael


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS
  2010-11-16 12:16             ` Michael Holzheu
@ 2010-11-16 12:36               ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-16 12:36 UTC (permalink / raw)
  To: holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Ingo Molnar, Oleg Nesterov, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

On Tue, 2010-11-16 at 13:16 +0100, Michael Holzheu wrote:

> Ok, thanks. So sched_clock() seems to be a bad idea for our purposes.
> 
> An alternative approach could be to have a global counter for the task
> snapshots, which is increased each time a snapshot is created for
> userspace. In addition to that we had to add a snapshot counter field to
> the task_struct that is set to the current value of the global counter
> each time a task leaves a CPU. Then userspace could ask for all tasks
> that have been active after snapshot number x. In the response userspace
> gets all tasks that have a snapshot number bigger than x together with
> the new snapshot number y that can be used for the next query.
> 
> Still it would be useful to add a timestamp of the creation of the
> taskstats data in the response to userspace for calculating the interval
> time between two snapshots. Would the usage of ktime_get() be valid for
> that purpose?

ktime_get() can be insanely slow, but yes that's an option. Another
option is using local_clock() and living with the fact that it may be
out of sync (up to a jiffy or so) between CPUs.

The advantage of using ktime_get() as opposed to any other clock is that
userspace has access to it as well through:
clock_gettime(CLOCK_MONOTONIC), although that's arguably not too
relevant when all you're interested in is deltas.




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-16 12:16                   ` Peter Zijlstra
@ 2010-11-16 15:33                     ` Martin Schwidefsky
  2010-11-16 15:45                       ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Martin Schwidefsky @ 2010-11-16 15:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390, jeremy.fitzhardinge,
	Avi Kivity

On Tue, 16 Nov 2010 13:16:08 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Tue, 2010-11-16 at 09:51 +0100, Martin Schwidefsky wrote:
> > On Mon, 15 Nov 2010 19:08:44 +0100
> > Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > 
> > > On Mon, 2010-11-15 at 18:59 +0100, Martin Schwidefsky wrote:
> > > > Steal time per task is at least good for performance problem analysis.
> > > > Sometimes knowing what is not the cause of a performance problem can help you
> > > > tremendously. If a task is slow and has no steal time, well then the hypervisor
> > > > is likely not the culprit. On the other hand if you do see lots of steal time
> > > > for a task while the rest of the system doesn't cause any steal time can tell
> > > > you something as well. That task might hit a specific function which causes
> > > > hypervisor overhead. The usefulness depends on the situation, it is another
> > > > data point which may or may not help you. 
> > > 
> > > If performance analysis is the only reason, why not add a tracepoint on
> > > vcpu enter that reports the duration the vcpu was out for and use perf
> > > to gather said data? It can tell you what process was running and what
> > > instruction it was at when the vcpu went away.
> > > 
> > > No need to add 40 bytes per task for that.
> > 
> > Which vcpu enter? We usually have z/VM as our hypervisor and want to be able
> > to do performance analysis with the data we gather inside the guest. There
> > is no vcpu enter we could put a tracepoint on. We would have to put tracepoints
> > on every possible interaction point with z/VM to get this data. To me it seems
> > a lot simpler to add the per-task steal time.
> 
> Oh, you guys don't have a hypercall wrapper to exploit? Because from
> what I heard from the kvm/xen/lguest people I gathered they could in
> fact do something like I proposed.

We could do something along the lines of a hypercall wrapper (we would call
it a diagnose wrapper, same thing). The diagnoses we have on s390 do vastly
different things, so it is not easy to have a common diagnose wrapper.
Would be easier to add a tracepoint for each diagnose inline assembly.

> In fact, kvm seems to already have these tracepoints: kvm_exit/kvm_entry
> and it has a separate excplicit hypercall tracepoint as well:
> kvm_hypercall.

But the kvm tracepoints are used when Linux is the hypervisor, no? For our
situation that would be a tracepoint in z/VM - or the equivalent. This is
out of scope of this patch.

> Except that the per-task steal time gives you lot less detail, being
> able to profile on vcpu exit/enter gives you a much more powerfull
> performance tool. Aside from being able to measure the steal-time it
> allows you to instantly find hypercalls (both explicit as well as
> implicit), so you can also measure the hypercall induced steal-time as
> well.

Yes and no. The tracepoint idea looks interesting in itself. But that does
not completely replace the per-task steal time. The hypervisor can take
away the cpu anytime, it is still interesting to know which task was hit
hardest by that. You could view the cpu time lost by a hypercall as
"synchronous" steal time for the task, the remaining delta to the total
per-task steal time as "asynchronous" steal time.

> > And if it is really the additional 40 bytes on x86 that bother you so much,
> > we could put them behind #ifdef CONFIG_VIRT_CPU_ACCOUNTING. There already
> > is one in the task_struct for prev_utime and prev_stime. 
> 
> Making it configurable would definitely help the embedded people, not
> sure about VIRT_CPU_ACCOUNTING though, I bet the x86 virt weird^Wpeople
> would like it too -- if only to strive for feature parity if nothing
> else :/
> 
> Its just that I'm not at all convinced its the best approach to solve
> the problem posed, and once its committed we're stuck with it due to
> ABI.
> 
> We should be very careful not to die a death of thousand cuts with all
> this accounting madness, there's way too many weird-ass process
> accounting junk that adds ABI constraints as it is.
> 
> I think its definitely worth investing extra time to implement these
> tracepoints if at all possible on your architecture before committing
> yourself to something like this.

I don't think it is either per-task steal time or tracepoints. Ideally 
we'd have both. But I understand that you want to be careful about
committing an ABI. From my viewpoint per-task steal is a logical
extension, the data it is based on is already there.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-16 15:33                     ` Martin Schwidefsky
@ 2010-11-16 15:45                       ` Peter Zijlstra
  2010-11-16 16:05                         ` Martin Schwidefsky
  2010-11-16 16:38                         ` Avi Kivity
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-16 15:45 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390, jeremy.fitzhardinge,
	Avi Kivity

On Tue, 2010-11-16 at 16:33 +0100, Martin Schwidefsky wrote:
> > Oh, you guys don't have a hypercall wrapper to exploit? Because from
> > what I heard from the kvm/xen/lguest people I gathered they could in
> > fact do something like I proposed.
> 
> We could do something along the lines of a hypercall wrapper (we would call
> it a diagnose wrapper, same thing). The diagnoses we have on s390 do vastly
> different things, so it is not easy to have a common diagnose wrapper.
> Would be easier to add a tracepoint for each diagnose inline assembly.

Right, bummer that.

> > In fact, kvm seems to already have these tracepoints: kvm_exit/kvm_entry
> > and it has a separate excplicit hypercall tracepoint as well:
> > kvm_hypercall.
> 
> But the kvm tracepoints are used when Linux is the hypervisor, no? For our
> situation that would be a tracepoint in z/VM - or the equivalent. This is
> out of scope of this patch.

Ah crud, you might be right.. Avi could a kvm guest generate events on
vcpu enter/exit?

> > Except that the per-task steal time gives you lot less detail, being
> > able to profile on vcpu exit/enter gives you a much more powerfull
> > performance tool. Aside from being able to measure the steal-time it
> > allows you to instantly find hypercalls (both explicit as well as
> > implicit), so you can also measure the hypercall induced steal-time as
> > well.
> 
> Yes and no. The tracepoint idea looks interesting in itself. But that does
> not completely replace the per-task steal time. The hypervisor can take
> away the cpu anytime, it is still interesting to know which task was hit
> hardest by that. You could view the cpu time lost by a hypercall as
> "synchronous" steal time for the task, the remaining delta to the total
> per-task steal time as "asynchronous" steal time. 

Right, so there is no way the guest knows about the vcpu getting
scheduled, it can only derive the fact from hardware clocks after the
fact?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-16 15:45                       ` Peter Zijlstra
@ 2010-11-16 16:05                         ` Martin Schwidefsky
  2010-11-16 18:39                           ` Jeremy Fitzhardinge
  2010-11-16 16:38                         ` Avi Kivity
  1 sibling, 1 reply; 52+ messages in thread
From: Martin Schwidefsky @ 2010-11-16 16:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390, jeremy.fitzhardinge,
	Avi Kivity

On Tue, 16 Nov 2010 16:45:29 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> > > Except that the per-task steal time gives you lot less detail, being
> > > able to profile on vcpu exit/enter gives you a much more powerfull
> > > performance tool. Aside from being able to measure the steal-time it
> > > allows you to instantly find hypercalls (both explicit as well as
> > > implicit), so you can also measure the hypercall induced steal-time as
> > > well.
> > 
> > Yes and no. The tracepoint idea looks interesting in itself. But that does
> > not completely replace the per-task steal time. The hypervisor can take
> > away the cpu anytime, it is still interesting to know which task was hit
> > hardest by that. You could view the cpu time lost by a hypercall as
> > "synchronous" steal time for the task, the remaining delta to the total
> > per-task steal time as "asynchronous" steal time. 
> 
> Right, so there is no way the guest knows about the vcpu getting
> scheduled, it can only derive the fact from hardware clocks after the
> fact?

Correct.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-16 15:45                       ` Peter Zijlstra
  2010-11-16 16:05                         ` Martin Schwidefsky
@ 2010-11-16 16:38                         ` Avi Kivity
  2010-11-16 16:43                           ` Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2010-11-16 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, Michael Holzheu, Shailabh Nagar,
	Andrew Morton, Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar,
	Oleg Nesterov, John stultz, Thomas Gleixner, Balbir Singh,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390,
	jeremy.fitzhardinge

On 11/16/2010 05:45 PM, Peter Zijlstra wrote:
>
>>> In fact, kvm seems to already have these tracepoints: kvm_exit/kvm_entry
>>> and it has a separate excplicit hypercall tracepoint as well:
>>> kvm_hypercall.
>> But the kvm tracepoints are used when Linux is the hypervisor, no? For our
>> situation that would be a tracepoint in z/VM - or the equivalent. This is
>> out of scope of this patch.
> Ah crud, you might be right.. Avi could a kvm guest generate events on
> vcpu enter/exit?

No.  Hypercalls are voluntary and known, but most exits are involuntary 
and unknown to the guest.  Any memory access can generate a page fault, 
and any host interrupt will exit the guest.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-16 16:38                         ` Avi Kivity
@ 2010-11-16 16:43                           ` Peter Zijlstra
  2010-11-16 16:56                             ` Avi Kivity
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2010-11-16 16:43 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Martin Schwidefsky, Michael Holzheu, Shailabh Nagar,
	Andrew Morton, Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar,
	Oleg Nesterov, John stultz, Thomas Gleixner, Balbir Singh,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390,
	jeremy.fitzhardinge

On Tue, 2010-11-16 at 18:38 +0200, Avi Kivity wrote:
> On 11/16/2010 05:45 PM, Peter Zijlstra wrote:
> >
> >>> In fact, kvm seems to already have these tracepoints: kvm_exit/kvm_entry
> >>> and it has a separate excplicit hypercall tracepoint as well:
> >>> kvm_hypercall.
> >> But the kvm tracepoints are used when Linux is the hypervisor, no? For our
> >> situation that would be a tracepoint in z/VM - or the equivalent. This is
> >> out of scope of this patch.
> > Ah crud, you might be right.. Avi could a kvm guest generate events on
> > vcpu enter/exit?
> 
> No.  Hypercalls are voluntary and known, but most exits are involuntary 
> and unknown to the guest.  Any memory access can generate a page fault, 
> and any host interrupt will exit the guest.

Right, but we could not make the guest jump to a special stub on vcpu
enter? I guess we could simply because we have the hypervisor under
control.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-16 16:43                           ` Peter Zijlstra
@ 2010-11-16 16:56                             ` Avi Kivity
  2010-11-16 17:06                               ` Avi Kivity
  0 siblings, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2010-11-16 16:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, Michael Holzheu, Shailabh Nagar,
	Andrew Morton, Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar,
	Oleg Nesterov, John stultz, Thomas Gleixner, Balbir Singh,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390,
	jeremy.fitzhardinge

On 11/16/2010 06:43 PM, Peter Zijlstra wrote:
>
>> No.  Hypercalls are voluntary and known, but most exits are involuntary
>> and unknown to the guest.  Any memory access can generate a page fault,
>> and any host interrupt will exit the guest.
> Right, but we could not make the guest jump to a special stub on vcpu
> enter? I guess we could simply because we have the hypervisor under
> control.

No, some entries inject an interrupt or exception, so the next rip value 
is unknown (without doing a lot of extra slow calculations).

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-13 18:38   ` Oleg Nesterov
  2010-11-15 15:55     ` Martin Schwidefsky
@ 2010-11-16 16:57     ` Michael Holzheu
  2010-11-18 17:10       ` Oleg Nesterov
  2010-11-16 17:34     ` Michael Holzheu
  2 siblings, 1 reply; 52+ messages in thread
From: Michael Holzheu @ 2010-11-16 16:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

Hello Oleg,

On Sat, 2010-11-13 at 19:38 +0100, Oleg Nesterov wrote:
> First of all, let me repeat, I am not going to discuss whether we need
> these changes or not, I do not know even if I understand your motivation.

Sorry, if I was not clear enough with my descriptions. Let me try to
describe my motivation again:

The cumulative accounting patch implements an infrastructure that makes
it possible to collect all CPU time usage between two task snapshots
without using task exit events. The main idea is to show cumulative CPU
time fields for each task in the top command that contain the CPU time
of children that died in the last interval.

Example 1 (simple case):
A "make" process forked several gcc processes after snapshot 1 and all
of them exited before snapshot 2. We subtract the cumulative CPU times
of "make" of snapshot 2 from the cumulative times of "make" of snapshot
1. The result will be the consumed CPU time of the dead gcc processes in
the last interval. The value is that we can see in top that "make" is
"responsible" for this CPU time.

Example 2:
We have the gcc processes in snapshot 1 but not in snapshot 2. Then the
top command has to find the nearest relative (e.g. the parent process)
that is still alive in snapshot 2, create the delta of the cumulative
time for this process and subtract the CPU time of the gcc processes of
snapshot 1. This gives you the CPU time that was consumed by the gcc
processes between snapshot 1 and 2.

With your help we identified two problems that make this approach
impossible or at least not exact with the current Linux implementation:
1. Cumulative CPU time counters are not complete (SA_NOCLDWAIT)
2. Because of reparent to init, there are situations where it is
   not clear to which tasks the CPU time of dead tasks between
   two snapshots has been accounted. This is a problem for example 2.

The patch tries to solve the problem by adding a second set of
cumulative data that contains all CPU time of dead children and adds the
parallel hierarchy to make it unambiguous which parent got the CPU time
of dead tasks (needed for example 2).

I hope that you understand now the value and the motivation of fixing
the two problems.

I know that new userspace APIs should be added with care and should be
avoided when the value of a new function is not big enough. I also can
understand the objections from Peter and your concerns. So is the value
of the new function big enough?

I will answer your other technical comments in a separate mail.

Michael


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-16 16:56                             ` Avi Kivity
@ 2010-11-16 17:06                               ` Avi Kivity
  0 siblings, 0 replies; 52+ messages in thread
From: Avi Kivity @ 2010-11-16 17:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, Michael Holzheu, Shailabh Nagar,
	Andrew Morton, Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar,
	Oleg Nesterov, John stultz, Thomas Gleixner, Balbir Singh,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390,
	jeremy.fitzhardinge

On 11/16/2010 06:56 PM, Avi Kivity wrote:
> On 11/16/2010 06:43 PM, Peter Zijlstra wrote:
>>
>>> No.  Hypercalls are voluntary and known, but most exits are involuntary
>>> and unknown to the guest.  Any memory access can generate a page fault,
>>> and any host interrupt will exit the guest.
>> Right, but we could not make the guest jump to a special stub on vcpu
>> enter? I guess we could simply because we have the hypervisor under
>> control.
>
> No, some entries inject an interrupt or exception, so the next rip 
> value is unknown (without doing a lot of extra slow calculations).

Also, consider an exit within the stub (if the stub is paged out, for 
example).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-13 18:38   ` Oleg Nesterov
  2010-11-15 15:55     ` Martin Schwidefsky
  2010-11-16 16:57     ` Michael Holzheu
@ 2010-11-16 17:34     ` Michael Holzheu
  2010-11-16 17:50       ` Oleg Nesterov
  2010-11-18 16:34       ` Oleg Nesterov
  2 siblings, 2 replies; 52+ messages in thread
From: Michael Holzheu @ 2010-11-16 17:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

Hello Oleg,

On Sat, 2010-11-13 at 19:38 +0100, Oleg Nesterov wrote:
> I already asked you to split these changes, perhaps you can do this?
> Say, bacct_add_tsk() looks overcomplicated, the change in copy_process()
> shouldn't introduce the new CLONE_THREAD check, not sure I understand
> why release_task() was chosen for reparenting, other nits...

I want to establish the new hierarchy when a new process is forked and
not for new threads, therefore the check for CLONE_THREAD in
copy_process(). I do the reparenting with reparent_acct() when a process
dies, therefore the check for "group_dead" in exit_signal().

> But it is not easy to discuss these completely different things
> looking at the single patch.
> 
> Imho, it would be much better if you make a separate patch which
> adds acct_parent/etc and implements the parallel hierarchy. This
> also makes sense because it touches the core kernel.
> 
> Personally I think that even "struct cdata" + __account_ctime() helper
> needs a separate patch, and perhaps this change makes sense by itself
> as cleanup. And this way the "trivial" changes (like the changes in
> k_getrusage) won't distract from the functional changes.
>
> The final change should introduce cdata_acct and actually implement
> the complete cumulative accounting.

So you want to have the following three patches:
* Introduce "struct cdata" + __account_ctime() (no functional change)
* Add cdata_acct accounting + parallel accounting hierarchy
* Add taskstats interface to export the data to userspace

Correct?

Michael


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-16 17:34     ` Michael Holzheu
@ 2010-11-16 17:50       ` Oleg Nesterov
  2010-11-18 16:34       ` Oleg Nesterov
  1 sibling, 0 replies; 52+ messages in thread
From: Oleg Nesterov @ 2010-11-16 17:50 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

Hi Michael,

I can't read this discussion today, will try to do tomorrow and reply.
Just one note,

On 11/16, Michael Holzheu wrote:
>
> So you want to have the following three patches:
> * Introduce "struct cdata" + __account_ctime() (no functional change)
> * Add cdata_acct accounting + parallel accounting hierarchy
> * Add taskstats interface to export the data to userspace

Something like this (if possible).

But my main concern is parallel hierarchy itself. So, if you ask me,
I'd prefer to see a separate patch which only adds acct_parent/etc
and implements this reparenting, even without cdata_acct accounting.

Again, if you think this is possible. I understand that it is not
easy to maintain this series,

Oleg.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting
  2010-11-16 16:05                         ` Martin Schwidefsky
@ 2010-11-16 18:39                           ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 52+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-16 18:39 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Peter Zijlstra, Michael Holzheu, Shailabh Nagar, Andrew Morton,
	Venkatesh Pallipadi, Suresh Siddha, Ingo Molnar, Oleg Nesterov,
	John stultz, Thomas Gleixner, Balbir Singh, Heiko Carstens,
	Roland McGrath, linux-kernel, linux-s390, Avi Kivity

On 11/16/2010 08:05 AM, Martin Schwidefsky wrote:
>>> Yes and no. The tracepoint idea looks interesting in itself. But that does
>>> not completely replace the per-task steal time. The hypervisor can take
>>> away the cpu anytime, it is still interesting to know which task was hit
>>> hardest by that. You could view the cpu time lost by a hypercall as
>>> "synchronous" steal time for the task, the remaining delta to the total
>>> per-task steal time as "asynchronous" steal time. 
>> Right, so there is no way the guest knows about the vcpu getting
>> scheduled, it can only derive the fact from hardware clocks after the
>> fact?
> Correct.

Yes, same for Xen.

    J


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-16 17:34     ` Michael Holzheu
  2010-11-16 17:50       ` Oleg Nesterov
@ 2010-11-18 16:34       ` Oleg Nesterov
  1 sibling, 0 replies; 52+ messages in thread
From: Oleg Nesterov @ 2010-11-18 16:34 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

On 11/16, Michael Holzheu wrote:
>
> On Sat, 2010-11-13 at 19:38 +0100, Oleg Nesterov wrote:
> > I already asked you to split these changes, perhaps you can do this?
> > Say, bacct_add_tsk() looks overcomplicated, the change in copy_process()
> > shouldn't introduce the new CLONE_THREAD check, not sure I understand
> > why release_task() was chosen for reparenting, other nits...
>
> I want to establish the new hierarchy when a new process is forked and
> not for new threads, therefore the check for CLONE_THREAD in
> copy_process().

Yes, but copy_process() already checks CLONE_THREAD many times. No
need to introduce the new check.

> I do the reparenting with reparent_acct() when a process
> dies, therefore the check for "group_dead" in exit_signal().

And it is not clear to me why release_task() is better than
exit_notify().


That said, perhaps I'll understand this reading the next version.
That is why I asked to split.

Oleg.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-16 16:57     ` Michael Holzheu
@ 2010-11-18 17:10       ` Oleg Nesterov
  2010-11-19 19:46         ` Michael Holzheu
  0 siblings, 1 reply; 52+ messages in thread
From: Oleg Nesterov @ 2010-11-18 17:10 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

Sorry for delay.

On 11/16, Michael Holzheu wrote:
>
> On Sat, 2010-11-13 at 19:38 +0100, Oleg Nesterov wrote:
> > First of all, let me repeat, I am not going to discuss whether we need
> > these changes or not, I do not know even if I understand your motivation.
>
> Sorry, if I was not clear enough with my descriptions. Let me try to
> describe my motivation again:

Yes, more or less I see what this patch does (I hope ;).

> 2. Because of reparent to init, there are situations where it is
>    not clear to which tasks the CPU time of dead tasks between
>    two snapshots has been accounted. This is a problem for example 2.

Yes, I see.

But I must admit, _personally_ I am not sure this problem worth the
trouble. And, I already mentioned daemonize() case, IOW I am not sure
it is _always_ right to choose exiting_parent->parent for accounting.
To me, this can be equally confusing. A user sees the running deamon
with ppid = 1, then this daemon exits and top reports the "unrelated"
process as cpu consumer.


But once again. I am _not_ against this patch. I never know when it
comes to new features. Certainly you know better if this suits top.

What I actually meant is: dear CC list, please look at this change
and comment ;)

Oleg.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
  2010-11-18 17:10       ` Oleg Nesterov
@ 2010-11-19 19:46         ` Michael Holzheu
  0 siblings, 0 replies; 52+ messages in thread
From: Michael Holzheu @ 2010-11-19 19:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Shailabh Nagar, Andrew Morton, Venkatesh Pallipadi,
	Suresh Siddha, Peter Zijlstra, Ingo Molnar, John stultz,
	Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
	Heiko Carstens, Roland McGrath, linux-kernel, linux-s390

Hello Oleg,

On Thu, 2010-11-18 at 18:10 +0100, Oleg Nesterov wrote:
> > 2. Because of reparent to init, there are situations where it is
> >    not clear to which tasks the CPU time of dead tasks between
> >    two snapshots has been accounted. This is a problem for example 2.
> 
> Yes, I see.
> 
> But I must admit, _personally_ I am not sure this problem worth the
> trouble.

I think you are right...

For that function, the introduced overhead, additional code and
especially the possible confusion with two process hierarchies is not
worth the effort. Maybe we have a chance to solve the reparent problem
by introducing task exit event filters (e.g. give me all exit events for
processes that have init as parent).

So I will send no more patches with the parallel hierarchy:
Good news for you :-)

But for the second problem with the forgotten CPU time, I would like
send you a new patch set, separated as you have requested. Although I
personally think that also there we probably have no good chances to get
them accepted upstream, because the signal_struct size is increased and
some cycles are added to the task exit processing.

It would be nice, if you find some time to look at the patches,
but no hurry!

Michael






^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2010-11-19 19:46 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-11 17:03 [RFC][PATCH v2 0/7] taskstats: Enhancements for precise process accounting (version 2) Michael Holzheu
2010-11-11 17:03 ` [RFC][PATCH v2 1/7] taskstats: Add new taskstats command TASKSTATS_CMD_ATTR_PIDS Michael Holzheu
2010-11-13 19:20   ` Peter Zijlstra
2010-11-15 15:53     ` Michael Holzheu
2010-11-15 16:06       ` Peter Zijlstra
2010-11-15 17:09         ` Michael Holzheu
2010-11-15 17:21           ` Peter Zijlstra
2010-11-16 12:16             ` Michael Holzheu
2010-11-16 12:36               ` Peter Zijlstra
2010-11-13 19:39   ` Peter Zijlstra
2010-11-13 20:00     ` Balbir Singh
2010-11-15 14:50     ` Michael Holzheu
2010-11-11 17:03 ` [RFC][PATCH v2 2/7] taskstats: Add "/proc/taskstats" Michael Holzheu
2010-11-11 17:03 ` [RFC][PATCH v2 3/7] taskstats: Add thread group ID to taskstats structure Michael Holzheu
2010-11-11 17:03 ` [RFC][PATCH v2 4/7] taskstats: Add per task steal time accounting Michael Holzheu
2010-11-13 19:38   ` Peter Zijlstra
2010-11-15 14:50     ` Martin Schwidefsky
2010-11-15 15:11       ` Peter Zijlstra
2010-11-15 17:42         ` Martin Schwidefsky
2010-11-15 17:45           ` Peter Zijlstra
2010-11-15 17:47           ` Peter Zijlstra
2010-11-15 17:48           ` Peter Zijlstra
2010-11-15 17:50           ` Peter Zijlstra
2010-11-15 17:59             ` Martin Schwidefsky
2010-11-15 18:08               ` Peter Zijlstra
2010-11-16  8:51                 ` Martin Schwidefsky
2010-11-16 12:16                   ` Peter Zijlstra
2010-11-16 15:33                     ` Martin Schwidefsky
2010-11-16 15:45                       ` Peter Zijlstra
2010-11-16 16:05                         ` Martin Schwidefsky
2010-11-16 18:39                           ` Jeremy Fitzhardinge
2010-11-16 16:38                         ` Avi Kivity
2010-11-16 16:43                           ` Peter Zijlstra
2010-11-16 16:56                             ` Avi Kivity
2010-11-16 17:06                               ` Avi Kivity
2010-11-11 17:03 ` [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU " Michael Holzheu
2010-11-13 18:38   ` Oleg Nesterov
2010-11-15 15:55     ` Martin Schwidefsky
2010-11-15 16:03       ` Peter Zijlstra
2010-11-15 17:49         ` Martin Schwidefsky
2010-11-15 17:51           ` Peter Zijlstra
2010-11-15 18:00             ` Martin Schwidefsky
2010-11-15 18:10               ` Peter Zijlstra
2010-11-16  8:54                 ` Martin Schwidefsky
2010-11-16 16:57     ` Michael Holzheu
2010-11-18 17:10       ` Oleg Nesterov
2010-11-19 19:46         ` Michael Holzheu
2010-11-16 17:34     ` Michael Holzheu
2010-11-16 17:50       ` Oleg Nesterov
2010-11-18 16:34       ` Oleg Nesterov
2010-11-11 17:03 ` [RFC][PATCH v2 6/7] taskstats: Fix accounting for non-leader thread exec Michael Holzheu
2010-11-11 17:11 ` [RFC][PATCH v2 7/7] taskstats: Precise process accounting user space Michael Holzheu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).