From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753148Ab0IWNsI (ORCPT ); Thu, 23 Sep 2010 09:48:08 -0400 Received: from mtagate7.uk.ibm.com ([194.196.100.167]:58693 "EHLO mtagate7.uk.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752473Ab0IWNsF (ORCPT ); Thu, 23 Sep 2010 09:48:05 -0400 Subject: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting From: Michael Holzheu Reply-To: holzheu@linux.vnet.ibm.com To: Shailabh Nagar , Andrew Morton , Venkatesh Pallipadi , Suresh Siddha , Peter Zijlstra , Ingo Molnar , Oleg Nesterov , John stultz , Thomas Gleixner , Balbir Singh , Martin Schwidefsky , Heiko Carstens Cc: linux-kernel@vger.kernel.org, linux-s390@vger.kernel.org Content-Type: text/plain; charset="us-ascii" Organization: IBM Date: Thu, 23 Sep 2010 15:48:01 +0200 Message-ID: <1285249681.1837.28.camel@holzheu-laptop> Mime-Version: 1.0 X-Mailer: Evolution 2.28.3 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently tools like "top" gather the task information by reading procfs files. This has several disadvantages: * It is very CPU intensive, because a lot of system calls (readdir, open, read, close) are necessary. * No real task snapshot can be provided, because while the procfs files are read the system continues running. * The procfs times granularity is restricted to jiffies. In parallel to procfs there exists the taskstats binary interface that uses netlink sockets as transport mechanism to deliver task information to user space. There exists a taskstats command "TASKSTATS_CMD_ATTR_PID" to get task information for a given PID. This command can already be used for tools like top, but has also several disadvantages: * You first have to find out which PIDs are available in the system. Currently we have to use procfs again to do this. * For each task two system calls have to be issued (First send the command and then receive the reply). * No snapshot mechanism is available. GOALS OF THIS PATCH SET ----------------------- The intention of this patch set is to provide better support for tools like top. The goal is to: * provide a task snapshot mechanism where we can get a consistent view of all running tasks. * provide a transport mechanism that does not require a lot of system calls and that allows implementing low CPU overhead task monitoring. * provide microsecond CPU time granularity. FIRST RESULTS ------------- Together with this kernel patch set also user space code for a new top utility (ptop) is provided that exploits the new kernel infrastructure. See patch 10 for more details. TEST1: System with many sleeping tasks for ((i=0; i < 1000; i++)) do sleep 1000000 & done # ptop_new_proc VVVV pid user sys ste total Name (#) (%) (%) (%) (%) (str) 541 0.37 2.39 0.10 2.87 top 3743 0.03 0.05 0.00 0.07 ptop_new_proc ^^^^ Compared to the old top command that has to scan more than 1000 proc directories the new ptop consumes much less CPU time (0.05% system time on my s390 system). TEST2: Show snapshot consistency with system that is 100% busy System with 3 CPUs: for ((i=0; i < $(cat /proc/cpuinfo | grep "^processor" | wc -l); i++)) do ./loop & done # ptop_snap_proc VVVV VVV VVV VVVVV pid user sys ste cuser csys cste delay total Elap+ Name (#) (%) (%) (%) (%) (%) (%) (%) (%) (hm) (str) 23891 99.84 0.06 0.09 0.00 0.00 0.00 0.01 99.99 0:00 loop 23881 99.66 0.06 0.09 0.00 0.00 0.00 0.20 99.81 0:00 loop 23886 99.65 0.06 0.09 0.00 0.00 0.00 0.20 99.80 0:00 loop 2413 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 4:17 sshd ... V:V:S 299.36 0.36 0.27 0.00 0.00 0.00 0.40 300.00 4:22 ^^^^^^ With the snapshot mechanism the sum of all tasks CPU times (user + system + steal) will be exactly 300.00% CPU time with this testcase. Using ptop_snap_proc (see patch 10) this works fine on s390. PATCHSET OVERVIEW ----------------- The code is not final and still has a few TODOs. But it is good enough for a first round of review. The following kernel patches are provided: [01] Prepare-0: Use real microsecond granularity for taskstats CPU times. [02] Prepare-1: Restructure taskstats.c in order to be able to add new commands more easily. [03] Prepare-2: Separate the finding of a task_struct by PID or TGID from filling the taskstats. [04] Add new command "TASKSTATS_CMD_ATTR_PIDS" to get a snapshot of multiple tasks. [05] Add procfs interface for taskstats commands. This allows to get a complete and consistent snapshot with all tasks using two system calls (ioctl and read). Transferring a snapshot of all running tasks is not possible using the existing netlink interface, because there we have the socket buffer size as restricting factor. [06] Add TGID to taskstats. [07] Add steal time per task accounting. [08] Add cumulative CPU time (user, system and steal) to taskstats. [09] Fix exit CPU time accounting. [10] Besides of the kernel patches also user space code is provided that exploits the new kernel infrastructure. The user space code provides the following: 1. A proposal for a taskstats user space library: 1.1 Based on netlink (requires libnl-devel-1.1-5) 2.1 Based on the new /proc/taskstats interface (see [05]) 2. A proposal for a task snapshot library based on taskstats library (1.1) 3. A new tool "ptop" (precise top) that uses the libraries