linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [0/1][ANNOUNCE] nproc v2: netlink access to /proc information
@ 2004-09-08 18:40 Roger Luethi
  2004-09-08 18:41 ` [1/1][PATCH] " Roger Luethi
  2004-09-16 21:43 ` nproc: So? Roger Luethi
  0 siblings, 2 replies; 63+ messages in thread
From: Roger Luethi @ 2004-09-08 18:40 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Albert Cahalan, William Lee Irwin III, Martin J. Bligh, Paul Jackson

I am submitting nproc, a new netlink interface to process information,
for review and a possible inclusion in mainline.

The problems with /proc as far as parsers go are widely known. Parsing is
both difficult and slow (including a more detailed discussion by reference:
http://marc.theaimsgroup.com/?l=linux-kernel&m=109361019528995). What
follows is an overview showing how nproc fares in those areas.

Roger

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Clean Interface
---------------
The main motivation was to clean up the mess that are /proc semantics
and provide a clean interface for tools to gather process information.

Nproc does not add new knowledge to the kernel (some redundancy remains
until routines are shared with /proc). Instead, it offers existing
information in a form that works for tools. In fact, a tool can pass
the buffer read from the netlink directly as a va_list to vprintf
(strings require a trivial extra operation).

A small user-space app can present a view like the one below based on
zero prior knowledge about the fields the kernel has to offer. While I
don't envision that as common for tools in the future, it demonstrates
what can be done with little effort. This is not a mock-up, by the way,
the nprocdemo tool exists (lines truncated to fit 80 chars).

MemFree |PageSize|Jiffies   |nr_dirty|nr_writeback|nr_unstable|[...]
____page|____byte|__________|____page|________page|_______page|[...]
    7546|    4096|   1917203|       1|           0|          0|[...]

PID  |Name           |VmSize  |VmLock  |VmRSS   |VmData  |VmStack |[...]
_____|_______________|_____KiB|_____KiB|_____KiB|_____KiB|_____KiB|[...]
    1|init           |    1340|       0|     468|     144|       4|[...]
    2|ksoftirqd/0    |       0|       0|       0|       0|       0|[...]
    3|events/0       |       0|       0|       0|       0|       0|[...]
    4|khelper        |       0|       0|       0|       0|       0|[...]
    5|netlink/0      |       0|       0|       0|       0|       0|[...]
    6|kacpid         |       0|       0|       0|       0|       0|[...]
   23|kblockd/0      |       0|       0|       0|       0|       0|[...]
   24|khubd          |       0|       0|       0|       0|       0|[...]
   36|pdflush        |       0|       0|       0|       0|       0|[...]
   37|pdflush        |       0|       0|       0|       0|       0|[...]
   38|kswapd0        |       0|       0|       0|       0|       0|[...]
   39|aio/0          |       0|       0|       0|       0|       0|[...]
  671|kseriod        |       0|       0|       0|       0|       0|[...]
  686|reiserfs/0     |       0|       0|       0|       0|       0|[...]
  851|udevd          |    1320|       0|     360|     144|       4|[...]
 9159|syslogd        |    1516|       0|     588|     272|      16|[...]
 9382|gpm            |    1540|       0|     468|     152|       4|[...]
 9452|klogd          |    1468|       0|     432|     276|       8|[...]
 9478|hddtemp        |    1692|       0|     848|     472|      16|[...]
 9486|login          |    2152|       0|    1204|     392|      36|[...]
 9487|agetty         |    1340|       0|     488|     156|       4|[...]
 9488|agetty         |    1340|       0|     488|     156|       4|[...]
 9489|agetty         |    1340|       0|     488|     156|       4|[...]
 9490|agetty         |    1340|       0|     488|     156|       4|[...]
 9491|agetty         |    1340|       0|     488|     156|       4|[...]
 9598|zsh            |    4748|       0|    1688|     532|      20|[...]
[...]

Performance
-----------
I measured the time to write a complete process table dump for 5000
tasks to /dev/null 100 times for "ps ax" and nprocdemo.

ps ax     (5 process fields):
real    1m0.472s
user    0m18.227s
sys     0m28.545s

nprocdemo (automatic field discovery, reading and printing 11 process
           fields + 9 global fields):
real    0m9.064s
user    0m2.491s
sys     0m1.554s

The details of resource usage for the benchmarks show that /proc based
tools are suffering badly from the inefficiency of three(!) conversions
between data and strings (kernel produces strings from numbers, app
converts back to numbers, app converts numbers again to strings for
printing).

For nproc based tools, only one conversion remains.

# ps ax > /dev/null
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        image name           app name   symbol name
6524     14.0613  vmlinux              ps         number
4828     10.4058  libc-2.3.3.so        ps         _IO_vfscanf_internal
2740      5.9056  vmlinux              ps         vsnprintf
2689      5.7956  vmlinux              ps         proc_pid_stat
1807      3.8946  vmlinux              ps         __d_lookup
1676      3.6123  libc-2.3.3.so        ps         ____strtol_l_internal
1335      2.8773  vmlinux              ps         link_path_walk
1133      2.4420  libproc-3.2.3.so     ps         status2proc
1094      2.3579  vmlinux              ps         render_sigset_t
1088      2.3450  libc-2.3.3.so        ps         _IO_vfprintf_internal
1086      2.3407  libc-2.3.3.so        ps         __GI_strchr
885       1.9075  libc-2.3.3.so        ps         ____strtoul_l_internal
800       1.7242  vmlinux              ps         pid_revalidate
581       1.2522  vmlinux              ps         proc_pid_status
551       1.1876  libc-2.3.3.so        ps         _IO_sputbackc_internal
529       1.1402  vmlinux              ps         system_call
524       1.1294  libc-2.3.3.so        ps         _IO_default_xsputn_internal
476       1.0259  libc-2.3.3.so        ps         __i686.get_pc_thunk.bx
466       1.0044  vmlinux              ps         get_tgid_list
442       0.9526  vmlinux              ps         atomic_dec_and_lock
373       0.8039  vmlinux              ps         dput
311       0.6703  libc-2.3.3.so        ps         __GI___strtol_internal
274       0.5906  vmlinux              ps         __copy_to_user_ll
272       0.5862  vmlinux              ps         path_lookup
270       0.5819  vmlinux              ps         strncpy_from_user
262       0.5647  libproc-3.2.3.so     ps         escape_str
259       0.5582  vmlinux              ps         page_address
249       0.5367  libc-2.3.3.so        ps         __GI_____strtoull_l_internal
244       0.5259  libc-2.3.3.so        ps         __GI_strlen

# nprocdemo > /dev/null
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        image name           app name   symbol name
1142     15.9208  libc-2.3.3.so        nprocdemo  _IO_vfprintf_internal
1072     14.9449  vmlinux              vmlinux    __task_mem
611       8.5181  libc-2.3.3.so        nprocdemo  _IO_new_file_xsputn
445       6.2038  vmlinux              vmlinux    nproc_pid_fields
244       3.4016  vmlinux              vmlinux    get_wchan
235       3.2762  vmlinux              nprocdemo  __copy_to_user_ll
233       3.2483  vmlinux              vmlinux    find_pid
215       2.9974  vmlinux              vmlinux    finish_task_switch
208       2.8998  vmlinux              nprocdemo  netlink_recvmsg
158       2.2027  vmlinux              nprocdemo  __wake_up
153       2.1330  libc-2.3.3.so        nprocdemo  __find_specmb
149       2.0772  vmlinux              nprocdemo  finish_task_switch
146       2.0354  libc-2.3.3.so        nprocdemo  __i686.get_pc_thunk.bx
114       1.5893  vmlinux              vmlinux    get_task_mm
94        1.3105  vmlinux              nprocdemo  skb_release_data
87        1.2129  vmlinux              vmlinux    nproc_ps_do_pid
76        1.0595  vmlinux              vmlinux    alloc_skb
72        1.0038  vmlinux              nprocdemo  system_call
68        0.9480  libc-2.3.3.so        nprocdemo  _IO_padn_internal
65        0.9062  libc-2.3.3.so        nprocdemo  read_int
64        0.8922  libc-2.3.3.so        nprocdemo  __recv
63        0.8783  vmlinux              vmlinux    netlink_attachskb
61        0.8504  vmlinux              nprocdemo  kfree
56        0.7807  vmlinux              vmlinux    __kmalloc
55        0.7668  vmlinux              vmlinux    schedule
47        0.6552  vmlinux              vmlinux    __task_mem_cheap
42        0.5855  vmlinux              nprocdemo  sys_socketcall
40        0.5576  vmlinux              nprocdemo  fget
37        0.5158  nprocdemo            nprocdemo  nproc_get_reply

EOT

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-08 18:40 [0/1][ANNOUNCE] nproc v2: netlink access to /proc information Roger Luethi
@ 2004-09-08 18:41 ` Roger Luethi
  2004-09-09  0:35   ` William Lee Irwin III
  2004-09-09 11:53   ` Stephen Smalley
  2004-09-16 21:43 ` nproc: So? Roger Luethi
  1 sibling, 2 replies; 63+ messages in thread
From: Roger Luethi @ 2004-09-08 18:41 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Albert Cahalan, William Lee Irwin III, Martin J. Bligh, Paul Jackson


A few notes:
- Access control can be implemented easily. Right now it would be bloat,
  though -- the vast majority of fields in /proc are world-readable
  (/proc/pid/environ being the notable exception).

- Additional process selectors (e.g. select by UID) are not hard to
  add, either, should there ever be a need.

- There are a few things I'm not sure about: For instance, what is a good
  return value for mm_struct related fields wrt kernel threads? I picked
  0, but ~(0) might be preferable because it's distinct.

Signed-off-by: Roger Luethi <rl@hellgate.ch>

diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/include/linux/netlink.h linux-2.6.9-rc1-bk13-nproc/include/linux/netlink.h
--- linux-2.6.9-rc1-bk13/include/linux/netlink.h	2004-09-06 18:53:17.000000000 +0200
+++ linux-2.6.9-rc1-bk13-nproc/include/linux/netlink.h	2004-09-06 19:50:56.000000000 +0200
@@ -15,6 +15,7 @@
 #define NETLINK_ARPD		8
 #define NETLINK_AUDIT		9	/* auditing */
 #define NETLINK_ROUTE6		11	/* af_inet6 route comm channel */
+#define NETLINK_NPROC		12	/* /proc information */
 #define NETLINK_IP6_FW		13
 #define NETLINK_DNRTMSG		14	/* DECnet routing messages */
 #define NETLINK_TAPBASE		16	/* 16 to 31 are ethertap */
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/include/linux/nproc.h linux-2.6.9-rc1-bk13-nproc/include/linux/nproc.h
--- linux-2.6.9-rc1-bk13/include/linux/nproc.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.9-rc1-bk13-nproc/include/linux/nproc.h	2004-09-08 18:56:41.763526856 +0200
@@ -0,0 +1,119 @@
+#ifndef _LINUX_NPROC_H
+#define _LINUX_NPROC_H
+
+#include <linux/config.h>
+
+#ifndef __KERNEL__
+#define CONFIG_NPROC
+#endif
+
+#ifdef CONFIG_NPROC
+
+/* Request types */
+#define NPROC_BASE		0x10
+#define NPROC_GET_FIELD_LIST	(NPROC_BASE+0)
+#define NPROC_GET_LABEL		(NPROC_BASE+1)
+#define NPROC_GET_GLOBAL	(NPROC_BASE+2)
+#define NPROC_GET_PS		(NPROC_BASE+3)
+#define NPROC_GET_PID_LIST	(NPROC_BASE+4)
+
+/* Request flags */
+
+
+/* Field scopes */
+#define NPROC_SCOPE_MASK	0x70000000
+#define NPROC_SCOPE_GLOBAL	0x10000000	/* Global w/o arguments */
+#define NPROC_SCOPE_PROCESS	0x20000000
+#define NPROC_SCOPE_LABEL	0x30000000
+
+/* Data types */
+#define NPROC_TYPE_MASK		0x07000000
+#define NPROC_TYPE_STRING	0x01000000
+#define NPROC_TYPE_U32		0x02000000
+#define NPROC_TYPE_UL		0x03000000
+#define NPROC_TYPE_U64		0x04000000
+
+/* Access control (unused) */
+#define NPROC_PERM_MASK		0x00300000
+#define NPROC_PERM_USER		0x00100000
+#define NPROC_PERM_ROOT		0x00200000
+
+/* Selectors */
+#define NPROC_SELECT_ALL	0x00000001
+#define NPROC_SELECT_PID	0x00000002
+#define NPROC_SELECT_UID	0x00000003
+
+/* Labels */
+#define NPROC_LABEL_FIELD_NAME	0x00000001
+#define NPROC_LABEL_FIELD_FMT	0x00000002
+#define NPROC_LABEL_FIELD_UNIT	0x00000003
+#define NPROC_LABEL_WCHAN	0x00000004
+
+/* Field IDs (unique key in bits 0 - 15) */
+#define NPROC_NOP_UL		(0x00000020 | NPROC_TYPE_UL)
+#define NPROC_PID		(0x00000001 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_NAME		(0x00000002 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS)
+/* Amount of free memory (pages) */
+#define NPROC_MEMFREE		(0x00000004 | NPROC_TYPE_U32    | NPROC_SCOPE_GLOBAL)
+/* Size of a page (bytes) */
+#define NPROC_PAGESIZE		(0x00000005 | NPROC_TYPE_U32    | NPROC_SCOPE_GLOBAL)
+/* There's no guarantee about anything with jiffies. Still useful for some. */
+#define NPROC_JIFFIES		(0x00000006 | NPROC_TYPE_U64    | NPROC_SCOPE_GLOBAL)
+/* Process: VM size (KiB) */
+#define NPROC_VMSIZE		(0x00000010 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+/* Process: locked memory (KiB) */
+#define NPROC_VMLOCK		(0x00000011 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+/* Process: Memory resident size (KiB) */
+#define NPROC_VMRSS		(0x00000012 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMDATA		(0x00000013 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMSTACK		(0x00000014 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMEXE		(0x00000015 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMLIB		(0x00000016 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_UID		(0x00000018 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_NR_DIRTY		(0x00000051 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_WRITEBACK	(0x00000052 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_UNSTABLE	(0x00000053 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_PG_TABLE_PGS	(0x00000054 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_MAPPED		(0x00000055 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_SLAB		(0x00000056 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_WCHAN		(0x00000080 | NPROC_TYPE_UL     | NPROC_SCOPE_PROCESS)
+#define NPROC_WCHAN_NAME	(0x00000081 | NPROC_TYPE_STRING)
+
+#ifdef __KERNEL__
+struct nproc_field {
+	__u32 id;
+	const char *label;
+	const char *fmt;
+	const char *unit;
+};
+
+static struct nproc_field labels[] = {
+	{ NPROC_PID,			"PID",		"%5u",	"" },
+	{ NPROC_NAME,			"Name",		"%-15s","" },
+	{ NPROC_MEMFREE,		"MemFree",	"%8u",	"page" },
+	{ NPROC_PAGESIZE,		"PageSize",	"%4u",	"byte" },
+	{ NPROC_JIFFIES,		"Jiffies",	"%10u",	"" },
+	{ NPROC_VMSIZE,			"VmSize",	"%8u",	"KiB" },
+	{ NPROC_VMLOCK,			"VmLock",	"%8u",	"KiB" },
+	{ NPROC_VMRSS,			"VmRSS",	"%8u",	"KiB" },
+	{ NPROC_VMDATA,			"VmData",	"%8u",	"KiB" },
+	{ NPROC_VMSTACK,		"VmStack",	"%8u",	"KiB" },
+	{ NPROC_VMEXE,			"VmExe",	"%8u",	"KiB" },
+	{ NPROC_VMLIB,			"VmLib",	"%8u",	"KiB" },
+	{ NPROC_UID,			"UID",		"%5u",	"" },
+	{ NPROC_NR_DIRTY,		"nr_dirty",	"%8d",	"page" },
+	{ NPROC_NR_WRITEBACK,		"nr_writeback",	"%8u",	"page" },
+	{ NPROC_NR_UNSTABLE,		"nr_unstable",	"%8u",	"page" },
+	{ NPROC_NR_PG_TABLE_PGS,	"nr_page_table_pages",	"%8u", "page" },
+	{ NPROC_NR_MAPPED,		"nr_mapped",	"%8u",	"page" },
+	{ NPROC_NR_SLAB,		"nr_slab",	"%8u",	"page" },
+	{ NPROC_WCHAN,			"wchan",	"%p",	"" },
+#ifdef CONFIG_KALLSYMS
+	{ NPROC_WCHAN_NAME,		"wchan_symbol",	"%s"},
+#endif
+};
+#endif /* __KERNEL__ */
+
+#endif /* CONFIG_NPROC */
+
+#endif /* _LINUX_NPROC_H */
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/include/linux/pid.h linux-2.6.9-rc1-bk13-nproc/include/linux/pid.h
--- linux-2.6.9-rc1-bk13/include/linux/pid.h	2004-09-06 18:53:17.000000000 +0200
+++ linux-2.6.9-rc1-bk13-nproc/include/linux/pid.h	2004-09-06 19:50:56.000000000 +0200
@@ -37,6 +37,7 @@ extern void FASTCALL(detach_pid(struct t
 extern struct pid *FASTCALL(find_pid(enum pid_type, int));
 
 extern int alloc_pidmap(void);
+extern void *get_pid_map(int);
 extern void FASTCALL(free_pidmap(int));
 extern void switch_exec_pids(struct task_struct *leader, struct task_struct *thread);
 
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/kernel/Makefile linux-2.6.9-rc1-bk13-nproc/kernel/Makefile
--- linux-2.6.9-rc1-bk13/kernel/Makefile	2004-09-06 18:53:17.000000000 +0200
+++ linux-2.6.9-rc1-bk13-nproc/kernel/Makefile	2004-09-06 19:50:56.000000000 +0200
@@ -15,6 +15,7 @@ obj-$(CONFIG_SMP) += cpu.o spinlock.o
 obj-$(CONFIG_UID16) += uid16.o
 obj-$(CONFIG_MODULES) += module.o
 obj-$(CONFIG_KALLSYMS) += kallsyms.o
+obj-$(CONFIG_NPROC) += nproc.o
 obj-$(CONFIG_PM) += power/
 obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_COMPAT) += compat.o
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/kernel/nproc.c linux-2.6.9-rc1-bk13-nproc/kernel/nproc.c
--- linux-2.6.9-rc1-bk13/kernel/nproc.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.9-rc1-bk13-nproc/kernel/nproc.c	2004-09-08 18:34:49.000000000 +0200
@@ -0,0 +1,851 @@
+/*
+ * nproc.c
+ *
+ * netlink interface to /proc information.
+ */
+
+#include <linux/skbuff.h>
+#include <net/sock.h>
+#include <linux/swap.h>		/* nr_free_pages() */
+#include <linux/kallsyms.h>	/* kallsyms_lookup() */
+#include <linux/pid.h>		/* get_pid_map() */
+#include <linux/nproc.h>
+#include <asm/bitops.h>
+
+//#define DEBUG
+
+/* There must be like 5 million dprintk definitions, so let's add some more */
+#ifdef DEBUG
+#define pdebug(x,args...) printk(KERN_DEBUG "%s:%d " x, __func__ , __LINE__, ##args)
+#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args)
+#else
+#define pdebug(x,args...)
+#define pwarn(x,args...)
+#endif
+
+#define perror(x,args...) printk(KERN_ERR "%s:%d " x, __func__ , __LINE__, ##args)
+
+static struct sock *nproc_sock = NULL;
+
+struct task_mem {
+	u32	vmdata;
+	u32	vmstack;
+	u32	vmexe;
+	u32	vmlib;
+};
+
+struct task_mem_cheap {
+	u32	vmsize;
+	u32	vmlock;
+	u32	vmrss;
+};
+
+/*
+ * __task_mem/__task_mem_cheap basically duplicate the MMU version of
+ * task_mem, but they are split by cost and work on structs.
+ */
+
+static void __task_mem(struct task_struct *tsk, struct task_mem *res)
+{
+	struct mm_struct *mm = get_task_mm(tsk);
+	if (mm) {
+		unsigned long data = 0, stack = 0, exec = 0, lib = 0;
+		struct vm_area_struct *vma;
+
+		down_read(&mm->mmap_sem);
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			unsigned long len = (vma->vm_end - vma->vm_start) >> 10;
+			if (!vma->vm_file) {
+				data += len;
+				if (vma->vm_flags & VM_GROWSDOWN)
+					stack += len;
+				continue;
+			}
+			if (vma->vm_flags & VM_WRITE)
+				continue;
+			if (vma->vm_flags & VM_EXEC) {
+				exec += len;
+				if (vma->vm_flags & VM_EXECUTABLE)
+					continue;
+				lib += len;
+			}
+		}
+		res->vmdata = data - stack;
+		res->vmstack = stack;
+		res->vmexe = exec - lib;
+		res->vmlib = lib;
+		up_read(&mm->mmap_sem);
+
+		mmput(mm);
+	} else {
+		res->vmdata = 0;
+		res->vmstack = 0;
+		res->vmexe = 0;
+		res->vmlib = 0;
+	}
+}
+
+static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res)
+{
+	struct mm_struct *mm = get_task_mm(tsk);
+	if (mm) {
+		res->vmsize = mm->total_vm << (PAGE_SHIFT-10);
+		res->vmlock = mm->locked_vm << (PAGE_SHIFT-10);
+		res->vmrss = mm->rss << (PAGE_SHIFT-10);
+		mmput(mm);
+	} else {
+		res->vmsize = 0;
+		res->vmlock = 0;
+		res->vmrss = 0;
+	}
+}
+
+/*
+ * page_alloc.c already has an extra function broken out to fill a
+ * struct with information. Cool. Not sure whether pgpgin/pgpgout
+ * should be left as is or nailed down as kbytes.
+ */
+static struct page_state *__vmstat(void)
+{
+	struct page_state *ps;
+	ps = kmalloc(sizeof(*ps), GFP_KERNEL);
+	if (!ps)
+		return ERR_PTR(-ENOMEM);
+	get_full_page_state(ps);
+	ps->pgpgin /= 2;	/* sectors -> kbytes */
+	ps->pgpgout /= 2;
+	return ps;
+}
+
+/*
+ * Allocate and prefill an skb. The nlmsghdr provided to the function
+ * is a pointer to the respective struct in the request message.
+ */
+static struct sk_buff *nproc_alloc_nlmsg(struct nlmsghdr *nlh, u32 len)
+{
+	__u32 seq = nlh->nlmsg_seq;
+	__u16 type = nlh->nlmsg_type;
+	__u32 pid = nlh->nlmsg_pid;
+	struct sk_buff *skb2 = 0;
+
+	skb2 = alloc_skb(NLMSG_SPACE(len), GFP_KERNEL);
+	if (!skb2) {
+		skb2 = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	NLMSG_PUT(skb2, pid, seq, type, NLMSG_ALIGN(len));
+out:
+	return skb2;
+
+nlmsg_failure:				/* Used by NLMSG_PUT */
+	kfree_skb(skb2);
+	return NULL;
+}
+
+#define mstore(value, id, buf)						\
+({									\
+	u32 _type = id & NPROC_TYPE_MASK;				\
+	switch (_type) {						\
+		case NPROC_TYPE_U32: {					\
+			__u32 *p = (u32 *)buf;				\
+			*p = value;					\
+			buf = (char *)++p;				\
+			break;						\
+		}							\
+		case NPROC_TYPE_UL: {					\
+			unsigned long *p = (unsigned long *)buf;	\
+			*p = value;					\
+			buf = (char *)++p;				\
+			break;						\
+		}							\
+		case NPROC_TYPE_U64: {					\
+			__u64 *p = (u64 *)buf;				\
+			*p = value;					\
+			buf = (char *)++p;				\
+			break;						\
+		}							\
+		default:						\
+			perror("Huh? Bad type!\n");			\
+	}								\
+})
+
+static char *nproc_ps_field(u32 id, char *buf, task_t *tsk)
+{
+	struct task_mem tsk_mem;
+	struct task_mem_cheap tsk_mem_cheap;
+
+	tsk_mem.vmdata = (~0);
+	tsk_mem_cheap.vmsize = (~0);
+
+	switch (id) {
+		case NPROC_PID:
+			mstore(tsk->pid, NPROC_PID, buf);
+			break;
+		case NPROC_UID:
+			mstore(tsk->uid, NPROC_UID, buf);
+			break;
+		case NPROC_VMSIZE:
+		case NPROC_VMLOCK:
+		case NPROC_VMRSS:
+			if (tsk_mem_cheap.vmsize == (~0))
+				__task_mem_cheap(tsk, &tsk_mem_cheap);
+
+			switch (id) {
+				case NPROC_VMSIZE:
+					mstore(tsk_mem_cheap.vmsize,
+							NPROC_VMSIZE, buf);
+					break;
+				case NPROC_VMLOCK:
+					mstore(tsk_mem_cheap.vmlock,
+							NPROC_VMLOCK, buf);
+					break;
+				case NPROC_VMRSS:
+					mstore(tsk_mem_cheap.vmrss,
+							NPROC_VMRSS, buf);
+					break;
+			}
+			break;
+		case NPROC_VMDATA:
+		case NPROC_VMSTACK:
+		case NPROC_VMEXE:
+		case NPROC_VMLIB:
+			if (tsk_mem.vmdata == (~0))
+					__task_mem(tsk, &tsk_mem);
+
+			switch (id) {
+				case NPROC_VMDATA:
+					mstore(tsk_mem.vmdata, NPROC_VMDATA,
+							buf);
+					break;
+				case NPROC_VMSTACK:
+					mstore(tsk_mem.vmstack, NPROC_VMSTACK,
+							buf);
+					break;
+				case NPROC_VMEXE:
+					mstore(tsk_mem.vmexe, NPROC_VMEXE, buf);
+					break;
+				case NPROC_VMLIB:
+					mstore(tsk_mem.vmlib, NPROC_VMLIB, buf);
+					break;
+			}
+			break;
+		case NPROC_JIFFIES:
+			mstore(get_jiffies_64(), NPROC_JIFFIES, buf);
+			break;
+		case NPROC_WCHAN:
+			mstore(get_wchan(tsk), NPROC_WCHAN, buf);
+			break;
+		case NPROC_NAME:
+			mstore(sizeof(tsk->comm), NPROC_TYPE_U32, buf);
+			strncpy(buf, tsk->comm, sizeof(tsk->comm));
+			buf += sizeof(tsk->comm);
+			break;
+		case NPROC_NOP_UL:
+			mstore(0, NPROC_TYPE_UL, buf);
+			break;
+		default:
+			pwarn("Unknown field ID %#x.\n", id);
+			goto err_inval;
+	}
+	return buf;
+err_inval:
+	return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Build and send a netlink msg for one PID.
+ */
+static int nproc_pid_msg(struct nlmsghdr *nlh, u32 *fdata, u32 len, task_t *tsk)
+{
+	int i;
+	int err = 0;
+	struct sk_buff *skb2;
+	char *buf;
+	struct nlmsghdr *nlh2;
+	u32 fcnt, *fields;
+
+	fcnt = fdata[0];
+	fields = &fdata[1];
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+	nlh2 = (struct nlmsghdr *)skb2->data;
+	buf = NLMSG_DATA(nlh2);
+
+	for (i = 0; i < fcnt; i++) {
+		buf = nproc_ps_field(fields[i], buf, tsk);
+		if (IS_ERR(buf)) {
+			err = PTR_ERR(buf);
+			goto out_free;
+		}
+	}
+	err = netlink_unicast(nproc_sock, skb2, nlh2->nlmsg_pid, 0);
+	if (err > 0)
+		err = 0;
+	return err;
+out_free:
+	kfree_skb(skb2);
+out:
+	return err;
+}
+
+/*
+ * Find task for given pid, grab task lock (caller must unlock).
+ */
+static task_t *nproc_ps_get_task(int pid)
+{
+	task_t *tsk;
+
+	read_lock(&tasklist_lock);
+	tsk = find_task_by_pid(pid);
+	if (tsk)
+		get_task_struct(tsk);
+	read_unlock(&tasklist_lock);
+	return tsk;
+}
+
+/*
+ * Iterate over a list of PIDs.
+ */
+static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata)
+{
+	int i;
+	int err = 0;
+	u32 tcnt;
+	u32 *pids;
+
+	if (left < sizeof(tcnt))
+		goto err_inval;
+	left -= sizeof(tcnt);
+
+	tcnt = sdata[0];
+
+	if (left < (tcnt * sizeof(u32)))
+		goto err_inval;
+	left -= tcnt * sizeof(u32);
+
+	if (left)
+		pwarn("%d bytes left.\n", left);
+
+	pids = &sdata[1];
+
+	for (i = 0; i < tcnt; i++) {
+		task_t *tsk;
+		tsk = nproc_ps_get_task(pids[i]);
+		if (!tsk)
+			continue;
+		err = nproc_pid_msg(nlh, fdata, len, tsk);
+		put_task_struct(tsk);
+		if (err)
+			goto out;
+	}
+
+out:
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+#define PIDMAP_ENTRIES (PID_MAX_LIMIT/PAGE_SIZE/8)
+#define BITS_PER_PAGE (PAGE_SIZE*8)
+
+/*
+ * Iterate over all PIDs.
+ */
+static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len)
+{
+	void *map;
+	int offset, i;
+	int err = 0;
+
+	for (i = 0; i < PIDMAP_ENTRIES; i++) {
+
+		map = get_pid_map(i);
+		if (!map)	/* done -- there are no holes in pidmap_array */
+			break;
+		if (IS_ERR(map))	/* No PIDs used in this map */
+			continue;
+		offset = 0;
+		for ( ; ; ) {
+			int pid;
+			task_t *tsk;
+			offset = find_next_bit(map, BITS_PER_PAGE, ++offset);
+			if (offset >= BITS_PER_PAGE)
+				break;
+			pid = offset + i * BITS_PER_PAGE;
+			tsk = nproc_ps_get_task(pid);
+			if (!tsk)
+				continue;
+			err = nproc_pid_msg(nlh, fdata, len, tsk);
+			put_task_struct(tsk);
+			if (err)
+				goto out;
+		}
+	}
+
+out:
+	return err;
+}
+
+static u32 __reply_size_special(u32 id)
+{
+	u32 len = 0;
+
+	switch (id) {
+		case NPROC_NAME:
+			len = sizeof(u32) +
+				sizeof(((struct task_struct*)0)->comm);
+			break;
+		default:
+			pwarn("Unknown field size in %#x.\n", id);
+	}
+	return len;
+}
+
+/*
+ * Calculates the size of a reply message payload. Alternatively, we could have
+ * the user space caller supply a number along with the request and bail
+ * out or realloc later if we find the allocation was too small. More
+ * responsibility in user space, but faster.
+ */
+static u32 *__reply_size (u32 *data, u32 *left, u32 *len)
+{
+	u32 *fields;
+	u32 fcnt;
+	int i;
+	*len = 0;
+
+	if (*left < sizeof(fcnt))
+		goto err_inval;
+	*left -= sizeof(fcnt);
+
+	fcnt = data[0];
+
+	if (*left < (fcnt * sizeof(u32)))
+		goto err_inval;
+	*left -= fcnt * sizeof(u32);
+
+	fields = &data[1];
+
+	for (i = 0; i < fcnt; i++) {
+		u32 id = fields[i];
+		u32 type = id & NPROC_TYPE_MASK;
+		pdebug("        %#8.8x.\n", fields[i]);
+		switch (type) {
+			case NPROC_TYPE_U32:
+				*len += sizeof(u32);
+				break;
+			case NPROC_TYPE_UL:
+				*len += sizeof(unsigned long);
+				break;
+			case NPROC_TYPE_U64:
+				*len += sizeof(u64);
+				break;
+			default: {		/* Special cases */
+				u32 slen;
+				slen = __reply_size_special(id);
+				if (slen)
+					*len += slen;
+				else
+					goto err_inval;
+			}
+		}
+	}
+
+	return &fields[fcnt];
+
+err_inval:
+	return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Call the chosen process selector. Adding additional selectors
+ * (e.g. select by uid) is easy, but is there a need?
+ */
+static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid)
+{
+	int err;
+	u32 len;
+	u32 *data = NLMSG_DATA(nlh);
+	u32 *sdata;
+	u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+
+	sdata = __reply_size(data, &left, &len);
+	if (IS_ERR(sdata)) {
+		err = PTR_ERR(sdata);
+		goto out;
+	}
+
+	if (left < sizeof(u32))
+		goto err_inval;
+	left -= sizeof(u32);
+
+	switch (*sdata) {
+		case NPROC_SELECT_ALL:
+			if (left)
+				pwarn("%d bytes left.\n", left);
+			err = nproc_ps_select_all(nlh, data, len);
+			break;
+		case NPROC_SELECT_PID:
+			err = nproc_ps_select_pid(nlh, data, len,
+					left, sdata + 1);
+			break;
+		default:
+			pwarn("Unknown selection method %#x.\n", *sdata);
+			goto err_inval;
+	}
+
+out:
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+static char *nproc_global_field(u32 id, char *buf)
+{
+	struct page_state *ps = NULL;
+
+	switch (id) {
+		case NPROC_NR_DIRTY:
+		case NPROC_NR_WRITEBACK:
+		case NPROC_NR_UNSTABLE:
+		case NPROC_NR_PG_TABLE_PGS:
+		case NPROC_NR_MAPPED:
+		case NPROC_NR_SLAB:
+			if (!ps) {
+				ps = __vmstat();
+				if (IS_ERR(ps)) {	/* Just pass it on */
+					buf = (void *)ps;
+					ps = NULL;
+					goto out;
+				}
+			}
+			switch (id) {
+				case NPROC_NR_DIRTY:
+					mstore(ps->nr_dirty, NPROC_NR_DIRTY,
+							buf);
+					break;
+				case NPROC_NR_WRITEBACK:
+					mstore(ps->nr_writeback,
+							NPROC_NR_WRITEBACK,
+							buf);
+					break;
+				case NPROC_NR_UNSTABLE:
+					mstore(ps->nr_unstable,
+							NPROC_NR_UNSTABLE,
+							buf);
+					break;
+				case NPROC_NR_PG_TABLE_PGS:
+					mstore(ps->nr_page_table_pages,
+							NPROC_NR_PG_TABLE_PGS,
+							buf);
+					break;
+				case NPROC_NR_MAPPED:
+					mstore(ps->nr_mapped, NPROC_NR_MAPPED,
+							buf);
+					break;
+				case NPROC_NR_SLAB:
+					mstore(ps->nr_slab, NPROC_NR_SLAB, buf);
+					break;
+			}
+			break;
+		case NPROC_MEMFREE:
+			mstore(nr_free_pages(), NPROC_MEMFREE, buf);
+			break;
+		case NPROC_PAGESIZE:
+			mstore(PAGE_SIZE, NPROC_PAGESIZE, buf);
+			break;
+		case NPROC_JIFFIES:
+			mstore(get_jiffies_64(), NPROC_JIFFIES, buf);
+			break;
+		default:
+			pwarn("Unknown field ID %#x.\n", id);
+			buf = ERR_PTR(-EINVAL);
+			goto out;
+	}
+	kfree(ps);
+out:
+	return buf;
+}
+
+static int nproc_get_global(struct nlmsghdr *nlh)
+{
+	int err, i;
+	void *errp;
+	struct sk_buff *skb2;
+	char *buf;
+	u32 fcnt, len;
+	u32 *data = NLMSG_DATA(nlh);
+	u32 *fields;
+	u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+	errp = __reply_size(data, &left, &len);
+	if (IS_ERR(errp)) {
+		err = PTR_ERR(errp);
+		goto out;
+	}
+	if (left)
+		pwarn("%d bytes left.\n", left);
+
+	fcnt = data[0];
+	fields = &data[1];
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+
+	buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+
+	for (i = 0; i < fcnt; i++) {
+		buf = nproc_global_field(fields[i], buf);
+		if (IS_ERR(buf)) {
+			err = PTR_ERR(buf);
+			kfree_skb(skb2);
+			goto out;
+		}
+	}
+
+	err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+	if (err > 0)
+		err = 0;
+out:
+	return err;
+}
+
+static int find_id(__u32 *data, __u32 *left)
+{
+	int i;
+	u32 id;
+
+	if (*left < sizeof(id))
+		goto err_inval;
+	*left -= sizeof(sizeof(id));
+
+	if (*left)
+		pwarn("%d bytes left.\n", *left);
+	id = data[1];
+
+	for (i = 0; i < ARRAY_SIZE(labels) && labels[i].id != id; i++)
+		;	/* Do nothing */
+
+	if (labels[i].id != id) {
+		pwarn("No matching label found for %#x.\n", id);
+		goto err_inval;
+	}
+
+	return i;
+
+err_inval:
+	return -EINVAL;
+}
+
+
+static int nproc_get_label(struct nlmsghdr *nlh)
+{
+	int err;
+	struct sk_buff *skb2;
+	const char *label;
+	char *buf;
+	int len;
+	u32 ltype;
+	u32 *data = NLMSG_DATA(nlh);
+	u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+	if (left < sizeof(ltype))
+		goto err_inval;
+	left -= sizeof(ltype);
+
+	ltype = data[0];
+
+	if (ltype == NPROC_LABEL_FIELD_NAME) {
+		int idx;
+		idx = find_id(data, &left);
+		if (idx < 0)
+			goto err_inval;
+		label = labels[idx].label;
+	}
+	else if (ltype == NPROC_LABEL_FIELD_UNIT) {
+		int idx;
+		idx = find_id(data, &left);
+		if (idx < 0)
+			goto err_inval;
+		label = labels[idx].unit;
+	}
+	else if (ltype == NPROC_LABEL_FIELD_FMT) {
+		int idx;
+		idx = find_id(data, &left);
+		if (idx < 0)
+			goto err_inval;
+		label = labels[idx].fmt;
+	}
+	else if (ltype == NPROC_LABEL_WCHAN) {
+		char *modname;
+		unsigned long wchan, size, offset;
+		char namebuf[128];
+
+		if (left < sizeof(unsigned long))
+			goto err_inval;
+		left -= sizeof(unsigned long);
+
+		if (left)
+			pwarn("%d bytes left.\n", left);
+
+		wchan = (unsigned long)data[1];
+		label = kallsyms_lookup(wchan, &size, &offset, &modname,
+				namebuf);
+
+		if (!label) {
+			pwarn("No ksym found for %#lx.\n", wchan);
+			goto err_inval;
+		}
+	}
+	else {
+		pwarn("Unknown label type %#x.\n", ltype);
+		goto err_inval;
+	}
+
+	len = strlen(label) + 1;
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+
+	buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+
+	strncpy(buf, label, len);
+
+	err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+	if (err > 0)
+		err = 0;
+out:
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+static int nproc_get_list(struct nlmsghdr *nlh)
+{
+	int err, i, cnt, len;
+	struct sk_buff *skb2;
+	u32 *buf;
+
+	cnt = ARRAY_SIZE(labels);
+	len = (cnt + 1) * sizeof(u32);
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+
+	buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+	buf[0] = cnt;
+	for (i = 0; i < cnt; i++)
+		buf[i + 1] = labels[i].id;
+
+	err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+	if (err > 0)
+		err = 0;
+out:
+	return err;
+}
+
+static __inline__ int nproc_process_msg(struct sk_buff *skb,
+		struct nlmsghdr *nlh)
+{
+	int err = 0;
+	uid_t uid;
+	kernel_cap_t caps;
+
+	if (!(nlh->nlmsg_flags & NLM_F_REQUEST))
+		goto out;
+
+	nlh->nlmsg_pid = NETLINK_CB(skb).pid;
+	uid = NETLINK_CB(skb).creds.uid;
+	caps = NETLINK_CB(skb).eff_cap;
+
+	switch (nlh->nlmsg_type) {
+		case NPROC_GET_FIELD_LIST:
+			err = nproc_get_list(nlh);
+			break;
+		case NPROC_GET_LABEL:
+			err = nproc_get_label(nlh);
+			break;
+		case NPROC_GET_GLOBAL:
+			err = nproc_get_global(nlh);
+			break;
+		case NPROC_GET_PS:
+			err = nproc_get_ps(nlh, uid);
+			break;
+		default:
+			pwarn("Unknown msg type %#x.\n", nlh->nlmsg_type);
+			err = -EINVAL;
+	}
+out:
+	return err;
+
+}
+
+static int nproc_receive_skb(struct sk_buff *skb)
+{
+	int err = 0;
+	struct nlmsghdr *nlh;
+
+	if (skb->len < NLMSG_LENGTH(0))
+		goto err_inval;
+
+	nlh = (struct nlmsghdr *)skb->data;
+	if (skb->len < nlh->nlmsg_len || nlh->nlmsg_len < sizeof(*nlh)){
+		pwarn("Invalid packet.\n");
+		goto err_inval;
+	}
+
+	err = nproc_process_msg(skb, nlh);
+	if (err || nlh->nlmsg_flags & NLM_F_ACK) {
+		pwarn("err %d, type %#x, flags %#x, seq %#x.\n", err,
+				nlh->nlmsg_type, nlh->nlmsg_flags,
+				nlh->nlmsg_seq);
+		netlink_ack(skb, nlh, err);
+	}
+
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+static void nproc_receive(struct sock *sk, int len)
+{
+	struct sk_buff *skb;
+
+	while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+		nproc_receive_skb(skb);
+		kfree_skb(skb);
+	}
+}
+
+static int nproc_init(void)
+{
+	nproc_sock = netlink_kernel_create(NETLINK_NPROC, nproc_receive);
+
+	if (!nproc_sock) {
+		pwarn("No netlink socket for nproc.\n");
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
+module_init(nproc_init);
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/kernel/pid.c linux-2.6.9-rc1-bk13-nproc/kernel/pid.c
--- linux-2.6.9-rc1-bk13/kernel/pid.c	2004-09-06 18:53:17.000000000 +0200
+++ linux-2.6.9-rc1-bk13-nproc/kernel/pid.c	2004-09-06 19:52:59.000000000 +0200
@@ -146,6 +146,17 @@ failure:
 	return -1;
 }
 
+void *get_pid_map(int idx)
+{
+	pidmap_t *map = pidmap_array + idx;
+	if (!map->page)
+		return NULL;
+	else if (atomic_read(&map->nr_free) == BITS_PER_PAGE)
+		return ERR_PTR(-1);
+	else
+		return map->page;
+}
+
 struct pid * fastcall find_pid(enum pid_type type, int nr)
 {
 	struct hlist_node *elem;
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/init/Kconfig linux-2.6.9-rc1-bk13-nproc/init/Kconfig
--- linux-2.6.9-rc1-bk13/init/Kconfig	2004-09-06 18:53:17.000000000 +0200
+++ linux-2.6.9-rc1-bk13-nproc/init/Kconfig	2004-09-06 19:50:56.000000000 +0200
@@ -139,6 +139,13 @@ config SYSCTL
 	  building a kernel for install/rescue disks or your system is very
 	  limited in memory.
 
+config NPROC
+	bool "Netlink interface to /proc information"
+	depends on PROC_FS && EXPERIMENTAL
+	default y
+	help
+	  Nproc is a netlink interface to /proc information.
+
 config AUDIT
 	bool "Auditing support"
 	default y if SECURITY_SELINUX


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-08 18:41 ` [1/1][PATCH] " Roger Luethi
@ 2004-09-09  0:35   ` William Lee Irwin III
  2004-09-09  0:43     ` William Lee Irwin III
  2004-09-09 18:43     ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi
  2004-09-09 11:53   ` Stephen Smalley
  1 sibling, 2 replies; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-09  0:35 UTC (permalink / raw)
  To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson

On Wed, Sep 08, 2004 at 08:41:30PM +0200, Roger Luethi wrote:
> A few notes:
> - Access control can be implemented easily. Right now it would be bloat,
>   though -- the vast majority of fields in /proc are world-readable
>   (/proc/pid/environ being the notable exception).
> - Additional process selectors (e.g. select by UID) are not hard to
>   add, either, should there ever be a need.
> - There are a few things I'm not sure about: For instance, what is a good
>   return value for mm_struct related fields wrt kernel threads? I picked
>   0, but ~(0) might be preferable because it's distinct.
> Signed-off-by: Roger Luethi <rl@hellgate.ch>

Any chance you could convert these to use the new vm statistics
accounting?


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09  0:35   ` William Lee Irwin III
@ 2004-09-09  0:43     ` William Lee Irwin III
  2004-09-09  1:15       ` William Lee Irwin III
  2004-09-09 18:43     ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi
  1 sibling, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-09  0:43 UTC (permalink / raw)
  To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson

On Wed, Sep 08, 2004 at 05:35:29PM -0700, William Lee Irwin III wrote:
> Any chance you could convert these to use the new vm statistics
> accounting?

Hmm, there's a more serious issue; CONFIG_MMU=n will barf on these.
For that you will need to #ifdef on CONFIG_MMU and use the methods
in fs/proc/task_nommu.c and so on.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09  0:43     ` William Lee Irwin III
@ 2004-09-09  1:15       ` William Lee Irwin III
  2004-09-09  1:17         ` [1/2] rediff nproc v2 vs. 2.6.9-rc1-mm4 William Lee Irwin III
  0 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-09  1:15 UTC (permalink / raw)
  To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson

On Wed, Sep 08, 2004 at 05:35:29PM -0700, William Lee Irwin III wrote:
>> Any chance you could convert these to use the new vm statistics
>> accounting?

On Wed, Sep 08, 2004 at 05:43:20PM -0700, William Lee Irwin III wrote:
> Hmm, there's a more serious issue; CONFIG_MMU=n will barf on these.
> For that you will need to #ifdef on CONFIG_MMU and use the methods
> in fs/proc/task_nommu.c and so on.

This is a straight rediff of nproc vs. 2.6.9-rc1-mm4. No changes
whatsoever to the underlying code were made; rather, this merely
resolves offsets so it applies cleanly.

Compiletested on ia64.


-- wli

Index: mm4-2.6.9-rc1/include/linux/netlink.h
===================================================================
--- mm4-2.6.9-rc1.orig/include/linux/netlink.h	2004-09-08 06:10:50.000000000 -0700
+++ mm4-2.6.9-rc1/include/linux/netlink.h	2004-09-08 17:45:27.500658296 -0700
@@ -15,6 +15,7 @@
 #define NETLINK_ARPD		8
 #define NETLINK_AUDIT		9	/* auditing */
 #define NETLINK_ROUTE6		11	/* af_inet6 route comm channel */
+#define NETLINK_NPROC		12	/* /proc information */
 #define NETLINK_IP6_FW		13
 #define NETLINK_DNRTMSG		14	/* DECnet routing messages */
 #define NETLINK_KEVENT		15	/* Kernel messages to userspace */
Index: mm4-2.6.9-rc1/include/linux/nproc.h
===================================================================
--- mm4-2.6.9-rc1.orig/include/linux/nproc.h	2004-04-25 12:31:02.000000000 -0700
+++ mm4-2.6.9-rc1/include/linux/nproc.h	2004-09-08 17:45:27.501634858 -0700
@@ -0,0 +1,119 @@
+#ifndef _LINUX_NPROC_H
+#define _LINUX_NPROC_H
+
+#include <linux/config.h>
+
+#ifndef __KERNEL__
+#define CONFIG_NPROC
+#endif
+
+#ifdef CONFIG_NPROC
+
+/* Request types */
+#define NPROC_BASE		0x10
+#define NPROC_GET_FIELD_LIST	(NPROC_BASE+0)
+#define NPROC_GET_LABEL		(NPROC_BASE+1)
+#define NPROC_GET_GLOBAL	(NPROC_BASE+2)
+#define NPROC_GET_PS		(NPROC_BASE+3)
+#define NPROC_GET_PID_LIST	(NPROC_BASE+4)
+
+/* Request flags */
+
+
+/* Field scopes */
+#define NPROC_SCOPE_MASK	0x70000000
+#define NPROC_SCOPE_GLOBAL	0x10000000	/* Global w/o arguments */
+#define NPROC_SCOPE_PROCESS	0x20000000
+#define NPROC_SCOPE_LABEL	0x30000000
+
+/* Data types */
+#define NPROC_TYPE_MASK		0x07000000
+#define NPROC_TYPE_STRING	0x01000000
+#define NPROC_TYPE_U32		0x02000000
+#define NPROC_TYPE_UL		0x03000000
+#define NPROC_TYPE_U64		0x04000000
+
+/* Access control (unused) */
+#define NPROC_PERM_MASK		0x00300000
+#define NPROC_PERM_USER		0x00100000
+#define NPROC_PERM_ROOT		0x00200000
+
+/* Selectors */
+#define NPROC_SELECT_ALL	0x00000001
+#define NPROC_SELECT_PID	0x00000002
+#define NPROC_SELECT_UID	0x00000003
+
+/* Labels */
+#define NPROC_LABEL_FIELD_NAME	0x00000001
+#define NPROC_LABEL_FIELD_FMT	0x00000002
+#define NPROC_LABEL_FIELD_UNIT	0x00000003
+#define NPROC_LABEL_WCHAN	0x00000004
+
+/* Field IDs (unique key in bits 0 - 15) */
+#define NPROC_NOP_UL		(0x00000020 | NPROC_TYPE_UL)
+#define NPROC_PID		(0x00000001 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_NAME		(0x00000002 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS)
+/* Amount of free memory (pages) */
+#define NPROC_MEMFREE		(0x00000004 | NPROC_TYPE_U32    | NPROC_SCOPE_GLOBAL)
+/* Size of a page (bytes) */
+#define NPROC_PAGESIZE		(0x00000005 | NPROC_TYPE_U32    | NPROC_SCOPE_GLOBAL)
+/* There's no guarantee about anything with jiffies. Still useful for some. */
+#define NPROC_JIFFIES		(0x00000006 | NPROC_TYPE_U64    | NPROC_SCOPE_GLOBAL)
+/* Process: VM size (KiB) */
+#define NPROC_VMSIZE		(0x00000010 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+/* Process: locked memory (KiB) */
+#define NPROC_VMLOCK		(0x00000011 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+/* Process: Memory resident size (KiB) */
+#define NPROC_VMRSS		(0x00000012 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMDATA		(0x00000013 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMSTACK		(0x00000014 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMEXE		(0x00000015 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMLIB		(0x00000016 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_UID		(0x00000018 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_NR_DIRTY		(0x00000051 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_WRITEBACK	(0x00000052 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_UNSTABLE	(0x00000053 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_PG_TABLE_PGS	(0x00000054 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_MAPPED		(0x00000055 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_SLAB		(0x00000056 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_WCHAN		(0x00000080 | NPROC_TYPE_UL     | NPROC_SCOPE_PROCESS)
+#define NPROC_WCHAN_NAME	(0x00000081 | NPROC_TYPE_STRING)
+
+#ifdef __KERNEL__
+struct nproc_field {
+	__u32 id;
+	const char *label;
+	const char *fmt;
+	const char *unit;
+};
+
+static struct nproc_field labels[] = {
+	{ NPROC_PID,			"PID",		"%5u",	"" },
+	{ NPROC_NAME,			"Name",		"%-15s","" },
+	{ NPROC_MEMFREE,		"MemFree",	"%8u",	"page" },
+	{ NPROC_PAGESIZE,		"PageSize",	"%4u",	"byte" },
+	{ NPROC_JIFFIES,		"Jiffies",	"%10u",	"" },
+	{ NPROC_VMSIZE,			"VmSize",	"%8u",	"KiB" },
+	{ NPROC_VMLOCK,			"VmLock",	"%8u",	"KiB" },
+	{ NPROC_VMRSS,			"VmRSS",	"%8u",	"KiB" },
+	{ NPROC_VMDATA,			"VmData",	"%8u",	"KiB" },
+	{ NPROC_VMSTACK,		"VmStack",	"%8u",	"KiB" },
+	{ NPROC_VMEXE,			"VmExe",	"%8u",	"KiB" },
+	{ NPROC_VMLIB,			"VmLib",	"%8u",	"KiB" },
+	{ NPROC_UID,			"UID",		"%5u",	"" },
+	{ NPROC_NR_DIRTY,		"nr_dirty",	"%8d",	"page" },
+	{ NPROC_NR_WRITEBACK,		"nr_writeback",	"%8u",	"page" },
+	{ NPROC_NR_UNSTABLE,		"nr_unstable",	"%8u",	"page" },
+	{ NPROC_NR_PG_TABLE_PGS,	"nr_page_table_pages",	"%8u", "page" },
+	{ NPROC_NR_MAPPED,		"nr_mapped",	"%8u",	"page" },
+	{ NPROC_NR_SLAB,		"nr_slab",	"%8u",	"page" },
+	{ NPROC_WCHAN,			"wchan",	"%p",	"" },
+#ifdef CONFIG_KALLSYMS
+	{ NPROC_WCHAN_NAME,		"wchan_symbol",	"%s"},
+#endif
+};
+#endif /* __KERNEL__ */
+
+#endif /* CONFIG_NPROC */
+
+#endif /* _LINUX_NPROC_H */
Index: mm4-2.6.9-rc1/include/linux/pid.h
===================================================================
--- mm4-2.6.9-rc1.orig/include/linux/pid.h	2004-09-08 06:10:36.000000000 -0700
+++ mm4-2.6.9-rc1/include/linux/pid.h	2004-09-08 17:45:27.501634858 -0700
@@ -37,6 +37,7 @@
 extern struct pid *FASTCALL(find_pid(enum pid_type, int));
 
 extern int alloc_pidmap(void);
+extern void *get_pid_map(int);
 extern void FASTCALL(free_pidmap(int));
 extern void switch_exec_pids(struct task_struct *leader, struct task_struct *thread);
 
Index: mm4-2.6.9-rc1/init/Kconfig
===================================================================
--- mm4-2.6.9-rc1.orig/init/Kconfig	2004-09-08 06:10:50.000000000 -0700
+++ mm4-2.6.9-rc1/init/Kconfig	2004-09-08 17:45:27.504564546 -0700
@@ -139,6 +139,13 @@
 	  building a kernel for install/rescue disks or your system is very
 	  limited in memory.
 
+config NPROC
+	bool "Netlink interface to /proc information"
+	depends on PROC_FS && EXPERIMENTAL
+	default y
+	help
+	  Nproc is a netlink interface to /proc information.
+
 config AUDIT
 	bool "Auditing support"
 	default y if SECURITY_SELINUX
Index: mm4-2.6.9-rc1/kernel/Makefile
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/Makefile	2004-09-08 06:10:50.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/Makefile	2004-09-08 17:45:27.501634858 -0700
@@ -15,6 +15,7 @@
 obj-$(CONFIG_UID16) += uid16.o
 obj-$(CONFIG_MODULES) += module.o
 obj-$(CONFIG_KALLSYMS) += kallsyms.o
+obj-$(CONFIG_NPROC) += nproc.o
 obj-$(CONFIG_PM) += power/
 obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_KEXEC) += kexec.o
Index: mm4-2.6.9-rc1/kernel/nproc.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/nproc.c	2004-04-25 12:31:02.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/nproc.c	2004-09-08 17:45:27.503587983 -0700
@@ -0,0 +1,851 @@
+/*
+ * nproc.c
+ *
+ * netlink interface to /proc information.
+ */
+
+#include <linux/skbuff.h>
+#include <net/sock.h>
+#include <linux/swap.h>		/* nr_free_pages() */
+#include <linux/kallsyms.h>	/* kallsyms_lookup() */
+#include <linux/pid.h>		/* get_pid_map() */
+#include <linux/nproc.h>
+#include <asm/bitops.h>
+
+//#define DEBUG
+
+/* There must be like 5 million dprintk definitions, so let's add some more */
+#ifdef DEBUG
+#define pdebug(x,args...) printk(KERN_DEBUG "%s:%d " x, __func__ , __LINE__, ##args)
+#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args)
+#else
+#define pdebug(x,args...)
+#define pwarn(x,args...)
+#endif
+
+#define perror(x,args...) printk(KERN_ERR "%s:%d " x, __func__ , __LINE__, ##args)
+
+static struct sock *nproc_sock = NULL;
+
+struct task_mem {
+	u32	vmdata;
+	u32	vmstack;
+	u32	vmexe;
+	u32	vmlib;
+};
+
+struct task_mem_cheap {
+	u32	vmsize;
+	u32	vmlock;
+	u32	vmrss;
+};
+
+/*
+ * __task_mem/__task_mem_cheap basically duplicate the MMU version of
+ * task_mem, but they are split by cost and work on structs.
+ */
+
+static void __task_mem(struct task_struct *tsk, struct task_mem *res)
+{
+	struct mm_struct *mm = get_task_mm(tsk);
+	if (mm) {
+		unsigned long data = 0, stack = 0, exec = 0, lib = 0;
+		struct vm_area_struct *vma;
+
+		down_read(&mm->mmap_sem);
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			unsigned long len = (vma->vm_end - vma->vm_start) >> 10;
+			if (!vma->vm_file) {
+				data += len;
+				if (vma->vm_flags & VM_GROWSDOWN)
+					stack += len;
+				continue;
+			}
+			if (vma->vm_flags & VM_WRITE)
+				continue;
+			if (vma->vm_flags & VM_EXEC) {
+				exec += len;
+				if (vma->vm_flags & VM_EXECUTABLE)
+					continue;
+				lib += len;
+			}
+		}
+		res->vmdata = data - stack;
+		res->vmstack = stack;
+		res->vmexe = exec - lib;
+		res->vmlib = lib;
+		up_read(&mm->mmap_sem);
+
+		mmput(mm);
+	} else {
+		res->vmdata = 0;
+		res->vmstack = 0;
+		res->vmexe = 0;
+		res->vmlib = 0;
+	}
+}
+
+static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res)
+{
+	struct mm_struct *mm = get_task_mm(tsk);
+	if (mm) {
+		res->vmsize = mm->total_vm << (PAGE_SHIFT-10);
+		res->vmlock = mm->locked_vm << (PAGE_SHIFT-10);
+		res->vmrss = mm->rss << (PAGE_SHIFT-10);
+		mmput(mm);
+	} else {
+		res->vmsize = 0;
+		res->vmlock = 0;
+		res->vmrss = 0;
+	}
+}
+
+/*
+ * page_alloc.c already has an extra function broken out to fill a
+ * struct with information. Cool. Not sure whether pgpgin/pgpgout
+ * should be left as is or nailed down as kbytes.
+ */
+static struct page_state *__vmstat(void)
+{
+	struct page_state *ps;
+	ps = kmalloc(sizeof(*ps), GFP_KERNEL);
+	if (!ps)
+		return ERR_PTR(-ENOMEM);
+	get_full_page_state(ps);
+	ps->pgpgin /= 2;	/* sectors -> kbytes */
+	ps->pgpgout /= 2;
+	return ps;
+}
+
+/*
+ * Allocate and prefill an skb. The nlmsghdr provided to the function
+ * is a pointer to the respective struct in the request message.
+ */
+static struct sk_buff *nproc_alloc_nlmsg(struct nlmsghdr *nlh, u32 len)
+{
+	__u32 seq = nlh->nlmsg_seq;
+	__u16 type = nlh->nlmsg_type;
+	__u32 pid = nlh->nlmsg_pid;
+	struct sk_buff *skb2 = 0;
+
+	skb2 = alloc_skb(NLMSG_SPACE(len), GFP_KERNEL);
+	if (!skb2) {
+		skb2 = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	NLMSG_PUT(skb2, pid, seq, type, NLMSG_ALIGN(len));
+out:
+	return skb2;
+
+nlmsg_failure:				/* Used by NLMSG_PUT */
+	kfree_skb(skb2);
+	return NULL;
+}
+
+#define mstore(value, id, buf)						\
+({									\
+	u32 _type = id & NPROC_TYPE_MASK;				\
+	switch (_type) {						\
+		case NPROC_TYPE_U32: {					\
+			__u32 *p = (u32 *)buf;				\
+			*p = value;					\
+			buf = (char *)++p;				\
+			break;						\
+		}							\
+		case NPROC_TYPE_UL: {					\
+			unsigned long *p = (unsigned long *)buf;	\
+			*p = value;					\
+			buf = (char *)++p;				\
+			break;						\
+		}							\
+		case NPROC_TYPE_U64: {					\
+			__u64 *p = (u64 *)buf;				\
+			*p = value;					\
+			buf = (char *)++p;				\
+			break;						\
+		}							\
+		default:						\
+			perror("Huh? Bad type!\n");			\
+	}								\
+})
+
+static char *nproc_ps_field(u32 id, char *buf, task_t *tsk)
+{
+	struct task_mem tsk_mem;
+	struct task_mem_cheap tsk_mem_cheap;
+
+	tsk_mem.vmdata = (~0);
+	tsk_mem_cheap.vmsize = (~0);
+
+	switch (id) {
+		case NPROC_PID:
+			mstore(tsk->pid, NPROC_PID, buf);
+			break;
+		case NPROC_UID:
+			mstore(tsk->uid, NPROC_UID, buf);
+			break;
+		case NPROC_VMSIZE:
+		case NPROC_VMLOCK:
+		case NPROC_VMRSS:
+			if (tsk_mem_cheap.vmsize == (~0))
+				__task_mem_cheap(tsk, &tsk_mem_cheap);
+
+			switch (id) {
+				case NPROC_VMSIZE:
+					mstore(tsk_mem_cheap.vmsize,
+							NPROC_VMSIZE, buf);
+					break;
+				case NPROC_VMLOCK:
+					mstore(tsk_mem_cheap.vmlock,
+							NPROC_VMLOCK, buf);
+					break;
+				case NPROC_VMRSS:
+					mstore(tsk_mem_cheap.vmrss,
+							NPROC_VMRSS, buf);
+					break;
+			}
+			break;
+		case NPROC_VMDATA:
+		case NPROC_VMSTACK:
+		case NPROC_VMEXE:
+		case NPROC_VMLIB:
+			if (tsk_mem.vmdata == (~0))
+					__task_mem(tsk, &tsk_mem);
+
+			switch (id) {
+				case NPROC_VMDATA:
+					mstore(tsk_mem.vmdata, NPROC_VMDATA,
+							buf);
+					break;
+				case NPROC_VMSTACK:
+					mstore(tsk_mem.vmstack, NPROC_VMSTACK,
+							buf);
+					break;
+				case NPROC_VMEXE:
+					mstore(tsk_mem.vmexe, NPROC_VMEXE, buf);
+					break;
+				case NPROC_VMLIB:
+					mstore(tsk_mem.vmlib, NPROC_VMLIB, buf);
+					break;
+			}
+			break;
+		case NPROC_JIFFIES:
+			mstore(get_jiffies_64(), NPROC_JIFFIES, buf);
+			break;
+		case NPROC_WCHAN:
+			mstore(get_wchan(tsk), NPROC_WCHAN, buf);
+			break;
+		case NPROC_NAME:
+			mstore(sizeof(tsk->comm), NPROC_TYPE_U32, buf);
+			strncpy(buf, tsk->comm, sizeof(tsk->comm));
+			buf += sizeof(tsk->comm);
+			break;
+		case NPROC_NOP_UL:
+			mstore(0, NPROC_TYPE_UL, buf);
+			break;
+		default:
+			pwarn("Unknown field ID %#x.\n", id);
+			goto err_inval;
+	}
+	return buf;
+err_inval:
+	return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Build and send a netlink msg for one PID.
+ */
+static int nproc_pid_msg(struct nlmsghdr *nlh, u32 *fdata, u32 len, task_t *tsk)
+{
+	int i;
+	int err = 0;
+	struct sk_buff *skb2;
+	char *buf;
+	struct nlmsghdr *nlh2;
+	u32 fcnt, *fields;
+
+	fcnt = fdata[0];
+	fields = &fdata[1];
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+	nlh2 = (struct nlmsghdr *)skb2->data;
+	buf = NLMSG_DATA(nlh2);
+
+	for (i = 0; i < fcnt; i++) {
+		buf = nproc_ps_field(fields[i], buf, tsk);
+		if (IS_ERR(buf)) {
+			err = PTR_ERR(buf);
+			goto out_free;
+		}
+	}
+	err = netlink_unicast(nproc_sock, skb2, nlh2->nlmsg_pid, 0);
+	if (err > 0)
+		err = 0;
+	return err;
+out_free:
+	kfree_skb(skb2);
+out:
+	return err;
+}
+
+/*
+ * Find task for given pid, grab task lock (caller must unlock).
+ */
+static task_t *nproc_ps_get_task(int pid)
+{
+	task_t *tsk;
+
+	read_lock(&tasklist_lock);
+	tsk = find_task_by_pid(pid);
+	if (tsk)
+		get_task_struct(tsk);
+	read_unlock(&tasklist_lock);
+	return tsk;
+}
+
+/*
+ * Iterate over a list of PIDs.
+ */
+static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata)
+{
+	int i;
+	int err = 0;
+	u32 tcnt;
+	u32 *pids;
+
+	if (left < sizeof(tcnt))
+		goto err_inval;
+	left -= sizeof(tcnt);
+
+	tcnt = sdata[0];
+
+	if (left < (tcnt * sizeof(u32)))
+		goto err_inval;
+	left -= tcnt * sizeof(u32);
+
+	if (left)
+		pwarn("%d bytes left.\n", left);
+
+	pids = &sdata[1];
+
+	for (i = 0; i < tcnt; i++) {
+		task_t *tsk;
+		tsk = nproc_ps_get_task(pids[i]);
+		if (!tsk)
+			continue;
+		err = nproc_pid_msg(nlh, fdata, len, tsk);
+		put_task_struct(tsk);
+		if (err)
+			goto out;
+	}
+
+out:
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+#define PIDMAP_ENTRIES (PID_MAX_LIMIT/PAGE_SIZE/8)
+#define BITS_PER_PAGE (PAGE_SIZE*8)
+
+/*
+ * Iterate over all PIDs.
+ */
+static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len)
+{
+	void *map;
+	int offset, i;
+	int err = 0;
+
+	for (i = 0; i < PIDMAP_ENTRIES; i++) {
+
+		map = get_pid_map(i);
+		if (!map)	/* done -- there are no holes in pidmap_array */
+			break;
+		if (IS_ERR(map))	/* No PIDs used in this map */
+			continue;
+		offset = 0;
+		for ( ; ; ) {
+			int pid;
+			task_t *tsk;
+			offset = find_next_bit(map, BITS_PER_PAGE, ++offset);
+			if (offset >= BITS_PER_PAGE)
+				break;
+			pid = offset + i * BITS_PER_PAGE;
+			tsk = nproc_ps_get_task(pid);
+			if (!tsk)
+				continue;
+			err = nproc_pid_msg(nlh, fdata, len, tsk);
+			put_task_struct(tsk);
+			if (err)
+				goto out;
+		}
+	}
+
+out:
+	return err;
+}
+
+static u32 __reply_size_special(u32 id)
+{
+	u32 len = 0;
+
+	switch (id) {
+		case NPROC_NAME:
+			len = sizeof(u32) +
+				sizeof(((struct task_struct*)0)->comm);
+			break;
+		default:
+			pwarn("Unknown field size in %#x.\n", id);
+	}
+	return len;
+}
+
+/*
+ * Calculates the size of a reply message payload. Alternatively, we could have
+ * the user space caller supply a number along with the request and bail
+ * out or realloc later if we find the allocation was too small. More
+ * responsibility in user space, but faster.
+ */
+static u32 *__reply_size (u32 *data, u32 *left, u32 *len)
+{
+	u32 *fields;
+	u32 fcnt;
+	int i;
+	*len = 0;
+
+	if (*left < sizeof(fcnt))
+		goto err_inval;
+	*left -= sizeof(fcnt);
+
+	fcnt = data[0];
+
+	if (*left < (fcnt * sizeof(u32)))
+		goto err_inval;
+	*left -= fcnt * sizeof(u32);
+
+	fields = &data[1];
+
+	for (i = 0; i < fcnt; i++) {
+		u32 id = fields[i];
+		u32 type = id & NPROC_TYPE_MASK;
+		pdebug("        %#8.8x.\n", fields[i]);
+		switch (type) {
+			case NPROC_TYPE_U32:
+				*len += sizeof(u32);
+				break;
+			case NPROC_TYPE_UL:
+				*len += sizeof(unsigned long);
+				break;
+			case NPROC_TYPE_U64:
+				*len += sizeof(u64);
+				break;
+			default: {		/* Special cases */
+				u32 slen;
+				slen = __reply_size_special(id);
+				if (slen)
+					*len += slen;
+				else
+					goto err_inval;
+			}
+		}
+	}
+
+	return &fields[fcnt];
+
+err_inval:
+	return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Call the chosen process selector. Adding additional selectors
+ * (e.g. select by uid) is easy, but is there a need?
+ */
+static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid)
+{
+	int err;
+	u32 len;
+	u32 *data = NLMSG_DATA(nlh);
+	u32 *sdata;
+	u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+
+	sdata = __reply_size(data, &left, &len);
+	if (IS_ERR(sdata)) {
+		err = PTR_ERR(sdata);
+		goto out;
+	}
+
+	if (left < sizeof(u32))
+		goto err_inval;
+	left -= sizeof(u32);
+
+	switch (*sdata) {
+		case NPROC_SELECT_ALL:
+			if (left)
+				pwarn("%d bytes left.\n", left);
+			err = nproc_ps_select_all(nlh, data, len);
+			break;
+		case NPROC_SELECT_PID:
+			err = nproc_ps_select_pid(nlh, data, len,
+					left, sdata + 1);
+			break;
+		default:
+			pwarn("Unknown selection method %#x.\n", *sdata);
+			goto err_inval;
+	}
+
+out:
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+static char *nproc_global_field(u32 id, char *buf)
+{
+	struct page_state *ps = NULL;
+
+	switch (id) {
+		case NPROC_NR_DIRTY:
+		case NPROC_NR_WRITEBACK:
+		case NPROC_NR_UNSTABLE:
+		case NPROC_NR_PG_TABLE_PGS:
+		case NPROC_NR_MAPPED:
+		case NPROC_NR_SLAB:
+			if (!ps) {
+				ps = __vmstat();
+				if (IS_ERR(ps)) {	/* Just pass it on */
+					buf = (void *)ps;
+					ps = NULL;
+					goto out;
+				}
+			}
+			switch (id) {
+				case NPROC_NR_DIRTY:
+					mstore(ps->nr_dirty, NPROC_NR_DIRTY,
+							buf);
+					break;
+				case NPROC_NR_WRITEBACK:
+					mstore(ps->nr_writeback,
+							NPROC_NR_WRITEBACK,
+							buf);
+					break;
+				case NPROC_NR_UNSTABLE:
+					mstore(ps->nr_unstable,
+							NPROC_NR_UNSTABLE,
+							buf);
+					break;
+				case NPROC_NR_PG_TABLE_PGS:
+					mstore(ps->nr_page_table_pages,
+							NPROC_NR_PG_TABLE_PGS,
+							buf);
+					break;
+				case NPROC_NR_MAPPED:
+					mstore(ps->nr_mapped, NPROC_NR_MAPPED,
+							buf);
+					break;
+				case NPROC_NR_SLAB:
+					mstore(ps->nr_slab, NPROC_NR_SLAB, buf);
+					break;
+			}
+			break;
+		case NPROC_MEMFREE:
+			mstore(nr_free_pages(), NPROC_MEMFREE, buf);
+			break;
+		case NPROC_PAGESIZE:
+			mstore(PAGE_SIZE, NPROC_PAGESIZE, buf);
+			break;
+		case NPROC_JIFFIES:
+			mstore(get_jiffies_64(), NPROC_JIFFIES, buf);
+			break;
+		default:
+			pwarn("Unknown field ID %#x.\n", id);
+			buf = ERR_PTR(-EINVAL);
+			goto out;
+	}
+	kfree(ps);
+out:
+	return buf;
+}
+
+static int nproc_get_global(struct nlmsghdr *nlh)
+{
+	int err, i;
+	void *errp;
+	struct sk_buff *skb2;
+	char *buf;
+	u32 fcnt, len;
+	u32 *data = NLMSG_DATA(nlh);
+	u32 *fields;
+	u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+	errp = __reply_size(data, &left, &len);
+	if (IS_ERR(errp)) {
+		err = PTR_ERR(errp);
+		goto out;
+	}
+	if (left)
+		pwarn("%d bytes left.\n", left);
+
+	fcnt = data[0];
+	fields = &data[1];
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+
+	buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+
+	for (i = 0; i < fcnt; i++) {
+		buf = nproc_global_field(fields[i], buf);
+		if (IS_ERR(buf)) {
+			err = PTR_ERR(buf);
+			kfree_skb(skb2);
+			goto out;
+		}
+	}
+
+	err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+	if (err > 0)
+		err = 0;
+out:
+	return err;
+}
+
+static int find_id(__u32 *data, __u32 *left)
+{
+	int i;
+	u32 id;
+
+	if (*left < sizeof(id))
+		goto err_inval;
+	*left -= sizeof(sizeof(id));
+
+	if (*left)
+		pwarn("%d bytes left.\n", *left);
+	id = data[1];
+
+	for (i = 0; i < ARRAY_SIZE(labels) && labels[i].id != id; i++)
+		;	/* Do nothing */
+
+	if (labels[i].id != id) {
+		pwarn("No matching label found for %#x.\n", id);
+		goto err_inval;
+	}
+
+	return i;
+
+err_inval:
+	return -EINVAL;
+}
+
+
+static int nproc_get_label(struct nlmsghdr *nlh)
+{
+	int err;
+	struct sk_buff *skb2;
+	const char *label;
+	char *buf;
+	int len;
+	u32 ltype;
+	u32 *data = NLMSG_DATA(nlh);
+	u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+	if (left < sizeof(ltype))
+		goto err_inval;
+	left -= sizeof(ltype);
+
+	ltype = data[0];
+
+	if (ltype == NPROC_LABEL_FIELD_NAME) {
+		int idx;
+		idx = find_id(data, &left);
+		if (idx < 0)
+			goto err_inval;
+		label = labels[idx].label;
+	}
+	else if (ltype == NPROC_LABEL_FIELD_UNIT) {
+		int idx;
+		idx = find_id(data, &left);
+		if (idx < 0)
+			goto err_inval;
+		label = labels[idx].unit;
+	}
+	else if (ltype == NPROC_LABEL_FIELD_FMT) {
+		int idx;
+		idx = find_id(data, &left);
+		if (idx < 0)
+			goto err_inval;
+		label = labels[idx].fmt;
+	}
+	else if (ltype == NPROC_LABEL_WCHAN) {
+		char *modname;
+		unsigned long wchan, size, offset;
+		char namebuf[128];
+
+		if (left < sizeof(unsigned long))
+			goto err_inval;
+		left -= sizeof(unsigned long);
+
+		if (left)
+			pwarn("%d bytes left.\n", left);
+
+		wchan = (unsigned long)data[1];
+		label = kallsyms_lookup(wchan, &size, &offset, &modname,
+				namebuf);
+
+		if (!label) {
+			pwarn("No ksym found for %#lx.\n", wchan);
+			goto err_inval;
+		}
+	}
+	else {
+		pwarn("Unknown label type %#x.\n", ltype);
+		goto err_inval;
+	}
+
+	len = strlen(label) + 1;
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+
+	buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+
+	strncpy(buf, label, len);
+
+	err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+	if (err > 0)
+		err = 0;
+out:
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+static int nproc_get_list(struct nlmsghdr *nlh)
+{
+	int err, i, cnt, len;
+	struct sk_buff *skb2;
+	u32 *buf;
+
+	cnt = ARRAY_SIZE(labels);
+	len = (cnt + 1) * sizeof(u32);
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+
+	buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+	buf[0] = cnt;
+	for (i = 0; i < cnt; i++)
+		buf[i + 1] = labels[i].id;
+
+	err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+	if (err > 0)
+		err = 0;
+out:
+	return err;
+}
+
+static __inline__ int nproc_process_msg(struct sk_buff *skb,
+		struct nlmsghdr *nlh)
+{
+	int err = 0;
+	uid_t uid;
+	kernel_cap_t caps;
+
+	if (!(nlh->nlmsg_flags & NLM_F_REQUEST))
+		goto out;
+
+	nlh->nlmsg_pid = NETLINK_CB(skb).pid;
+	uid = NETLINK_CB(skb).creds.uid;
+	caps = NETLINK_CB(skb).eff_cap;
+
+	switch (nlh->nlmsg_type) {
+		case NPROC_GET_FIELD_LIST:
+			err = nproc_get_list(nlh);
+			break;
+		case NPROC_GET_LABEL:
+			err = nproc_get_label(nlh);
+			break;
+		case NPROC_GET_GLOBAL:
+			err = nproc_get_global(nlh);
+			break;
+		case NPROC_GET_PS:
+			err = nproc_get_ps(nlh, uid);
+			break;
+		default:
+			pwarn("Unknown msg type %#x.\n", nlh->nlmsg_type);
+			err = -EINVAL;
+	}
+out:
+	return err;
+
+}
+
+static int nproc_receive_skb(struct sk_buff *skb)
+{
+	int err = 0;
+	struct nlmsghdr *nlh;
+
+	if (skb->len < NLMSG_LENGTH(0))
+		goto err_inval;
+
+	nlh = (struct nlmsghdr *)skb->data;
+	if (skb->len < nlh->nlmsg_len || nlh->nlmsg_len < sizeof(*nlh)){
+		pwarn("Invalid packet.\n");
+		goto err_inval;
+	}
+
+	err = nproc_process_msg(skb, nlh);
+	if (err || nlh->nlmsg_flags & NLM_F_ACK) {
+		pwarn("err %d, type %#x, flags %#x, seq %#x.\n", err,
+				nlh->nlmsg_type, nlh->nlmsg_flags,
+				nlh->nlmsg_seq);
+		netlink_ack(skb, nlh, err);
+	}
+
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+static void nproc_receive(struct sock *sk, int len)
+{
+	struct sk_buff *skb;
+
+	while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+		nproc_receive_skb(skb);
+		kfree_skb(skb);
+	}
+}
+
+static int nproc_init(void)
+{
+	nproc_sock = netlink_kernel_create(NETLINK_NPROC, nproc_receive);
+
+	if (!nproc_sock) {
+		pwarn("No netlink socket for nproc.\n");
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
+module_init(nproc_init);
Index: mm4-2.6.9-rc1/kernel/pid.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/pid.c	2004-09-08 06:10:54.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/pid.c	2004-09-08 17:45:27.504564546 -0700
@@ -148,6 +148,17 @@
 	return -1;
 }
 
+void *get_pid_map(int idx)
+{
+	pidmap_t *map = pidmap_array + idx;
+	if (!map->page)
+		return NULL;
+	else if (atomic_read(&map->nr_free) == BITS_PER_PAGE)
+		return ERR_PTR(-1);
+	else
+		return map->page;
+}
+
 struct pid * fastcall find_pid(enum pid_type type, int nr)
 {
 	struct hlist_node *elem;

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [1/2] rediff nproc v2 vs. 2.6.9-rc1-mm4
  2004-09-09  1:15       ` William Lee Irwin III
@ 2004-09-09  1:17         ` William Lee Irwin III
  2004-09-09  1:21           ` [2/2] handle CONFIG_MMU=n and use new vm stats for CONFIG_MMU=y William Lee Irwin III
  0 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-09  1:17 UTC (permalink / raw)
  To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson

On Wed, Sep 08, 2004 at 05:35:29PM -0700, William Lee Irwin III wrote:
>>> Any chance you could convert these to use the new vm statistics
>>> accounting?

On Wed, Sep 08, 2004 at 05:43:20PM -0700, William Lee Irwin III wrote:
>> Hmm, there's a more serious issue; CONFIG_MMU=n will barf on these.
>> For that you will need to #ifdef on CONFIG_MMU and use the methods
>> in fs/proc/task_nommu.c and so on.

On Wed, Sep 08, 2004 at 06:15:49PM -0700, William Lee Irwin III wrote:
> This is a straight rediff of nproc vs. 2.6.9-rc1-mm4. No changes
> whatsoever to the underlying code were made; rather, this merely
> resolves offsets so it applies cleanly.
> Compiletested on ia64.

Repost with appropriate Subject: line.


-- wli

Index: mm4-2.6.9-rc1/include/linux/netlink.h
===================================================================
--- mm4-2.6.9-rc1.orig/include/linux/netlink.h	2004-09-08 06:10:50.000000000 -0700
+++ mm4-2.6.9-rc1/include/linux/netlink.h	2004-09-08 17:45:27.500658296 -0700
@@ -15,6 +15,7 @@
 #define NETLINK_ARPD		8
 #define NETLINK_AUDIT		9	/* auditing */
 #define NETLINK_ROUTE6		11	/* af_inet6 route comm channel */
+#define NETLINK_NPROC		12	/* /proc information */
 #define NETLINK_IP6_FW		13
 #define NETLINK_DNRTMSG		14	/* DECnet routing messages */
 #define NETLINK_KEVENT		15	/* Kernel messages to userspace */
Index: mm4-2.6.9-rc1/include/linux/nproc.h
===================================================================
--- mm4-2.6.9-rc1.orig/include/linux/nproc.h	2004-04-25 12:31:02.000000000 -0700
+++ mm4-2.6.9-rc1/include/linux/nproc.h	2004-09-08 17:45:27.501634858 -0700
@@ -0,0 +1,119 @@
+#ifndef _LINUX_NPROC_H
+#define _LINUX_NPROC_H
+
+#include <linux/config.h>
+
+#ifndef __KERNEL__
+#define CONFIG_NPROC
+#endif
+
+#ifdef CONFIG_NPROC
+
+/* Request types */
+#define NPROC_BASE		0x10
+#define NPROC_GET_FIELD_LIST	(NPROC_BASE+0)
+#define NPROC_GET_LABEL		(NPROC_BASE+1)
+#define NPROC_GET_GLOBAL	(NPROC_BASE+2)
+#define NPROC_GET_PS		(NPROC_BASE+3)
+#define NPROC_GET_PID_LIST	(NPROC_BASE+4)
+
+/* Request flags */
+
+
+/* Field scopes */
+#define NPROC_SCOPE_MASK	0x70000000
+#define NPROC_SCOPE_GLOBAL	0x10000000	/* Global w/o arguments */
+#define NPROC_SCOPE_PROCESS	0x20000000
+#define NPROC_SCOPE_LABEL	0x30000000
+
+/* Data types */
+#define NPROC_TYPE_MASK		0x07000000
+#define NPROC_TYPE_STRING	0x01000000
+#define NPROC_TYPE_U32		0x02000000
+#define NPROC_TYPE_UL		0x03000000
+#define NPROC_TYPE_U64		0x04000000
+
+/* Access control (unused) */
+#define NPROC_PERM_MASK		0x00300000
+#define NPROC_PERM_USER		0x00100000
+#define NPROC_PERM_ROOT		0x00200000
+
+/* Selectors */
+#define NPROC_SELECT_ALL	0x00000001
+#define NPROC_SELECT_PID	0x00000002
+#define NPROC_SELECT_UID	0x00000003
+
+/* Labels */
+#define NPROC_LABEL_FIELD_NAME	0x00000001
+#define NPROC_LABEL_FIELD_FMT	0x00000002
+#define NPROC_LABEL_FIELD_UNIT	0x00000003
+#define NPROC_LABEL_WCHAN	0x00000004
+
+/* Field IDs (unique key in bits 0 - 15) */
+#define NPROC_NOP_UL		(0x00000020 | NPROC_TYPE_UL)
+#define NPROC_PID		(0x00000001 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_NAME		(0x00000002 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS)
+/* Amount of free memory (pages) */
+#define NPROC_MEMFREE		(0x00000004 | NPROC_TYPE_U32    | NPROC_SCOPE_GLOBAL)
+/* Size of a page (bytes) */
+#define NPROC_PAGESIZE		(0x00000005 | NPROC_TYPE_U32    | NPROC_SCOPE_GLOBAL)
+/* There's no guarantee about anything with jiffies. Still useful for some. */
+#define NPROC_JIFFIES		(0x00000006 | NPROC_TYPE_U64    | NPROC_SCOPE_GLOBAL)
+/* Process: VM size (KiB) */
+#define NPROC_VMSIZE		(0x00000010 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+/* Process: locked memory (KiB) */
+#define NPROC_VMLOCK		(0x00000011 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+/* Process: Memory resident size (KiB) */
+#define NPROC_VMRSS		(0x00000012 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMDATA		(0x00000013 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMSTACK		(0x00000014 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMEXE		(0x00000015 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMLIB		(0x00000016 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_UID		(0x00000018 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_NR_DIRTY		(0x00000051 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_WRITEBACK	(0x00000052 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_UNSTABLE	(0x00000053 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_PG_TABLE_PGS	(0x00000054 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_MAPPED		(0x00000055 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_SLAB		(0x00000056 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_WCHAN		(0x00000080 | NPROC_TYPE_UL     | NPROC_SCOPE_PROCESS)
+#define NPROC_WCHAN_NAME	(0x00000081 | NPROC_TYPE_STRING)
+
+#ifdef __KERNEL__
+struct nproc_field {
+	__u32 id;
+	const char *label;
+	const char *fmt;
+	const char *unit;
+};
+
+static struct nproc_field labels[] = {
+	{ NPROC_PID,			"PID",		"%5u",	"" },
+	{ NPROC_NAME,			"Name",		"%-15s","" },
+	{ NPROC_MEMFREE,		"MemFree",	"%8u",	"page" },
+	{ NPROC_PAGESIZE,		"PageSize",	"%4u",	"byte" },
+	{ NPROC_JIFFIES,		"Jiffies",	"%10u",	"" },
+	{ NPROC_VMSIZE,			"VmSize",	"%8u",	"KiB" },
+	{ NPROC_VMLOCK,			"VmLock",	"%8u",	"KiB" },
+	{ NPROC_VMRSS,			"VmRSS",	"%8u",	"KiB" },
+	{ NPROC_VMDATA,			"VmData",	"%8u",	"KiB" },
+	{ NPROC_VMSTACK,		"VmStack",	"%8u",	"KiB" },
+	{ NPROC_VMEXE,			"VmExe",	"%8u",	"KiB" },
+	{ NPROC_VMLIB,			"VmLib",	"%8u",	"KiB" },
+	{ NPROC_UID,			"UID",		"%5u",	"" },
+	{ NPROC_NR_DIRTY,		"nr_dirty",	"%8d",	"page" },
+	{ NPROC_NR_WRITEBACK,		"nr_writeback",	"%8u",	"page" },
+	{ NPROC_NR_UNSTABLE,		"nr_unstable",	"%8u",	"page" },
+	{ NPROC_NR_PG_TABLE_PGS,	"nr_page_table_pages",	"%8u", "page" },
+	{ NPROC_NR_MAPPED,		"nr_mapped",	"%8u",	"page" },
+	{ NPROC_NR_SLAB,		"nr_slab",	"%8u",	"page" },
+	{ NPROC_WCHAN,			"wchan",	"%p",	"" },
+#ifdef CONFIG_KALLSYMS
+	{ NPROC_WCHAN_NAME,		"wchan_symbol",	"%s"},
+#endif
+};
+#endif /* __KERNEL__ */
+
+#endif /* CONFIG_NPROC */
+
+#endif /* _LINUX_NPROC_H */
Index: mm4-2.6.9-rc1/include/linux/pid.h
===================================================================
--- mm4-2.6.9-rc1.orig/include/linux/pid.h	2004-09-08 06:10:36.000000000 -0700
+++ mm4-2.6.9-rc1/include/linux/pid.h	2004-09-08 17:45:27.501634858 -0700
@@ -37,6 +37,7 @@
 extern struct pid *FASTCALL(find_pid(enum pid_type, int));
 
 extern int alloc_pidmap(void);
+extern void *get_pid_map(int);
 extern void FASTCALL(free_pidmap(int));
 extern void switch_exec_pids(struct task_struct *leader, struct task_struct *thread);
 
Index: mm4-2.6.9-rc1/init/Kconfig
===================================================================
--- mm4-2.6.9-rc1.orig/init/Kconfig	2004-09-08 06:10:50.000000000 -0700
+++ mm4-2.6.9-rc1/init/Kconfig	2004-09-08 17:45:27.504564546 -0700
@@ -139,6 +139,13 @@
 	  building a kernel for install/rescue disks or your system is very
 	  limited in memory.
 
+config NPROC
+	bool "Netlink interface to /proc information"
+	depends on PROC_FS && EXPERIMENTAL
+	default y
+	help
+	  Nproc is a netlink interface to /proc information.
+
 config AUDIT
 	bool "Auditing support"
 	default y if SECURITY_SELINUX
Index: mm4-2.6.9-rc1/kernel/Makefile
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/Makefile	2004-09-08 06:10:50.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/Makefile	2004-09-08 17:45:27.501634858 -0700
@@ -15,6 +15,7 @@
 obj-$(CONFIG_UID16) += uid16.o
 obj-$(CONFIG_MODULES) += module.o
 obj-$(CONFIG_KALLSYMS) += kallsyms.o
+obj-$(CONFIG_NPROC) += nproc.o
 obj-$(CONFIG_PM) += power/
 obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_KEXEC) += kexec.o
Index: mm4-2.6.9-rc1/kernel/nproc.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/nproc.c	2004-04-25 12:31:02.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/nproc.c	2004-09-08 17:45:27.503587983 -0700
@@ -0,0 +1,851 @@
+/*
+ * nproc.c
+ *
+ * netlink interface to /proc information.
+ */
+
+#include <linux/skbuff.h>
+#include <net/sock.h>
+#include <linux/swap.h>		/* nr_free_pages() */
+#include <linux/kallsyms.h>	/* kallsyms_lookup() */
+#include <linux/pid.h>		/* get_pid_map() */
+#include <linux/nproc.h>
+#include <asm/bitops.h>
+
+//#define DEBUG
+
+/* There must be like 5 million dprintk definitions, so let's add some more */
+#ifdef DEBUG
+#define pdebug(x,args...) printk(KERN_DEBUG "%s:%d " x, __func__ , __LINE__, ##args)
+#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args)
+#else
+#define pdebug(x,args...)
+#define pwarn(x,args...)
+#endif
+
+#define perror(x,args...) printk(KERN_ERR "%s:%d " x, __func__ , __LINE__, ##args)
+
+static struct sock *nproc_sock = NULL;
+
+struct task_mem {
+	u32	vmdata;
+	u32	vmstack;
+	u32	vmexe;
+	u32	vmlib;
+};
+
+struct task_mem_cheap {
+	u32	vmsize;
+	u32	vmlock;
+	u32	vmrss;
+};
+
+/*
+ * __task_mem/__task_mem_cheap basically duplicate the MMU version of
+ * task_mem, but they are split by cost and work on structs.
+ */
+
+static void __task_mem(struct task_struct *tsk, struct task_mem *res)
+{
+	struct mm_struct *mm = get_task_mm(tsk);
+	if (mm) {
+		unsigned long data = 0, stack = 0, exec = 0, lib = 0;
+		struct vm_area_struct *vma;
+
+		down_read(&mm->mmap_sem);
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			unsigned long len = (vma->vm_end - vma->vm_start) >> 10;
+			if (!vma->vm_file) {
+				data += len;
+				if (vma->vm_flags & VM_GROWSDOWN)
+					stack += len;
+				continue;
+			}
+			if (vma->vm_flags & VM_WRITE)
+				continue;
+			if (vma->vm_flags & VM_EXEC) {
+				exec += len;
+				if (vma->vm_flags & VM_EXECUTABLE)
+					continue;
+				lib += len;
+			}
+		}
+		res->vmdata = data - stack;
+		res->vmstack = stack;
+		res->vmexe = exec - lib;
+		res->vmlib = lib;
+		up_read(&mm->mmap_sem);
+
+		mmput(mm);
+	} else {
+		res->vmdata = 0;
+		res->vmstack = 0;
+		res->vmexe = 0;
+		res->vmlib = 0;
+	}
+}
+
+static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res)
+{
+	struct mm_struct *mm = get_task_mm(tsk);
+	if (mm) {
+		res->vmsize = mm->total_vm << (PAGE_SHIFT-10);
+		res->vmlock = mm->locked_vm << (PAGE_SHIFT-10);
+		res->vmrss = mm->rss << (PAGE_SHIFT-10);
+		mmput(mm);
+	} else {
+		res->vmsize = 0;
+		res->vmlock = 0;
+		res->vmrss = 0;
+	}
+}
+
+/*
+ * page_alloc.c already has an extra function broken out to fill a
+ * struct with information. Cool. Not sure whether pgpgin/pgpgout
+ * should be left as is or nailed down as kbytes.
+ */
+static struct page_state *__vmstat(void)
+{
+	struct page_state *ps;
+	ps = kmalloc(sizeof(*ps), GFP_KERNEL);
+	if (!ps)
+		return ERR_PTR(-ENOMEM);
+	get_full_page_state(ps);
+	ps->pgpgin /= 2;	/* sectors -> kbytes */
+	ps->pgpgout /= 2;
+	return ps;
+}
+
+/*
+ * Allocate and prefill an skb. The nlmsghdr provided to the function
+ * is a pointer to the respective struct in the request message.
+ */
+static struct sk_buff *nproc_alloc_nlmsg(struct nlmsghdr *nlh, u32 len)
+{
+	__u32 seq = nlh->nlmsg_seq;
+	__u16 type = nlh->nlmsg_type;
+	__u32 pid = nlh->nlmsg_pid;
+	struct sk_buff *skb2 = 0;
+
+	skb2 = alloc_skb(NLMSG_SPACE(len), GFP_KERNEL);
+	if (!skb2) {
+		skb2 = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	NLMSG_PUT(skb2, pid, seq, type, NLMSG_ALIGN(len));
+out:
+	return skb2;
+
+nlmsg_failure:				/* Used by NLMSG_PUT */
+	kfree_skb(skb2);
+	return NULL;
+}
+
+#define mstore(value, id, buf)						\
+({									\
+	u32 _type = id & NPROC_TYPE_MASK;				\
+	switch (_type) {						\
+		case NPROC_TYPE_U32: {					\
+			__u32 *p = (u32 *)buf;				\
+			*p = value;					\
+			buf = (char *)++p;				\
+			break;						\
+		}							\
+		case NPROC_TYPE_UL: {					\
+			unsigned long *p = (unsigned long *)buf;	\
+			*p = value;					\
+			buf = (char *)++p;				\
+			break;						\
+		}							\
+		case NPROC_TYPE_U64: {					\
+			__u64 *p = (u64 *)buf;				\
+			*p = value;					\
+			buf = (char *)++p;				\
+			break;						\
+		}							\
+		default:						\
+			perror("Huh? Bad type!\n");			\
+	}								\
+})
+
+static char *nproc_ps_field(u32 id, char *buf, task_t *tsk)
+{
+	struct task_mem tsk_mem;
+	struct task_mem_cheap tsk_mem_cheap;
+
+	tsk_mem.vmdata = (~0);
+	tsk_mem_cheap.vmsize = (~0);
+
+	switch (id) {
+		case NPROC_PID:
+			mstore(tsk->pid, NPROC_PID, buf);
+			break;
+		case NPROC_UID:
+			mstore(tsk->uid, NPROC_UID, buf);
+			break;
+		case NPROC_VMSIZE:
+		case NPROC_VMLOCK:
+		case NPROC_VMRSS:
+			if (tsk_mem_cheap.vmsize == (~0))
+				__task_mem_cheap(tsk, &tsk_mem_cheap);
+
+			switch (id) {
+				case NPROC_VMSIZE:
+					mstore(tsk_mem_cheap.vmsize,
+							NPROC_VMSIZE, buf);
+					break;
+				case NPROC_VMLOCK:
+					mstore(tsk_mem_cheap.vmlock,
+							NPROC_VMLOCK, buf);
+					break;
+				case NPROC_VMRSS:
+					mstore(tsk_mem_cheap.vmrss,
+							NPROC_VMRSS, buf);
+					break;
+			}
+			break;
+		case NPROC_VMDATA:
+		case NPROC_VMSTACK:
+		case NPROC_VMEXE:
+		case NPROC_VMLIB:
+			if (tsk_mem.vmdata == (~0))
+					__task_mem(tsk, &tsk_mem);
+
+			switch (id) {
+				case NPROC_VMDATA:
+					mstore(tsk_mem.vmdata, NPROC_VMDATA,
+							buf);
+					break;
+				case NPROC_VMSTACK:
+					mstore(tsk_mem.vmstack, NPROC_VMSTACK,
+							buf);
+					break;
+				case NPROC_VMEXE:
+					mstore(tsk_mem.vmexe, NPROC_VMEXE, buf);
+					break;
+				case NPROC_VMLIB:
+					mstore(tsk_mem.vmlib, NPROC_VMLIB, buf);
+					break;
+			}
+			break;
+		case NPROC_JIFFIES:
+			mstore(get_jiffies_64(), NPROC_JIFFIES, buf);
+			break;
+		case NPROC_WCHAN:
+			mstore(get_wchan(tsk), NPROC_WCHAN, buf);
+			break;
+		case NPROC_NAME:
+			mstore(sizeof(tsk->comm), NPROC_TYPE_U32, buf);
+			strncpy(buf, tsk->comm, sizeof(tsk->comm));
+			buf += sizeof(tsk->comm);
+			break;
+		case NPROC_NOP_UL:
+			mstore(0, NPROC_TYPE_UL, buf);
+			break;
+		default:
+			pwarn("Unknown field ID %#x.\n", id);
+			goto err_inval;
+	}
+	return buf;
+err_inval:
+	return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Build and send a netlink msg for one PID.
+ */
+static int nproc_pid_msg(struct nlmsghdr *nlh, u32 *fdata, u32 len, task_t *tsk)
+{
+	int i;
+	int err = 0;
+	struct sk_buff *skb2;
+	char *buf;
+	struct nlmsghdr *nlh2;
+	u32 fcnt, *fields;
+
+	fcnt = fdata[0];
+	fields = &fdata[1];
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+	nlh2 = (struct nlmsghdr *)skb2->data;
+	buf = NLMSG_DATA(nlh2);
+
+	for (i = 0; i < fcnt; i++) {
+		buf = nproc_ps_field(fields[i], buf, tsk);
+		if (IS_ERR(buf)) {
+			err = PTR_ERR(buf);
+			goto out_free;
+		}
+	}
+	err = netlink_unicast(nproc_sock, skb2, nlh2->nlmsg_pid, 0);
+	if (err > 0)
+		err = 0;
+	return err;
+out_free:
+	kfree_skb(skb2);
+out:
+	return err;
+}
+
+/*
+ * Find task for given pid, grab task lock (caller must unlock).
+ */
+static task_t *nproc_ps_get_task(int pid)
+{
+	task_t *tsk;
+
+	read_lock(&tasklist_lock);
+	tsk = find_task_by_pid(pid);
+	if (tsk)
+		get_task_struct(tsk);
+	read_unlock(&tasklist_lock);
+	return tsk;
+}
+
+/*
+ * Iterate over a list of PIDs.
+ */
+static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata)
+{
+	int i;
+	int err = 0;
+	u32 tcnt;
+	u32 *pids;
+
+	if (left < sizeof(tcnt))
+		goto err_inval;
+	left -= sizeof(tcnt);
+
+	tcnt = sdata[0];
+
+	if (left < (tcnt * sizeof(u32)))
+		goto err_inval;
+	left -= tcnt * sizeof(u32);
+
+	if (left)
+		pwarn("%d bytes left.\n", left);
+
+	pids = &sdata[1];
+
+	for (i = 0; i < tcnt; i++) {
+		task_t *tsk;
+		tsk = nproc_ps_get_task(pids[i]);
+		if (!tsk)
+			continue;
+		err = nproc_pid_msg(nlh, fdata, len, tsk);
+		put_task_struct(tsk);
+		if (err)
+			goto out;
+	}
+
+out:
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+#define PIDMAP_ENTRIES (PID_MAX_LIMIT/PAGE_SIZE/8)
+#define BITS_PER_PAGE (PAGE_SIZE*8)
+
+/*
+ * Iterate over all PIDs.
+ */
+static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len)
+{
+	void *map;
+	int offset, i;
+	int err = 0;
+
+	for (i = 0; i < PIDMAP_ENTRIES; i++) {
+
+		map = get_pid_map(i);
+		if (!map)	/* done -- there are no holes in pidmap_array */
+			break;
+		if (IS_ERR(map))	/* No PIDs used in this map */
+			continue;
+		offset = 0;
+		for ( ; ; ) {
+			int pid;
+			task_t *tsk;
+			offset = find_next_bit(map, BITS_PER_PAGE, ++offset);
+			if (offset >= BITS_PER_PAGE)
+				break;
+			pid = offset + i * BITS_PER_PAGE;
+			tsk = nproc_ps_get_task(pid);
+			if (!tsk)
+				continue;
+			err = nproc_pid_msg(nlh, fdata, len, tsk);
+			put_task_struct(tsk);
+			if (err)
+				goto out;
+		}
+	}
+
+out:
+	return err;
+}
+
+static u32 __reply_size_special(u32 id)
+{
+	u32 len = 0;
+
+	switch (id) {
+		case NPROC_NAME:
+			len = sizeof(u32) +
+				sizeof(((struct task_struct*)0)->comm);
+			break;
+		default:
+			pwarn("Unknown field size in %#x.\n", id);
+	}
+	return len;
+}
+
+/*
+ * Calculates the size of a reply message payload. Alternatively, we could have
+ * the user space caller supply a number along with the request and bail
+ * out or realloc later if we find the allocation was too small. More
+ * responsibility in user space, but faster.
+ */
+static u32 *__reply_size (u32 *data, u32 *left, u32 *len)
+{
+	u32 *fields;
+	u32 fcnt;
+	int i;
+	*len = 0;
+
+	if (*left < sizeof(fcnt))
+		goto err_inval;
+	*left -= sizeof(fcnt);
+
+	fcnt = data[0];
+
+	if (*left < (fcnt * sizeof(u32)))
+		goto err_inval;
+	*left -= fcnt * sizeof(u32);
+
+	fields = &data[1];
+
+	for (i = 0; i < fcnt; i++) {
+		u32 id = fields[i];
+		u32 type = id & NPROC_TYPE_MASK;
+		pdebug("        %#8.8x.\n", fields[i]);
+		switch (type) {
+			case NPROC_TYPE_U32:
+				*len += sizeof(u32);
+				break;
+			case NPROC_TYPE_UL:
+				*len += sizeof(unsigned long);
+				break;
+			case NPROC_TYPE_U64:
+				*len += sizeof(u64);
+				break;
+			default: {		/* Special cases */
+				u32 slen;
+				slen = __reply_size_special(id);
+				if (slen)
+					*len += slen;
+				else
+					goto err_inval;
+			}
+		}
+	}
+
+	return &fields[fcnt];
+
+err_inval:
+	return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Call the chosen process selector. Adding additional selectors
+ * (e.g. select by uid) is easy, but is there a need?
+ */
+static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid)
+{
+	int err;
+	u32 len;
+	u32 *data = NLMSG_DATA(nlh);
+	u32 *sdata;
+	u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+
+	sdata = __reply_size(data, &left, &len);
+	if (IS_ERR(sdata)) {
+		err = PTR_ERR(sdata);
+		goto out;
+	}
+
+	if (left < sizeof(u32))
+		goto err_inval;
+	left -= sizeof(u32);
+
+	switch (*sdata) {
+		case NPROC_SELECT_ALL:
+			if (left)
+				pwarn("%d bytes left.\n", left);
+			err = nproc_ps_select_all(nlh, data, len);
+			break;
+		case NPROC_SELECT_PID:
+			err = nproc_ps_select_pid(nlh, data, len,
+					left, sdata + 1);
+			break;
+		default:
+			pwarn("Unknown selection method %#x.\n", *sdata);
+			goto err_inval;
+	}
+
+out:
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+static char *nproc_global_field(u32 id, char *buf)
+{
+	struct page_state *ps = NULL;
+
+	switch (id) {
+		case NPROC_NR_DIRTY:
+		case NPROC_NR_WRITEBACK:
+		case NPROC_NR_UNSTABLE:
+		case NPROC_NR_PG_TABLE_PGS:
+		case NPROC_NR_MAPPED:
+		case NPROC_NR_SLAB:
+			if (!ps) {
+				ps = __vmstat();
+				if (IS_ERR(ps)) {	/* Just pass it on */
+					buf = (void *)ps;
+					ps = NULL;
+					goto out;
+				}
+			}
+			switch (id) {
+				case NPROC_NR_DIRTY:
+					mstore(ps->nr_dirty, NPROC_NR_DIRTY,
+							buf);
+					break;
+				case NPROC_NR_WRITEBACK:
+					mstore(ps->nr_writeback,
+							NPROC_NR_WRITEBACK,
+							buf);
+					break;
+				case NPROC_NR_UNSTABLE:
+					mstore(ps->nr_unstable,
+							NPROC_NR_UNSTABLE,
+							buf);
+					break;
+				case NPROC_NR_PG_TABLE_PGS:
+					mstore(ps->nr_page_table_pages,
+							NPROC_NR_PG_TABLE_PGS,
+							buf);
+					break;
+				case NPROC_NR_MAPPED:
+					mstore(ps->nr_mapped, NPROC_NR_MAPPED,
+							buf);
+					break;
+				case NPROC_NR_SLAB:
+					mstore(ps->nr_slab, NPROC_NR_SLAB, buf);
+					break;
+			}
+			break;
+		case NPROC_MEMFREE:
+			mstore(nr_free_pages(), NPROC_MEMFREE, buf);
+			break;
+		case NPROC_PAGESIZE:
+			mstore(PAGE_SIZE, NPROC_PAGESIZE, buf);
+			break;
+		case NPROC_JIFFIES:
+			mstore(get_jiffies_64(), NPROC_JIFFIES, buf);
+			break;
+		default:
+			pwarn("Unknown field ID %#x.\n", id);
+			buf = ERR_PTR(-EINVAL);
+			goto out;
+	}
+	kfree(ps);
+out:
+	return buf;
+}
+
+static int nproc_get_global(struct nlmsghdr *nlh)
+{
+	int err, i;
+	void *errp;
+	struct sk_buff *skb2;
+	char *buf;
+	u32 fcnt, len;
+	u32 *data = NLMSG_DATA(nlh);
+	u32 *fields;
+	u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+	errp = __reply_size(data, &left, &len);
+	if (IS_ERR(errp)) {
+		err = PTR_ERR(errp);
+		goto out;
+	}
+	if (left)
+		pwarn("%d bytes left.\n", left);
+
+	fcnt = data[0];
+	fields = &data[1];
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+
+	buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+
+	for (i = 0; i < fcnt; i++) {
+		buf = nproc_global_field(fields[i], buf);
+		if (IS_ERR(buf)) {
+			err = PTR_ERR(buf);
+			kfree_skb(skb2);
+			goto out;
+		}
+	}
+
+	err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+	if (err > 0)
+		err = 0;
+out:
+	return err;
+}
+
+static int find_id(__u32 *data, __u32 *left)
+{
+	int i;
+	u32 id;
+
+	if (*left < sizeof(id))
+		goto err_inval;
+	*left -= sizeof(sizeof(id));
+
+	if (*left)
+		pwarn("%d bytes left.\n", *left);
+	id = data[1];
+
+	for (i = 0; i < ARRAY_SIZE(labels) && labels[i].id != id; i++)
+		;	/* Do nothing */
+
+	if (labels[i].id != id) {
+		pwarn("No matching label found for %#x.\n", id);
+		goto err_inval;
+	}
+
+	return i;
+
+err_inval:
+	return -EINVAL;
+}
+
+
+static int nproc_get_label(struct nlmsghdr *nlh)
+{
+	int err;
+	struct sk_buff *skb2;
+	const char *label;
+	char *buf;
+	int len;
+	u32 ltype;
+	u32 *data = NLMSG_DATA(nlh);
+	u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+	if (left < sizeof(ltype))
+		goto err_inval;
+	left -= sizeof(ltype);
+
+	ltype = data[0];
+
+	if (ltype == NPROC_LABEL_FIELD_NAME) {
+		int idx;
+		idx = find_id(data, &left);
+		if (idx < 0)
+			goto err_inval;
+		label = labels[idx].label;
+	}
+	else if (ltype == NPROC_LABEL_FIELD_UNIT) {
+		int idx;
+		idx = find_id(data, &left);
+		if (idx < 0)
+			goto err_inval;
+		label = labels[idx].unit;
+	}
+	else if (ltype == NPROC_LABEL_FIELD_FMT) {
+		int idx;
+		idx = find_id(data, &left);
+		if (idx < 0)
+			goto err_inval;
+		label = labels[idx].fmt;
+	}
+	else if (ltype == NPROC_LABEL_WCHAN) {
+		char *modname;
+		unsigned long wchan, size, offset;
+		char namebuf[128];
+
+		if (left < sizeof(unsigned long))
+			goto err_inval;
+		left -= sizeof(unsigned long);
+
+		if (left)
+			pwarn("%d bytes left.\n", left);
+
+		wchan = (unsigned long)data[1];
+		label = kallsyms_lookup(wchan, &size, &offset, &modname,
+				namebuf);
+
+		if (!label) {
+			pwarn("No ksym found for %#lx.\n", wchan);
+			goto err_inval;
+		}
+	}
+	else {
+		pwarn("Unknown label type %#x.\n", ltype);
+		goto err_inval;
+	}
+
+	len = strlen(label) + 1;
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+
+	buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+
+	strncpy(buf, label, len);
+
+	err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+	if (err > 0)
+		err = 0;
+out:
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+static int nproc_get_list(struct nlmsghdr *nlh)
+{
+	int err, i, cnt, len;
+	struct sk_buff *skb2;
+	u32 *buf;
+
+	cnt = ARRAY_SIZE(labels);
+	len = (cnt + 1) * sizeof(u32);
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+
+	buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+	buf[0] = cnt;
+	for (i = 0; i < cnt; i++)
+		buf[i + 1] = labels[i].id;
+
+	err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+	if (err > 0)
+		err = 0;
+out:
+	return err;
+}
+
+static __inline__ int nproc_process_msg(struct sk_buff *skb,
+		struct nlmsghdr *nlh)
+{
+	int err = 0;
+	uid_t uid;
+	kernel_cap_t caps;
+
+	if (!(nlh->nlmsg_flags & NLM_F_REQUEST))
+		goto out;
+
+	nlh->nlmsg_pid = NETLINK_CB(skb).pid;
+	uid = NETLINK_CB(skb).creds.uid;
+	caps = NETLINK_CB(skb).eff_cap;
+
+	switch (nlh->nlmsg_type) {
+		case NPROC_GET_FIELD_LIST:
+			err = nproc_get_list(nlh);
+			break;
+		case NPROC_GET_LABEL:
+			err = nproc_get_label(nlh);
+			break;
+		case NPROC_GET_GLOBAL:
+			err = nproc_get_global(nlh);
+			break;
+		case NPROC_GET_PS:
+			err = nproc_get_ps(nlh, uid);
+			break;
+		default:
+			pwarn("Unknown msg type %#x.\n", nlh->nlmsg_type);
+			err = -EINVAL;
+	}
+out:
+	return err;
+
+}
+
+static int nproc_receive_skb(struct sk_buff *skb)
+{
+	int err = 0;
+	struct nlmsghdr *nlh;
+
+	if (skb->len < NLMSG_LENGTH(0))
+		goto err_inval;
+
+	nlh = (struct nlmsghdr *)skb->data;
+	if (skb->len < nlh->nlmsg_len || nlh->nlmsg_len < sizeof(*nlh)){
+		pwarn("Invalid packet.\n");
+		goto err_inval;
+	}
+
+	err = nproc_process_msg(skb, nlh);
+	if (err || nlh->nlmsg_flags & NLM_F_ACK) {
+		pwarn("err %d, type %#x, flags %#x, seq %#x.\n", err,
+				nlh->nlmsg_type, nlh->nlmsg_flags,
+				nlh->nlmsg_seq);
+		netlink_ack(skb, nlh, err);
+	}
+
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+static void nproc_receive(struct sock *sk, int len)
+{
+	struct sk_buff *skb;
+
+	while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+		nproc_receive_skb(skb);
+		kfree_skb(skb);
+	}
+}
+
+static int nproc_init(void)
+{
+	nproc_sock = netlink_kernel_create(NETLINK_NPROC, nproc_receive);
+
+	if (!nproc_sock) {
+		pwarn("No netlink socket for nproc.\n");
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
+module_init(nproc_init);
Index: mm4-2.6.9-rc1/kernel/pid.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/pid.c	2004-09-08 06:10:54.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/pid.c	2004-09-08 17:45:27.504564546 -0700
@@ -148,6 +148,17 @@
 	return -1;
 }
 
+void *get_pid_map(int idx)
+{
+	pidmap_t *map = pidmap_array + idx;
+	if (!map->page)
+		return NULL;
+	else if (atomic_read(&map->nr_free) == BITS_PER_PAGE)
+		return ERR_PTR(-1);
+	else
+		return map->page;
+}
+
 struct pid * fastcall find_pid(enum pid_type type, int nr)
 {
 	struct hlist_node *elem;

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [2/2] handle CONFIG_MMU=n and use new vm stats for CONFIG_MMU=y
  2004-09-09  1:17         ` [1/2] rediff nproc v2 vs. 2.6.9-rc1-mm4 William Lee Irwin III
@ 2004-09-09  1:21           ` William Lee Irwin III
  2004-09-09  1:22             ` William Lee Irwin III
  2004-09-09  1:26             ` [3/2] round up text memory to the nearest page in fs/proc/task_mmu.c William Lee Irwin III
  0 siblings, 2 replies; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-09  1:21 UTC (permalink / raw)
  To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson

On Wed, Sep 08, 2004 at 06:15:49PM -0700, William Lee Irwin III wrote:
>> This is a straight rediff of nproc vs. 2.6.9-rc1-mm4. No changes
>> whatsoever to the underlying code were made; rather, this merely
>> resolves offsets so it applies cleanly.
>> Compiletested on ia64.

On Wed, Sep 08, 2004 at 06:17:08PM -0700, William Lee Irwin III wrote:
> Repost with appropriate Subject: line.

Make __task_mem() and __task_mem_cheap() use the appropriate methods
for CONFIG_MMU=y and add some attempt at correct code for CONFIG_MMU=n.
The new methods for /proc/ accounting involve using counters kept in
the mm instead of iteration over vmas. For the CONFIG_MMU=y case this
does not involve acquiring mm->mmap_sem for any per-mm statistics. The
CONFIG_MMU=n case still needs iteration over tblocks to calculate them.


-- wli

Index: mm4-2.6.9-rc1/kernel/nproc.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/nproc.c	2004-09-08 17:45:27.503587983 -0700
+++ mm4-2.6.9-rc1/kernel/nproc.c	2004-09-08 18:11:24.826811093 -0700
@@ -44,44 +44,20 @@
  * __task_mem/__task_mem_cheap basically duplicate the MMU version of
  * task_mem, but they are split by cost and work on structs.
  */
-
+#ifdef CONFIG_MMU
 static void __task_mem(struct task_struct *tsk, struct task_mem *res)
 {
 	struct mm_struct *mm = get_task_mm(tsk);
-	if (mm) {
-		unsigned long data = 0, stack = 0, exec = 0, lib = 0;
-		struct vm_area_struct *vma;
-
-		down_read(&mm->mmap_sem);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			unsigned long len = (vma->vm_end - vma->vm_start) >> 10;
-			if (!vma->vm_file) {
-				data += len;
-				if (vma->vm_flags & VM_GROWSDOWN)
-					stack += len;
-				continue;
-			}
-			if (vma->vm_flags & VM_WRITE)
-				continue;
-			if (vma->vm_flags & VM_EXEC) {
-				exec += len;
-				if (vma->vm_flags & VM_EXECUTABLE)
-					continue;
-				lib += len;
-			}
-		}
-		res->vmdata = data - stack;
-		res->vmstack = stack;
-		res->vmexe = exec - lib;
-		res->vmlib = lib;
-		up_read(&mm->mmap_sem);
 
+	if (!mm)
+		memset(res, 0, sizeof(struct task_mem));
+	else {
+		res->vmdata = (mm->total_vm - mm->shared_vm - mm->stack_vm)
+							<< (PAGE_SHIFT - 10);
+		res->vmstack = mm->stack_vm << (PAGE_SHIFT - 10);
+		res->vmexe = PAGE_ALIGN(mm->end_code - mm->start_code) >> 10;
+		res->vmlib = (mm->exec_vm << (PAGE_SHIFT - 10)) - res->vmexe;
 		mmput(mm);
-	} else {
-		res->vmdata = 0;
-		res->vmstack = 0;
-		res->vmexe = 0;
-		res->vmlib = 0;
 	}
 }
 
@@ -99,6 +75,80 @@
 		res->vmrss = 0;
 	}
 }
+#else /* !CONFIG_MMU */
+static void __task_mem(task_t *task, struct task_mem *stats)
+{
+	struct mm_struct *mm = get_task_mm(task)
+
+	if (!mm)
+		memset(stats, 0, sizeof(struct task_mem));
+	else {
+		unsigned long bytes = 0, sbytes = 0, slack = 0;
+		struct mm_tblk_struct *tblk;
+
+		down_read(&mm->mmap_sem);
+		for (tblk = &mm->context.tblk; tblk; tblk = tblk->next) {
+			if (!tblk->rblock)
+				continue;
+			bytes += kobjsize(tblk);
+			if (atomic_read(&mm->mm_count) > 1) ||
+					tblk->rblock->refcount > 1) {
+				sbytes += kobjsize(tblk->rblock->kblock);
+				sbytes += kobjsize(tblk->rblock);
+			} else {
+				bytes += kobjsize(tblk->rblock->kblock);
+				bytes += kobjsize(tblk->rblock);
+				slack += kobjsize(tblock->rblock->kblock);
+			}
+		}
+		if (atomic_read(&mm->mm_count) > 1)
+			sbytes += kobjsize(mm);
+		else
+			bytes += kobjsize(mm);
+		up_read(&mm->mmap_sem);
+		mmput(mm);
+		if (task->fs && atomic_read(&task->fs->count) > 1)
+			sbytes += kobjsize(task->files);
+		else
+			bytes += kobjsize(task->files);
+		if (task->sighand && atomic_read(&task->sighand->count) > 1)
+			sbytes += kobjsize(task->sighand);
+		else
+			bytes += kobjsize(task->sighand);
+		bytes += kobjsize(task);
+		/* some interpretation is needed */
+		stats->vmdata = bytes;
+		stats->vmstack = sbytes;
+		stats->vmexe = stats->vmlib = 0;
+	}
+}
+
+static void __task_mem_cheap(task_t *task, struct task_mem_cheap *stats)
+{
+	struct mm_struct *mm = get_task_mm(task);
+	struct mm_tblock_struct *tblk;
+	int size;
+
+	memset(stats, 0, sizeof(struct task_mem_cheap));
+	stats->vmrss += kobjsize(mm);
+	down_read(&mm->mmap_sem);
+	for (tblk = &mm->context.block; tblk; tblk = tblk->next) {
+		if (tblk->next)
+			stats->vmrss += kobjsize(tblk->next);
+		if (tblk->rblock) {
+			stats->vmsize += kobjsize(tblk->rblock);
+			stats->vmrss += kobjsize(tblk->rblock);
+			stats->vmrss += kobjsize(tblk->rblock->kblock);
+		}
+	}
+	stats->vmrss += mm->end_code - mm->start_code;
+	stats->vmrss += mm->start_stack - mm->start_data;
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	stats->vmrss >>= 10;
+	stats->vmsize >>= 10;
+}
+#endif /* !CONFIG_MMU */
 
 /*
  * page_alloc.c already has an extra function broken out to fill a

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [2/2] handle CONFIG_MMU=n and use new vm stats for CONFIG_MMU=y
  2004-09-09  1:21           ` [2/2] handle CONFIG_MMU=n and use new vm stats for CONFIG_MMU=y William Lee Irwin III
@ 2004-09-09  1:22             ` William Lee Irwin III
  2004-09-09  1:26             ` [3/2] round up text memory to the nearest page in fs/proc/task_mmu.c William Lee Irwin III
  1 sibling, 0 replies; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-09  1:22 UTC (permalink / raw)
  To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson

On Wed, Sep 08, 2004 at 06:21:37PM -0700, William Lee Irwin III wrote:
> Make __task_mem() and __task_mem_cheap() use the appropriate methods
> for CONFIG_MMU=y and add some attempt at correct code for CONFIG_MMU=n.
> The new methods for /proc/ accounting involve using counters kept in
> the mm instead of iteration over vmas. For the CONFIG_MMU=y case this
> does not involve acquiring mm->mmap_sem for any per-mm statistics. The
> CONFIG_MMU=n case still needs iteration over tblocks to calculate them.

Once again, compiletested only on ia64.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [3/2] round up text memory to the nearest page in fs/proc/task_mmu.c
  2004-09-09  1:21           ` [2/2] handle CONFIG_MMU=n and use new vm stats for CONFIG_MMU=y William Lee Irwin III
  2004-09-09  1:22             ` William Lee Irwin III
@ 2004-09-09  1:26             ` William Lee Irwin III
  1 sibling, 0 replies; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-09  1:26 UTC (permalink / raw)
  To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson

On Wed, Sep 08, 2004 at 06:21:37PM -0700, William Lee Irwin III wrote:
> Make __task_mem() and __task_mem_cheap() use the appropriate methods
> for CONFIG_MMU=y and add some attempt at correct code for CONFIG_MMU=n.
> The new methods for /proc/ accounting involve using counters kept in
> the mm instead of iteration over vmas. For the CONFIG_MMU=y case this
> does not involve acquiring mm->mmap_sem for any per-mm statistics. The
> CONFIG_MMU=n case still needs iteration over tblocks to calculate them.

Round up text memory to the nearest page to resolve potential alignment
anomalies in reported statistics. Compiletested on ia64.


-- wli

Index: mm4-2.6.9-rc1/fs/proc/task_mmu.c
===================================================================
--- mm4-2.6.9-rc1.orig/fs/proc/task_mmu.c	2004-09-08 06:10:35.000000000 -0700
+++ mm4-2.6.9-rc1/fs/proc/task_mmu.c	2004-09-08 18:27:39.401017905 -0700
@@ -9,7 +9,7 @@
 	unsigned long data, text, lib;
 
 	data = mm->total_vm - mm->shared_vm - mm->stack_vm;
-	text = (mm->end_code - mm->start_code) >> 10;
+	text = PAGE_ALIGN(mm->end_code - mm->start_code) >> 10;
 	lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
 	buffer += sprintf(buffer,
 		"VmSize:\t%8lu kB\n"

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-08 18:41 ` [1/1][PATCH] " Roger Luethi
  2004-09-09  0:35   ` William Lee Irwin III
@ 2004-09-09 11:53   ` Stephen Smalley
  2004-09-09 17:22     ` William Lee Irwin III
  1 sibling, 1 reply; 63+ messages in thread
From: Stephen Smalley @ 2004-09-09 11:53 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Andrew Morton, lkml, Albert Cahalan, William Lee Irwin III,
	Martin J. Bligh, Paul Jackson

On Wed, 2004-09-08 at 14:41, Roger Luethi wrote:
> A few notes:
> - Access control can be implemented easily. Right now it would be bloat,
>   though -- the vast majority of fields in /proc are world-readable
>   (/proc/pid/environ being the notable exception).

They aren't world readable when using a security module like SELinux;
they are then typically only accessible by processes in the same
security domain, aside from processes in privileged domains. 
security_task_to_inode() hook sets the security attributes on the
/proc/pid inodes based on their security context, and then
security_inode_permission() hook controls access to them.  So you need
at least comparable controls.

-- 
Stephen Smalley <sds@epoch.ncsc.mil>
National Security Agency


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 11:53   ` Stephen Smalley
@ 2004-09-09 17:22     ` William Lee Irwin III
  2004-09-09 17:53       ` Roger Luethi
  0 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-09 17:22 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Roger Luethi, Andrew Morton, lkml, Albert Cahalan,
	Martin J. Bligh, Paul Jackson

On Wed, 2004-09-08 at 14:41, Roger Luethi wrote:
>> A few notes:
>> - Access control can be implemented easily. Right now it would be bloat,
>>   though -- the vast majority of fields in /proc are world-readable
>>   (/proc/pid/environ being the notable exception).

On Thu, Sep 09, 2004 at 07:53:31AM -0400, Stephen Smalley wrote:
> They aren't world readable when using a security module like SELinux;
> they are then typically only accessible by processes in the same
> security domain, aside from processes in privileged domains. 
> security_task_to_inode() hook sets the security attributes on the
> /proc/pid inodes based on their security context, and then
> security_inode_permission() hook controls access to them.  So you need
> at least comparable controls.

Can you make a more specific suggestion regarding the controls to use?
It's a bit awkward for those highly unfamiliar with the subsystem to
invent new methods for the security layer independently, so it's likely
best some guidance (e.g. function prototype) be given.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 17:22     ` William Lee Irwin III
@ 2004-09-09 17:53       ` Roger Luethi
  2004-09-09 20:01         ` Stephen Smalley
  2004-09-09 20:44         ` Chris Wright
  0 siblings, 2 replies; 63+ messages in thread
From: Roger Luethi @ 2004-09-09 17:53 UTC (permalink / raw)
  To: William Lee Irwin III, Stephen Smalley, Andrew Morton, lkml,
	Albert Cahalan, Martin J. Bligh, Paul Jackson

On Thu, 09 Sep 2004 10:22:00 -0700, William Lee Irwin III wrote:
> On Thu, Sep 09, 2004 at 07:53:31AM -0400, Stephen Smalley wrote:
> > They aren't world readable when using a security module like SELinux;
> > they are then typically only accessible by processes in the same
> > security domain, aside from processes in privileged domains. 
> > security_task_to_inode() hook sets the security attributes on the
> > /proc/pid inodes based on their security context, and then
> > security_inode_permission() hook controls access to them.  So you need
> > at least comparable controls.
> 
> Can you make a more specific suggestion regarding the controls to use?
> It's a bit awkward for those highly unfamiliar with the subsystem to

For the same reason, I'm not comfortable with implementing SELinux type
access controls myself. How about:

config NPROC
	depends on !SECURITY_SELINUX

Adding access control later won't be a problem for anyone who groks
SELinux.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09  0:35   ` William Lee Irwin III
  2004-09-09  0:43     ` William Lee Irwin III
@ 2004-09-09 18:43     ` Roger Luethi
  2004-09-09 18:49       ` William Lee Irwin III
  1 sibling, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2004-09-09 18:43 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, linux-kernel,
	Albert Cahalan, Paul Jackson

On Wed, 08 Sep 2004 17:35:29 -0700, William Lee Irwin III wrote:
> On Wed, Sep 08, 2004 at 08:41:30PM +0200, Roger Luethi wrote:
> > A few notes:
> > - Access control can be implemented easily. Right now it would be bloat,
> >   though -- the vast majority of fields in /proc are world-readable
> >   (/proc/pid/environ being the notable exception).
> > - Additional process selectors (e.g. select by UID) are not hard to
> >   add, either, should there ever be a need.
> > - There are a few things I'm not sure about: For instance, what is a good
> >   return value for mm_struct related fields wrt kernel threads? I picked
> >   0, but ~(0) might be preferable because it's distinct.
> > Signed-off-by: Roger Luethi <rl@hellgate.ch>
> 
> Any chance you could convert these to use the new vm statistics
> accounting?

Mea culpa. I copied the routines wholesale from 2.6.7 when I started
work on nproc. They still seemed to work with 2.6.9-rc1-bk13, I hadn't
noticed the work that had gone into field computation already. So for
CONFIG_MMU, values in both __task_mem and __task_mem_cheap are cheap
now. The routines can be merged.

!CONFIG_MMU is a different story. Presumably, it needs a change in the
fields that are offered (cp. task_mem in fs/proc/task_nommu.c).

FWIW, my prefered solution would be to have only one routine task_mem
to fill the respective struct for nproc and /proc.

There seems to be a discrepancy between current task_mem in
fs/proc/task_nommu.c and the __task_mem{,_cheap} routines you wrote
for the nproc !CONFIG_MMU case. Can you explain?

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 18:43     ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi
@ 2004-09-09 18:49       ` William Lee Irwin III
  2004-09-09 19:00         ` William Lee Irwin III
  2004-09-09 19:11         ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi
  0 siblings, 2 replies; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-09 18:49 UTC (permalink / raw)
  To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson

On Wed, 08 Sep 2004 17:35:29 -0700, William Lee Irwin III wrote:
>> Any chance you could convert these to use the new vm statistics
>> accounting?

On Thu, Sep 09, 2004 at 08:43:01PM +0200, Roger Luethi wrote:
> Mea culpa. I copied the routines wholesale from 2.6.7 when I started
> work on nproc. They still seemed to work with 2.6.9-rc1-bk13, I hadn't
> noticed the work that had gone into field computation already. So for
> CONFIG_MMU, values in both __task_mem and __task_mem_cheap are cheap
> now. The routines can be merged.
> !CONFIG_MMU is a different story. Presumably, it needs a change in the
> fields that are offered (cp. task_mem in fs/proc/task_nommu.c).
> FWIW, my prefered solution would be to have only one routine task_mem
> to fill the respective struct for nproc and /proc.

I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation
patch atop the others I sent.


On Thu, Sep 09, 2004 at 08:43:01PM +0200, Roger Luethi wrote:
> There seems to be a discrepancy between current task_mem in
> fs/proc/task_nommu.c and the __task_mem{,_cheap} routines you wrote
> for the nproc !CONFIG_MMU case. Can you explain?

I'm not aware of a discrepancy with the fs/proc/task_nommu.c code; I
did, however, have to mangle the things via guesswork to avoid adding
the new fields, which I really wanted you to arrange for or comment on
as they are a matter of interface. Also, could you be more specific
about these discrepancies?


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 18:49       ` William Lee Irwin III
@ 2004-09-09 19:00         ` William Lee Irwin III
  2004-09-09 19:02           ` [4/2] consolidate __task_mem() and __task_mem_cheap() William Lee Irwin III
  2004-09-09 19:11         ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi
  1 sibling, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-09 19:00 UTC (permalink / raw)
  To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson

On Thu, Sep 09, 2004 at 11:49:33AM -0700, William Lee Irwin III wrote:
> I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation
> patch atop the others I sent.

Consolidate __task_mem() and __task_mem_cheap() now that both have been
made cheap, and also combine struct task_mem with struct task_mem_cheap.
Also adjust various users of *_cheap to the new terminology so no trace
of the *_cheap bits remains. Compiletested on ia64.


Index: mm4-2.6.9-rc1/kernel/nproc.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/nproc.c	2004-09-08 18:11:24.826811093 -0700
+++ mm4-2.6.9-rc1/kernel/nproc.c	2004-09-09 12:00:44.649267323 -0700
@@ -32,17 +32,14 @@
 	u32	vmstack;
 	u32	vmexe;
 	u32	vmlib;
-};
-
-struct task_mem_cheap {
 	u32	vmsize;
 	u32	vmlock;
 	u32	vmrss;
 };
 
 /*
- * __task_mem/__task_mem_cheap basically duplicate the MMU version of
- * task_mem, but they are split by cost and work on structs.
+ * __task_mem() basically duplicates() the MMU and nommu versions of
+ * task_mem() from fs/proc/task_mmu.c and fs/proc/task_nommu.c
  */
 #ifdef CONFIG_MMU
 static void __task_mem(struct task_struct *tsk, struct task_mem *res)
@@ -57,22 +54,10 @@
 		res->vmstack = mm->stack_vm << (PAGE_SHIFT - 10);
 		res->vmexe = PAGE_ALIGN(mm->end_code - mm->start_code) >> 10;
 		res->vmlib = (mm->exec_vm << (PAGE_SHIFT - 10)) - res->vmexe;
-		mmput(mm);
-	}
-}
-
-static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res)
-{
-	struct mm_struct *mm = get_task_mm(tsk);
-	if (mm) {
 		res->vmsize = mm->total_vm << (PAGE_SHIFT-10);
 		res->vmlock = mm->locked_vm << (PAGE_SHIFT-10);
 		res->vmrss = mm->rss << (PAGE_SHIFT-10);
 		mmput(mm);
-	} else {
-		res->vmsize = 0;
-		res->vmlock = 0;
-		res->vmrss = 0;
 	}
 }
 #else /* !CONFIG_MMU */
@@ -86,9 +71,16 @@
 		unsigned long bytes = 0, sbytes = 0, slack = 0;
 		struct mm_tblk_struct *tblk;
 
+		stats->vmrss += kobjsize(mm);
 		down_read(&mm->mmap_sem);
 		for (tblk = &mm->context.tblk; tblk; tblk = tblk->next) {
-			if (!tblk->rblock)
+			if (tblk->next)
+				stats->vmrss += kobjsize(tblk->next);
+			if (tblk->rblock) {
+				stats->vmsize += kobjsize(tblk->rblock);
+				stats->vmrss += kobjsize(tblk->rblock);
+				stats->vmrss += kobjsize(tblk->rblock->kblock);
+			} else
 				continue;
 			bytes += kobjsize(tblk);
 			if (atomic_read(&mm->mm_count) > 1) ||
@@ -120,34 +112,12 @@
 		stats->vmdata = bytes;
 		stats->vmstack = sbytes;
 		stats->vmexe = stats->vmlib = 0;
+		stats->vmrss += mm->end_code - mm->start_code;
+		stats->vmrss += mm->start_stack - mm->start_data;
+		stats->vmrss >>= 10;
+		stats->vmsize >>= 10;
 	}
 }
-
-static void __task_mem_cheap(task_t *task, struct task_mem_cheap *stats)
-{
-	struct mm_struct *mm = get_task_mm(task);
-	struct mm_tblock_struct *tblk;
-	int size;
-
-	memset(stats, 0, sizeof(struct task_mem_cheap));
-	stats->vmrss += kobjsize(mm);
-	down_read(&mm->mmap_sem);
-	for (tblk = &mm->context.block; tblk; tblk = tblk->next) {
-		if (tblk->next)
-			stats->vmrss += kobjsize(tblk->next);
-		if (tblk->rblock) {
-			stats->vmsize += kobjsize(tblk->rblock);
-			stats->vmrss += kobjsize(tblk->rblock);
-			stats->vmrss += kobjsize(tblk->rblock->kblock);
-		}
-	}
-	stats->vmrss += mm->end_code - mm->start_code;
-	stats->vmrss += mm->start_stack - mm->start_data;
-	up_read(&mm->mmap_sem);
-	mmput(mm);
-	stats->vmrss >>= 10;
-	stats->vmsize >>= 10;
-}
 #endif /* !CONFIG_MMU */
 
 /*
@@ -223,10 +193,9 @@
 static char *nproc_ps_field(u32 id, char *buf, task_t *tsk)
 {
 	struct task_mem tsk_mem;
-	struct task_mem_cheap tsk_mem_cheap;
 
 	tsk_mem.vmdata = (~0);
-	tsk_mem_cheap.vmsize = (~0);
+	tsk_mem.vmsize = (~0);
 
 	switch (id) {
 		case NPROC_PID:
@@ -238,20 +207,20 @@
 		case NPROC_VMSIZE:
 		case NPROC_VMLOCK:
 		case NPROC_VMRSS:
-			if (tsk_mem_cheap.vmsize == (~0))
-				__task_mem_cheap(tsk, &tsk_mem_cheap);
+			if (tsk_mem.vmsize == (~0))
+				__task_mem(tsk, &tsk_mem);
 
 			switch (id) {
 				case NPROC_VMSIZE:
-					mstore(tsk_mem_cheap.vmsize,
+					mstore(tsk_mem.vmsize,
 							NPROC_VMSIZE, buf);
 					break;
 				case NPROC_VMLOCK:
-					mstore(tsk_mem_cheap.vmlock,
+					mstore(tsk_mem.vmlock,
 							NPROC_VMLOCK, buf);
 					break;
 				case NPROC_VMRSS:
-					mstore(tsk_mem_cheap.vmrss,
+					mstore(tsk_mem.vmrss,
 							NPROC_VMRSS, buf);
 					break;
 			}

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [4/2] consolidate __task_mem() and __task_mem_cheap()
  2004-09-09 19:00         ` William Lee Irwin III
@ 2004-09-09 19:02           ` William Lee Irwin III
  2004-09-09 19:07             ` Roger Luethi
  0 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-09 19:02 UTC (permalink / raw)
  To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson

On Thu, Sep 09, 2004 at 12:00:24PM -0700, William Lee Irwin III wrote:
> Consolidate __task_mem() and __task_mem_cheap() now that both have been
> made cheap, and also combine struct task_mem with struct task_mem_cheap.
> Also adjust various users of *_cheap to the new terminology so no trace
> of the *_cheap bits remains. Compiletested on ia64.

Repost with appropriate Subject: line.


Index: mm4-2.6.9-rc1/kernel/nproc.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/nproc.c	2004-09-08 18:11:24.826811093 -0700
+++ mm4-2.6.9-rc1/kernel/nproc.c	2004-09-09 12:00:44.649267323 -0700
@@ -32,17 +32,14 @@
 	u32	vmstack;
 	u32	vmexe;
 	u32	vmlib;
-};
-
-struct task_mem_cheap {
 	u32	vmsize;
 	u32	vmlock;
 	u32	vmrss;
 };
 
 /*
- * __task_mem/__task_mem_cheap basically duplicate the MMU version of
- * task_mem, but they are split by cost and work on structs.
+ * __task_mem() basically duplicates() the MMU and nommu versions of
+ * task_mem() from fs/proc/task_mmu.c and fs/proc/task_nommu.c
  */
 #ifdef CONFIG_MMU
 static void __task_mem(struct task_struct *tsk, struct task_mem *res)
@@ -57,22 +54,10 @@
 		res->vmstack = mm->stack_vm << (PAGE_SHIFT - 10);
 		res->vmexe = PAGE_ALIGN(mm->end_code - mm->start_code) >> 10;
 		res->vmlib = (mm->exec_vm << (PAGE_SHIFT - 10)) - res->vmexe;
-		mmput(mm);
-	}
-}
-
-static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res)
-{
-	struct mm_struct *mm = get_task_mm(tsk);
-	if (mm) {
 		res->vmsize = mm->total_vm << (PAGE_SHIFT-10);
 		res->vmlock = mm->locked_vm << (PAGE_SHIFT-10);
 		res->vmrss = mm->rss << (PAGE_SHIFT-10);
 		mmput(mm);
-	} else {
-		res->vmsize = 0;
-		res->vmlock = 0;
-		res->vmrss = 0;
 	}
 }
 #else /* !CONFIG_MMU */
@@ -86,9 +71,16 @@
 		unsigned long bytes = 0, sbytes = 0, slack = 0;
 		struct mm_tblk_struct *tblk;
 
+		stats->vmrss += kobjsize(mm);
 		down_read(&mm->mmap_sem);
 		for (tblk = &mm->context.tblk; tblk; tblk = tblk->next) {
-			if (!tblk->rblock)
+			if (tblk->next)
+				stats->vmrss += kobjsize(tblk->next);
+			if (tblk->rblock) {
+				stats->vmsize += kobjsize(tblk->rblock);
+				stats->vmrss += kobjsize(tblk->rblock);
+				stats->vmrss += kobjsize(tblk->rblock->kblock);
+			} else
 				continue;
 			bytes += kobjsize(tblk);
 			if (atomic_read(&mm->mm_count) > 1) ||
@@ -120,34 +112,12 @@
 		stats->vmdata = bytes;
 		stats->vmstack = sbytes;
 		stats->vmexe = stats->vmlib = 0;
+		stats->vmrss += mm->end_code - mm->start_code;
+		stats->vmrss += mm->start_stack - mm->start_data;
+		stats->vmrss >>= 10;
+		stats->vmsize >>= 10;
 	}
 }
-
-static void __task_mem_cheap(task_t *task, struct task_mem_cheap *stats)
-{
-	struct mm_struct *mm = get_task_mm(task);
-	struct mm_tblock_struct *tblk;
-	int size;
-
-	memset(stats, 0, sizeof(struct task_mem_cheap));
-	stats->vmrss += kobjsize(mm);
-	down_read(&mm->mmap_sem);
-	for (tblk = &mm->context.block; tblk; tblk = tblk->next) {
-		if (tblk->next)
-			stats->vmrss += kobjsize(tblk->next);
-		if (tblk->rblock) {
-			stats->vmsize += kobjsize(tblk->rblock);
-			stats->vmrss += kobjsize(tblk->rblock);
-			stats->vmrss += kobjsize(tblk->rblock->kblock);
-		}
-	}
-	stats->vmrss += mm->end_code - mm->start_code;
-	stats->vmrss += mm->start_stack - mm->start_data;
-	up_read(&mm->mmap_sem);
-	mmput(mm);
-	stats->vmrss >>= 10;
-	stats->vmsize >>= 10;
-}
 #endif /* !CONFIG_MMU */
 
 /*
@@ -223,10 +193,9 @@
 static char *nproc_ps_field(u32 id, char *buf, task_t *tsk)
 {
 	struct task_mem tsk_mem;
-	struct task_mem_cheap tsk_mem_cheap;
 
 	tsk_mem.vmdata = (~0);
-	tsk_mem_cheap.vmsize = (~0);
+	tsk_mem.vmsize = (~0);
 
 	switch (id) {
 		case NPROC_PID:
@@ -238,20 +207,20 @@
 		case NPROC_VMSIZE:
 		case NPROC_VMLOCK:
 		case NPROC_VMRSS:
-			if (tsk_mem_cheap.vmsize == (~0))
-				__task_mem_cheap(tsk, &tsk_mem_cheap);
+			if (tsk_mem.vmsize == (~0))
+				__task_mem(tsk, &tsk_mem);
 
 			switch (id) {
 				case NPROC_VMSIZE:
-					mstore(tsk_mem_cheap.vmsize,
+					mstore(tsk_mem.vmsize,
 							NPROC_VMSIZE, buf);
 					break;
 				case NPROC_VMLOCK:
-					mstore(tsk_mem_cheap.vmlock,
+					mstore(tsk_mem.vmlock,
 							NPROC_VMLOCK, buf);
 					break;
 				case NPROC_VMRSS:
-					mstore(tsk_mem_cheap.vmrss,
+					mstore(tsk_mem.vmrss,
 							NPROC_VMRSS, buf);
 					break;
 			}

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [4/2] consolidate __task_mem() and __task_mem_cheap()
  2004-09-09 19:02           ` [4/2] consolidate __task_mem() and __task_mem_cheap() William Lee Irwin III
@ 2004-09-09 19:07             ` Roger Luethi
  2004-09-09 19:15               ` [5/2] fix nommu VSZ reporting in consolidated task_mem() William Lee Irwin III
  0 siblings, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2004-09-09 19:07 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, linux-kernel,
	Albert Cahalan, Paul Jackson

On Thu, 09 Sep 2004 12:02:14 -0700, William Lee Irwin III wrote:
> +		stats->vmrss += mm->end_code - mm->start_code;

s/vmrss/vmsize/ ?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 18:49       ` William Lee Irwin III
  2004-09-09 19:00         ` William Lee Irwin III
@ 2004-09-09 19:11         ` Roger Luethi
  2004-09-09 19:23           ` William Lee Irwin III
  2004-09-11 22:25           ` Albert Cahalan
  1 sibling, 2 replies; 63+ messages in thread
From: Roger Luethi @ 2004-09-09 19:11 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, linux-kernel,
	Albert Cahalan, Paul Jackson

On Thu, 09 Sep 2004 11:49:33 -0700, William Lee Irwin III wrote:
> I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation
> patch atop the others I sent.

I have a few minor changes coming up as well.

One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should
ifdef them out of the list of offered fields for that configuration (and
maybe in nproc_ps_field as well).

> On Thu, Sep 09, 2004 at 08:43:01PM +0200, Roger Luethi wrote:
> > There seems to be a discrepancy between current task_mem in
> > fs/proc/task_nommu.c and the __task_mem{,_cheap} routines you wrote
> > for the nproc !CONFIG_MMU case. Can you explain?
> 
> I'm not aware of a discrepancy with the fs/proc/task_nommu.c code; I
> did, however, have to mangle the things via guesswork to avoid adding
> the new fields, which I really wanted you to arrange for or comment on
> as they are a matter of interface. Also, could you be more specific
> about these discrepancies?

task_nommu.c offers Mem, Slack, and Shared. __task_mem for !CONFIG_MMU
offers VmData, VmStack, VmRSS, VmSize.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [5/2] fix nommu VSZ reporting in consolidated task_mem()
  2004-09-09 19:07             ` Roger Luethi
@ 2004-09-09 19:15               ` William Lee Irwin III
  0 siblings, 0 replies; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-09 19:15 UTC (permalink / raw)
  To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson

On Thu, 09 Sep 2004 12:02:14 -0700, William Lee Irwin III wrote:
>> +		stats->vmrss += mm->end_code - mm->start_code;

On Thu, Sep 09, 2004 at 09:07:57PM +0200, Roger Luethi wrote:
> s/vmrss/vmsize/ ?

This follows fs/proc/task_nommu.c:task_statm, which ->vmsize would not.
vmsize would be the sum of kobjsize(tblk->rblock->kblock) for each
tblock, which actually does need fixing in the above.


-- wli

Index: mm4-2.6.9-rc1/kernel/nproc.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/nproc.c	2004-09-09 12:00:44.649267323 -0700
+++ mm4-2.6.9-rc1/kernel/nproc.c	2004-09-09 12:18:01.876793680 -0700
@@ -77,7 +77,7 @@
 			if (tblk->next)
 				stats->vmrss += kobjsize(tblk->next);
 			if (tblk->rblock) {
-				stats->vmsize += kobjsize(tblk->rblock);
+				stats->vmsize += kobjsize(tblk->rblock->kblock);
 				stats->vmrss += kobjsize(tblk->rblock);
 				stats->vmrss += kobjsize(tblk->rblock->kblock);
 			} else

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 19:11         ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi
@ 2004-09-09 19:23           ` William Lee Irwin III
  2004-09-09 21:19             ` Roger Luethi
  2004-09-10 15:30             ` Roger Luethi
  2004-09-11 22:25           ` Albert Cahalan
  1 sibling, 2 replies; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-09 19:23 UTC (permalink / raw)
  To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson

On Thu, 09 Sep 2004 11:49:33 -0700, William Lee Irwin III wrote:
>> I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation
>> patch atop the others I sent.

On Thu, Sep 09, 2004 at 09:11:42PM +0200, Roger Luethi wrote:
> I have a few minor changes coming up as well.

I rest assured that nothing I've written thus far will apply to or be
included in any of it, as a matter of course (nothing specific to you).


On Thu, Sep 09, 2004 at 09:11:42PM +0200, Roger Luethi wrote:
> One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should
> ifdef them out of the list of offered fields for that configuration (and
> maybe in nproc_ps_field as well).

This may be; I'll leave that decision to you as the interface designer.


On Thu, 09 Sep 2004 11:49:33 -0700, William Lee Irwin III wrote:
>> I'm not aware of a discrepancy with the fs/proc/task_nommu.c code; I
>> did, however, have to mangle the things via guesswork to avoid adding
>> the new fields, which I really wanted you to arrange for or comment on
>> as they are a matter of interface. Also, could you be more specific
>> about these discrepancies?

On Thu, Sep 09, 2004 at 09:11:42PM +0200, Roger Luethi wrote:
> task_nommu.c offers Mem, Slack, and Shared. __task_mem for !CONFIG_MMU
> offers VmData, VmStack, VmRSS, VmSize.

I took the structure fields to be just an argument passing convention
giving the nommu case an identical prototype much like the helpers in
fs/proc/task_{no,}mmu.c. Using different field names and etc. is also
feasible, of course. I'll wait for your updates to follow up further.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 17:53       ` Roger Luethi
@ 2004-09-09 20:01         ` Stephen Smalley
  2004-09-09 20:48           ` Chris Wright
  2004-09-09 20:55           ` Roger Luethi
  2004-09-09 20:44         ` Chris Wright
  1 sibling, 2 replies; 63+ messages in thread
From: Stephen Smalley @ 2004-09-09 20:01 UTC (permalink / raw)
  To: Roger Luethi
  Cc: William Lee Irwin III, Andrew Morton, lkml, Albert Cahalan,
	Martin J. Bligh, Paul Jackson, James Morris, Chris Wright

On Thu, 2004-09-09 at 13:53, Roger Luethi wrote:
> On Thu, 09 Sep 2004 10:22:00 -0700, William Lee Irwin III wrote:
> > On Thu, Sep 09, 2004 at 07:53:31AM -0400, Stephen Smalley wrote:
> > > They aren't world readable when using a security module like SELinux;
> > > they are then typically only accessible by processes in the same
> > > security domain, aside from processes in privileged domains. 
> > > security_task_to_inode() hook sets the security attributes on the
> > > /proc/pid inodes based on their security context, and then
> > > security_inode_permission() hook controls access to them.  So you need
> > > at least comparable controls.
> > 
> > Can you make a more specific suggestion regarding the controls to use?
> > It's a bit awkward for those highly unfamiliar with the subsystem to
> 
> For the same reason, I'm not comfortable with implementing SELinux type
> access controls myself. How about:
> 
> config NPROC
> 	depends on !SECURITY_SELINUX
> 
> Adding access control later won't be a problem for anyone who groks
> SELinux.

Well, it isn't that easy, or at least I don't think it is.  The problem
is that there is no way presently to convey the sender's security
credentials (beyond the existing uid, cap information), since the LSM
patches for adding security fields and hooks for managing skb security
fields were rejected.  The best we can do at present is pass along the
sender pid, uid, and cap, and the security module can look up the pid if
it chooses to get the security field (but is naturally subject to races
in that situation).

Most obvious place to hook would be nproc_ps_get_task; we could then
perform a check based on the sender's credentials and the target task's
credentials, and simply return NULL if permission is not granted for
that pair, thus skipping that task as if it didn't exist.  That requires
propagating the sender's credentials down to that function.

Untested patch below.

Index: linux-2.6/include/linux/security.h
===================================================================
RCS file: /nfshome/pal/CVS/linux-2.6/include/linux/security.h,v
retrieving revision 1.37
diff -u -p -r1.37 security.h
--- linux-2.6/include/linux/security.h	16 Jun 2004 14:49:42 -0000	1.37
+++ linux-2.6/include/linux/security.h	9 Sep 2004 19:38:23 -0000
@@ -632,6 +632,13 @@ struct swap_info_struct;
  * 	security attributes, e.g. for /proc/pid inodes.
  *	@p contains the task_struct for the task.
  *	@inode contains the inode structure for the inode.
+ * @task_getstate:
+ * 	Check permission before getting the state of a task.
+ *      @pid contains the pid of the requesting process.
+ *	@p contains the task_struct for the target task.
+ *      @uid contains the uid of the requesting process.
+ *      @caps contains the capability set of the requesting process.
+ *      Return 0 if permission is granted.
  *
  * Security hooks for Netlink messaging.
  *
@@ -1153,6 +1160,7 @@ struct security_operations {
 			   unsigned long arg5);
 	void (*task_reparent_to_init) (struct task_struct * p);
 	void (*task_to_inode)(struct task_struct *p, struct inode *inode);
+	int (*task_getstate)(pid_t pid, struct task_struct *p, uid_t uid, kernel_cap_t caps);
 
 	int (*ipc_permission) (struct kern_ipc_perm * ipcp, short flag);
 
@@ -1756,6 +1764,11 @@ static inline void security_task_to_inod
 	security_ops->task_to_inode(p, inode);
 }
 
+static inline int security_task_getstate(pid_t pid, struct task_struct *p, uid_t uid, kernel_cap_t caps)
+{
+	return security_ops->task_getstate(pid, p, uid, caps);
+}
+
 static inline int security_ipc_permission (struct kern_ipc_perm *ipcp,
 					   short flag)
 {
@@ -2389,6 +2402,11 @@ static inline void security_task_reparen
 static inline void security_task_to_inode(struct task_struct *p, struct inode *inode)
 { }
 
+static inline int security_task_getstate(pid_t pid, struct task_struct *p, uid_t uid, kernel_cap_t caps)
+{
+	return 0;
+}
+
 static inline int security_ipc_permission (struct kern_ipc_perm *ipcp,
 					   short flag)
 {
Index: linux-2.6/security/dummy.c
===================================================================
RCS file: /nfshome/pal/CVS/linux-2.6/security/dummy.c,v
retrieving revision 1.34
diff -u -p -r1.34 dummy.c
--- linux-2.6/security/dummy.c	16 Jun 2004 14:49:42 -0000	1.34
+++ linux-2.6/security/dummy.c	9 Sep 2004 19:39:01 -0000
@@ -619,6 +619,12 @@ static void dummy_task_reparent_to_init 
 static void dummy_task_to_inode(struct task_struct *p, struct inode *inode)
 { }
 
+
+static int dummy_task_getstate(pid_t pid, struct task_struct *p, uid_t uid, kernel_cap_t caps)
+{
+	return 0;
+}
+
 static int dummy_ipc_permission (struct kern_ipc_perm *ipcp, short flag)
 {
 	return 0;
@@ -979,6 +985,7 @@ void security_fixup_ops (struct security
 	set_to_dummy_if_null(ops, task_prctl);
 	set_to_dummy_if_null(ops, task_reparent_to_init);
  	set_to_dummy_if_null(ops, task_to_inode);
+ 	set_to_dummy_if_null(ops, task_getstate);
 	set_to_dummy_if_null(ops, ipc_permission);
 	set_to_dummy_if_null(ops, msg_msg_alloc_security);
 	set_to_dummy_if_null(ops, msg_msg_free_security);

--- linux-2.6/kernel/nproc.c.orig	2004-09-09 15:51:25.727833776 -0400
+++ linux-2.6/kernel/nproc.c	2004-09-09 15:30:19.171379624 -0400
@@ -296,7 +296,7 @@ out:
 /*
  * Find task for given pid, grab task lock (caller must unlock).
  */
-static task_t *nproc_ps_get_task(int pid)
+static task_t *nproc_ps_get_task(struct nlmsghdr *nlh, int pid, uid_t uid, kernel_cap_t caps)
 {
 	task_t *tsk;
 
@@ -305,13 +305,17 @@ static task_t *nproc_ps_get_task(int pid
 	if (tsk)
 		get_task_struct(tsk);
 	read_unlock(&tasklist_lock);
+	if (tsk && security_task_getstate(nlh->nlmsg_pid, tsk, uid, caps)) {
+		put_task_struct(tsk);
+		return NULL;
+	}
 	return tsk;
 }
 
 /*
  * Iterate over a list of PIDs.
  */
-static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata)
+static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata, uid_t uid, kernel_cap_t caps)
 {
 	int i;
 	int err = 0;
@@ -335,7 +339,7 @@ static int nproc_ps_select_pid(struct nl
 
 	for (i = 0; i < tcnt; i++) {
 		task_t *tsk;
-		tsk = nproc_ps_get_task(pids[i]);
+		tsk = nproc_ps_get_task(nlh, pids[i], uid, caps);
 		if (!tsk)
 			continue;
 		err = nproc_pid_msg(nlh, fdata, len, tsk);
@@ -357,7 +361,7 @@ err_inval:
 /*
  * Iterate over all PIDs.
  */
-static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len)
+static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len, uid_t uid, kernel_cap_t caps)
 {
 	void *map;
 	int offset, i;
@@ -378,7 +382,7 @@ static int nproc_ps_select_all(struct nl
 			if (offset >= BITS_PER_PAGE)
 				break;
 			pid = offset + i * BITS_PER_PAGE;
-			tsk = nproc_ps_get_task(pid);
+			tsk = nproc_ps_get_task(nlh, pid, uid, caps);
 			if (!tsk)
 				continue;
 			err = nproc_pid_msg(nlh, fdata, len, tsk);
@@ -467,7 +471,7 @@ err_inval:
  * Call the chosen process selector. Adding additional selectors
  * (e.g. select by uid) is easy, but is there a need?
  */
-static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid)
+static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid, kernel_cap_t caps)
 {
 	int err;
 	u32 len;
@@ -490,11 +494,11 @@ static int nproc_get_ps(struct nlmsghdr 
 		case NPROC_SELECT_ALL:
 			if (left)
 				pwarn("%d bytes left.\n", left);
-			err = nproc_ps_select_all(nlh, data, len);
+			err = nproc_ps_select_all(nlh, data, len, uid, caps);
 			break;
 		case NPROC_SELECT_PID:
 			err = nproc_ps_select_pid(nlh, data, len,
-					left, sdata + 1);
+					left, sdata + 1, uid, caps);
 			break;
 		default:
 			pwarn("Unknown selection method %#x.\n", *sdata);
@@ -787,7 +791,7 @@ static __inline__ int nproc_process_msg(
 			err = nproc_get_global(nlh);
 			break;
 		case NPROC_GET_PS:
-			err = nproc_get_ps(nlh, uid);
+			err = nproc_get_ps(nlh, uid, caps);
 			break;
 		default:
 			pwarn("Unknown msg type %#x.\n", nlh->nlmsg_type);


-- 
Stephen Smalley <sds@epoch.ncsc.mil>
National Security Agency


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 17:53       ` Roger Luethi
  2004-09-09 20:01         ` Stephen Smalley
@ 2004-09-09 20:44         ` Chris Wright
  1 sibling, 0 replies; 63+ messages in thread
From: Chris Wright @ 2004-09-09 20:44 UTC (permalink / raw)
  To: William Lee Irwin III, Stephen Smalley, Andrew Morton, lkml,
	Albert Cahalan, Martin J. Bligh, Paul Jackson

* Roger Luethi (rl@hellgate.ch) wrote:
> On Thu, 09 Sep 2004 10:22:00 -0700, William Lee Irwin III wrote:
> > On Thu, Sep 09, 2004 at 07:53:31AM -0400, Stephen Smalley wrote:
> > > They aren't world readable when using a security module like SELinux;
> > > they are then typically only accessible by processes in the same
> > > security domain, aside from processes in privileged domains. 
> > > security_task_to_inode() hook sets the security attributes on the
> > > /proc/pid inodes based on their security context, and then
> > > security_inode_permission() hook controls access to them.  So you need
> > > at least comparable controls.
> > 
> > Can you make a more specific suggestion regarding the controls to use?
> > It's a bit awkward for those highly unfamiliar with the subsystem to
> 
> For the same reason, I'm not comfortable with implementing SELinux type
> access controls myself. How about:
> 
> config NPROC
> 	depends on !SECURITY_SELINUX
> 
It's not just SELinux, it's any security module (i.e. CONFIG_SECURITY for
starters).

thanks,
-chris
-- 
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 20:01         ` Stephen Smalley
@ 2004-09-09 20:48           ` Chris Wright
  2004-09-10 12:11             ` Stephen Smalley
  2004-09-09 20:55           ` Roger Luethi
  1 sibling, 1 reply; 63+ messages in thread
From: Chris Wright @ 2004-09-09 20:48 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Roger Luethi, William Lee Irwin III, Andrew Morton, lkml,
	Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris,
	Chris Wright

* Stephen Smalley (sds@epoch.ncsc.mil) wrote:
> Well, it isn't that easy, or at least I don't think it is.  The problem
> is that there is no way presently to convey the sender's security
> credentials (beyond the existing uid, cap information), since the LSM
> patches for adding security fields and hooks for managing skb security
> fields were rejected.  The best we can do at present is pass along the
> sender pid, uid, and cap, and the security module can look up the pid if
> it chooses to get the security field (but is naturally subject to races
> in that situation).
> 
> Most obvious place to hook would be nproc_ps_get_task; we could then
> perform a check based on the sender's credentials and the target task's
> credentials, and simply return NULL if permission is not granted for
> that pair, thus skipping that task as if it didn't exist.  That requires
> propagating the sender's credentials down to that function.
> 
> Untested patch below.
> 
> Index: linux-2.6/include/linux/security.h
> ===================================================================
> RCS file: /nfshome/pal/CVS/linux-2.6/include/linux/security.h,v
> retrieving revision 1.37
> diff -u -p -r1.37 security.h
> --- linux-2.6/include/linux/security.h	16 Jun 2004 14:49:42 -0000	1.37
> +++ linux-2.6/include/linux/security.h	9 Sep 2004 19:38:23 -0000
> @@ -632,6 +632,13 @@ struct swap_info_struct;
>   * 	security attributes, e.g. for /proc/pid inodes.
>   *	@p contains the task_struct for the task.
>   *	@inode contains the inode structure for the inode.
> + * @task_getstate:
> + * 	Check permission before getting the state of a task.
> + *      @pid contains the pid of the requesting process.
> + *	@p contains the task_struct for the target task.
> + *      @uid contains the uid of the requesting process.
> + *      @caps contains the capability set of the requesting process.
> + *      Return 0 if permission is granted.

Why caps?


thanks,
-chris
-- 
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 20:01         ` Stephen Smalley
  2004-09-09 20:48           ` Chris Wright
@ 2004-09-09 20:55           ` Roger Luethi
  2004-09-09 21:05             ` Chris Wright
  2004-09-09 21:25             ` Roger Luethi
  1 sibling, 2 replies; 63+ messages in thread
From: Roger Luethi @ 2004-09-09 20:55 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: William Lee Irwin III, Andrew Morton, lkml, Albert Cahalan,
	Martin J. Bligh, Paul Jackson, James Morris, Chris Wright

On Thu, 09 Sep 2004 16:01:06 -0400, Stephen Smalley wrote:
> > For the same reason, I'm not comfortable with implementing SELinux type
> > access controls myself. How about:
> > 
> > config NPROC
> > 	depends on !SECURITY_SELINUX
> > 
> > Adding access control later won't be a problem for anyone who groks
> > SELinux.
> 
[...]
> Most obvious place to hook would be nproc_ps_get_task; we could then
> perform a check based on the sender's credentials and the target task's
> credentials, and simply return NULL if permission is not granted for
> that pair, thus skipping that task as if it didn't exist.  That requires
> propagating the sender's credentials down to that function.
> 
> Untested patch below.

I used a somewhat different approach in my development tree (not
SELinuxy, though): Most fields were world readable, some required
credentials.

I don't have any strong feelings on access control, so I'd be happy
with any mechanism that doesn't completely botch performance. Anyway,
I do not consider lack of access controls to be a showstopper.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 20:55           ` Roger Luethi
@ 2004-09-09 21:05             ` Chris Wright
  2004-09-09 21:25             ` Roger Luethi
  1 sibling, 0 replies; 63+ messages in thread
From: Chris Wright @ 2004-09-09 21:05 UTC (permalink / raw)
  To: Stephen Smalley, William Lee Irwin III, Andrew Morton, lkml,
	Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris,
	Chris Wright

* Roger Luethi (rl@hellgate.ch) wrote:
> On Thu, 09 Sep 2004 16:01:06 -0400, Stephen Smalley wrote:
> > > For the same reason, I'm not comfortable with implementing SELinux type
> > > access controls myself. How about:
> > > 
> > > config NPROC
> > > 	depends on !SECURITY_SELINUX
> > > 
> > > Adding access control later won't be a problem for anyone who groks
> > > SELinux.
> > 
> [...]
> > Most obvious place to hook would be nproc_ps_get_task; we could then
> > perform a check based on the sender's credentials and the target task's
> > credentials, and simply return NULL if permission is not granted for
> > that pair, thus skipping that task as if it didn't exist.  That requires
> > propagating the sender's credentials down to that function.
> > 
> > Untested patch below.
> 
> I used a somewhat different approach in my development tree (not
> SELinuxy, though): Most fields were world readable, some required
> credentials.
> 
> I don't have any strong feelings on access control, so I'd be happy
> with any mechanism that doesn't completely botch performance. Anyway,
> I do not consider lack of access controls to be a showstopper.

Some of these things become quite sensitive, esp across setuid, etc.
For prototyping, I agree, not a showstopper.  For merging, it should be
figured out properly.

thanks,
-chris
-- 
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 19:23           ` William Lee Irwin III
@ 2004-09-09 21:19             ` Roger Luethi
  2004-09-10 15:30             ` Roger Luethi
  1 sibling, 0 replies; 63+ messages in thread
From: Roger Luethi @ 2004-09-09 21:19 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, linux-kernel,
	Albert Cahalan, Paul Jackson

On Thu, 09 Sep 2004 12:23:13 -0700, William Lee Irwin III wrote:
> I took the structure fields to be just an argument passing convention
> giving the nommu case an identical prototype much like the helpers in

That seems rather confusing. We must special-case for !CONFIG_MMU
anyway because field IDs are tied to meaning, i.e. systems export
different sets of fields depending on this configuration setting. The
proc filesystem does the same, the difference is that a changing set
is easier to handle with nproc.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 20:55           ` Roger Luethi
  2004-09-09 21:05             ` Chris Wright
@ 2004-09-09 21:25             ` Roger Luethi
  2004-09-11 22:36               ` Albert Cahalan
  1 sibling, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2004-09-09 21:25 UTC (permalink / raw)
  To: Stephen Smalley, William Lee Irwin III, Andrew Morton, lkml,
	Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris,
	Chris Wright

On Thu, 09 Sep 2004 22:55:31 +0200, Roger Luethi wrote:
> I used a somewhat different approach in my development tree (not
> SELinuxy, though): Most fields were world readable, some required
> credentials.

I forgot to mention that you can see the remnants of that approach in
<linux/nproc.h>: I used two bits of the field ID to define per-field
access restrictions (NPROC_PERM_USER, NPROC_PERM_ROOT).

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 20:48           ` Chris Wright
@ 2004-09-10 12:11             ` Stephen Smalley
  0 siblings, 0 replies; 63+ messages in thread
From: Stephen Smalley @ 2004-09-10 12:11 UTC (permalink / raw)
  To: Chris Wright
  Cc: Roger Luethi, William Lee Irwin III, Andrew Morton, lkml,
	Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris

On Thu, 2004-09-09 at 16:48, Chris Wright wrote:
> > + * @task_getstate:
> > + * 	Check permission before getting the state of a task.
> > + *      @pid contains the pid of the requesting process.
> > + *	@p contains the task_struct for the target task.
> > + *      @uid contains the uid of the requesting process.
> > + *      @caps contains the capability set of the requesting process.
> > + *      Return 0 if permission is granted.
> 
> Why caps?

It is readily available in the netlink skb parms, and someone might want
to use it, e.g. a security module might limit a requesting process to
only getting state of other processes with the same uid unless the
requesting process has some capability.

-- 
Stephen Smalley <sds@epoch.ncsc.mil>
National Security Agency


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 19:23           ` William Lee Irwin III
  2004-09-09 21:19             ` Roger Luethi
@ 2004-09-10 15:30             ` Roger Luethi
  1 sibling, 0 replies; 63+ messages in thread
From: Roger Luethi @ 2004-09-10 15:30 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, linux-kernel,
	Albert Cahalan, Paul Jackson

On Thu, 09 Sep 2004 12:23:13 -0700, William Lee Irwin III wrote:
> feasible, of course. I'll wait for your updates to follow up further.

Incremental update below. It contains a reorganization of the field
IDs (something I expected to do based on feedback) and minor tweaks in
error handling.

I'll post a full patch once the MMU stuff is sorted out.

Roger

diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-mm4.01/include/linux/nproc.h linux-2.6.9-rc1-mm4.02/include/linux/nproc.h
--- linux-2.6.9-rc1-mm4.01/include/linux/nproc.h	2004-09-10 17:19:34.018727960 +0200
+++ linux-2.6.9-rc1-mm4.02/include/linux/nproc.h	2004-09-10 14:43:13.000000000 +0200
@@ -49,35 +49,57 @@
 #define NPROC_LABEL_FIELD_UNIT	0x00000003
 #define NPROC_LABEL_WCHAN	0x00000004
 
-/* Field IDs (unique key in bits 0 - 15) */
-#define NPROC_NOP_UL		(0x00000020 | NPROC_TYPE_UL)
-#define NPROC_PID		(0x00000001 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
-#define NPROC_NAME		(0x00000002 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS)
-/* Amount of free memory (pages) */
-#define NPROC_MEMFREE		(0x00000004 | NPROC_TYPE_U32    | NPROC_SCOPE_GLOBAL)
-/* Size of a page (bytes) */
-#define NPROC_PAGESIZE		(0x00000005 | NPROC_TYPE_U32    | NPROC_SCOPE_GLOBAL)
+/* --------------------------------------------------------------------- misc */
 /* There's no guarantee about anything with jiffies. Still useful for some. */
-#define NPROC_JIFFIES		(0x00000006 | NPROC_TYPE_U64    | NPROC_SCOPE_GLOBAL)
-/* Process: VM size (KiB) */
-#define NPROC_VMSIZE		(0x00000010 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
-/* Process: locked memory (KiB) */
-#define NPROC_VMLOCK		(0x00000011 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
-/* Process: Memory resident size (KiB) */
-#define NPROC_VMRSS		(0x00000012 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
-#define NPROC_VMDATA		(0x00000013 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
-#define NPROC_VMSTACK		(0x00000014 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
-#define NPROC_VMEXE		(0x00000015 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
-#define NPROC_VMLIB		(0x00000016 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
-#define NPROC_UID		(0x00000018 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
-#define NPROC_NR_DIRTY		(0x00000051 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
-#define NPROC_NR_WRITEBACK	(0x00000052 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
-#define NPROC_NR_UNSTABLE	(0x00000053 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
-#define NPROC_NR_PG_TABLE_PGS	(0x00000054 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
-#define NPROC_NR_MAPPED		(0x00000055 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
-#define NPROC_NR_SLAB		(0x00000056 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
-#define NPROC_WCHAN		(0x00000080 | NPROC_TYPE_UL     | NPROC_SCOPE_PROCESS)
-#define NPROC_WCHAN_NAME	(0x00000081 | NPROC_TYPE_STRING)
+#define NPROC_JIFFIES		(0x00000001 | NPROC_TYPE_U64    | NPROC_SCOPE_GLOBAL)
+/* Field IDs (unique key in bits 0 - 15) */
+#define NPROC_NOP_UL		(0x00000002 | NPROC_TYPE_UL)
+/* Size of a page */
+#define NPROC_PAGESIZE		(0x00000003 | NPROC_TYPE_U32    | NPROC_SCOPE_GLOBAL)
+/* --------------------------------------------------------- /proc/PID/status */
+#define NPROC_NAME		(0x00000100 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS)
+#define NPROC_STATE		(0x00000101 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_STATE_NAME	(0x00000102 | NPROC_TYPE_STRING)
+#define NPROC_SLEEP_TIME	(0x00000103 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_TOTAL_TIME	(0x00000104 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_PID		(0x00000105 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_TGID		(0x00000106 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_PPID		(0x00000107 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_TRACER_PID	(0x00000108 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_UID		(0x00000109 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_EUID		(0x00000110 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_SUID		(0x00000111 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_FSUID		(0x00000112 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_GID		(0x00000113 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_EGID		(0x00000114 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_SGID		(0x00000115 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_FSGID		(0x00000116 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+/* Process: VM size */
+#define NPROC_VMSIZE		(0x00000117 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+/* Process: locked memory */
+#define NPROC_VMLOCK		(0x00000118 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+/* Process: Memory resident size */
+#define NPROC_VMRSS		(0x00000119 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMDATA		(0x00000120 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMSTACK		(0x00000121 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMEXE		(0x00000122 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMLIB		(0x00000123 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+/* ------------------------------------------------------------- /proc/vmstat */
+#define NPROC_NR_DIRTY		(0x00000214 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_WRITEBACK	(0x00000215 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_UNSTABLE	(0x00000216 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_PG_TABLE_PGS	(0x00000217 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_MAPPED		(0x00000218 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_SLAB		(0x00000219 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+/* ------------------------------------------------------------ /proc/meminfo */
+/* Amount of free memory */
+#define NPROC_MEMFREE		(0x00000320 | NPROC_TYPE_U32    | NPROC_SCOPE_GLOBAL)
+/* ---------------------------------------------------------- /proc/PID/wchan */
+#define NPROC_WCHAN		(0x00000421 | NPROC_TYPE_UL     | NPROC_SCOPE_PROCESS)
+#define NPROC_WCHAN_NAME	(0x00000422 | NPROC_TYPE_STRING)
+/* ----------------------------------------------------------- /proc/PID/stat */
+/* ---------------------------------------------------------- /proc/PID/statm */
+
 
 #ifdef __KERNEL__
 struct nproc_field {
@@ -88,11 +110,11 @@ struct nproc_field {
 };
 
 static struct nproc_field labels[] = {
-	{ NPROC_PID,			"PID",		"%5u",	"" },
-	{ NPROC_NAME,			"Name",		"%-15s","" },
-	{ NPROC_MEMFREE,		"MemFree",	"%8u",	"page" },
-	{ NPROC_PAGESIZE,		"PageSize",	"%4u",	"byte" },
 	{ NPROC_JIFFIES,		"Jiffies",	"%10u",	"" },
+	{ NPROC_PAGESIZE,		"PageSize",	"%4u",	"byte" },
+	{ NPROC_NAME,			"Name",		"%-15s","" },
+	{ NPROC_PID,			"PID",		"%5u",	"" },
+	{ NPROC_UID,			"UID",		"%5u",	"" },
 	{ NPROC_VMSIZE,			"VmSize",	"%8u",	"KiB" },
 	{ NPROC_VMLOCK,			"VmLock",	"%8u",	"KiB" },
 	{ NPROC_VMRSS,			"VmRSS",	"%8u",	"KiB" },
@@ -100,16 +122,16 @@ static struct nproc_field labels[] = {
 	{ NPROC_VMSTACK,		"VmStack",	"%8u",	"KiB" },
 	{ NPROC_VMEXE,			"VmExe",	"%8u",	"KiB" },
 	{ NPROC_VMLIB,			"VmLib",	"%8u",	"KiB" },
-	{ NPROC_UID,			"UID",		"%5u",	"" },
 	{ NPROC_NR_DIRTY,		"nr_dirty",	"%8d",	"page" },
 	{ NPROC_NR_WRITEBACK,		"nr_writeback",	"%8u",	"page" },
 	{ NPROC_NR_UNSTABLE,		"nr_unstable",	"%8u",	"page" },
 	{ NPROC_NR_PG_TABLE_PGS,	"nr_page_table_pages",	"%8u", "page" },
 	{ NPROC_NR_MAPPED,		"nr_mapped",	"%8u",	"page" },
 	{ NPROC_NR_SLAB,		"nr_slab",	"%8u",	"page" },
+	{ NPROC_MEMFREE,		"MemFree",	"%8u",	"page" },
 	{ NPROC_WCHAN,			"wchan",	"%p",	"" },
 #ifdef CONFIG_KALLSYMS
-	{ NPROC_WCHAN_NAME,		"wchan_symbol",	"%s"},
+	{ NPROC_WCHAN_NAME,		"wchan_symbol",	"%s",	""},
 #endif
 };
 #endif /* __KERNEL__ */
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-mm4.01/kernel/nproc.c linux-2.6.9-rc1-mm4.02/kernel/nproc.c
--- linux-2.6.9-rc1-mm4.01/kernel/nproc.c	2004-09-10 17:19:34.034725528 +0200
+++ linux-2.6.9-rc1-mm4.02/kernel/nproc.c	2004-09-10 12:04:28.000000000 +0200
@@ -17,12 +17,11 @@
 /* There must be like 5 million dprintk definitions, so let's add some more */
 #ifdef DEBUG
 #define pdebug(x,args...) printk(KERN_DEBUG "%s:%d " x, __func__ , __LINE__, ##args)
-#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args)
 #else
 #define pdebug(x,args...)
-#define pwarn(x,args...)
 #endif
 
+#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args)
 #define perror(x,args...) printk(KERN_ERR "%s:%d " x, __func__ , __LINE__, ##args)
 
 static struct sock *nproc_sock = NULL;
@@ -129,18 +128,18 @@ static struct sk_buff *nproc_alloc_nlmsg
 	struct sk_buff *skb2 = 0;
 
 	skb2 = alloc_skb(NLMSG_SPACE(len), GFP_KERNEL);
-	if (!skb2) {
-		skb2 = ERR_PTR(-ENOMEM);
-		goto out;
-	}
+	if (!skb2)
+		goto err_out;
 
 	NLMSG_PUT(skb2, pid, seq, type, NLMSG_ALIGN(len));
-out:
-	return skb2;
+	goto out;
 
 nlmsg_failure:				/* Used by NLMSG_PUT */
 	kfree_skb(skb2);
-	return NULL;
+err_out:
+	skb2 = ERR_PTR(-ENOMEM);
+out:
+	return skb2;
 }
 
 #define mstore(value, id, buf)						\
@@ -634,18 +633,17 @@ static int find_id(__u32 *data, __u32 *l
 		pwarn("%d bytes left.\n", *left);
 	id = data[1];
 
-	for (i = 0; i < ARRAY_SIZE(labels) && labels[i].id != id; i++)
-		;	/* Do nothing */
-
-	if (labels[i].id != id) {
-		pwarn("No matching label found for %#x.\n", id);
-		goto err_inval;
+	for (i = 0; i < ARRAY_SIZE(labels); i++) {
+		if (labels[i].id == id)
+			goto out;
 	}
 
-	return i;
+	pwarn("No matching label found for %#x.\n", id);
 
 err_inval:
 	return -EINVAL;
+out:
+	return i;
 }
 
 
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-mm4.01/init/Kconfig linux-2.6.9-rc1-mm4.02/init/Kconfig
--- linux-2.6.9-rc1-mm4.01/init/Kconfig	2004-09-10 17:19:34.040724616 +0200
+++ linux-2.6.9-rc1-mm4.02/init/Kconfig	2004-09-10 00:32:36.000000000 +0200
@@ -141,10 +141,11 @@ config SYSCTL
 
 config NPROC
 	bool "Netlink interface to /proc information"
-	depends on PROC_FS && EXPERIMENTAL
+	depends on EXPERIMENTAL && !CONFIG_SECURITY
 	default y
 	help
-	  Nproc is a netlink interface to /proc information.
+	  Nproc is a netlink interface to /proc information. Its benefits
+	  are clean semantics and high performance.
 
 config AUDIT
 	bool "Auditing support"

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 19:11         ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi
  2004-09-09 19:23           ` William Lee Irwin III
@ 2004-09-11 22:25           ` Albert Cahalan
  2004-09-12  4:58             ` William Lee Irwin III
  2004-09-14  5:59             ` Roger Luethi
  1 sibling, 2 replies; 63+ messages in thread
From: Albert Cahalan @ 2004-09-11 22:25 UTC (permalink / raw)
  To: Roger Luethi
  Cc: William Lee Irwin III, Andrew Morton OSDL,
	linux-kernel mailing list, Paul Jackson

On Thu, 2004-09-09 at 15:11, Roger Luethi wrote:
> On Thu, 09 Sep 2004 11:49:33 -0700, William Lee Irwin III wrote:
> > I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation
> > patch atop the others I sent.
> 
> I have a few minor changes coming up as well.
> 
> One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should
> ifdef them out of the list of offered fields for that configuration (and
> maybe in nproc_ps_field as well).

No. First of all, I think they can be offered. Until proven
otherwise, I'll assume that the !CONFIG_MMU case is buggy.

Second of all, removal will make the !CONFIG_MMU systems
less compatible with the rest of the world. This will
mean that fewer apps can run on !CONFIG_MMU boxes. It's
same problem as "All the world's a VAX". It's better that
the apps work; an author working on a Pentium 4 Xeon is
likely to write code that relies on the fields and might
not really understand what "no MMU" is all about.




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-09 21:25             ` Roger Luethi
@ 2004-09-11 22:36               ` Albert Cahalan
  2004-09-12  5:00                 ` William Lee Irwin III
  2004-09-14  6:44                 ` Roger Luethi
  0 siblings, 2 replies; 63+ messages in thread
From: Albert Cahalan @ 2004-09-11 22:36 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Stephen Smalley, William Lee Irwin III, Andrew Morton OSDL, lkml,
	Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris,
	Chris Wright

On Thu, 2004-09-09 at 17:25, Roger Luethi wrote:
> On Thu, 09 Sep 2004 22:55:31 +0200, Roger Luethi wrote:
> > I used a somewhat different approach in my development tree (not
> > SELinuxy, though): Most fields were world readable, some required
> > credentials.
> 
> I forgot to mention that you can see the remnants of that approach in
> <linux/nproc.h>: I used two bits of the field ID to define per-field
> access restrictions (NPROC_PERM_USER, NPROC_PERM_ROOT).

Besides the low-security and high-security choices,
I'd like to see a medium-security choice.

low: everybody sees everything
medium: everybody sees something; privileged user sees all
high: must be privileged

This might mean that asking for stuff like EIP and WCHAN
causes you to see fewer processes.

If partial info is returned for a process, I'd like to
also get a bitmap of valid fields. Special "not valid"
values are a pain to deal with.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-11 22:25           ` Albert Cahalan
@ 2004-09-12  4:58             ` William Lee Irwin III
  2004-09-14  5:59             ` Roger Luethi
  1 sibling, 0 replies; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-12  4:58 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: Roger Luethi, Andrew Morton OSDL, linux-kernel mailing list,
	Paul Jackson

On Thu, 2004-09-09 at 15:11, Roger Luethi wrote:
>> I have a few minor changes coming up as well.
>> One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should
>> ifdef them out of the list of offered fields for that configuration (and
>> maybe in nproc_ps_field as well).

On Sat, Sep 11, 2004 at 06:25:56PM -0400, Albert Cahalan wrote:
> No. First of all, I think they can be offered. Until proven
> otherwise, I'll assume that the !CONFIG_MMU case is buggy.
> Second of all, removal will make the !CONFIG_MMU systems
> less compatible with the rest of the world. This will
> mean that fewer apps can run on !CONFIG_MMU boxes. It's
> same problem as "All the world's a VAX". It's better that
> the apps work; an author working on a Pentium 4 Xeon is
> likely to write code that relies on the fields and might
> not really understand what "no MMU" is all about.

Would the nommu bits I wrote be satisfactory for you?


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-11 22:36               ` Albert Cahalan
@ 2004-09-12  5:00                 ` William Lee Irwin III
  2004-09-14  6:44                 ` Roger Luethi
  1 sibling, 0 replies; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-12  5:00 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: Roger Luethi, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris,
	Chris Wright

On Thu, 2004-09-09 at 17:25, Roger Luethi wrote:
>> I forgot to mention that you can see the remnants of that approach in
>> <linux/nproc.h>: I used two bits of the field ID to define per-field
>> access restrictions (NPROC_PERM_USER, NPROC_PERM_ROOT).

On Sat, Sep 11, 2004 at 06:36:53PM -0400, Albert Cahalan wrote:
> Besides the low-security and high-security choices,
> I'd like to see a medium-security choice.
> low: everybody sees everything
> medium: everybody sees something; privileged user sees all
> high: must be privileged
> This might mean that asking for stuff like EIP and WCHAN
> causes you to see fewer processes.
> If partial info is returned for a process, I'd like to
> also get a bitmap of valid fields. Special "not valid"
> values are a pain to deal with.

That's an interesting observation. Perhaps the union of the mmu and
nommu fields should be nominally reported alongside a bitmap of the
useful fields?


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-11 22:25           ` Albert Cahalan
  2004-09-12  4:58             ` William Lee Irwin III
@ 2004-09-14  5:59             ` Roger Luethi
  2004-09-14  6:18               ` William Lee Irwin III
  1 sibling, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2004-09-14  5:59 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: William Lee Irwin III, Andrew Morton OSDL,
	linux-kernel mailing list, Paul Jackson

On Sat, 11 Sep 2004 18:25:56 -0400, Albert Cahalan wrote:
> > One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should
> > ifdef them out of the list of offered fields for that configuration (and
> > maybe in nproc_ps_field as well).
> 
> No. First of all, I think they can be offered. Until proven
> otherwise, I'll assume that the !CONFIG_MMU case is buggy.

I agree with you that those specific fields should be offered for
!CONFIG_MMU. However, if for some reason they cannot carry a value
that fits the field description, they should not be offered at all. The
ambiguity of having 0 mean either "0" or "this field is not available"
is bad. Trying to read a specific field _can_ fail, and applications
had better handle that case (it's still trivial compared to having to
parse different /proc file layouts depending on the configuration).

> mean that fewer apps can run on !CONFIG_MMU boxes. It's
> same problem as "All the world's a VAX". It's better that
> the apps work; an author working on a Pentium 4 Xeon is
> likely to write code that relies on the fields and might
> not really understand what "no MMU" is all about.

The presumed wrong assumptions underlying broken tools of the future
are not a good base for designing a new interface. My interest is in
making it easy to write correct applications (or in fixing broken apps
that won't work, say, on !CONFIG_MMU systems).

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14  5:59             ` Roger Luethi
@ 2004-09-14  6:18               ` William Lee Irwin III
  2004-09-14  6:23                 ` William Lee Irwin III
  0 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-14  6:18 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Albert Cahalan, Andrew Morton OSDL, linux-kernel mailing list,
	Paul Jackson

On Sat, 11 Sep 2004 18:25:56 -0400, Albert Cahalan wrote:
>> No. First of all, I think they can be offered. Until proven
>> otherwise, I'll assume that the !CONFIG_MMU case is buggy.

On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote:
> I agree with you that those specific fields should be offered for
> !CONFIG_MMU. However, if for some reason they cannot carry a value
> that fits the field description, they should not be offered at all. The
> ambiguity of having 0 mean either "0" or "this field is not available"
> is bad. Trying to read a specific field _can_ fail, and applications
> had better handle that case (it's still trivial compared to having to
> parse different /proc file layouts depending on the configuration).

Apart from doing something it's supposed to for !CONFIG_MMU and using
the internal kernel accounting I set up for the CONFIG_MMU=y case I'm
not very concerned about this. I have a vague notion there should
probably be some consistency with the /proc/ precedent but am not
particularly tied to it. We should probably ask Greg Ungerer (the
maintainer of the external MMU-less patches) about what he prefers
since it's likely we can't anticipate all of the !CONFIG_MMU concerns.


On Sat, 11 Sep 2004 18:25:56 -0400, Albert Cahalan wrote:
>> mean that fewer apps can run on !CONFIG_MMU boxes. It's
>> same problem as "All the world's a VAX". It's better that
>> the apps work; an author working on a Pentium 4 Xeon is
>> likely to write code that relies on the fields and might
>> not really understand what "no MMU" is all about.

On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote:
> The presumed wrong assumptions underlying broken tools of the future
> are not a good base for designing a new interface. My interest is in
> making it easy to write correct applications (or in fixing broken apps
> that won't work, say, on !CONFIG_MMU systems).

I don't really know what the approach to app compatibility used by
userspace for !CONFIG_MMU is; I'll refer you to Greg Ungerer as my
knowledge of the CONFIG_MMU usage models and/or whatever userspace
is used in tandem with it outside the VM's internals is rather scant.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14  6:18               ` William Lee Irwin III
@ 2004-09-14  6:23                 ` William Lee Irwin III
  2004-09-14  7:47                   ` Greg Ungerer
  0 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-14  6:23 UTC (permalink / raw)
  To: Greg Ungerer
  Cc: Albert Cahalan, Andrew Morton OSDL, linux-kernel mailing list,
	Paul Jackson, Roger Luethi

Greg, could you comment on this since there are some people having
trouble figuring out what's going on with VM-related /proc/ fields for
!CONFIG_MMU. Please forgive the top-posting, it made more sense to
quote the text below in this instance.

On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote:
>> I agree with you that those specific fields should be offered for
>> !CONFIG_MMU. However, if for some reason they cannot carry a value
>> that fits the field description, they should not be offered at all. The
>> ambiguity of having 0 mean either "0" or "this field is not available"
>> is bad. Trying to read a specific field _can_ fail, and applications
>> had better handle that case (it's still trivial compared to having to
>> parse different /proc file layouts depending on the configuration).

On Mon, Sep 13, 2004 at 11:18:00PM -0700, William Lee Irwin III wrote:
> Apart from doing something it's supposed to for !CONFIG_MMU and using
> the internal kernel accounting I set up for the CONFIG_MMU=y case I'm
> not very concerned about this. I have a vague notion there should
> probably be some consistency with the /proc/ precedent but am not
> particularly tied to it. We should probably ask Greg Ungerer (the
> maintainer of the external MMU-less patches) about what he prefers
> since it's likely we can't anticipate all of the !CONFIG_MMU concerns.

On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote:
>> The presumed wrong assumptions underlying broken tools of the future
>> are not a good base for designing a new interface. My interest is in
>> making it easy to write correct applications (or in fixing broken apps
>> that won't work, say, on !CONFIG_MMU systems).

On Mon, Sep 13, 2004 at 11:18:00PM -0700, William Lee Irwin III wrote:
> I don't really know what the approach to app compatibility used by
> userspace for !CONFIG_MMU is; I'll refer you to Greg Ungerer as my
> knowledge of the CONFIG_MMU usage models and/or whatever userspace
> is used in tandem with it outside the VM's internals is rather scant.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-11 22:36               ` Albert Cahalan
  2004-09-12  5:00                 ` William Lee Irwin III
@ 2004-09-14  6:44                 ` Roger Luethi
  2004-09-14  7:10                   ` William Lee Irwin III
  1 sibling, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2004-09-14  6:44 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: Stephen Smalley, William Lee Irwin III, Andrew Morton OSDL, lkml,
	Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris,
	Chris Wright

On Sat, 11 Sep 2004 18:36:53 -0400, Albert Cahalan wrote:
> > I forgot to mention that you can see the remnants of that approach in
> > <linux/nproc.h>: I used two bits of the field ID to define per-field
> > access restrictions (NPROC_PERM_USER, NPROC_PERM_ROOT).
> 
> Besides the low-security and high-security choices,
> I'd like to see a medium-security choice.
> 
> low: everybody sees everything
> medium: everybody sees something; privileged user sees all
> high: must be privileged
> 
> This might mean that asking for stuff like EIP and WCHAN
> causes you to see fewer processes.

I'm not sure I understand you correctly, but the combination of
NPROC_PERM_USER and NPROC_PERM_ROOT already seems to fit your
description:

- If the access control bits for a field are cleared, any process/user
  can get that field information for any process.

- If the access control bits are set to NPROC_PERM_USER, only root and
  the owner of a process can read the field for that process.

- For NPROC_PERM_ROOT, only root can ever read such a field.

I picked that design because it captures the essence of what /proc
does today.

> If partial info is returned for a process, I'd like to
> also get a bitmap of valid fields. Special "not valid"
> values are a pain to deal with.

If an app asks for a field it has no or partial permission for, the set
of processes returned is trimmed accordingly. Since an application will
expect this behavior based on the access control bits, no guessing is
involved here.

If an app asks for a non-existant field (not supported on this
architecture or obsolete), it will get an error back. No guessing
involved here, either. We could report the bad field ID back, but it's
easy for user-space to figure out and it's not in the fast path (for
user space).

The tricky case is if an app asks for an offered field without permission
problems, but the field is not available in that particular context. The
only instance of this that comes to mind are mm_struct related fields
and kernel threads. Neither returning an error nor skipping affected
processes seems a good solution. In this special case, the current
nproc code returns 0, but that's probably not optimal. Currently,
my preferred solution would be to return ~(0).

I'm not convinced yet that making message formats more complex (adding
bitmaps or lists of applicaple fields or something) for one special
case is a better idea.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14  6:44                 ` Roger Luethi
@ 2004-09-14  7:10                   ` William Lee Irwin III
  2004-09-14  7:55                     ` Roger Luethi
  0 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-14  7:10 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Sat, 11 Sep 2004 18:36:53 -0400, Albert Cahalan wrote:
>> This might mean that asking for stuff like EIP and WCHAN
>> causes you to see fewer processes.

On Tue, Sep 14, 2004 at 08:44:03AM +0200, Roger Luethi wrote:
> I'm not sure I understand you correctly, but the combination of
> NPROC_PERM_USER and NPROC_PERM_ROOT already seems to fit your
> description:
> - If the access control bits for a field are cleared, any process/user
>   can get that field information for any process.
> - If the access control bits are set to NPROC_PERM_USER, only root and
>   the owner of a process can read the field for that process.
> - For NPROC_PERM_ROOT, only root can ever read such a field.
> I picked that design because it captures the essence of what /proc
> does today.

The concern appears to be that the tools might interpret failed
permission checks as indications of process nonexistence. I don't
regard this as particularly pressing, as properly-written apps should
check the specific value of errno (in particular to retry when EAGAIN
is received in numerous contexts).


On Sat, 11 Sep 2004 18:36:53 -0400, Albert Cahalan wrote:
>> If partial info is returned for a process, I'd like to
>> also get a bitmap of valid fields. Special "not valid"
>> values are a pain to deal with.

On Tue, Sep 14, 2004 at 08:44:03AM +0200, Roger Luethi wrote:
> If an app asks for a field it has no or partial permission for, the set
> of processes returned is trimmed accordingly. Since an application will
> expect this behavior based on the access control bits, no guessing is
> involved here.
> If an app asks for a non-existant field (not supported on this
> architecture or obsolete), it will get an error back. No guessing
> involved here, either. We could report the bad field ID back, but it's
> easy for user-space to figure out and it's not in the fast path (for
> user space).
> The tricky case is if an app asks for an offered field without permission
> problems, but the field is not available in that particular context. The
> only instance of this that comes to mind are mm_struct related fields
> and kernel threads. Neither returning an error nor skipping affected
> processes seems a good solution. In this special case, the current
> nproc code returns 0, but that's probably not optimal. Currently,
> my preferred solution would be to return ~(0).
> I'm not convinced yet that making message formats more complex (adding
> bitmaps or lists of applicaple fields or something) for one special
> case is a better idea.

Distinguishing between EPERM, ENOSYS, ENOENT, etc. could probably be
done if the fields are measured in units such that the top bit is never
set for any feasible value, then a fully qualified error return could
simply be returned as (unsigned long)(-err). I suspect VSZ may be
problematic wrt. overflows even for 32-bit, not just for 31-bit.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14  6:23                 ` William Lee Irwin III
@ 2004-09-14  7:47                   ` Greg Ungerer
  2004-09-14  8:27                     ` Roger Luethi
  0 siblings, 1 reply; 63+ messages in thread
From: Greg Ungerer @ 2004-09-14  7:47 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Albert Cahalan, Andrew Morton OSDL, linux-kernel mailing list,
	Paul Jackson, Roger Luethi

Hi William, Roger,

William Lee Irwin III wrote:
> Greg, could you comment on this since there are some people having
> trouble figuring out what's going on with VM-related /proc/ fields for
> !CONFIG_MMU. Please forgive the top-posting, it made more sense to
> quote the text below in this instance.

Yeah, the !CONFIG_MMU code behind this is probably a little stale.
The thinking has mostly been to keep things as much the same as
possible, even if the fields didn't have a sensible meaning in
non-mmu space.


> On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote:
> 
>>>I agree with you that those specific fields should be offered for
>>>!CONFIG_MMU. However, if for some reason they cannot carry a value
>>>that fits the field description, they should not be offered at all. The
>>>ambiguity of having 0 mean either "0" or "this field is not available"
>>>is bad. Trying to read a specific field _can_ fail, and applications
>>>had better handle that case (it's still trivial compared to having to
>>>parse different /proc file layouts depending on the configuration).

In at least one case this is true now, as you mention for the
VmXxx fields. But looking at these now I think we could actually
implement most of them in a sensible way for the no-mmu case.
Size, Exe, Lib, Stk, etc  all apply with their conventional
meanings.


> On Mon, Sep 13, 2004 at 11:18:00PM -0700, William Lee Irwin III wrote:
> 
>>Apart from doing something it's supposed to for !CONFIG_MMU and using
>>the internal kernel accounting I set up for the CONFIG_MMU=y case I'm
>>not very concerned about this. I have a vague notion there should
>>probably be some consistency with the /proc/ precedent but am not
>>particularly tied to it. We should probably ask Greg Ungerer (the
>>maintainer of the external MMU-less patches) about what he prefers
>>since it's likely we can't anticipate all of the !CONFIG_MMU concerns.
> 
> 
> On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote:
> 
>>>The presumed wrong assumptions underlying broken tools of the future
>>>are not a good base for designing a new interface. My interest is in
>>>making it easy to write correct applications (or in fixing broken apps
>>>that won't work, say, on !CONFIG_MMU systems).

Reality for non-mmu targets is that most apps just won't be fixed
for them, so we try real hard to make the world look like it is
just like any other linux architecture.

I think !CONFIG_MMU case can be cleaned up to make it almost identical
to the CONFIG_MMU case, and reporting sensible values for just about
all fields.

Regards
Greg


------------------------------------------------------------------------
Greg Ungerer  --  Chief Software Dude       EMAIL:     gerg@snapgear.com
SnapGear -- a CyberGuard Company            PHONE:       +61 7 3435 2888
825 Stanley St,                             FAX:         +61 7 3891 3630
Woolloongabba, QLD, 4102, Australia         WEB: http://www.SnapGear.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14  7:10                   ` William Lee Irwin III
@ 2004-09-14  7:55                     ` Roger Luethi
  2004-09-14  8:01                       ` William Lee Irwin III
  0 siblings, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2004-09-14  7:55 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Tue, 14 Sep 2004 00:10:58 -0700, William Lee Irwin III wrote:
> > - If the access control bits for a field are cleared, any process/user
> >   can get that field information for any process.
> > - If the access control bits are set to NPROC_PERM_USER, only root and
> >   the owner of a process can read the field for that process.
> > - For NPROC_PERM_ROOT, only root can ever read such a field.
> > I picked that design because it captures the essence of what /proc
> > does today.
> 
> The concern appears to be that the tools might interpret failed
> permission checks as indications of process nonexistence. I don't
> regard this as particularly pressing, as properly-written apps should
> check the specific value of errno (in particular to retry when EAGAIN
> is received in numerous contexts).

I would expect a tool to refrain from asking for fields with restricted
access if it needs a complete overview over existing processes. It can
always ask for restricted fields in a second request (the vast majority
of fields are world-readable anyway).

> > processes seems a good solution. In this special case, the current
> > nproc code returns 0, but that's probably not optimal. Currently,
> > my preferred solution would be to return ~(0).
> > I'm not convinced yet that making message formats more complex (adding
> > bitmaps or lists of applicaple fields or something) for one special
> > case is a better idea.
> 
> Distinguishing between EPERM, ENOSYS, ENOENT, etc. could probably be
> done if the fields are measured in units such that the top bit is never
> set for any feasible value, then a fully qualified error return could
> simply be returned as (unsigned long)(-err). I suspect VSZ may be
> problematic wrt. overflows even for 32-bit, not just for 31-bit.

Yeah, that makes me nervous. There are just too many ways this can go
wrong or be misinterpreted in user space. Currently, nproc does not
indicate the type of error at all, because a properly written user-space
app will either not hit an error or be able to figure out what the
problem was based on the available information. I suppose if we wanted
to change that (which doesn't sound unreasonable), the proper way would
be to return error flags with an error message (delivered via netlink).

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14  7:55                     ` Roger Luethi
@ 2004-09-14  8:01                       ` William Lee Irwin III
  2004-09-14  9:27                         ` Roger Luethi
  0 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-14  8:01 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Tue, 14 Sep 2004 00:10:58 -0700, William Lee Irwin III wrote:
>> The concern appears to be that the tools might interpret failed
>> permission checks as indications of process nonexistence. I don't
>> regard this as particularly pressing, as properly-written apps should
>> check the specific value of errno (in particular to retry when EAGAIN
>> is received in numerous contexts).

On Tue, Sep 14, 2004 at 09:55:08AM +0200, Roger Luethi wrote:
> I would expect a tool to refrain from asking for fields with restricted
> access if it needs a complete overview over existing processes. It can
> always ask for restricted fields in a second request (the vast majority
> of fields are world-readable anyway).

That expectation can't be entirely relied upon, as the restrictions may
not be predictable.


On Tue, 14 Sep 2004 00:10:58 -0700, William Lee Irwin III wrote:
>> Distinguishing between EPERM, ENOSYS, ENOENT, etc. could probably be
>> done if the fields are measured in units such that the top bit is never
>> set for any feasible value, then a fully qualified error return could
>> simply be returned as (unsigned long)(-err). I suspect VSZ may be
>> problematic wrt. overflows even for 32-bit, not just for 31-bit.

On Tue, Sep 14, 2004 at 09:55:08AM +0200, Roger Luethi wrote:
> Yeah, that makes me nervous. There are just too many ways this can go
> wrong or be misinterpreted in user space. Currently, nproc does not
> indicate the type of error at all, because a properly written user-space
> app will either not hit an error or be able to figure out what the
> problem was based on the available information. I suppose if we wanted
> to change that (which doesn't sound unreasonable), the proper way would
> be to return error flags with an error message (delivered via netlink).

This kind of error reporting is better still, as the fields then won't
be polluted with invalid data under any circumstance (assuming the code
can report subsets of the fields or some such, which I presume to be
the case given that avoiding reporting potentially computationally
expensive fields was one of the original motivators of the patch).


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14  7:47                   ` Greg Ungerer
@ 2004-09-14  8:27                     ` Roger Luethi
  0 siblings, 0 replies; 63+ messages in thread
From: Roger Luethi @ 2004-09-14  8:27 UTC (permalink / raw)
  To: Greg Ungerer
  Cc: William Lee Irwin III, Albert Cahalan, Andrew Morton OSDL,
	linux-kernel mailing list, Paul Jackson

On Tue, 14 Sep 2004 17:47:52 +1000, Greg Ungerer wrote:
> Yeah, the !CONFIG_MMU code behind this is probably a little stale.
> The thinking has mostly been to keep things as much the same as
> possible, even if the fields didn't have a sensible meaning in
> non-mmu space.

With nproc, tool authors won't need to write any special-casing code
for non-MMU. All they need to handle is the possibility that a field
they ask for does not exist. (Of course it doesn't hurt if they know
how to deal with non-MMU specific fields if any exist)

> >On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote:
> >
> >>>I agree with you that those specific fields should be offered for
> >>>!CONFIG_MMU. However, if for some reason they cannot carry a value
> >>>that fits the field description, they should not be offered at all. The
> >>>ambiguity of having 0 mean either "0" or "this field is not available"
> >>>is bad. Trying to read a specific field _can_ fail, and applications
> >>>had better handle that case (it's still trivial compared to having to
> >>>parse different /proc file layouts depending on the configuration).
> 
> In at least one case this is true now, as you mention for the
> VmXxx fields. But looking at these now I think we could actually
> implement most of them in a sensible way for the no-mmu case.
> Size, Exe, Lib, Stk, etc  all apply with their conventional
> meanings.

It seems we all agree on that.

What I'd object to is offering fields like Size, Exe, etc. and filling
them with values that are wrong (e.g. returning always 0 for Exe). In
such a case, the field is simply not offered and asking for it an
error.

That's not a problem we can solve for tool authors: Allowing them to
distinguish between N/A and 0 is a property of the interface, and using
that interface means knowing how to deal with that distinction.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14  8:01                       ` William Lee Irwin III
@ 2004-09-14  9:27                         ` Roger Luethi
  2004-09-14 15:37                           ` William Lee Irwin III
  0 siblings, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2004-09-14  9:27 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Tue, 14 Sep 2004 01:01:32 -0700, William Lee Irwin III wrote:
> On Tue, 14 Sep 2004 00:10:58 -0700, William Lee Irwin III wrote:
> >> The concern appears to be that the tools might interpret failed
> >> permission checks as indications of process nonexistence. I don't
> >> regard this as particularly pressing, as properly-written apps should
> >> check the specific value of errno (in particular to retry when EAGAIN
> >> is received in numerous contexts).
> 
> On Tue, Sep 14, 2004 at 09:55:08AM +0200, Roger Luethi wrote:
> > I would expect a tool to refrain from asking for fields with restricted
> > access if it needs a complete overview over existing processes. It can
> > always ask for restricted fields in a second request (the vast majority
> > of fields are world-readable anyway).
> 
> That expectation can't be entirely relied upon, as the restrictions may
> not be predictable.

They should be. For the simple design I described the access restrictions
are part of the field ID, so a tool can deduce the exact type of access
restrictions even if it doesn't know the field. There's plenty of space
left for additional access control flags in the field ID.

If it gets much more complex, the application (let alone the kernel)
has to have some knowledge of the security model anyway, so we could have
simple operations that allow a tool to discover how access restrictions
apply to the supported fields.

> > problem was based on the available information. I suppose if we wanted
> > to change that (which doesn't sound unreasonable), the proper way would
> > be to return error flags with an error message (delivered via netlink).
> 
> This kind of error reporting is better still, as the fields then won't
> be polluted with invalid data under any circumstance (assuming the code
> can report subsets of the fields or some such, which I presume to be
> the case given that avoiding reporting potentially computationally
> expensive fields was one of the original motivators of the patch).

It cannot easily, and I don't think it wants to. The reason it's hard to
just reply with a subset is that the kernel does not send any description
of the reply content other than the serial number of the request --
it's up to the tool to know what it asked for. So if you remove a field,
you'd have to let user-space know which field you removed. Sending only
the allowed subset makes handling on both sides more complicated --
the kernel needs to build different kinds of messages in answer to one
request, and user-space tool need to be able to parse that.

The way the interface works now, though, is that a tool can rely on
the content of the reply to match the request. This makes the common
case both easy to write and fast.

Let me break it down once again:

- If a tool asks for a field the kernel doesn't know about, that's a
  fatal error. An error message is returned, nothing else (this can be
  discovered before any other reply is delivered).

- If a tool specifically asks for a process which doesn't exist,
  nothing is returned. We could return an error indicating that. Might
  be a good idea.

- If a tool asks for a field it doesn't have permission to read, it usally
  does have permission to read that field for some tasks (e.g. same owner),
  but not for others. So for some replies to one request, all requested
  fields will contain meaningful values.  What about the replies that
  describe the tasks where the tool must not read at least some of the
  requested values? I chose to simply skip those tasks.

  We could also send an error message ("some tasks omitted") or send a
  complete reply with the restricted fields zeroed and a special flag set
  ("some fields in this reply zeroed due to access control").

I'm really afraid of over-engineering something here, though. The fields
requested by tools like ps and top by default are all world readable
in /proc. I showed that solutions fit right in should we ever need
access control for real-world applications. For now, I'd rather not
extend the interface significantly unless the current semantics are
clearly insufficient.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14  9:27                         ` Roger Luethi
@ 2004-09-14 15:37                           ` William Lee Irwin III
  2004-09-14 16:01                             ` Roger Luethi
  0 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-14 15:37 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Tue, 14 Sep 2004 01:01:32 -0700, William Lee Irwin III wrote:
>> That expectation can't be entirely relied upon, as the restrictions may
>> not be predictable.

On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> They should be. For the simple design I described the access restrictions
> are part of the field ID, so a tool can deduce the exact type of access
> restrictions even if it doesn't know the field. There's plenty of space
> left for additional access control flags in the field ID.

No, in general races of the form "permissions were altered after I
checked them" can happen.


On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> If it gets much more complex, the application (let alone the kernel)
> has to have some knowledge of the security model anyway, so we could have
> simple operations that allow a tool to discover how access restrictions
> apply to the supported fields.

Checking that system calls succeeded is a minimum requirement at all
times. Misinterpreting error returns is the app's fault.


On Tue, 14 Sep 2004 01:01:32 -0700, William Lee Irwin III wrote:
>> This kind of error reporting is better still, as the fields then won't
>> be polluted with invalid data under any circumstance (assuming the code
>> can report subsets of the fields or some such, which I presume to be
>> the case given that avoiding reporting potentially computationally
>> expensive fields was one of the original motivators of the patch).

On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> It cannot easily, and I don't think it wants to. The reason it's hard to
> just reply with a subset is that the kernel does not send any description
> of the reply content other than the serial number of the request --
> it's up to the tool to know what it asked for. So if you remove a field,
> you'd have to let user-space know which field you removed. Sending only
> the allowed subset makes handling on both sides more complicated --
> the kernel needs to build different kinds of messages in answer to one
> request, and user-space tool need to be able to parse that.

Irritating. That must mean you can't ask for specific fields.


On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> The way the interface works now, though, is that a tool can rely on
> the content of the reply to match the request. This makes the common
> case both easy to write and fast.
> Let me break it down once again:
> - If a tool asks for a field the kernel doesn't know about, that's a
>   fatal error. An error message is returned, nothing else (this can be
>   discovered before any other reply is delivered).

If you can't ask for specific fields you're dead anyway.


On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> - If a tool specifically asks for a process which doesn't exist,
>   nothing is returned. We could return an error indicating that. Might
>   be a good idea.

ESRCH and ENOENT sound good.


On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> - If a tool asks for a field it doesn't have permission to read, it usally
>   does have permission to read that field for some tasks (e.g. same owner),
>   but not for others. So for some replies to one request, all requested
>   fields will contain meaningful values.  What about the replies that
>   describe the tasks where the tool must not read at least some of the
>   requested values? I chose to simply skip those tasks.

This is the bit about being dead already if you can't request subsets
of fields and/or one field at a time.


On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
>   We could also send an error message ("some tasks omitted") or send a
>   complete reply with the restricted fields zeroed and a special flag set
>   ("some fields in this reply zeroed due to access control").
> I'm really afraid of over-engineering something here, though. The fields
> requested by tools like ps and top by default are all world readable
> in /proc. I showed that solutions fit right in should we ever need
> access control for real-world applications. For now, I'd rather not
> extend the interface significantly unless the current semantics are
> clearly insufficient.

Well, "return this set of fields" means there's only one type of
request necessary, and userspace merely iterates through the subsets
obtained by striking out fields to which accesses caused errors until
either the set is empty or the call succeeds. One field at a time at
all times also means there's only one type of request necessary. So I
don't see overengineering happening here, merely that "either all
succeed or all fail" is a semantic that creates hardships for userspace;
both the alternatives are simple.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14 15:37                           ` William Lee Irwin III
@ 2004-09-14 16:01                             ` Roger Luethi
  2004-09-14 16:37                               ` William Lee Irwin III
  0 siblings, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2004-09-14 16:01 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
> On Tue, 14 Sep 2004 01:01:32 -0700, William Lee Irwin III wrote:
> >> That expectation can't be entirely relied upon, as the restrictions may
> >> not be predictable.
> 
> On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> > They should be. For the simple design I described the access restrictions
> > are part of the field ID, so a tool can deduce the exact type of access
> > restrictions even if it doesn't know the field. There's plenty of space
> > left for additional access control flags in the field ID.
> 
> No, in general races of the form "permissions were altered after I
> checked them" can happen.

Can you make an example? Some scenario where this would be important?

> On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> > If it gets much more complex, the application (let alone the kernel)
> > has to have some knowledge of the security model anyway, so we could have
> > simple operations that allow a tool to discover how access restrictions
> > apply to the supported fields.
> 
> Checking that system calls succeeded is a minimum requirement at all
> times. Misinterpreting error returns is the app's fault.

It's async. You can't rely on return values. They'd have to be in
netlink messages.

> On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> > It cannot easily, and I don't think it wants to. The reason it's hard to
> > just reply with a subset is that the kernel does not send any description
> > of the reply content other than the serial number of the request --
> > it's up to the tool to know what it asked for. So if you remove a field,
> > you'd have to let user-space know which field you removed. Sending only
> > the allowed subset makes handling on both sides more complicated --
> > the kernel needs to build different kinds of messages in answer to one
> > request, and user-space tool need to be able to parse that.
> 
> Irritating. That must mean you can't ask for specific fields.

How so? For process fields, the request block is one u32 indicating the
number of field IDs to follow, then a bunch of u32 containing field IDs.
Any subset of field IDs, in any order of the tool's choosing.

The kernel replies with one message per process, each message containing
all the fields the tool requested, in the same order.

> On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> >   We could also send an error message ("some tasks omitted") or send a
> >   complete reply with the restricted fields zeroed and a special flag set
> >   ("some fields in this reply zeroed due to access control").
> > I'm really afraid of over-engineering something here, though. The fields
> > requested by tools like ps and top by default are all world readable
> > in /proc. I showed that solutions fit right in should we ever need
> > access control for real-world applications. For now, I'd rather not
> > extend the interface significantly unless the current semantics are
> > clearly insufficient.
> 
> Well, "return this set of fields" means there's only one type of
> request necessary, and userspace merely iterates through the subsets
> obtained by striking out fields to which accesses caused errors until
> either the set is empty or the call succeeds. One field at a time at
> all times also means there's only one type of request necessary. So I

One field at a time at all times is unnecessarily slow.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14 16:01                             ` Roger Luethi
@ 2004-09-14 16:37                               ` William Lee Irwin III
  2004-09-14 17:15                                 ` Roger Luethi
  2004-09-14 18:37                                 ` Chris Wright
  0 siblings, 2 replies; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-14 16:37 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
>> No, in general races of the form "permissions were altered after I
>> checked them" can happen.

On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> Can you make an example? Some scenario where this would be important?

Not particularly. It largely means poorly-coded apps may report gibberish.


On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
>> Checking that system calls succeeded is a minimum requirement at all
>> times. Misinterpreting error returns is the app's fault.

On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> It's async. You can't rely on return values. They'd have to be in
> netlink messages.

That's fine. Do these error messages specify which field access(es)
caused the error?


On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
>> Irritating. That must mean you can't ask for specific fields.

On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> How so? For process fields, the request block is one u32 indicating the
> number of field IDs to follow, then a bunch of u32 containing field IDs.
> Any subset of field IDs, in any order of the tool's choosing.
> The kernel replies with one message per process, each message containing
> all the fields the tool requested, in the same order.

Then assuming the error messages indicate which field access(es) caused
the error(s), you're already done; userspace must merely retry the
request with the offending fields cast out. Otherwise, you're still
done: userspace can merely retry the field accesses one at a time
(though it's nicer to say which ones caused the errors).


On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
>> Well, "return this set of fields" means there's only one type of
>> request necessary, and userspace merely iterates through the subsets
>> obtained by striking out fields to which accesses caused errors until
>> either the set is empty or the call succeeds. One field at a time at
>> all times also means there's only one type of request necessary. So I

On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> One field at a time at all times is unnecessarily slow.

Yes, that was the "slower and stupider than thou" option. You've
already vectorized field access requests, of which I heartily approve.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14 16:37                               ` William Lee Irwin III
@ 2004-09-14 17:15                                 ` Roger Luethi
  2004-09-14 17:43                                   ` William Lee Irwin III
  2004-09-14 18:37                                 ` Chris Wright
  1 sibling, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2004-09-14 17:15 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote:
> On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
> >> No, in general races of the form "permissions were altered after I
> >> checked them" can happen.
> 
> On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> > Can you make an example? Some scenario where this would be important?
> 
> Not particularly. It largely means poorly-coded apps may report gibberish.

If we are still talking about the same thing here, gibberish is a rather
strong word. In the design I proposed access control affects the subset
of tasks returned as a result -- the tool would still display meaningful
information for the tasks it got replies for.

Anyway, if the access restrictions are hard-coded into the field ID,
then it's only the credentials that can change, and I can't see a race
there at the moment.

> On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
> >> Checking that system calls succeeded is a minimum requirement at all
> >> times. Misinterpreting error returns is the app's fault.
> 
> On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> > It's async. You can't rely on return values. They'd have to be in
> > netlink messages.
> 
> That's fine. Do these error messages specify which field access(es)
> caused the error?

They don't, because the access control I had in my dev tree silently
skipped tasks containing fields the process had no permission to read.
IOW, access control works as an implicit task selector. And security
wise that's clean because the kernel does not reveal any information
about other processes to the querying task (not even evidence of their
existence).

> Then assuming the error messages indicate which field access(es) caused
> the error(s), you're already done; userspace must merely retry the
> request with the offending fields cast out. Otherwise, you're still
> done: userspace can merely retry the field accesses one at a time
> (though it's nicer to say which ones caused the errors).

Agreed on every point.

The question I am pondering is: Does nproc need access control right now?
It's more work in kernel and user space and adds new opportunities to
introduce bugs. The merits seem rather dubious right now, considering
that all the fields used by current process info tools (files
/proc/pid{cmdline, stat, statm, status, wchan}) are world readable.
So my preference is to wait with access control until we know where
and how it is necessary.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14 17:15                                 ` Roger Luethi
@ 2004-09-14 17:43                                   ` William Lee Irwin III
  2004-09-14 18:45                                     ` Roger Luethi
  0 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-14 17:43 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote:
>> Not particularly. It largely means poorly-coded apps may report gibberish.

On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote:
> If we are still talking about the same thing here, gibberish is a rather
> strong word. In the design I proposed access control affects the subset
> of tasks returned as a result -- the tool would still display meaningful
> information for the tasks it got replies for.

That sounds bizarre. I'd expect some kind of reply, even if merely an
error. I suppose "no reply" could be interpreted as "ESRCH", though
this means distinguishing between "some field caused an error" and
"the thing is dead" means the app has to fall back to requesting fields
one at a time.


On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote:
> Anyway, if the access restrictions are hard-coded into the field ID,
> then it's only the credentials that can change, and I can't see a race
> there at the moment.

The race is in the app, not the kernel, so there's nothing to fix in
the kernel apart from distinctions between ESRCH and EPERM in error
reporting (otherwise the app is helpless to resolve the ambiguity).


On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote:
>> That's fine. Do these error messages specify which field access(es)
>> caused the error?

On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote:
> They don't, because the access control I had in my dev tree silently
> skipped tasks containing fields the process had no permission to read.
> IOW, access control works as an implicit task selector. And security
> wise that's clean because the kernel does not reveal any information
> about other processes to the querying task (not even evidence of their
> existence).

If all errors are handled with "no reply", userspace loses some
efficiency, as it's forced to retry field accesses one at a time and
wait for timeouts on each of them for a dead/inaccessible task.


On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote:
>> Then assuming the error messages indicate which field access(es) caused
>> the error(s), you're already done; userspace must merely retry the
>> request with the offending fields cast out. Otherwise, you're still
>> done: userspace can merely retry the field accesses one at a time
>> (though it's nicer to say which ones caused the errors).

On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote:
> Agreed on every point.
> The question I am pondering is: Does nproc need access control right now?
> It's more work in kernel and user space and adds new opportunities to
> introduce bugs. The merits seem rather dubious right now, considering
> that all the fields used by current process info tools (files
> /proc/pid{cmdline, stat, statm, status, wchan}) are world readable.
> So my preference is to wait with access control until we know where
> and how it is necessary.

This I can't answer.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14 16:37                               ` William Lee Irwin III
  2004-09-14 17:15                                 ` Roger Luethi
@ 2004-09-14 18:37                                 ` Chris Wright
  2004-09-14 18:55                                   ` Roger Luethi
  1 sibling, 1 reply; 63+ messages in thread
From: Chris Wright @ 2004-09-14 18:37 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Roger Luethi, Albert Cahalan, Stephen Smalley,
	Andrew Morton OSDL, lkml, Paul Jackson, James Morris,
	Chris Wright

* William Lee Irwin III (wli@holomorphy.com) wrote:
> On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
> >> No, in general races of the form "permissions were altered after I
> >> checked them" can happen.
> 
> On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> > Can you make an example? Some scenario where this would be important?
> 
> Not particularly. It largely means poorly-coded apps may report gibberish.

Canonical example is access(2) followed by open(2), not really relevant
in this case.  However, exec setuid root app...when do you check, and
when to you fill in data to send back to user?  For /proc, this type of
check happens often (see things like may_ptrace_attach and
task_dumpable in fs/proc/base.c).

thanks,
-chris
-- 
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14 17:43                                   ` William Lee Irwin III
@ 2004-09-14 18:45                                     ` Roger Luethi
  2004-09-14 19:07                                       ` William Lee Irwin III
  0 siblings, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2004-09-14 18:45 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Tue, 14 Sep 2004 10:43:25 -0700, William Lee Irwin III wrote:
> On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote:
> >> Not particularly. It largely means poorly-coded apps may report gibberish.
> 
> On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote:
> > If we are still talking about the same thing here, gibberish is a rather
> > strong word. In the design I proposed access control affects the subset
> > of tasks returned as a result -- the tool would still display meaningful
> > information for the tasks it got replies for.
> 
> That sounds bizarre. I'd expect some kind of reply, even if merely an
> error. I suppose "no reply" could be interpreted as "ESRCH", though
> this means distinguishing between "some field caused an error" and
> "the thing is dead" means the app has to fall back to requesting fields
> one at a time.

I suppose you are thinking of a request that lists a number of PIDs along
with a number of field IDs. In that case yes, I agree that it makes sense
to provide some explicit feedback to the tool once we add access control
(before that, there is no ambiguity: a missing answer means ESRCH).

The most common request, though, won't provide a list of pids, it will
only provide a list of field IDs and select all processes in the system
(NPROC_SELECT_ALL). There is no ambiguity here, either: The tool didn't
ask for any specific process to begin with, ESRCH doesn't make sense
here. And for a system that looks anything like /proc does today,
fields that are capable of triggering EPERM are few and far between,
certainly not something you are hitting unexpectedly in the fast path
of a process monitoring tool.

Thanks, by the way, for all the feedback that helped me realize that
I have so far failed to explain the design well enough. I will try to
work on that.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14 18:37                                 ` Chris Wright
@ 2004-09-14 18:55                                   ` Roger Luethi
  2004-09-14 19:05                                     ` Chris Wright
  0 siblings, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2004-09-14 18:55 UTC (permalink / raw)
  To: Chris Wright
  Cc: William Lee Irwin III, Albert Cahalan, Stephen Smalley,
	Andrew Morton OSDL, lkml, Paul Jackson, James Morris

On Tue, 14 Sep 2004 11:37:36 -0700, Chris Wright wrote:
> * William Lee Irwin III (wli@holomorphy.com) wrote:
> > On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
> > >> No, in general races of the form "permissions were altered after I
> > >> checked them" can happen.
> > 
> > On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> > > Can you make an example? Some scenario where this would be important?
> > 
> > Not particularly. It largely means poorly-coded apps may report gibberish.
> 
> Canonical example is access(2) followed by open(2), not really relevant
> in this case.  However, exec setuid root app...when do you check, and
> when to you fill in data to send back to user?  For /proc, this type of
> check happens often (see things like may_ptrace_attach and
> task_dumpable in fs/proc/base.c).

For nproc, the procedure looks like this: A tool send(2)s a request,
credentials are attached to skb. Based on said credentials, the kernel
is free to provide (netlink_unicast to originating socket) or withhold
information. In this regard, nproc works like other netlink interfaces.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14 18:55                                   ` Roger Luethi
@ 2004-09-14 19:05                                     ` Chris Wright
  2004-09-14 21:12                                       ` Roger Luethi
  0 siblings, 1 reply; 63+ messages in thread
From: Chris Wright @ 2004-09-14 19:05 UTC (permalink / raw)
  To: Chris Wright, William Lee Irwin III, Albert Cahalan,
	Stephen Smalley, Andrew Morton OSDL, lkml, Paul Jackson,
	James Morris

* Roger Luethi (rl@hellgate.ch) wrote:
> On Tue, 14 Sep 2004 11:37:36 -0700, Chris Wright wrote:
> > Canonical example is access(2) followed by open(2), not really relevant
> > in this case.  However, exec setuid root app...when do you check, and
> > when to you fill in data to send back to user?  For /proc, this type of
> > check happens often (see things like may_ptrace_attach and
> > task_dumpable in fs/proc/base.c).
> 
> For nproc, the procedure looks like this: A tool send(2)s a request,
> credentials are attached to skb. Based on said credentials, the kernel
> is free to provide (netlink_unicast to originating socket) or withhold
> information. In this regard, nproc works like other netlink interfaces.

Understood.  Question is, if the request is for data that's associated
with a task that is in the middle of an execve(setuid_root_app), does
the credential-check/skb-fill for response happen atomically w.r.t. said
execve?  IOW, is it possible to pass credential check, then fill data
that's become sensitive since the check happened?

thanks,
-chris
-- 
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14 18:45                                     ` Roger Luethi
@ 2004-09-14 19:07                                       ` William Lee Irwin III
  2004-09-14 19:31                                         ` Roger Luethi
  2004-09-15 11:44                                         ` Roger Luethi
  0 siblings, 2 replies; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-14 19:07 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Tue, Sep 14, 2004 at 08:45:18PM +0200, Roger Luethi wrote:
> I suppose you are thinking of a request that lists a number of PIDs along
> with a number of field IDs. In that case yes, I agree that it makes sense
> to provide some explicit feedback to the tool once we add access control
> (before that, there is no ambiguity: a missing answer means ESRCH).
> The most common request, though, won't provide a list of pids, it will
> only provide a list of field IDs and select all processes in the system
> (NPROC_SELECT_ALL). There is no ambiguity here, either: The tool didn't
> ask for any specific process to begin with, ESRCH doesn't make sense
> here. And for a system that looks anything like /proc does today,
> fields that are capable of triggering EPERM are few and far between,
> certainly not something you are hitting unexpectedly in the fast path
> of a process monitoring tool.

Okay, so what kinds of errors are returned in this case, if any, or
(worst case) are the offending tasks completely silently dropped?


On Tue, Sep 14, 2004 at 08:45:18PM +0200, Roger Luethi wrote:
> Thanks, by the way, for all the feedback that helped me realize that
> I have so far failed to explain the design well enough. I will try to
> work on that.

Thanks; while I could in principle expend more effort to understand the
netlink code, it's likely swifter to be given such commentary.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14 19:07                                       ` William Lee Irwin III
@ 2004-09-14 19:31                                         ` Roger Luethi
  2004-09-14 19:36                                           ` William Lee Irwin III
  2004-09-15 11:44                                         ` Roger Luethi
  1 sibling, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2004-09-14 19:31 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Tue, 14 Sep 2004 12:07:47 -0700, William Lee Irwin III wrote:
> On Tue, Sep 14, 2004 at 08:45:18PM +0200, Roger Luethi wrote:
> > I suppose you are thinking of a request that lists a number of PIDs along
> > with a number of field IDs. In that case yes, I agree that it makes sense
> > to provide some explicit feedback to the tool once we add access control
> > (before that, there is no ambiguity: a missing answer means ESRCH).
> > The most common request, though, won't provide a list of pids, it will
> > only provide a list of field IDs and select all processes in the system
> > (NPROC_SELECT_ALL). There is no ambiguity here, either: The tool didn't
> > ask for any specific process to begin with, ESRCH doesn't make sense
> > here. And for a system that looks anything like /proc does today,
> > fields that are capable of triggering EPERM are few and far between,
> > certainly not something you are hitting unexpectedly in the fast path
> > of a process monitoring tool.
> 
> Okay, so what kinds of errors are returned in this case, if any, or
> (worst case) are the offending tasks completely silently dropped?

In published code: No access control whatsoever. In dev tree: Silently
dropped. Possible: Any kind of error and additional information that
makes sense (we have netlink messages as a transport, after all).

That said, I don't think dropping tasks silently is a "worst case"
in this scenario. Whatever your error report is going to be, it will
boil down to saying "some tasks that may or may not live by the time
you read this have been skipped because some fields that you knew had
access restrictions prevented providing the information in those cases,
and I must be cautious about not revealing any sensitive information
to you so sorry I can't be more helpful". What's a tool going to do
with that? If it cares to get a complete snapshot, it can simply send
two requests: One with and one without restricted fields.

So the tool would, say, request PID/VmSize in the first message and
environ in the second message. Since only the owner can read the
environment, the second request would yield answers only for a subset
of the total process table.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14 19:31                                         ` Roger Luethi
@ 2004-09-14 19:36                                           ` William Lee Irwin III
  2004-09-14 19:50                                             ` Roger Luethi
  0 siblings, 1 reply; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-14 19:36 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Tue, 14 Sep 2004 12:07:47 -0700, William Lee Irwin III wrote:
>> Okay, so what kinds of errors are returned in this case, if any, or
>> (worst case) are the offending tasks completely silently dropped?

On Tue, Sep 14, 2004 at 09:31:39PM +0200, Roger Luethi wrote:
> In published code: No access control whatsoever. In dev tree: Silently
> dropped. Possible: Any kind of error and additional information that
> makes sense (we have netlink messages as a transport, after all).

I'm not sure what to make of this.


On Tue, Sep 14, 2004 at 09:31:39PM +0200, Roger Luethi wrote:
> That said, I don't think dropping tasks silently is a "worst case"
> in this scenario. Whatever your error report is going to be, it will
> boil down to saying "some tasks that may or may not live by the time
> you read this have been skipped because some fields that you knew had
> access restrictions prevented providing the information in those cases,
> and I must be cautious about not revealing any sensitive information
> to you so sorry I can't be more helpful". What's a tool going to do
> with that? If it cares to get a complete snapshot, it can simply send
> two requests: One with and one without restricted fields.
> So the tool would, say, request PID/VmSize in the first message and
> environ in the second message. Since only the owner can read the
> environment, the second request would yield answers only for a subset
> of the total process table.

This sounds safe enough, though it's unclear how to predict what fields
may be restricted. I suppose one doesn't try and requests one field at
a time for all tasks in this model of interaction with userspace.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14 19:36                                           ` William Lee Irwin III
@ 2004-09-14 19:50                                             ` Roger Luethi
  0 siblings, 0 replies; 63+ messages in thread
From: Roger Luethi @ 2004-09-14 19:50 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Tue, 14 Sep 2004 12:36:26 -0700, William Lee Irwin III wrote:
> On Tue, Sep 14, 2004 at 09:31:39PM +0200, Roger Luethi wrote:
> > In published code: No access control whatsoever. In dev tree: Silently
> > dropped. Possible: Any kind of error and additional information that
> > makes sense (we have netlink messages as a transport, after all).
> 
> I'm not sure what to make of this.

I was just trying to say that anything is possible (there are no
limitations inherent to the design), but I prefer it the way it is now.
I don't feel strongly about it should something different turn out to
be the preferred method of tool authors.

> This sounds safe enough, though it's unclear how to predict what fields
> may be restricted. I suppose one doesn't try and requests one field at

Simple: The fact that a field is subject to access restrictions is part
of the field ID. You can check that nproc.h contains this:

/* Access control (unused) */
#define NPROC_PERM_MASK		0x00300000
#define NPROC_PERM_USER		0x00100000
#define NPROC_PERM_ROOT		0x00200000

So even if a tool were to discover a new, previously unknown field offered
by the kernel, it could immediately tell that access restrictions apply and
what type they are (in case you wonder, there's extra space in reserve to
cover additional types of restrictions, including some catch-all thing (say
NPROC_PERM_COMPLEX_WHICH_MEANS_YOU_HAD_BETTER_KNOW_WHAT_YOU'RE_DOING)). So
nproc can cover everything /proc does today and is ready to go way beyond
that -- should that ever be deemed a good thing.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14 19:05                                     ` Chris Wright
@ 2004-09-14 21:12                                       ` Roger Luethi
  0 siblings, 0 replies; 63+ messages in thread
From: Roger Luethi @ 2004-09-14 21:12 UTC (permalink / raw)
  To: Chris Wright
  Cc: William Lee Irwin III, Albert Cahalan, Stephen Smalley,
	Andrew Morton OSDL, lkml, Paul Jackson, James Morris

On Tue, 14 Sep 2004 12:05:09 -0700, Chris Wright wrote:
> Understood.  Question is, if the request is for data that's associated
> with a task that is in the middle of an execve(setuid_root_app), does
> the credential-check/skb-fill for response happen atomically w.r.t. said
> execve?  IOW, is it possible to pass credential check, then fill data
> that's become sensitive since the check happened?

It shouldn't be once we implement access control. I don't pretend to know
what the best way is to prevent that. Checking several times just shrinks
the race window, so I suppose we'd have to lock the source data structures
down prior to checking credentials and copying data.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-14 19:07                                       ` William Lee Irwin III
  2004-09-14 19:31                                         ` Roger Luethi
@ 2004-09-15 11:44                                         ` Roger Luethi
  2004-09-15 20:02                                           ` Roger Luethi
  1 sibling, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2004-09-15 11:44 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Tue, 14 Sep 2004 12:07:47 -0700, William Lee Irwin III wrote:
> Thanks; while I could in principle expend more effort to understand the
> netlink code, it's likely swifter to be given such commentary.

This message aims at showing how nproc works for user space. If you need
additional or a different kind of documentation, let me know.

Roger

Field ID
========
In order to extract a specific value from the proc filesystem, a tool
combines the file path and some method to determine the appropriate
offset into that file (depending on the file based on keyword,
white-space separated column, etc.). At this point, the tool applies
its knowledge of the specific field format to convert the string back
to what it stands for.

Nproc, on the other hand, uses field IDs to identify information.

Each field ID (32 bit) contains a number of sub fields:

bits
 0-15 Content ID. For instance, 0x117 is the virtual memory size of
                  a process.
20-21 Access control ID. Type of access control restrictions that apply
                         to this field. Currently unused.
24-26 Data type ID. Defines the return type which is one of u32,
                    unsigned long, u64, or string.
28-30 Scope ID. Defines the scope for which a field is valid. Scope
                can be process (e.g. VmSize) or global (e.g. MemFree).

The remaining bits are reserved for future use.

Some details on sub-fields:

Content ID (bits 0-15)
----------
Bits 8-15 are used to indicate the /proc file in which a field occurs and
0-7 to indicate the field within that file (where applicable). There's
no magic to that other than the fact that it makes easier for humans
to check nproc.h.

Content IDs are immutable and identical on all platforms. Thus,
the meaning of any content ID, once assigned, must never ever change!

Data type ID (24-26)
------------
It's no problem to define additional (even complex) data types should
the need arise. For numbers, the data type simply defines the size of
the container (32 bit, long, 64 bit).

For strings, the string itself is prepended with a u32 indicating the
length of the string.

Scope ID (28-30)
--------
The scope ID is just another piece of information for tools with
automatic field discovery (see example below).


Examples
========
A few examples of how the mechanisms are used:

Simple
------
A tool like vmstat(8) starts from a bunch of IDs for global fields it's
interested in. After opening the socket, it sends one NPROC_GET_GLOBAL
request containing said field IDs to the kernel. The kernel sends one
reply for vmstat to read: A va_list containing the result for each
requested field ID.

Unit conversion (if necessary) can typically be done in place. Format
string and buffer are directly passed to vprintf(3). Done.

Detecting obsolete fields
-------------------------
An NPROC_GET_FIELD_LIST request can be used at start-up to determine
the field IDs that are offered by the kernel. If an app requests an
obsolete field anyway (being optimistic is faster for the common case),
it will get an error message back and can determine the cause from there.

I don't expect this to happen more often than it has in the past
(disappearing fields suck), but it's a clean way to handle such an event.

Field autodiscovery
-------------------
A tool may be interested in printing all information available
about a set of processes it is monitoring. At start-up, it sends
NPROC_GET_FIELD_LIST and finds a new field it doesn't know about.

>From the field ID, the tool can deduce that the unknown field:
- is in process scope and thus interesting for its task. That's all it
  takes to add the new field to the NPROC_GET_PS request sent to the kernel
  (along with a list of monitored PIDs). If the reply for a PID is missing
  from the result, the PID has died.
- needs 32bits to store the result

With three label calls on the new field ID, the app determines that
the kernel suggests "VmShared" as a label, "%8u" for formatting,
and that the unit is "KiB". (This may sound like bloat or overkill,
but all these strings are already available via /proc for many fields,
just in a processed form that makes it impractical to get the individual
elements back.) The tool appends the format string for the new field to
its own format string and can now proceed like the tool in the first,
trivial example.

Dealing with strings
--------------------
Most strings are really static labels (e.g. the label for a field ID
or the symbol name for wchan). In those cases, it's up to user-space
to ask for a label and cache the result as necessary. There are some
cases, though, where the label is transient. At least one of them,
the process name, is important enough to justify strings in regular
(as opposed to label) replies. Otherwise, the process and its name
may be gone by the time a tool gets around to ask for it based on
a PID it received. As there are no unique task identifiers, there
are races possible and correct caching is hard if not impossible.

But how can we still get a valid va_list back? A library function
in user space takes care of that. For a given list of field IDs, it
replaces every string type field with a NOP (reply size: unsigned long)
and appends the string type field ID to the end of the list:

u32   u32    u32
PID | NAME | VMSIZE

becomes

u32   u32    u32     u32
PID | NOP | VMSIZE | NAME

Now it's trivial to fix the replies:


u32   unsigned long               u32    u32   string
1   | 0                         | 1340 |  16 | init
                                          ^-- space used for this string

becomes

u32   unsigned long               u32    u32   string
1   | <pointer to first string> | 1340 |  16 | init

Anticipating type changes
-------------------------
Some fields may grow in size (e.g. NPROC_PID may move from u32 to unsigned
long or u64). If a field is not available from the kernel, a smart tool can
check the list of field IDs for a field with with the same content ID but a
different data type and print that instead.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-15 11:44                                         ` Roger Luethi
@ 2004-09-15 20:02                                           ` Roger Luethi
  2004-09-15 20:20                                             ` William Lee Irwin III
  0 siblings, 1 reply; 63+ messages in thread
From: Roger Luethi @ 2004-09-15 20:02 UTC (permalink / raw)
  To: William Lee Irwin III, Albert Cahalan, Stephen Smalley,
	Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson,
	James Morris, Chris Wright

Here's another thing we haven't been able to do with /proc: Finding out
the relative cost of computing the elements we offer to user space.

I ran a test program against 2.6.9-rc2-bk1 + nproc to get:

Testing all process fields, best out of 10
FieldID    CPU (s)  Wall (s) Label
0x03000002 0.140000 0.202728 NOP
0x21000100 0.150000 0.210021 Name
0x22000105 0.120000 0.204886 PID
0x22000109 0.130000 0.205319 UID
0x22000117 0.140000 0.215275 VmSize
0x22000118 0.130000 0.214240 VmLock
0x22000119 0.120000 0.214870 VmRSS
0x22000120 0.160000 1.020574 VmData
0x22000121 0.140000 1.021185 VmStack
0x22000122 0.170000 1.021619 VmExe
0x22000123 0.170000 1.020045 VmLib
0x23000421 0.140000 0.220748 wchan

Ignore the absolute values (I requested each field individually for all
processes on my workstation, 1000 times). The cost of walking all vmas
for VmData & Co. is very visible.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-15 20:02                                           ` Roger Luethi
@ 2004-09-15 20:20                                             ` William Lee Irwin III
  2004-09-15 20:33                                               ` Roger Luethi
  2004-09-15 20:44                                               ` Roger Luethi
  0 siblings, 2 replies; 63+ messages in thread
From: William Lee Irwin III @ 2004-09-15 20:20 UTC (permalink / raw)
  To: Roger Luethi
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Wed, Sep 15, 2004 at 10:02:30PM +0200, Roger Luethi wrote:
> Here's another thing we haven't been able to do with /proc: Finding out
> the relative cost of computing the elements we offer to user space.
> I ran a test program against 2.6.9-rc2-bk1 + nproc to get:
> Testing all process fields, best out of 10
> FieldID    CPU (s)  Wall (s) Label
> 0x03000002 0.140000 0.202728 NOP
> 0x21000100 0.150000 0.210021 Name
> 0x22000105 0.120000 0.204886 PID
> 0x22000109 0.130000 0.205319 UID
> 0x22000117 0.140000 0.215275 VmSize
> 0x22000118 0.130000 0.214240 VmLock
> 0x22000119 0.120000 0.214870 VmRSS
> 0x22000120 0.160000 1.020574 VmData
> 0x22000121 0.140000 1.021185 VmStack
> 0x22000122 0.170000 1.021619 VmExe
> 0x22000123 0.170000 1.020045 VmLib
> 0x23000421 0.140000 0.220748 wchan
> Ignore the absolute values (I requested each field individually for all
> processes on my workstation, 1000 times). The cost of walking all vmas
> for VmData & Co. is very visible.

Try this again after applying my updates, which make it equivalent to the
algorithms used internally by fs/proc/task_mmu.c.


-- wli

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-15 20:20                                             ` William Lee Irwin III
@ 2004-09-15 20:33                                               ` Roger Luethi
  2004-09-15 20:44                                               ` Roger Luethi
  1 sibling, 0 replies; 63+ messages in thread
From: Roger Luethi @ 2004-09-15 20:33 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Wed, 15 Sep 2004 13:20:28 -0700, William Lee Irwin III wrote:
> Try this again after applying my updates, which make it equivalent to the
> algorithms used internally by fs/proc/task_mmu.c.

That doesn't sound very interesting. The results are predictable. The
point of my previous message was that we can easily identify expensive
fields.

Ah well, compiling patched kernel anyway.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [1/1][PATCH] nproc v2: netlink access to /proc information
  2004-09-15 20:20                                             ` William Lee Irwin III
  2004-09-15 20:33                                               ` Roger Luethi
@ 2004-09-15 20:44                                               ` Roger Luethi
  1 sibling, 0 replies; 63+ messages in thread
From: Roger Luethi @ 2004-09-15 20:44 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml,
	Albert Cahalan, Paul Jackson, James Morris, Chris Wright

On Wed, 15 Sep 2004 13:20:28 -0700, William Lee Irwin III wrote:
> > Ignore the absolute values (I requested each field individually for all
> > processes on my workstation, 1000 times). The cost of walking all vmas
> > for VmData & Co. is very visible.
> 
> Try this again after applying my updates, which make it equivalent to the
> algorithms used internally by fs/proc/task_mmu.c.

Here you go:

Testing all process fields, best out of 10
FieldID    CPU (s)  Wall (s) Label
0x03000002 0.130000 0.208989 NOP
0x21000100 0.150000 0.222867 Name
0x22000105 0.140000 0.216126 PID
0x22000109 0.140000 0.218058 UID
0x22000117 0.140000 0.231467 VmSize
0x22000118 0.140000 0.227863 VmLock
0x22000119 0.140000 0.229867 VmRSS
0x22000120 0.140000 0.226822 VmData
0x22000121 0.140000 0.228589 VmStack
0x22000122 0.130000 0.229107 VmExe
0x22000123 0.140000 0.228584 VmLib
0x23000421 0.140000 0.230716 wchan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* nproc: So?
  2004-09-08 18:40 [0/1][ANNOUNCE] nproc v2: netlink access to /proc information Roger Luethi
  2004-09-08 18:41 ` [1/1][PATCH] " Roger Luethi
@ 2004-09-16 21:43 ` Roger Luethi
  1 sibling, 0 replies; 63+ messages in thread
From: Roger Luethi @ 2004-09-16 21:43 UTC (permalink / raw)
  To: linux-kernel

I have received some constructive criticism and suggestions, but I didn't
see any comments on the desirability of nproc in mainline. Initially meant
to be a proof-of-concept, nproc has become an interface that is much
cleaner and faster than procfs can ever hope to be (it takes some reading
of procps or libgtop code to appreciate the complexity that is /proc file
parsing today), and every change in /proc files widens the gap. I presented
source code, benchmarks, and design documentation to substantiate my
claims; I can post the user-space code somewhere if there's interest.

So I'm wondering if everybody's waiting for me to answer some important
question I overlooked, or if there is a general sentiment that this
project is not worth pursuing.

Roger

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2004-09-16 21:43 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-08 18:40 [0/1][ANNOUNCE] nproc v2: netlink access to /proc information Roger Luethi
2004-09-08 18:41 ` [1/1][PATCH] " Roger Luethi
2004-09-09  0:35   ` William Lee Irwin III
2004-09-09  0:43     ` William Lee Irwin III
2004-09-09  1:15       ` William Lee Irwin III
2004-09-09  1:17         ` [1/2] rediff nproc v2 vs. 2.6.9-rc1-mm4 William Lee Irwin III
2004-09-09  1:21           ` [2/2] handle CONFIG_MMU=n and use new vm stats for CONFIG_MMU=y William Lee Irwin III
2004-09-09  1:22             ` William Lee Irwin III
2004-09-09  1:26             ` [3/2] round up text memory to the nearest page in fs/proc/task_mmu.c William Lee Irwin III
2004-09-09 18:43     ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi
2004-09-09 18:49       ` William Lee Irwin III
2004-09-09 19:00         ` William Lee Irwin III
2004-09-09 19:02           ` [4/2] consolidate __task_mem() and __task_mem_cheap() William Lee Irwin III
2004-09-09 19:07             ` Roger Luethi
2004-09-09 19:15               ` [5/2] fix nommu VSZ reporting in consolidated task_mem() William Lee Irwin III
2004-09-09 19:11         ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi
2004-09-09 19:23           ` William Lee Irwin III
2004-09-09 21:19             ` Roger Luethi
2004-09-10 15:30             ` Roger Luethi
2004-09-11 22:25           ` Albert Cahalan
2004-09-12  4:58             ` William Lee Irwin III
2004-09-14  5:59             ` Roger Luethi
2004-09-14  6:18               ` William Lee Irwin III
2004-09-14  6:23                 ` William Lee Irwin III
2004-09-14  7:47                   ` Greg Ungerer
2004-09-14  8:27                     ` Roger Luethi
2004-09-09 11:53   ` Stephen Smalley
2004-09-09 17:22     ` William Lee Irwin III
2004-09-09 17:53       ` Roger Luethi
2004-09-09 20:01         ` Stephen Smalley
2004-09-09 20:48           ` Chris Wright
2004-09-10 12:11             ` Stephen Smalley
2004-09-09 20:55           ` Roger Luethi
2004-09-09 21:05             ` Chris Wright
2004-09-09 21:25             ` Roger Luethi
2004-09-11 22:36               ` Albert Cahalan
2004-09-12  5:00                 ` William Lee Irwin III
2004-09-14  6:44                 ` Roger Luethi
2004-09-14  7:10                   ` William Lee Irwin III
2004-09-14  7:55                     ` Roger Luethi
2004-09-14  8:01                       ` William Lee Irwin III
2004-09-14  9:27                         ` Roger Luethi
2004-09-14 15:37                           ` William Lee Irwin III
2004-09-14 16:01                             ` Roger Luethi
2004-09-14 16:37                               ` William Lee Irwin III
2004-09-14 17:15                                 ` Roger Luethi
2004-09-14 17:43                                   ` William Lee Irwin III
2004-09-14 18:45                                     ` Roger Luethi
2004-09-14 19:07                                       ` William Lee Irwin III
2004-09-14 19:31                                         ` Roger Luethi
2004-09-14 19:36                                           ` William Lee Irwin III
2004-09-14 19:50                                             ` Roger Luethi
2004-09-15 11:44                                         ` Roger Luethi
2004-09-15 20:02                                           ` Roger Luethi
2004-09-15 20:20                                             ` William Lee Irwin III
2004-09-15 20:33                                               ` Roger Luethi
2004-09-15 20:44                                               ` Roger Luethi
2004-09-14 18:37                                 ` Chris Wright
2004-09-14 18:55                                   ` Roger Luethi
2004-09-14 19:05                                     ` Chris Wright
2004-09-14 21:12                                       ` Roger Luethi
2004-09-09 20:44         ` Chris Wright
2004-09-16 21:43 ` nproc: So? Roger Luethi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).