All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] s390: Hypervisor File System
@ 2006-04-28  9:22 Michael Holzheu
  2006-04-28  9:43 ` Jörn Engel
                   ` (4 more replies)
  0 siblings, 5 replies; 62+ messages in thread
From: Michael Holzheu @ 2006-04-28  9:22 UTC (permalink / raw)
  To: akpm; +Cc: schwidefsky, penberg, ioe-lkml, joern, linux-kernel

On zSeries machines there exists an interface which allows the operating
system  to retrieve LPAR hypervisor accounting data. For example, it is
possible to get usage data for physical and virtual cpus. In order to
provide this information to user space programs, I implemented a new
virtual Linux file system named 'hypfs' using the Linux 2.6 libfs
framework. The name 'hypfs' stands for 'Hypervisor Filesystem'. All the
accounting information is put into different virtual files which can be
accessed from user space. All data is represented as ASCII strings.

When the file system is mounted the accounting information is retrieved
and a file system tree is created with the attribute files containing
the cpu information. The content of the files remains unchanged until a
new update is made. An update can be triggered from user space through
writing 'something' into a special purpose update file.

We create the following directory structure:

<mount-point>/
        update
        cpus/
                <cpu-id>
                        type
                        mgmtime
                <cpu-id>
                        ...
        hyp/
                type
        systems/
                <lpar-name>
                        cpus/
                                <cpu-id>
                                        type
                                        mgmtime
                                        cputime
                                        onlinetime
                                <cpu-id>
                                        ...
                <lpar-name>
                        cpus/
                                ...

- update: File to trigger update
- cpus/: Directory for all physical cpus
- cpus/<cpu-id>/: Directory for one physical cpu.
- cpus/<cpu-id>/type: Type name of physical zSeries cpu.
- cpus/<cpu-id>/mgmtime: Physical-LPAR-management time in microseconds.
- hyp/: Directory for hypervisor information
- hyp/type: Typ of hypervisor (currently only 'LPAR Hypervisor')
- systems/: Directory for all LPARs
- systems/<lpar-name>/: Directory for one LPAR.
- systems/<lpar-name>/cpus/<cpu-id>/: Directory for the virtual cpus
- systems/<lpar-name>/cpus/<cpu-id>/type: Typ of cpu.
- systems/<lpar-name>/cpus/<cpu-id>/mgmtime:
Accumulated number of microseconds during which a physical
CPU was assigned to the logical cpu and the cpu time was 
consumed by the hypervisor and was not provided to
the LPAR (LPAR overhead).

- systems/<lpar-name>/cpus/<cpu-id>/cputime:
Accumulated number of microseconds during which a physical CPU
was assigned to the logical cpu and the cpu time was consumed
by the LPAR.

- systems/<lpar-name>/cpus/<cpu-id>/onlinetime:
Accumulated number of microseconds during which the logical CPU
has been online.

As mount point for the filesystem /sys/hypervisor is created.

The update process is triggered when writing 'something' into the
'update' file at the top level hypfs directory. You can do this e.g.
with 'echo 1 > update'. During the update the whole directory structure
is deleted and built up again.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Ingo Oeser <ioe-lkml@rameria.de>
Cc: Jörn Engel <joern@wohnheim.fh-wedel.de>

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Michael Holzheu <holzheu@de.ibm.com>

---

 arch/s390/Kconfig            |    7
 arch/s390/Makefile           |    2
 arch/s390/hypfs/Makefile     |    7
 arch/s390/hypfs/hypfs.h      |   30 ++
 arch/s390/hypfs/hypfs_diag.c |  628 +++++++++++++++++++++++++++++++++++++++++++
 arch/s390/hypfs/hypfs_diag.h |   16 +
 arch/s390/hypfs/inode.c      |  462 +++++++++++++++++++++++++++++++
 7 files changed, 1151 insertions(+), 1 deletion(-)

diff -urpN linux-2.6.16/arch/s390/Kconfig linux-2.6.16-hypfs/arch/s390/Kconfig
--- linux-2.6.16/arch/s390/Kconfig	2006-03-20 06:53:29.000000000 +0100
+++ linux-2.6.16-hypfs/arch/s390/Kconfig	2006-04-28 08:58:46.000000000 +0200
@@ -442,6 +442,13 @@ config NO_IDLE_HZ_INIT
 	  The HZ timer is switched off in idle by default. That means the
 	  HZ timer is already disabled at boot time.
 
+config HYPFS_FS
+	bool "s390 hypervisor file system support"
+	default y
+	help
+	  This is a virtual file system intended to provide accounting
+	  information in a s390 hypervisor environment.
+
 config KEXEC
 	bool "kexec system call (EXPERIMENTAL)"
 	depends on EXPERIMENTAL
diff -urpN linux-2.6.16/arch/s390/Makefile linux-2.6.16-hypfs/arch/s390/Makefile
--- linux-2.6.16/arch/s390/Makefile	2006-03-20 06:53:29.000000000 +0100
+++ linux-2.6.16-hypfs/arch/s390/Makefile	2006-04-28 08:59:10.000000000 +0200
@@ -77,7 +77,7 @@ LDFLAGS_vmlinux := -e start
 head-y		:= arch/$(ARCH)/kernel/head.o arch/$(ARCH)/kernel/init_task.o
 
 core-y		+= arch/$(ARCH)/mm/ arch/$(ARCH)/kernel/ arch/$(ARCH)/crypto/ \
-		   arch/$(ARCH)/appldata/
+		   arch/$(ARCH)/appldata/ arch/$(ARCH)/hypfs/
 libs-y		+= arch/$(ARCH)/lib/
 drivers-y	+= drivers/s390/
 drivers-$(CONFIG_MATHEMU) += arch/$(ARCH)/math-emu/
diff -urpN linux-2.6.16/arch/s390/hypfs/Makefile linux-2.6.16-hypfs/arch/s390/hypfs/Makefile
--- linux-2.6.16/arch/s390/hypfs/Makefile	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.16-hypfs/arch/s390/hypfs/Makefile	2006-04-28 08:59:36.000000000 +0200
@@ -0,0 +1,7 @@
+#
+# Makefile for the linux hypfs filesystem routines.
+#
+
+obj-$(CONFIG_HYPFS_FS) += hypfs.o
+
+hypfs-objs := inode.o hypfs_diag.o
diff -urpN linux-2.6.16/arch/s390/hypfs/hypfs.h linux-2.6.16-hypfs/arch/s390/hypfs/hypfs.h
--- linux-2.6.16/arch/s390/hypfs/hypfs.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.16-hypfs/arch/s390/hypfs/hypfs.h	2006-04-28 08:59:36.000000000 +0200
@@ -0,0 +1,30 @@
+/*
+ *  fs/hypfs/hypfs.h
+ *    Hypervisor filesystem for Linux on s390.
+ *
+ *    Copyright (C) IBM Corp. 2006
+ *    Author(s): Michael Holzheu <holzheu@de.ibm.com>
+ */
+
+#ifndef _HYPFS_H_
+#define _HYPFS_H_
+
+#include <linux/fs.h>
+#include <linux/types.h>
+
+#define REG_FILE_MODE    0440
+#define UPDATE_FILE_MODE 0220
+#define DIR_MODE         0550
+
+extern struct dentry *hypfs_mkdir(struct super_block *sb, struct dentry *parent,
+				  const char *name);
+
+extern struct dentry *hypfs_create_u64(struct super_block *sb,
+				       struct dentry *dir, const char *name,
+				       __u64 value);
+
+extern struct dentry *hypfs_create_str(struct super_block *sb,
+				       struct dentry *dir, const char *name,
+				       char *string);
+
+#endif /* _HYPFS_H_ */
diff -urpN linux-2.6.16/arch/s390/hypfs/hypfs_diag.c linux-2.6.16-hypfs/arch/s390/hypfs/hypfs_diag.c
--- linux-2.6.16/arch/s390/hypfs/hypfs_diag.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.16-hypfs/arch/s390/hypfs/hypfs_diag.c	2006-04-28 08:59:36.000000000 +0200
@@ -0,0 +1,628 @@
+/*
+ *  fs/hypfs/hypfs_diag.c
+ *    Hypervisor filesystem for Linux on s390. Diag 204 and 224
+ *    implementation.
+ *
+ *    Copyright (C) IBM Corp. 2006
+ *    Author(s): Michael Holzheu <holzheu@de.ibm.com>
+ */
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <asm/ebcdic.h>
+#include "hypfs.h"
+
+#define LPAR_NAME_LEN 8		/* lpar name len in diag 204 data */
+#define TMP_SIZE 64		/* size of temporary buffers */
+
+/* diag 204 subcodes */
+enum diag204_sc {
+	SUBC_STIB4 = 4,
+	SUBC_RSI = 5,
+	SUBC_STIB6 = 6,
+	SUBC_STIB7 = 7
+};
+
+/* The two available diag 204 data formats */
+enum diag204_format {
+	INFO_SIMPLE = 0,
+	INFO_EXT = 0x00010000
+};
+
+/* bit is set in flags, when physical cpu info is included in diag 204 data */
+#define LPAR_PHYS_FLG  0x80
+
+static char *diag224_cpu_names;			/* diag 224 name table */
+static enum diag204_sc diag204_store_sc;	/* used subcode for store */
+static enum diag204_format diag204_info_type;	/* used diag 204 data format */
+
+/*
+ * DIAG 204 data structures and member access functions.
+ *
+ * Since we have two different diag 204 data formats for old and new s390 
+ * machines, we do not access the structs directly, but use getter functions for 
+ * each struct member instead. This should make the code more readable.
+ */
+
+/* Time information block */
+
+struct info_blk_hdr {
+	__u8  npar;
+	__u8  flags;
+	__u16 tslice;
+	__u16 phys_cpus;
+	__u16 this_part;
+	__u64 curtod;
+} __attribute__ ((packed));
+
+struct x_info_blk_hdr {
+	__u8  npar;
+	__u8  flags;
+	__u16 tslice;
+	__u16 phys_cpus;
+	__u16 this_part;
+	__u64 curtod1;
+	__u64 curtod2;
+	char reserved[40];
+} __attribute__ ((packed));
+
+static inline int info_blk_hdr__size(enum diag204_format type)
+{
+	if (type == INFO_SIMPLE)
+		return sizeof(struct info_blk_hdr);
+	else /* INFO_EXT */
+		return sizeof(struct x_info_blk_hdr);
+}
+
+static inline __u8 info_blk_hdr__npar(enum diag204_format type, void *hdr)
+{
+	if (type == INFO_SIMPLE)
+		return ((struct info_blk_hdr *)hdr)->npar;
+	else /* INFO_EXT */
+		return ((struct x_info_blk_hdr *)hdr)->npar;
+}
+
+static inline __u8 info_blk_hdr__flags(enum diag204_format type, void *hdr)
+{
+	if (type == INFO_SIMPLE)
+		return ((struct info_blk_hdr *)hdr)->flags;
+	else /* INFO_EXT */
+		return ((struct x_info_blk_hdr *)hdr)->flags;
+}
+
+static inline __u16 info_blk_hdr__pcpus(enum diag204_format type, void *hdr)
+{
+	if (type == INFO_SIMPLE)
+		return ((struct info_blk_hdr *)hdr)->phys_cpus;
+	else /* INFO_EXT */
+		return ((struct x_info_blk_hdr *)hdr)->phys_cpus;
+}
+
+/* Partition header */
+
+struct part_hdr {
+	__u8 pn;
+	__u8 cpus;
+	char reserved[6];
+	char part_name[LPAR_NAME_LEN];
+} __attribute__ ((packed));
+
+struct x_part_hdr {
+	__u8  pn;
+	__u8  cpus;
+	__u8  rcpus;
+	__u8  pflag;
+	__u32 mlu;
+	char  part_name[LPAR_NAME_LEN];
+	char  lpc_name[8];
+	char  os_name[8];
+	__u64 online_cs;
+	__u64 online_es;
+	__u8  upid;
+	char  reserved1[3];
+	__u32 group_mlu;
+	char  group_name[8];
+	char  reserved2[32];
+} __attribute__ ((packed));
+
+static inline int part_hdr__size(enum diag204_format type)
+{
+	if (type == INFO_SIMPLE)
+		return sizeof(struct part_hdr);
+	else /* INFO_EXT */
+		return sizeof(struct x_part_hdr);
+}
+
+static inline __u8 part_hdr__cpus(enum diag204_format type, void *hdr)
+{
+	if (type == INFO_SIMPLE)
+		return ((struct part_hdr *)hdr)->cpus;
+	else /* INFO_EXT */
+		return ((struct x_part_hdr *)hdr)->cpus;
+}
+
+static inline void part_hdr__part_name(enum diag204_format type, void *hdr,
+				       char *name)
+{
+	if (type == INFO_SIMPLE)
+		memcpy(name, ((struct part_hdr *)hdr)->part_name,
+		       LPAR_NAME_LEN);
+	else /* INFO_EXT */
+		memcpy(name, ((struct x_part_hdr *)hdr)->part_name,
+		       LPAR_NAME_LEN);
+	EBCASC(name, LPAR_NAME_LEN);
+	name[LPAR_NAME_LEN] = 0;
+	strstrip(name);
+}
+
+struct cpu_info {
+	__u16 cpu_addr;
+	char  reserved1[2];
+	__u8  ctidx;
+	__u8  cflag;
+	__u16 weight;
+	__u64 acc_time;
+	__u64 lp_time;
+} __attribute__ ((packed));
+
+struct x_cpu_info {
+	__u16 cpu_addr;
+	char  reserved1[2];
+	__u8  ctidx;
+	__u8  cflag;
+	__u16 weight;
+	__u64 acc_time;
+	__u64 lp_time;
+	__u16 min_weight;
+	__u16 cur_weight;
+	__u16 max_weight;
+	char  reseved2[2];
+	__u64 online_time;
+	__u64 wait_time;
+	__u32 pma_weight;
+	__u32 polar_weight;
+	char  reserved3[40];
+} __attribute__ ((packed));
+
+/* CPU info block */
+
+static inline int cpu_info__size(enum diag204_format type)
+{
+	if (type == INFO_SIMPLE)
+		return sizeof(struct cpu_info);
+	else /* INFO_EXT */
+		return sizeof(struct x_cpu_info);
+}
+
+static inline __u8 cpu_info__ctidx(enum diag204_format type, void *hdr)
+{
+	if (type == INFO_SIMPLE)
+		return ((struct cpu_info *)hdr)->ctidx;
+	else /* INFO_EXT */
+		return ((struct x_cpu_info *)hdr)->ctidx;
+}
+
+static inline __u16 cpu_info__cpu_addr(enum diag204_format type, void *hdr)
+{
+	if (type == INFO_SIMPLE)
+		return ((struct cpu_info *)hdr)->cpu_addr;
+	else /* INFO_EXT */
+		return ((struct x_cpu_info *)hdr)->cpu_addr;
+}
+
+static inline __u64 cpu_info__acc_time(enum diag204_format type, void *hdr)
+{
+	if (type == INFO_SIMPLE)
+		return ((struct cpu_info *)hdr)->acc_time;
+	else /* INFO_EXT */
+		return ((struct x_cpu_info *)hdr)->acc_time;
+}
+
+static inline __u64 cpu_info__lp_time(enum diag204_format type, void *hdr)
+{
+	if (type == INFO_SIMPLE)
+		return ((struct cpu_info *)hdr)->lp_time;
+	else /* INFO_EXT */
+		return ((struct x_cpu_info *)hdr)->lp_time;
+}
+
+static inline __u64 cpu_info__online_time(enum diag204_format type, void *hdr)
+{
+	if (type == INFO_SIMPLE)
+		return 0;	/* online_time not available in simple info */
+	else /* INFO_EXT */
+		return ((struct x_cpu_info *)hdr)->online_time;
+}
+
+/* Physical header */
+
+struct phys_hdr {
+	char reserved1[1];
+	__u8 cpus;
+	char reserved2[6];
+	char mgm_name[8];
+} __attribute__ ((packed));
+
+struct x_phys_hdr {
+	char reserved1[1];
+	__u8 cpus;
+	char reserved2[6];
+	char mgm_name[8];
+	char reserved3[80];
+} __attribute__ ((packed));
+
+static inline int phys_hdr__size(enum diag204_format type)
+{
+	if (type == INFO_SIMPLE)
+		return sizeof(struct phys_hdr);
+	else /* INFO_EXT */
+		return sizeof(struct x_phys_hdr);
+}
+
+static inline __u8 phys_hdr__cpus(enum diag204_format type, void *hdr)
+{
+	if (type == INFO_SIMPLE)
+		return ((struct phys_hdr *)hdr)->cpus;
+	else /* INFO_EXT */
+		return ((struct x_phys_hdr *)hdr)->cpus;
+}
+
+/* Physical CPU info block */
+
+struct phys_cpu {
+	__u16 cpu_addr;
+	char  reserved1[2];
+	__u8  ctidx;
+	char  reserved2[3];
+	__u64 mgm_time;
+	char  reserved3[8];
+} __attribute__ ((packed));
+
+struct x_phys_cpu {
+	__u16 cpu_addr;
+	char  reserved1[2];
+	__u8  ctidx;
+	char  reserved2[3];
+	__u64 mgm_time;
+	char  reserved3[80];
+} __attribute__ ((packed));
+
+static inline int phys_cpu__size(enum diag204_format type)
+{
+	if (type == INFO_SIMPLE)
+		return sizeof(struct phys_cpu);
+	else /* INFO_EXT */
+		return sizeof(struct x_phys_cpu);
+}
+
+static inline __u16 phys_cpu__cpu_addr(enum diag204_format type, void *hdr)
+{
+	if (type == INFO_SIMPLE)
+		return ((struct phys_cpu *)hdr)->cpu_addr;
+	else /* INFO_EXT */
+		return ((struct x_phys_cpu *)hdr)->cpu_addr;
+}
+
+static inline __u64 phys_cpu__mgm_time(enum diag204_format type, void *hdr)
+{
+	if (type == INFO_SIMPLE)
+		return ((struct phys_cpu *)hdr)->mgm_time;
+	else /* INFO_EXT */
+		return ((struct x_phys_cpu *)hdr)->mgm_time;
+}
+
+static inline __u64 phys_cpu__ctidx(enum diag204_format type, void *hdr)
+{
+	if (type == INFO_SIMPLE)
+		return ((struct phys_cpu *)hdr)->ctidx;
+	else /* INFO_EXT */
+		return ((struct x_phys_cpu *)hdr)->ctidx;
+}
+
+/* Diagnose 204 functions */
+
+static inline int diag204(unsigned long subcode, unsigned long size, void *addr)
+{
+	register unsigned long _subcode asm("0") = subcode;
+	register unsigned long _size asm("1") = size;
+
+	asm volatile ("   diag    %2,%0,0x204\n"
+		      "0: \n" ".section __ex_table,\"a\"\n"
+#ifndef __s390x__
+		      "    .align 4\n"
+		      "    .long  0b,0b\n"
+#else
+		      "    .align 8\n"
+		      "    .quad  0b,0b\n"
+#endif
+		      ".previous":"+d" (_subcode), "+d"(_size)
+		      :"d"(addr)
+		      :"memory");
+	if (_subcode)
+		return -1;
+	else
+		return _size;
+}
+
+/* 
+ * diag204_probe() has to find out, which type of diagnose 204 implementation
+ * we have on our machine. Currently there are three possible scanarios:
+ *   - subcode 4   + simple data format (only one page)
+ *   - subcode 4-6 + extended data format
+ *   - subcode 4-7 + extended data format
+ *
+ * Subcode 5 is used to retrieve the size of the data, provided by subcodes
+ * 6 and 7. Subcode 7 basically has the same function as subcode 6. In addition
+ * to subcode 6 it provides also information about secondary cpus.
+ * In order to get as much information as possible, we first try
+ * subcode 7, then 6 and if both fail, we use subcode 4.
+ */
+
+static int diag204_probe(void)
+{
+	void *buf;
+	int pages, rc;
+
+	pages = diag204(SUBC_RSI | INFO_EXT, 0, 0);
+	if (pages > 0) {
+		buf = kmalloc(pages * PAGE_SIZE, GFP_KERNEL);
+		if (!buf) {
+			rc = -ENOMEM;
+			goto err_out;
+		}
+		if (diag204(SUBC_STIB7 | INFO_EXT, pages, buf) >= 0) {
+			diag204_store_sc = SUBC_STIB7;
+			diag204_info_type = INFO_EXT;
+			goto out;
+		}
+		if (diag204(SUBC_STIB6 | INFO_EXT, pages, buf) >= 0) {
+			diag204_store_sc = SUBC_STIB7;
+			diag204_info_type = INFO_EXT;
+			goto out;
+		}
+		kfree(buf);
+	}
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf) {
+		rc = -ENOMEM;
+		goto err_out;
+	}
+	if (diag204(SUBC_STIB4 | INFO_SIMPLE, pages, buf) >= 0) {
+		diag204_store_sc = SUBC_STIB4;
+		diag204_info_type = INFO_SIMPLE;
+		goto out;
+	} else {
+		rc = -ENOSYS;
+		goto err_out;
+	}
+      out:
+	rc = 0;
+      err_out:
+	kfree(buf);
+	return rc;
+}
+
+static void *diag204_store(void)
+{
+	void *buf;
+	int pages;
+
+	if (diag204_store_sc == SUBC_STIB4)
+		pages = 1;
+	else
+		pages = diag204(SUBC_RSI | diag204_info_type, 0, 0);
+
+	if (pages < 0)
+		return ERR_PTR(-ENOSYS);
+
+	buf = kmalloc(pages * PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return ERR_PTR(-ENOMEM);
+
+	if (diag204(diag204_store_sc | diag204_info_type, pages, buf) < 0) {
+		kfree(buf);
+		return ERR_PTR(-ENOSYS);
+	}
+	return buf;
+}
+
+/* Diagnose 224 functions */
+
+static inline void diag224(void *ptr)
+{
+	asm volatile("   diag    %0,%1,0x224\n"
+		     : :"d" (0), "d"(ptr) : "memory");
+}
+
+static inline int diag224_get_name_table(void)
+{
+	/* memory must be below 2GB */
+	diag224_cpu_names = kmalloc(PAGE_SIZE, GFP_KERNEL | GFP_DMA);
+	if (!diag224_cpu_names)
+		return -ENOMEM;
+	diag224(diag224_cpu_names);
+	EBCASC(diag224_cpu_names + 16, (*diag224_cpu_names + 1) * 16);
+	return 0;
+}
+
+static inline void diag224_delete_name_table(void)
+{
+	kfree(diag224_cpu_names);
+}
+
+static int diag224_idx2name(int index, char *name)
+{
+	memcpy(name, diag224_cpu_names + ((index + 1) * 16), 16);
+	name[16] = 0;
+	strstrip(name);
+	return 0;
+}
+
+int diag_init(void)
+{
+	int rc;
+
+	if (diag204_probe()) {
+		printk(KERN_ERR "hypfs: diag 204 not working.");
+		return -ENODATA;
+	}
+	rc = diag224_get_name_table();
+	if (rc) {
+		diag224_delete_name_table();
+		printk(KERN_ERR "hypfs: could not get name table.\n");
+	}
+	return rc;
+}
+
+void diag_exit(void)
+{
+	diag224_delete_name_table();
+}
+
+/*
+ * Functions to create the directory structure
+ * *******************************************
+ */
+
+static int hypfs_create_cpu_files(struct super_block *sb,
+				  struct dentry *cpus_dir, void *cpu_info)
+{
+	struct dentry *cpu_dir;
+	char buffer[TMP_SIZE];
+	void *rc;
+
+	snprintf(buffer, TMP_SIZE, "%d", cpu_info__cpu_addr(diag204_info_type,
+							    cpu_info));
+	cpu_dir = hypfs_mkdir(sb, cpus_dir, buffer);
+	rc = hypfs_create_u64(sb, cpu_dir, "mgmtime",
+			      cpu_info__acc_time(diag204_info_type, cpu_info) -
+			      cpu_info__lp_time(diag204_info_type, cpu_info));
+	if (IS_ERR(rc))
+		return PTR_ERR(rc);
+	rc = hypfs_create_u64(sb, cpu_dir, "cputime",
+			      cpu_info__lp_time(diag204_info_type, cpu_info));
+	if (IS_ERR(rc))
+		return PTR_ERR(rc);
+	if (diag204_info_type == INFO_EXT) {
+		rc = hypfs_create_u64(sb, cpu_dir, "onlinetime",
+				      cpu_info__online_time(diag204_info_type,
+							    cpu_info));
+		if (IS_ERR(rc))
+			return PTR_ERR(rc);
+	}
+	diag224_idx2name(cpu_info__ctidx(diag204_info_type, cpu_info), buffer);
+	rc = hypfs_create_str(sb, cpu_dir, "type", buffer);
+	if (IS_ERR(rc))
+		return PTR_ERR(rc);
+	return 0;
+}
+
+static void *hypfs_create_lpar_files(struct super_block *sb,
+				     struct dentry *systems_dir, void *part_hdr)
+{
+	struct dentry *cpus_dir;
+	struct dentry *lpar_dir;
+	char lpar_name[LPAR_NAME_LEN + 1];
+	void *cpu_info;
+	int i;
+
+	part_hdr__part_name(diag204_info_type, part_hdr, lpar_name);
+	lpar_name[LPAR_NAME_LEN] = 0;
+	lpar_dir = hypfs_mkdir(sb, systems_dir, lpar_name);
+	if (IS_ERR(lpar_dir))
+		return lpar_dir;
+	cpus_dir = hypfs_mkdir(sb, lpar_dir, "cpus");
+	if (IS_ERR(cpus_dir))
+		return cpus_dir;
+	cpu_info = part_hdr + part_hdr__size(diag204_info_type);
+	for (i = 0; i < part_hdr__cpus(diag204_info_type, part_hdr); i++) {
+		int rc;
+		rc = hypfs_create_cpu_files(sb, cpus_dir, cpu_info);
+		if (rc)
+			return ERR_PTR(rc);
+		cpu_info += cpu_info__size(diag204_info_type);
+	}
+	return cpu_info;
+}
+
+static int hypfs_create_phys_cpu_files(struct super_block *sb,
+				       struct dentry *cpus_dir, void *cpu_info)
+{
+	struct dentry *cpu_dir;
+	char buffer[TMP_SIZE];
+	void *rc;
+
+	snprintf(buffer, TMP_SIZE, "%i", phys_cpu__cpu_addr(diag204_info_type,
+							    cpu_info));
+	cpu_dir = hypfs_mkdir(sb, cpus_dir, buffer);
+	if (IS_ERR(cpu_dir))
+		return PTR_ERR(cpu_dir);
+	rc = hypfs_create_u64(sb, cpu_dir, "mgmtime",
+			      phys_cpu__mgm_time(diag204_info_type, cpu_info));
+	if (IS_ERR(rc))
+		return PTR_ERR(rc);
+	diag224_idx2name(phys_cpu__ctidx(diag204_info_type, cpu_info), buffer);
+	rc = hypfs_create_str(sb, cpu_dir, "type", buffer);
+	if (IS_ERR(rc))
+		return PTR_ERR(rc);
+	return 0;
+}
+
+static void *hypfs_create_phys_files(struct super_block *sb,
+				     struct dentry *parent_dir, void *phys_hdr)
+{
+	int i;
+	void *cpu_info;
+	struct dentry *cpus_dir;
+
+	cpus_dir = hypfs_mkdir(sb, parent_dir, "cpus");
+	if (IS_ERR(cpus_dir))
+		return cpus_dir;
+	cpu_info = phys_hdr + phys_hdr__size(diag204_info_type);
+	for (i = 0; i < phys_hdr__cpus(diag204_info_type, phys_hdr); i++) {
+		int rc;
+		rc = hypfs_create_phys_cpu_files(sb, cpus_dir, cpu_info);
+		if (rc)
+			return ERR_PTR(rc);
+		cpu_info += phys_cpu__size(diag204_info_type);
+	}
+	return cpu_info;
+}
+
+int diag_create_files(struct super_block *sb, struct dentry *root)
+{
+	struct dentry *systems_dir, *hyp_dir;
+	void *time_hdr, *part_hdr;
+	int i;
+	void *buffer, *rc;
+
+	buffer = diag204_store();
+	if (IS_ERR(buffer))
+		return PTR_ERR(buffer);
+
+	systems_dir = hypfs_mkdir(sb, root, "systems");
+	if (IS_ERR(systems_dir))
+		return PTR_ERR(systems_dir);
+	time_hdr = (struct x_info_blk_hdr *)buffer;
+	part_hdr = time_hdr + info_blk_hdr__size(diag204_info_type);
+	for (i = 0; i < info_blk_hdr__npar(diag204_info_type, time_hdr); i++) {
+		part_hdr = hypfs_create_lpar_files(sb, systems_dir, part_hdr);
+		if (IS_ERR(part_hdr))
+			return PTR_ERR(part_hdr);
+	}
+	if (info_blk_hdr__flags(diag204_info_type, time_hdr) & LPAR_PHYS_FLG) {
+		void *rc;
+		rc = hypfs_create_phys_files(sb, root, part_hdr);
+		if (IS_ERR(rc))
+			return PTR_ERR(rc);
+	}
+	hyp_dir = hypfs_mkdir(sb, root, "hyp");
+	if (IS_ERR(hyp_dir))
+		return PTR_ERR(hyp_dir);
+	rc = hypfs_create_str(sb, hyp_dir, "type", "LPAR Hypervisor");
+	if (IS_ERR(rc))
+		return PTR_ERR(rc);
+	kfree(buffer);
+	return 0;
+}
diff -urpN linux-2.6.16/arch/s390/hypfs/hypfs_diag.h linux-2.6.16-hypfs/arch/s390/hypfs/hypfs_diag.h
--- linux-2.6.16/arch/s390/hypfs/hypfs_diag.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.16-hypfs/arch/s390/hypfs/hypfs_diag.h	2006-04-28 08:59:36.000000000 +0200
@@ -0,0 +1,16 @@
+/*
+ *  fs/hypfs/hypfs_diag.h
+ *    Hypervisor filesystem for Linux on s390.
+ *
+ *    Copyright (C) IBM Corp. 2006
+ *    Author(s): Michael Holzheu <holzheu@de.ibm.com>
+ */
+
+#ifndef _HYPFS_DIAG_H_
+#define _HYPFS_DIAG_H_
+
+extern int diag_init(void);
+extern void diag_exit(void);
+extern int diag_create_files(struct super_block *sb, struct dentry *root);
+
+#endif /* _HYPFS_DIAG_H_ */
diff -urpN linux-2.6.16/arch/s390/hypfs/inode.c linux-2.6.16-hypfs/arch/s390/hypfs/inode.c
--- linux-2.6.16/arch/s390/hypfs/inode.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.16-hypfs/arch/s390/hypfs/inode.c	2006-04-28 08:59:36.000000000 +0200
@@ -0,0 +1,462 @@
+/*
+ *  fs/hypfs/inode.c
+ *    Hypervisor filesystem for Linux on s390.
+ *
+ *    Copyright (C) IBM Corp. 2006
+ *    Author(s): Michael Holzheu <holzheu@de.ibm.com>
+ */
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/vfs.h>
+#include <linux/pagemap.h>
+#include <linux/gfp.h>
+#include <linux/time.h>
+#include <linux/parser.h>
+#include <linux/sysfs.h>
+#include <linux/module.h>
+#include <asm/ebcdic.h>
+#include "hypfs.h"
+#include "hypfs_diag.h"
+
+#define HYPFS_MAGIC 0x687970	/* ASCII 'hyp' */
+#define TMP_SIZE 64		/* size of temporary buffers */
+
+static struct dentry *hypfs_create_update_file(struct super_block *sb,
+					       struct dentry *dir);
+
+struct hypfs_sb_info {
+	int uid;		/* uid used for files and dirs */
+	int gid;		/* gid used for files and dirs */
+};
+
+static struct dentry *update_file_dentry;
+static struct file_operations hypfs_file_ops;
+static struct file_system_type hypfs_type;
+static struct super_operations hypfs_s_ops;
+static time_t last_update_time = 0;	/* update time in seconds since 1970 */
+static DEFINE_MUTEX(hypfs_lock);
+
+/* start of list of all dentries, which have to be deleted on update */
+static struct dentry *hypfs_last_dentry;
+
+static void hypfs_update_update(void)
+{
+	struct inode* inode = update_file_dentry->d_inode;
+	
+	last_update_time = get_seconds();
+	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+}
+
+/* directory tree removal functions */
+
+static void hypfs_add_dentry(struct dentry *dentry)
+{
+	dentry->d_fsdata = hypfs_last_dentry;
+	hypfs_last_dentry = dentry;
+}
+
+static void hypfs_delete_tree(struct dentry *root)
+{
+	while (hypfs_last_dentry) {
+		struct dentry *parent, *next_dentry;
+
+		parent = hypfs_last_dentry->d_parent;
+		if (S_ISDIR(hypfs_last_dentry->d_inode->i_mode))
+			simple_rmdir(parent->d_inode, hypfs_last_dentry);
+		else
+			simple_unlink(parent->d_inode, hypfs_last_dentry);
+		d_delete(hypfs_last_dentry);
+		next_dentry = hypfs_last_dentry->d_fsdata;
+		dput(hypfs_last_dentry);
+		hypfs_last_dentry = next_dentry;
+	}
+}
+
+static struct inode *hypfs_make_inode(struct super_block *sb, int mode)
+{
+	struct inode *ret = new_inode(sb);
+
+	if (ret) {
+		struct hypfs_sb_info *hypfs_info = sb->s_fs_info;
+		ret->i_mode = mode;
+		ret->i_uid = hypfs_info->uid;
+		ret->i_gid = hypfs_info->gid;
+		ret->i_blksize = PAGE_CACHE_SIZE;
+		ret->i_blocks = 0;
+		ret->i_atime = ret->i_mtime = ret->i_ctime = CURRENT_TIME;
+		if (mode & S_IFDIR)
+			ret->i_nlink = 2;
+		else
+			ret->i_nlink = 1;
+	}
+	return ret;
+}
+
+static void hypfs_drop_inode(struct inode *inode)
+{
+	kfree(inode->u.generic_ip);
+	generic_delete_inode(inode);
+}
+
+static int hypfs_open(struct inode *inode, struct file *filp)
+{
+	char *data = (char *)filp->f_dentry->d_inode->u.generic_ip;
+
+	if (filp->f_mode & FMODE_WRITE) {
+		if (!(inode->i_mode & S_IWUGO))
+			return -EACCES;
+	}
+	if (filp->f_mode & FMODE_READ) {
+		if (!(inode->i_mode & S_IRUGO))
+			return -EACCES;
+	}
+	mutex_lock(&hypfs_lock);
+	filp->private_data = kmalloc(strlen(data) + 1, GFP_KERNEL);
+	if (!filp->private_data) {
+		mutex_unlock(&hypfs_lock);
+		return -ENOMEM;
+	}
+	strcpy(filp->private_data, data);
+	mutex_unlock(&hypfs_lock);
+	return 0;
+}
+
+static ssize_t hypfs_aio_read(struct kiocb *iocb, __user char *buf,
+			      size_t count, loff_t offset)
+{
+	char *data;
+	size_t len;
+	struct file *filp = iocb->ki_filp;
+
+	data = (char *)filp->private_data;
+	len = strlen(data);
+	if (offset > len) {
+		count = 0;
+		goto out;
+	}
+	if (count > len - offset)
+		count = len - offset;
+	if (copy_to_user(buf, data + offset, count)) {
+		count = -EFAULT;
+		goto out;
+	}
+	iocb->ki_pos += count;
+	file_accessed(filp);
+      out:
+	return count;
+}
+static ssize_t hypfs_aio_write(struct kiocb *iocb, const char __user *buf,
+			       size_t count, loff_t pos)
+{
+	int rc;
+	struct super_block *sb;
+
+	mutex_lock(&hypfs_lock);
+	sb = iocb->ki_filp->f_dentry->d_inode->i_sb;
+	if (last_update_time == get_seconds()) {
+		rc = -EBUSY;
+		goto out;
+	}
+	hypfs_delete_tree(sb->s_root);
+	rc = diag_create_files(sb, sb->s_root);
+	if (rc) {
+		printk(KERN_ERR "hypfs: Update failed\n");
+		hypfs_delete_tree(sb->s_root);
+		goto out;
+	}
+	hypfs_update_update();
+	rc = count;
+      out:
+	mutex_unlock(&hypfs_lock);
+	return rc;
+}
+
+static int hypfs_release(struct inode *inode, struct file *filp)
+{
+	kfree(filp->private_data);
+	return 0;
+}
+
+enum { opt_uid, opt_gid, opt_err };
+
+static match_table_t hypfs_tokens = {
+	{opt_uid, "uid=%u"},
+	{opt_gid, "gid=%u"},
+	{opt_err, NULL}
+};
+
+static int hypfs_parse_options(char *options, struct super_block *sb)
+{
+	char *str;
+	substring_t args[MAX_OPT_ARGS];
+
+	if (!options)
+		return 0;
+	while ((str = strsep(&options, ",")) != NULL) {
+		int token, option;
+		struct hypfs_sb_info *hypfs_info = sb->s_fs_info;
+
+		if (!*str)
+			continue;
+		token = match_token(str, hypfs_tokens, args);
+		switch (token) {
+		case opt_uid:
+			if (match_int(&args[0], &option))
+				return -EINVAL;
+			hypfs_info->uid = option;
+			break;
+		case opt_gid:
+			if (match_int(&args[0], &option))
+				return -EINVAL;
+			hypfs_info->gid = option;
+			break;
+		case opt_err:
+		default:
+			printk(KERN_ERR "hypfs: Unrecognized mount option "
+			       "\"%s\" or missing value\n", str);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int hypfs_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct inode *root_inode;
+	int rc = 0;
+	struct hypfs_sb_info *sbi;
+
+	sbi = kmalloc(sizeof(struct hypfs_sb_info), GFP_KERNEL);
+	if (!sbi)
+		return -ENOMEM;
+	sbi->uid = current->uid;
+	sbi->gid = current->gid;
+	sb->s_fs_info = sbi;
+	sb->s_blocksize = PAGE_CACHE_SIZE;
+	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
+	sb->s_magic = HYPFS_MAGIC;
+	sb->s_op = &hypfs_s_ops;
+	if (hypfs_parse_options(data, sb)) {
+		rc = -EINVAL;
+		goto err_alloc;
+	}
+	root_inode = hypfs_make_inode(sb, S_IFDIR | 0755);
+	if (!root_inode) {
+		rc = -ENOMEM;
+		goto err_alloc;
+	}
+	root_inode->i_op = &simple_dir_inode_operations;
+	root_inode->i_fop = &simple_dir_operations;
+	sb->s_root = d_alloc_root(root_inode);
+	if (!sb->s_root) {
+		rc = -ENOMEM;
+		goto err_inode;
+	}
+	rc = diag_create_files(sb, sb->s_root);
+	if (rc)
+		goto err_tree;
+	update_file_dentry = hypfs_create_update_file(sb, sb->s_root);
+	if (IS_ERR(update_file_dentry)) {
+		rc = PTR_ERR(update_file_dentry);
+		goto err_tree;
+	}
+	hypfs_update_update();
+	return 0;
+
+      err_tree:
+	hypfs_delete_tree(sb->s_root);
+	dput(sb->s_root);
+      err_inode:
+	iput(root_inode);
+      err_alloc:
+	kfree(sbi);
+	return rc;
+}
+
+static void hypfs_put_super(struct super_block *sb)
+{
+	kfree(sb->s_fs_info);
+}
+
+static struct super_block *hypfs_get_super(struct file_system_type *fst,
+					   int flags, const char *devname,
+					   void *data)
+{
+	return get_sb_single(fst, flags, data, hypfs_fill_super);
+}
+
+static struct dentry *hypfs_create_file(struct super_block *sb,
+					struct dentry *parent, const char *name,
+					char *data, mode_t mode)
+{
+	struct dentry *dentry;
+	struct inode *inode;
+	struct qstr qname;
+
+	qname.name = name;
+	qname.len = strlen(name);
+	qname.hash = full_name_hash(name, qname.len);
+	dentry = lookup_one_len(name, parent, strlen(name));
+	if (IS_ERR(dentry))
+		return ERR_PTR(-ENOMEM);
+	inode = hypfs_make_inode(sb, mode);
+	if (!inode) {
+		dput(dentry);
+		return ERR_PTR(-ENOMEM);
+	}
+	if (mode & S_IFREG) {
+		inode->i_fop = &hypfs_file_ops;
+		if(data)
+			inode->i_size = strlen(data);
+		else
+			inode->i_size = 0;
+	} else if (mode & S_IFDIR) {
+		inode->i_op = &simple_dir_inode_operations;
+		inode->i_fop = &simple_dir_operations;
+		parent->d_inode->i_nlink++;
+	} else
+		BUG();
+	inode->u.generic_ip = data;
+	d_instantiate(dentry, inode);
+	return dentry;
+}
+
+struct dentry *hypfs_mkdir(struct super_block *sb, struct dentry *parent,
+			   const char *name)
+{
+	struct dentry *dentry;
+
+	dentry = hypfs_create_file(sb, parent, name, NULL, S_IFDIR | DIR_MODE);
+	if (IS_ERR(dentry))
+		return dentry;
+	hypfs_add_dentry(dentry);
+	parent->d_inode->i_nlink++;
+	return dentry;
+}
+
+static struct dentry *hypfs_create_update_file(struct super_block *sb,
+					       struct dentry *dir)
+{
+	struct dentry *dentry;
+
+	dentry = hypfs_create_file(sb, dir, "update", NULL,
+				   S_IFREG | UPDATE_FILE_MODE);
+	if (!dentry)
+		return ERR_PTR(-ENOMEM);
+	/*
+	 * We do not put the update file on the 'delete' list with
+	 * hypfs_add_dentry(), since it should not be removed when the tree
+	 * is updated.
+	 */
+	return dentry;
+}
+
+struct dentry *hypfs_create_u64(struct super_block *sb, struct dentry *dir,
+				const char *name, __u64 value)
+{
+	char *buffer;
+	char tmp[TMP_SIZE];
+	struct dentry *dentry;
+
+	snprintf(tmp, TMP_SIZE, "%lld\n", (unsigned long long int)value);
+	buffer = kmalloc(strlen(tmp) + 1, GFP_KERNEL);
+	if (!buffer)
+		return ERR_PTR(-ENOMEM);
+	strcpy(buffer, tmp);
+	dentry =
+	    hypfs_create_file(sb, dir, name, buffer, S_IFREG | REG_FILE_MODE);
+	if (!dentry) {
+		kfree(buffer);
+		return ERR_PTR(-ENOMEM);
+	}
+	hypfs_add_dentry(dentry);
+	return dentry;
+}
+
+struct dentry *hypfs_create_str(struct super_block *sb, struct dentry *dir,
+				const char *name, char *string)
+{
+	char *buffer;
+	struct dentry *dentry;
+
+	buffer = kmalloc(strlen(string) + 2, GFP_KERNEL);
+	if (!buffer)
+		return ERR_PTR(-ENOMEM);
+	sprintf(buffer, "%s\n", string);
+	dentry =
+	    hypfs_create_file(sb, dir, name, buffer, S_IFREG | REG_FILE_MODE);
+	if (!dentry) {
+		kfree(buffer);
+		return ERR_PTR(-ENOMEM);
+	}
+	hypfs_add_dentry(dentry);
+	return dentry;
+}
+
+static struct file_operations hypfs_file_ops = {
+	.open		= hypfs_open,
+	.release	= hypfs_release,
+	.read		= do_sync_read,
+	.write		= do_sync_write,
+	.aio_read	= hypfs_aio_read,
+	.aio_write	= hypfs_aio_write,
+};
+
+static struct file_system_type hypfs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "hypfs",
+	.get_sb		= hypfs_get_super,
+	.kill_sb	= kill_litter_super
+};
+
+static struct super_operations hypfs_s_ops = {
+	.statfs		= simple_statfs,
+	.drop_inode	= hypfs_drop_inode,
+	.put_super	= hypfs_put_super
+};
+
+static decl_subsys(hypervisor, NULL, NULL);
+
+static int __init hypfs_init(void)
+{
+	int rc;
+
+	if (MACHINE_IS_VM)
+		return -ENODATA;
+	if (diag_init()) {
+		rc = -ENODATA;
+		goto err_msg;
+	}
+	rc = subsystem_register(&hypervisor_subsys);
+	if (rc)
+		goto err_diag;
+	rc = register_filesystem(&hypfs_type);
+	if (rc)
+		goto err_sysfs;
+	return 0;
+
+      err_sysfs:
+	subsystem_unregister(&hypervisor_subsys);
+      err_diag:
+	diag_exit();
+      err_msg:
+	printk(KERN_ERR "hypfs: Initialization failed with rc = %i.\n", rc);
+	return rc;
+}
+
+static void __exit hypfs_exit(void)
+{
+	diag_exit();
+	unregister_filesystem(&hypfs_type);
+	subsystem_unregister(&hypervisor_subsys);
+}
+
+module_init(hypfs_init)
+module_exit(hypfs_exit)
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Michael Holzheu <holzheu@de.ibm.com>");
+MODULE_DESCRIPTION("s390 Hypervisor Filesystem");

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-28  9:22 [PATCH] s390: Hypervisor File System Michael Holzheu
@ 2006-04-28  9:43 ` Jörn Engel
  2006-04-28 11:53   ` Michael Holzheu
  2006-04-28  9:56 ` Andrew Morton
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 62+ messages in thread
From: Jörn Engel @ 2006-04-28  9:43 UTC (permalink / raw)
  To: Michael Holzheu; +Cc: akpm, schwidefsky, penberg, ioe-lkml, linux-kernel

On Fri, 28 April 2006 11:22:25 +0200, Michael Holzheu wrote:
> +
> +static inline int info_blk_hdr__size(enum diag204_format type)
> +{
> +	if (type == INFO_SIMPLE)
> +		return sizeof(struct info_blk_hdr);
> +	else /* INFO_EXT */
> +		return sizeof(struct x_info_blk_hdr);
> +}
> +
> +static inline __u8 info_blk_hdr__npar(enum diag204_format type, void *hdr)
> +{
> +	if (type == INFO_SIMPLE)
> +		return ((struct info_blk_hdr *)hdr)->npar;
> +	else /* INFO_EXT */
> +		return ((struct x_info_blk_hdr *)hdr)->npar;
> +}
> +
> +static inline __u8 info_blk_hdr__flags(enum diag204_format type, void *hdr)
> +{
> +	if (type == INFO_SIMPLE)
> +		return ((struct info_blk_hdr *)hdr)->flags;
> +	else /* INFO_EXT */
> +		return ((struct x_info_blk_hdr *)hdr)->flags;
> +}
> +
> +static inline __u16 info_blk_hdr__pcpus(enum diag204_format type, void *hdr)
> +{
> +	if (type == INFO_SIMPLE)
> +		return ((struct info_blk_hdr *)hdr)->phys_cpus;
> +	else /* INFO_EXT */
> +		return ((struct x_info_blk_hdr *)hdr)->phys_cpus;
> +}

Hmm.  Another possible approach would be to create a small struct with
a couple of functions, have one function each for type==INFO_SIMPLE
and type==INFO_EXT and fill the struct either with one set of
functions or the other.

Whether that would make more sense than your current approach, I
cannot judge.  Iirc, function calls are still fairly expensive on
s390, so maybe not.

Jörn

-- 
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-28  9:22 [PATCH] s390: Hypervisor File System Michael Holzheu
  2006-04-28  9:43 ` Jörn Engel
@ 2006-04-28  9:56 ` Andrew Morton
  2006-04-28 17:36   ` Michael Holzheu
  2006-04-28 10:36 ` Pekka J Enberg
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 62+ messages in thread
From: Andrew Morton @ 2006-04-28  9:56 UTC (permalink / raw)
  To: Michael Holzheu; +Cc: schwidefsky, penberg, ioe-lkml, joern, linux-kernel

Michael Holzheu <holzheu@de.ibm.com> wrote:
>
> On zSeries machines there exists an interface which allows the operating
> system  to retrieve LPAR hypervisor accounting data. For example, it is
> possible to get usage data for physical and virtual cpus. In order to
> provide this information to user space programs, I implemented a new
> virtual Linux file system named 'hypfs' using the Linux 2.6 libfs
> framework. The name 'hypfs' stands for 'Hypervisor Filesystem'. All the
> accounting information is put into different virtual files which can be
> accessed from user space. All data is represented as ASCII strings.
> 
> When the file system is mounted the accounting information is retrieved
> and a file system tree is created with the attribute files containing
> the cpu information. The content of the files remains unchanged until a
> new update is made. An update can be triggered from user space through
> writing 'something' into a special purpose update file.

Looks sane.  A few comments, mainly minor, but a few bugs:


> +config HYPFS_FS
> +	bool "s390 hypervisor file system support"
> +	default y
> +	help
> +	  This is a virtual file system intended to provide accounting
> +	  information in a s390 hypervisor environment.
> +

"in an s390"?

> +static inline int diag204(unsigned long subcode, unsigned long size, void *addr)

There's a lot of inlining going on.  Is it excessive?

> ...
>
> +static int diag204_probe(void)
> +{
> +	void *buf;
> +	int pages, rc;
> +
> +	pages = diag204(SUBC_RSI | INFO_EXT, 0, 0);
> +	if (pages > 0) {

In other places, you return an error if (pages < 0), but not here.

> +		buf = kmalloc(pages * PAGE_SIZE, GFP_KERNEL);
> +		if (!buf) {
> +			rc = -ENOMEM;
> +			goto err_out;
> +		}
> +		if (diag204(SUBC_STIB7 | INFO_EXT, pages, buf) >= 0) {
> +			diag204_store_sc = SUBC_STIB7;
> +			diag204_info_type = INFO_EXT;
> +			goto out;
> +		}
> +		if (diag204(SUBC_STIB6 | INFO_EXT, pages, buf) >= 0) {
> +			diag204_store_sc = SUBC_STIB7;
> +			diag204_info_type = INFO_EXT;
> +			goto out;
> +		}
> +		kfree(buf);
> +	}
> +	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> +	if (!buf) {
> +		rc = -ENOMEM;
> +		goto err_out;
> +	}
> +	if (diag204(SUBC_STIB4 | INFO_SIMPLE, pages, buf) >= 0) {
> +		diag204_store_sc = SUBC_STIB4;
> +		diag204_info_type = INFO_SIMPLE;
> +		goto out;
> +	} else {
> +		rc = -ENOSYS;
> +		goto err_out;
> +	}
> +      out:
> +	rc = 0;
> +      err_out:
> +	kfree(buf);
> +	return rc;
> +}

The indenting of the labels is a bit unconventional.

> +static void *diag204_store(void)
> +{
> +	void *buf;
> +	int pages;
> +
> +	if (diag204_store_sc == SUBC_STIB4)
> +		pages = 1;
> +	else
> +		pages = diag204(SUBC_RSI | diag204_info_type, 0, 0);
> +
> +	if (pages < 0)
> +		return ERR_PTR(-ENOSYS);
> +
> +	buf = kmalloc(pages * PAGE_SIZE, GFP_KERNEL);
> +	if (!buf)
> +		return ERR_PTR(-ENOMEM);
> +
> +	if (diag204(diag204_store_sc | diag204_info_type, pages, buf) < 0) {
> +		kfree(buf);
> +		return ERR_PTR(-ENOSYS);
> +	}
> +	return buf;
> +}

One wonders if

	pages = diag204(...);
	return kmalloc(pages * PAGE_SIZE, gfp_flags);

would be a useful helper function.

> +	asm volatile("   diag    %0,%1,0x224\n"
> +		     : :"d" (0), "d"(ptr) : "memory");
> +}
> +
> +static inline int diag224_get_name_table(void)
> +{
> +	/* memory must be below 2GB */
> +	diag224_cpu_names = kmalloc(PAGE_SIZE, GFP_KERNEL | GFP_DMA);
> +	if (!diag224_cpu_names)
> +		return -ENOMEM;
> +	diag224(diag224_cpu_names);
> +	EBCASC(diag224_cpu_names + 16, (*diag224_cpu_names + 1) * 16);
> +	return 0;
> +}
> +
> +static inline void diag224_delete_name_table(void)
> +{
> +	kfree(diag224_cpu_names);
> +}
> +
> +static int diag224_idx2name(int index, char *name)
> +{
> +	memcpy(name, diag224_cpu_names + ((index + 1) * 16), 16);
> +	name[16] = 0;

Should this be "15"?   I guess not...

> +	strstrip(name);
> +	return 0;
> +}
> +
> +int diag_init(void)
> +{
> +	int rc;
> +
> +	if (diag204_probe()) {
> +		printk(KERN_ERR "hypfs: diag 204 not working.");
> +		return -ENODATA;
> +	}
> +	rc = diag224_get_name_table();
> +	if (rc) {
> +		diag224_delete_name_table();
> +		printk(KERN_ERR "hypfs: could not get name table.\n");
> +	}
> +	return rc;
> +}

This can be __init

> +void diag_exit(void)
> +{
> +	diag224_delete_name_table();
> +}

And __exit.

> +int diag_create_files(struct super_block *sb, struct dentry *root)
> +{

Perhaps as a globally-visible function this should be called
hypfs_diag_create_files().  Ditto diag_init() and diag_exit().

> +	struct dentry *systems_dir, *hyp_dir;
> +	void *time_hdr, *part_hdr;
> +	int i;
> +	void *buffer, *rc;
> +
> +	buffer = diag204_store();
> +	if (IS_ERR(buffer))
> +		return PTR_ERR(buffer);
> +
> +	systems_dir = hypfs_mkdir(sb, root, "systems");
> +	if (IS_ERR(systems_dir))
> +		return PTR_ERR(systems_dir);
> +	time_hdr = (struct x_info_blk_hdr *)buffer;
> +	part_hdr = time_hdr + info_blk_hdr__size(diag204_info_type);
> +	for (i = 0; i < info_blk_hdr__npar(diag204_info_type, time_hdr); i++) {
> +		part_hdr = hypfs_create_lpar_files(sb, systems_dir, part_hdr);
> +		if (IS_ERR(part_hdr))
> +			return PTR_ERR(part_hdr);
> +	}
> +	if (info_blk_hdr__flags(diag204_info_type, time_hdr) & LPAR_PHYS_FLG) {
> +		void *rc;
> +		rc = hypfs_create_phys_files(sb, root, part_hdr);
> +		if (IS_ERR(rc))
> +			return PTR_ERR(rc);
> +	}
> +	hyp_dir = hypfs_mkdir(sb, root, "hyp");
> +	if (IS_ERR(hyp_dir))
> +		return PTR_ERR(hyp_dir);
> +	rc = hypfs_create_str(sb, hyp_dir, "type", "LPAR Hypervisor");
> +	if (IS_ERR(rc))
> +		return PTR_ERR(rc);
> +	kfree(buffer);
> +	return 0;
> +}

This leaks `buffer' on the error paths, yes?

> +/*
> + *  fs/hypfs/inode.c
> + *    Hypervisor filesystem for Linux on s390.
> + *
> + *    Copyright (C) IBM Corp. 2006
> + *    Author(s): Michael Holzheu <holzheu@de.ibm.com>
> + */
> +
> +#include <linux/types.h>
> +#include <linux/errno.h>
> +#include <linux/fs.h>
> +#include <linux/namei.h>
> +#include <linux/vfs.h>
> +#include <linux/pagemap.h>
> +#include <linux/gfp.h>
> +#include <linux/time.h>
> +#include <linux/parser.h>
> +#include <linux/sysfs.h>
> +#include <linux/module.h>
> +#include <asm/ebcdic.h>
> +#include "hypfs.h"
> +#include "hypfs_diag.h"
> +
> +#define HYPFS_MAGIC 0x687970	/* ASCII 'hyp' */
> +#define TMP_SIZE 64		/* size of temporary buffers */
> +
> +static struct dentry *hypfs_create_update_file(struct super_block *sb,
> +					       struct dentry *dir);
> +
> +struct hypfs_sb_info {
> +	int uid;		/* uid used for files and dirs */
> +	int gid;		/* gid used for files and dirs */
> +};

uid_t?

> +
> +static int hypfs_open(struct inode *inode, struct file *filp)
> +{
> +	char *data = (char *)filp->f_dentry->d_inode->u.generic_ip;

Unneeded typecast.

> +	if (filp->f_mode & FMODE_WRITE) {
> +		if (!(inode->i_mode & S_IWUGO))
> +			return -EACCES;
> +	}
> +	if (filp->f_mode & FMODE_READ) {
> +		if (!(inode->i_mode & S_IRUGO))
> +			return -EACCES;
> +	}

Is the standard VFS permission checking not appropriate?

(A comment should be added here).

> +	mutex_lock(&hypfs_lock);
> +	filp->private_data = kmalloc(strlen(data) + 1, GFP_KERNEL);
> +	if (!filp->private_data) {
> +		mutex_unlock(&hypfs_lock);
> +		return -ENOMEM;
> +	}
> +	strcpy(filp->private_data, data);

kstrdup()

> +	mutex_unlock(&hypfs_lock);
> +	return 0;
> +}
> +
> +static ssize_t hypfs_aio_read(struct kiocb *iocb, __user char *buf,
> +			      size_t count, loff_t offset)
> +{
> +	char *data;
> +	size_t len;
> +	struct file *filp = iocb->ki_filp;
> +
> +	data = (char *)filp->private_data;

Unneeded cast.

> +	len = strlen(data);
> +	if (offset > len) {
> +		count = 0;
> +		goto out;
> +	}
> +	if (count > len - offset)
> +		count = len - offset;
> +	if (copy_to_user(buf, data + offset, count)) {
> +		count = -EFAULT;
> +		goto out;
> +	}
> +	iocb->ki_pos += count;
> +	file_accessed(filp);
> +      out:
> +	return count;
> +}
> +static ssize_t hypfs_aio_write(struct kiocb *iocb, const char __user *buf,
> +			       size_t count, loff_t pos)
> +{
> +	int rc;
> +	struct super_block *sb;
> +
> +	mutex_lock(&hypfs_lock);
> +	sb = iocb->ki_filp->f_dentry->d_inode->i_sb;
> +	if (last_update_time == get_seconds()) {
> +		rc = -EBUSY;
> +		goto out;
> +	}

Why's it looking at the time in this manner?

(Whatever the reason, a comment should be added here).

> +static struct dentry *hypfs_create_file(struct super_block *sb,
> +					struct dentry *parent, const char *name,
> +					char *data, mode_t mode)
> +{
> +	struct dentry *dentry;
> +	struct inode *inode;
> +	struct qstr qname;
> +
> +	qname.name = name;
> +	qname.len = strlen(name);
> +	qname.hash = full_name_hash(name, qname.len);
> +	dentry = lookup_one_len(name, parent, strlen(name));
> +	if (IS_ERR(dentry))
> +		return ERR_PTR(-ENOMEM);

	return dentry;

> +	inode = hypfs_make_inode(sb, mode);
> +	if (!inode) {
> +		dput(dentry);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +	if (mode & S_IFREG) {
> +		inode->i_fop = &hypfs_file_ops;
> +		if(data)
> +			inode->i_size = strlen(data);
> +		else
> +			inode->i_size = 0;
> +	} else if (mode & S_IFDIR) {
> +		inode->i_op = &simple_dir_inode_operations;
> +		inode->i_fop = &simple_dir_operations;
> +		parent->d_inode->i_nlink++;
> +	} else
> +		BUG();
> +	inode->u.generic_ip = data;
> +	d_instantiate(dentry, inode);
> +	return dentry;
> +}
>
> +static struct dentry *hypfs_create_update_file(struct super_block *sb,
> +					       struct dentry *dir)
> +{
> +	struct dentry *dentry;
> +
> +	dentry = hypfs_create_file(sb, dir, "update", NULL,
> +				   S_IFREG | UPDATE_FILE_MODE);
> +	if (!dentry)
> +		return ERR_PTR(-ENOMEM);

But hypfs_create_file() will return ERR_PTR(-ENOMEM), never NULL.

> +	/*
> +	 * We do not put the update file on the 'delete' list with
> +	 * hypfs_add_dentry(), since it should not be removed when the tree
> +	 * is updated.
> +	 */
> +	return dentry;
> +}
> +
> +struct dentry *hypfs_create_u64(struct super_block *sb, struct dentry *dir,
> +				const char *name, __u64 value)
> +{
> +	char *buffer;
> +	char tmp[TMP_SIZE];
> +	struct dentry *dentry;
> +
> +	snprintf(tmp, TMP_SIZE, "%lld\n", (unsigned long long int)value);
> +	buffer = kmalloc(strlen(tmp) + 1, GFP_KERNEL);
> +	if (!buffer)
> +		return ERR_PTR(-ENOMEM);
> +	strcpy(buffer, tmp);

kstrdup()

> +	dentry =
> +	    hypfs_create_file(sb, dir, name, buffer, S_IFREG | REG_FILE_MODE);
> +	if (!dentry) {

IS_ERR()

> +		kfree(buffer);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +	hypfs_add_dentry(dentry);
> +	return dentry;
> +}
> +
> +struct dentry *hypfs_create_str(struct super_block *sb, struct dentry *dir,
> +				const char *name, char *string)
> +{
> +	char *buffer;
> +	struct dentry *dentry;
> +
> +	buffer = kmalloc(strlen(string) + 2, GFP_KERNEL);
> +	if (!buffer)
> +		return ERR_PTR(-ENOMEM);
> +	sprintf(buffer, "%s\n", string);
> +	dentry =
> +	    hypfs_create_file(sb, dir, name, buffer, S_IFREG | REG_FILE_MODE);
> +	if (!dentry) {

IS_ERR()

(Better check the whole patch for this)

> +		kfree(buffer);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +	hypfs_add_dentry(dentry);
> +	return dentry;
> +}


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-28  9:22 [PATCH] s390: Hypervisor File System Michael Holzheu
  2006-04-28  9:43 ` Jörn Engel
  2006-04-28  9:56 ` Andrew Morton
@ 2006-04-28 10:36 ` Pekka J Enberg
  2006-04-28 13:14   ` Michael Holzheu
  2006-04-29  6:44 ` Andrew Morton
  2006-04-29  7:53 ` Greg KH
  4 siblings, 1 reply; 62+ messages in thread
From: Pekka J Enberg @ 2006-04-28 10:36 UTC (permalink / raw)
  To: Michael Holzheu; +Cc: akpm, schwidefsky, ioe-lkml, joern, linux-kernel

Hi Michael,

On Fri, 28 Apr 2006, Michael Holzheu wrote:
> diff -urpN linux-2.6.16/arch/s390/hypfs/inode.c linux-2.6.16-hypfs/arch/s390/hypfs/inode.c
> --- linux-2.6.16/arch/s390/hypfs/inode.c	1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.16-hypfs/arch/s390/hypfs/inode.c	2006-04-28 08:59:36.000000000 +0200

[snip]

> +static struct dentry *update_file_dentry;
> +static struct file_operations hypfs_file_ops;
> +static struct file_system_type hypfs_type;
> +static struct super_operations hypfs_s_ops;
> +static time_t last_update_time = 0;	/* update time in seconds since 1970 */
> +static DEFINE_MUTEX(hypfs_lock);

These should be per-superblock and not global, right?

				Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-28  9:43 ` Jörn Engel
@ 2006-04-28 11:53   ` Michael Holzheu
  2006-04-28 15:48     ` Jörn Engel
  0 siblings, 1 reply; 62+ messages in thread
From: Michael Holzheu @ 2006-04-28 11:53 UTC (permalink / raw)
  To: Jörn Engel; +Cc: akpm, ioe-lkml, linux-kernel, mschwid2, penberg

Hi Jörn,

Jörn Engel <joern@wohnheim.fh-wedel.de> wrote on 04/28/2006 11:43:20 AM:
> On Fri, 28 April 2006 11:22:25 +0200, Michael Holzheu wrote:
> > +
> > +static inline int info_blk_hdr__size(enum diag204_format type)
> > +{
> > +   if (type == INFO_SIMPLE)
> > +      return sizeof(struct info_blk_hdr);
> > +   else /* INFO_EXT */
> > +      return sizeof(struct x_info_blk_hdr);
> > +}
> > +
> > +static inline __u8 info_blk_hdr__npar(enum diag204_format type, void
*hdr)
> > +{
> > +   if (type == INFO_SIMPLE)
> > +      return ((struct info_blk_hdr *)hdr)->npar;
> > +   else /* INFO_EXT */
> > +      return ((struct x_info_blk_hdr *)hdr)->npar;
> > +}
> > +
> > +static inline __u8 info_blk_hdr__flags(enum diag204_format type, void
*hdr)
> > +{
> > +   if (type == INFO_SIMPLE)
> > +      return ((struct info_blk_hdr *)hdr)->flags;
> > +   else /* INFO_EXT */
> > +      return ((struct x_info_blk_hdr *)hdr)->flags;
> > +}
> > +
> > +static inline __u16 info_blk_hdr__pcpus(enum diag204_format type,void
*hdr)
> > +{
> > +   if (type == INFO_SIMPLE)
> > +      return ((struct info_blk_hdr *)hdr)->phys_cpus;
> > +   else /* INFO_EXT */
> > +      return ((struct x_info_blk_hdr *)hdr)->phys_cpus;
> > +}
>
> Hmm.  Another possible approach would be to create a small struct with
> a couple of functions, have one function each for type==INFO_SIMPLE
> and type==INFO_EXT and fill the struct either with one set of
> functions or the other.
>
> Whether that would make more sense than your current approach, I
> cannot judge.  Iirc, function calls are still fairly expensive on
> s390, so maybe not.
>

We discussed that! If we would have more than two data formats
we would have taken your approach. But for only two, I think
it is ok to do it my way!

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-28 10:36 ` Pekka J Enberg
@ 2006-04-28 13:14   ` Michael Holzheu
  0 siblings, 0 replies; 62+ messages in thread
From: Michael Holzheu @ 2006-04-28 13:14 UTC (permalink / raw)
  To: Pekka J Enberg; +Cc: akpm, ioe-lkml, joern, linux-kernel, mschwid2

Pekka J Enberg <penberg@cs.Helsinki.FI> wrote on 04/28/2006 12:36:59 PM:
>
> > +static struct dentry *update_file_dentry;
> > +static struct file_operations hypfs_file_ops;
> > +static struct file_system_type hypfs_type;
> > +static struct super_operations hypfs_s_ops;
> > +static time_t last_update_time = 0;   /* update time in seconds
> since 1970 */
> > +static DEFINE_MUTEX(hypfs_lock);
>
> These should be per-superblock and not global, right?

Yes! update_file_dentry, last_update_time and hypfs_lock
should be per-superblock. I will change this!

hypfs_file_ops, hpyfs_type and hypfs_s_ops should remain
global, since this will be the same for all super blocks.

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-28 11:53   ` Michael Holzheu
@ 2006-04-28 15:48     ` Jörn Engel
  0 siblings, 0 replies; 62+ messages in thread
From: Jörn Engel @ 2006-04-28 15:48 UTC (permalink / raw)
  To: Michael Holzheu; +Cc: akpm, ioe-lkml, linux-kernel, mschwid2, penberg

On Fri, 28 April 2006 13:53:50 +0200, Michael Holzheu wrote:
> 
> We discussed that! If we would have more than two data formats
> we would have taken your approach. But for only two, I think
> it is ok to do it my way!

Sounds fair.  Sorry I missed the previous discussion.

Jörn

-- 
All art is but imitation of nature.
-- Lucius Annaeus Seneca

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-28  9:56 ` Andrew Morton
@ 2006-04-28 17:36   ` Michael Holzheu
  2006-04-28 17:43     ` Jörn Engel
  2006-04-28 19:44     ` Andrew Morton
  0 siblings, 2 replies; 62+ messages in thread
From: Michael Holzheu @ 2006-04-28 17:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ioe-lkml, joern, linux-kernel, mschwid2, penberg

Hi Andrew!

Thanks for the review!

Andrew Morton <akpm@osdl.org> wrote on 04/28/2006 11:56:21 AM:
> Michael Holzheu <holzheu@de.ibm.com> wrote:

[snip]

>
> > +config HYPFS_FS
> > +   bool "s390 hypervisor file system support"
> > +   default y
> > +   help
> > +     This is a virtual file system intended to provide accounting
> > +     information in a s390 hypervisor environment.
> > +
>
> "in an s390"?

Yes!

> > +static inline int diag204(unsigned long subcode, unsigned long
> size, void *addr)
>
> There's a lot of inlining going on.  Is it excessive?

I will remove the diag inlines. The code is not performance critical
anyway, since diag 204 itself is soo slow. But I would like to keep
the inlines for the getter functions.

> > ...
> >
> > +static int diag204_probe(void)
> > +{
> > +   void *buf;
> > +   int pages, rc;
> > +
> > +   pages = diag204(SUBC_RSI | INFO_EXT, 0, 0);
> > +   if (pages > 0) {
>
> In other places, you return an error if (pages < 0), but not here.

I make an additional comment to document the code
here!

... and you are right! In diag204_store() we should check
not for pages < 0, but for pages <= 0! The other spots
are ok.

[snip]

> > +   }
> > +      out:
> > +   rc = 0;
> > +      err_out:
> > +   kfree(buf);
> > +   return rc;
> > +}
>
> The indenting of the labels is a bit unconventional.

Sorry, that was lindent...

> > +static void *diag204_store(void)
> > +{
> > +   void *buf;
> > +   int pages;
> > +
> > +   if (diag204_store_sc == SUBC_STIB4)
> > +      pages = 1;
> > +   else
> > +      pages = diag204(SUBC_RSI | diag204_info_type, 0, 0);
> > +
> > +   if (pages < 0)
> > +      return ERR_PTR(-ENOSYS);
> > +
> > +   buf = kmalloc(pages * PAGE_SIZE, GFP_KERNEL);
> > +   if (!buf)
> > +      return ERR_PTR(-ENOMEM);
> > +
> > +   if (diag204(diag204_store_sc | diag204_info_type, pages, buf) < 0)
{
> > +      kfree(buf);
> > +      return ERR_PTR(-ENOSYS);
> > +   }
> > +   return buf;
> > +}
>
> One wonders if
>
>    pages = diag204(...);
>    return kmalloc(pages * PAGE_SIZE, gfp_flags);
>
> would be a useful helper function.

Good idea! I implemented a diag204_get_buffer() function, but
I am not sure, if it is better now. At least it is not worse!

> > +   asm volatile("   diag    %0,%1,0x224\n"
> > +           : :"d" (0), "d"(ptr) : "memory");
> > +}
> > +
> > +static inline int diag224_get_name_table(void)
> > +{
> > +   /* memory must be below 2GB */
> > +   diag224_cpu_names = kmalloc(PAGE_SIZE, GFP_KERNEL | GFP_DMA);
> > +   if (!diag224_cpu_names)
> > +      return -ENOMEM;
> > +   diag224(diag224_cpu_names);
> > +   EBCASC(diag224_cpu_names + 16, (*diag224_cpu_names + 1) * 16);
> > +   return 0;
> > +}
> > +
> > +static inline void diag224_delete_name_table(void)
> > +{
> > +   kfree(diag224_cpu_names);
> > +}
> > +
> > +static int diag224_idx2name(int index, char *name)
> > +{
> > +   memcpy(name, diag224_cpu_names + ((index + 1) * 16), 16);
> > +   name[16] = 0;
>
> Should this be "15"?   I guess not...

No bug, our strings here have 16 characters and are not
0 terminated.

> > +   strstrip(name);
> > +   return 0;
> > +}
> > +
> > +int diag_init(void)
> > +{
> > +   int rc;
> > +
> > +   if (diag204_probe()) {
> > +      printk(KERN_ERR "hypfs: diag 204 not working.");
> > +      return -ENODATA;
> > +   }
> > +   rc = diag224_get_name_table();
> > +   if (rc) {
> > +      diag224_delete_name_table();
> > +      printk(KERN_ERR "hypfs: could not get name table.\n");
> > +   }
> > +   return rc;
> > +}
>
> This can be __init

Yes!

> > +void diag_exit(void)
> > +{
> > +   diag224_delete_name_table();
> > +}
>
> And __exit.

Yes!

> > +int diag_create_files(struct super_block *sb, struct dentry *root)
> > +{
>
> Perhaps as a globally-visible function this should be called
> hypfs_diag_create_files().  Ditto diag_init() and diag_exit().

Ok, I renamed the functions.

> > +   struct dentry *systems_dir, *hyp_dir;
> > +   void *time_hdr, *part_hdr;
> > +   int i;
> > +   void *buffer, *rc;
> > +
> > +   buffer = diag204_store();
> > +   if (IS_ERR(buffer))
> > +      return PTR_ERR(buffer);
> > +
> > +   systems_dir = hypfs_mkdir(sb, root, "systems");
> > +   if (IS_ERR(systems_dir))
> > +      return PTR_ERR(systems_dir);
> > +   time_hdr = (struct x_info_blk_hdr *)buffer;
> > +   part_hdr = time_hdr + info_blk_hdr__size(diag204_info_type);
> > +   for (i = 0; i < info_blk_hdr__npar(diag204_info_type, time_hdr);
i++) {
> > +      part_hdr = hypfs_create_lpar_files(sb, systems_dir, part_hdr);
> > +      if (IS_ERR(part_hdr))
> > +         return PTR_ERR(part_hdr);
> > +   }
> > +   if (info_blk_hdr__flags(diag204_info_type, time_hdr) &
LPAR_PHYS_FLG) {
> > +      void *rc;
> > +      rc = hypfs_create_phys_files(sb, root, part_hdr);
> > +      if (IS_ERR(rc))
> > +         return PTR_ERR(rc);
> > +   }
> > +   hyp_dir = hypfs_mkdir(sb, root, "hyp");
> > +   if (IS_ERR(hyp_dir))
> > +      return PTR_ERR(hyp_dir);
> > +   rc = hypfs_create_str(sb, hyp_dir, "type", "LPAR Hypervisor");
> > +   if (IS_ERR(rc))
> > +      return PTR_ERR(rc);
> > +   kfree(buffer);
> > +   return 0;
> > +}
>
> This leaks `buffer' on the error paths, yes?

Right! I fixed this!

> > +/*
> > + *  fs/hypfs/inode.c
> > + *    Hypervisor filesystem for Linux on s390.
> > + *
> > + *    Copyright (C) IBM Corp. 2006
> > + *    Author(s): Michael Holzheu <holzheu@de.ibm.com>
> > + */
> > +
> > +#include <linux/types.h>
> > +#include <linux/errno.h>
> > +#include <linux/fs.h>
> > +#include <linux/namei.h>
> > +#include <linux/vfs.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/gfp.h>
> > +#include <linux/time.h>
> > +#include <linux/parser.h>
> > +#include <linux/sysfs.h>
> > +#include <linux/module.h>
> > +#include <asm/ebcdic.h>
> > +#include "hypfs.h"
> > +#include "hypfs_diag.h"
> > +
> > +#define HYPFS_MAGIC 0x687970   /* ASCII 'hyp' */
> > +#define TMP_SIZE 64      /* size of temporary buffers */
> > +
> > +static struct dentry *hypfs_create_update_file(struct super_block *sb,
> > +                      struct dentry *dir);
> > +
> > +struct hypfs_sb_info {
> > +   int uid;      /* uid used for files and dirs */
> > +   int gid;      /* gid used for files and dirs */
> > +};
>
> uid_t?

Good idea!

> > +
> > +static int hypfs_open(struct inode *inode, struct file *filp)
> > +{
> > +   char *data = (char *)filp->f_dentry->d_inode->u.generic_ip;
>
> Unneeded typecast.

Yes

> > +   if (filp->f_mode & FMODE_WRITE) {
> > +      if (!(inode->i_mode & S_IWUGO))
> > +         return -EACCES;
> > +   }
> > +   if (filp->f_mode & FMODE_READ) {
> > +      if (!(inode->i_mode & S_IRUGO))
> > +         return -EACCES;
> > +   }
>
> Is the standard VFS permission checking not appropriate?
>
> (A comment should be added here).

You mean using .permission in the inode operations
and using the generic_permission() function?

Currently I do not have own inode operations (and
I don't want to have them ...)

> > +   mutex_lock(&hypfs_lock);
> > +   filp->private_data = kmalloc(strlen(data) + 1, GFP_KERNEL);
> > +   if (!filp->private_data) {
> > +      mutex_unlock(&hypfs_lock);
> > +      return -ENOMEM;
> > +   }
> > +   strcpy(filp->private_data, data);
>
> kstrdup()

Yes!

> > +   mutex_unlock(&hypfs_lock);
> > +   return 0;
> > +}
> > +
> > +static ssize_t hypfs_aio_read(struct kiocb *iocb, __user char *buf,
> > +               size_t count, loff_t offset)
> > +{
> > +   char *data;
> > +   size_t len;
> > +   struct file *filp = iocb->ki_filp;
> > +
> > +   data = (char *)filp->private_data;
>
> Unneeded cast.

Yes!

> > +   len = strlen(data);
> > +   if (offset > len) {
> > +      count = 0;
> > +      goto out;
> > +   }
> > +   if (count > len - offset)
> > +      count = len - offset;
> > +   if (copy_to_user(buf, data + offset, count)) {
> > +      count = -EFAULT;
> > +      goto out;
> > +   }
> > +   iocb->ki_pos += count;
> > +   file_accessed(filp);
> > +      out:
> > +   return count;
> > +}
> > +static ssize_t hypfs_aio_write(struct kiocb *iocb, const char __user
*buf,
> > +                size_t count, loff_t pos)
> > +{
> > +   int rc;
> > +   struct super_block *sb;
> > +
> > +   mutex_lock(&hypfs_lock);
> > +   sb = iocb->ki_filp->f_dentry->d_inode->i_sb;
> > +   if (last_update_time == get_seconds()) {
> > +      rc = -EBUSY;
> > +      goto out;
> > +   }
>
> Why's it looking at the time in this manner?
>
> (Whatever the reason, a comment should be added here).

I added the following comment:

/*
 * Currently we only allow one update per second for two reasons:
 * 1. diag 204 is VERY expensive
 * 2. If several processes do updates in parallel and then read the
 *    hypfs data, the likelihood of collisions is reduced, if we restrict
 *    the minimum update interval. A collision occurs, if during the
 *    data gathering of one process another process triggers an update
 *    If the first process wants to ensure consistent data, it has
 *    to restart data collection in this case.
 */

[snip]

> > +static struct dentry *hypfs_create_update_file(struct super_block *sb,
> > +                      struct dentry *dir)
> > +{
> > +   struct dentry *dentry;
> > +
> > +   dentry = hypfs_create_file(sb, dir, "update", NULL,
> > +               S_IFREG | UPDATE_FILE_MODE);
> > +   if (!dentry)
> > +      return ERR_PTR(-ENOMEM);
>
> But hypfs_create_file() will return ERR_PTR(-ENOMEM), never NULL.

Fixed this!

> > +   /*
> > +    * We do not put the update file on the 'delete' list with
> > +    * hypfs_add_dentry(), since it should not be removed when the tree
> > +    * is updated.
> > +    */
> > +   return dentry;
> > +}
> > +
> > +struct dentry *hypfs_create_u64(struct super_block *sb, struct dentry
*dir,
> > +            const char *name, __u64 value)
> > +{
> > +   char *buffer;
> > +   char tmp[TMP_SIZE];
> > +   struct dentry *dentry;
> > +
> > +   snprintf(tmp, TMP_SIZE, "%lld\n", (unsigned long long int)value);
> > +   buffer = kmalloc(strlen(tmp) + 1, GFP_KERNEL);
> > +   if (!buffer)
> > +      return ERR_PTR(-ENOMEM);
> > +   strcpy(buffer, tmp);
>
> kstrdup()

Yes!

> > +   dentry =
> > +       hypfs_create_file(sb, dir, name, buffer, S_IFREG |
REG_FILE_MODE);
> > +   if (!dentry) {
>
> IS_ERR()

Yes!

> > +      kfree(buffer);
> > +      return ERR_PTR(-ENOMEM);
> > +   }
> > +   hypfs_add_dentry(dentry);
> > +   return dentry;
> > +}
> > +
> > +struct dentry *hypfs_create_str(struct super_block *sb, struct dentry
*dir,
> > +            const char *name, char *string)
> > +{
> > +   char *buffer;
> > +   struct dentry *dentry;
> > +
> > +   buffer = kmalloc(strlen(string) + 2, GFP_KERNEL);
> > +   if (!buffer)
> > +      return ERR_PTR(-ENOMEM);
> > +   sprintf(buffer, "%s\n", string);
> > +   dentry =
> > +       hypfs_create_file(sb, dir, name, buffer, S_IFREG |
REG_FILE_MODE);
> > +   if (!dentry) {
>
> IS_ERR()
> (Better check the whole patch for this)

Yes!

Thanks again for the review!

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-28 17:36   ` Michael Holzheu
@ 2006-04-28 17:43     ` Jörn Engel
  2006-05-02  8:06       ` Michael Holzheu
  2006-04-28 19:44     ` Andrew Morton
  1 sibling, 1 reply; 62+ messages in thread
From: Jörn Engel @ 2006-04-28 17:43 UTC (permalink / raw)
  To: Michael Holzheu; +Cc: Andrew Morton, ioe-lkml, linux-kernel, mschwid2, penberg

On Fri, 28 April 2006 19:36:30 +0200, Michael Holzheu wrote:
> Andrew Morton <akpm@osdl.org> wrote on 04/28/2006 11:56:21 AM:
> > Michael Holzheu <holzheu@de.ibm.com> wrote:
> 
> > > +static int diag224_idx2name(int index, char *name)
> > > +{
> > > +   memcpy(name, diag224_cpu_names + ((index + 1) * 16), 16);
> > > +   name[16] = 0;
> >
> > Should this be "15"?   I guess not...
> 
> No bug, our strings here have 16 characters and are not
> 0 terminated.

Hmm.  TMP_SIZE is defined to 64 and used for buffers allocated on the
stack.  It is not too excessive, but in this case 17 would definitely
be enough.  Not sure if it's worth going through.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-28 17:36   ` Michael Holzheu
  2006-04-28 17:43     ` Jörn Engel
@ 2006-04-28 19:44     ` Andrew Morton
  1 sibling, 0 replies; 62+ messages in thread
From: Andrew Morton @ 2006-04-28 19:44 UTC (permalink / raw)
  To: Michael Holzheu; +Cc: ioe-lkml, joern, linux-kernel, mschwid2, penberg

Michael Holzheu <HOLZHEU@de.ibm.com> wrote:
>
> > > +   if (filp->f_mode & FMODE_WRITE) {
>  > > +      if (!(inode->i_mode & S_IWUGO))
>  > > +         return -EACCES;
>  > > +   }
>  > > +   if (filp->f_mode & FMODE_READ) {
>  > > +      if (!(inode->i_mode & S_IRUGO))
>  > > +         return -EACCES;
>  > > +   }
>  >
>  > Is the standard VFS permission checking not appropriate?
>  >
>  > (A comment should be added here).
> 
>  You mean using .permission in the inode operations
>  and using the generic_permission() function?
> 
>  Currently I do not have own inode operations (and
>  I don't want to have them ...)

The VFS-level open() code implements standard permission-checking so I
_think_ you don't need to do anything in here.  See how ramfs does it.

ramfs does have an inode_operations, for ->getattr() support.  So it can
return a correct number in stat->blocks.

sysfs implements inode_operations, so it can do stuff in ->setattr().

I don't think hypfs needs either of those, so you still shouldn't need a
file_inode_operations.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-28  9:22 [PATCH] s390: Hypervisor File System Michael Holzheu
                   ` (2 preceding siblings ...)
  2006-04-28 10:36 ` Pekka J Enberg
@ 2006-04-29  6:44 ` Andrew Morton
  2006-04-29  7:51   ` Greg KH
  2006-04-29  7:53 ` Greg KH
  4 siblings, 1 reply; 62+ messages in thread
From: Andrew Morton @ 2006-04-29  6:44 UTC (permalink / raw)
  To: Michael Holzheu; +Cc: schwidefsky, penberg, ioe-lkml, joern, linux-kernel

Michael Holzheu <holzheu@de.ibm.com> wrote:
>
>  As mount point for the filesystem /sys/hypervisor is created.

What does this mean, btw?  I don't see code there creating a new sysfs
directory, and userspace cannot do this.

Confused.

Also, "/sys/hypervisor" is probably insufficiently specific.  In a few
years time people will be asking "Which hypervisor?  We have eighteen of them!".


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-29  6:44 ` Andrew Morton
@ 2006-04-29  7:51   ` Greg KH
  2006-04-29  8:14     ` Andrew Morton
  0 siblings, 1 reply; 62+ messages in thread
From: Greg KH @ 2006-04-29  7:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michael Holzheu, schwidefsky, penberg, ioe-lkml, joern, linux-kernel

On Fri, Apr 28, 2006 at 11:44:41PM -0700, Andrew Morton wrote:
> Michael Holzheu <holzheu@de.ibm.com> wrote:
> >
> >  As mount point for the filesystem /sys/hypervisor is created.
> 
> What does this mean, btw?  I don't see code there creating a new sysfs
> directory, and userspace cannot do this.

The call to subsystem_register() does this.

> Also, "/sys/hypervisor" is probably insufficiently specific.  In a few
> years time people will be asking "Which hypervisor?  We have eighteen of them!".

I agree, the xen people are already clammering for some kind of sysfs
tree and wanted to create /sys/hypervisor/xen.  How about
/sys/hypervisor/s390?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-28  9:22 [PATCH] s390: Hypervisor File System Michael Holzheu
                   ` (3 preceding siblings ...)
  2006-04-29  6:44 ` Andrew Morton
@ 2006-04-29  7:53 ` Greg KH
  2006-04-29  8:41   ` Kyle Moffett
                     ` (3 more replies)
  4 siblings, 4 replies; 62+ messages in thread
From: Greg KH @ 2006-04-29  7:53 UTC (permalink / raw)
  To: Michael Holzheu; +Cc: akpm, schwidefsky, penberg, ioe-lkml, joern, linux-kernel

On Fri, Apr 28, 2006 at 11:22:25AM +0200, Michael Holzheu wrote:
> On zSeries machines there exists an interface which allows the operating
> system  to retrieve LPAR hypervisor accounting data. For example, it is
> possible to get usage data for physical and virtual cpus. In order to
> provide this information to user space programs, I implemented a new
> virtual Linux file system named 'hypfs' using the Linux 2.6 libfs
> framework. The name 'hypfs' stands for 'Hypervisor Filesystem'. All the
> accounting information is put into different virtual files which can be
> accessed from user space. All data is represented as ASCII strings.
> 
> When the file system is mounted the accounting information is retrieved
> and a file system tree is created with the attribute files containing
> the cpu information. The content of the files remains unchanged until a
> new update is made. An update can be triggered from user space through
> writing 'something' into a special purpose update file.
> 
> We create the following directory structure:
> 
> <mount-point>/
>         update
>         cpus/
>                 <cpu-id>
>                         type
>                         mgmtime
>                 <cpu-id>
>                         ...
>         hyp/
>                 type
>         systems/
>                 <lpar-name>
>                         cpus/
>                                 <cpu-id>
>                                         type
>                                         mgmtime
>                                         cputime
>                                         onlinetime
>                                 <cpu-id>
>                                         ...
>                 <lpar-name>
>                         cpus/
>                                 ...
> 
> - update: File to trigger update
> - cpus/: Directory for all physical cpus
> - cpus/<cpu-id>/: Directory for one physical cpu.
> - cpus/<cpu-id>/type: Type name of physical zSeries cpu.
> - cpus/<cpu-id>/mgmtime: Physical-LPAR-management time in microseconds.
> - hyp/: Directory for hypervisor information
> - hyp/type: Typ of hypervisor (currently only 'LPAR Hypervisor')
> - systems/: Directory for all LPARs
> - systems/<lpar-name>/: Directory for one LPAR.
> - systems/<lpar-name>/cpus/<cpu-id>/: Directory for the virtual cpus
> - systems/<lpar-name>/cpus/<cpu-id>/type: Typ of cpu.
> - systems/<lpar-name>/cpus/<cpu-id>/mgmtime:
> Accumulated number of microseconds during which a physical
> CPU was assigned to the logical cpu and the cpu time was 
> consumed by the hypervisor and was not provided to
> the LPAR (LPAR overhead).
> 
> - systems/<lpar-name>/cpus/<cpu-id>/cputime:
> Accumulated number of microseconds during which a physical CPU
> was assigned to the logical cpu and the cpu time was consumed
> by the LPAR.
> 
> - systems/<lpar-name>/cpus/<cpu-id>/onlinetime:
> Accumulated number of microseconds during which the logical CPU
> has been online.
> 
> As mount point for the filesystem /sys/hypervisor is created.
> 
> The update process is triggered when writing 'something' into the
> 'update' file at the top level hypfs directory. You can do this e.g.
> with 'echo 1 > update'. During the update the whole directory structure
> is deleted and built up again.

This sounds a lot like configfs.  Why not use that instead?

Is there a reason that sysfs can't be used for a lot of these things
too?

We already have the different cpus in sysfs, why put things in a
different location than that?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-29  7:51   ` Greg KH
@ 2006-04-29  8:14     ` Andrew Morton
  2006-05-03  8:48       ` Michael Holzheu
  0 siblings, 1 reply; 62+ messages in thread
From: Andrew Morton @ 2006-04-29  8:14 UTC (permalink / raw)
  To: Greg KH; +Cc: holzheu, schwidefsky, penberg, ioe-lkml, joern, linux-kernel

Greg KH <greg@kroah.com> wrote:
>
>  On Fri, Apr 28, 2006 at 11:44:41PM -0700, Andrew Morton wrote:
>  > Michael Holzheu <holzheu@de.ibm.com> wrote:
>  > >
>  > >  As mount point for the filesystem /sys/hypervisor is created.
>  > 
>  > What does this mean, btw?  I don't see code there creating a new sysfs
>  > directory, and userspace cannot do this.
> 
>  The call to subsystem_register() does this.

Ah.  It's using "hypfs".  Michael was tricking us.

>  > Also, "/sys/hypervisor" is probably insufficiently specific.  In a few
>  > years time people will be asking "Which hypervisor?  We have eighteen of them!".
> 
>  I agree, the xen people are already clammering for some kind of sysfs
>  tree and wanted to create /sys/hypervisor/xen.  How about
>  /sys/hypervisor/s390?

Yes, something like that.  Even "hypfs" is possibly too generic.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-29  7:53 ` Greg KH
@ 2006-04-29  8:41   ` Kyle Moffett
  2006-04-29 21:55     ` Greg KH
  2006-05-03  9:33     ` Michael Holzheu
  2006-05-02 10:12   ` Michael Holzheu
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 62+ messages in thread
From: Kyle Moffett @ 2006-04-29  8:41 UTC (permalink / raw)
  To: Greg KH
  Cc: Michael Holzheu, akpm, schwidefsky, penberg, ioe-lkml, joern,
	linux-kernel

On Apr 29, 2006, at 03:53:11, Greg KH wrote:
>> The update process is triggered when writing 'something' into the  
>> 'update' file at the top level hypfs directory. You can do this  
>> e.g. with 'echo 1 > update'. During the update the whole directory  
>> structure is deleted and built up again.
>
> This sounds a lot like configfs.  Why not use that instead?
>
> Is there a reason that sysfs can't be used for a lot of these  
> things too?
>
> We already have the different cpus in sysfs, why put things in a  
> different location than that?

It sounds like a lot of things need some kind of shell-scriptable  
transaction interface for sysfs files.  You don't want to have more  
than one value per file, but reading or writing of some values must  
be done together for consistency reasons.  Is there any way to  
implement something like this?  This would work for the framebuffer  
people and solve the needs of a lot of the people who still want  
ioctls or some other atomic-multivalued transfer that would otherwise  
be a great sysfs candidate.

> # Start a transaction
> exec 3</sys/hypervisor/s390/transaction
>
> # Add another reference to the transaction
> exec 4<&3
>
> # List CPUs and info
> cd /sys/hypervisor/s390/cpus
> for i in *; do
> 	echo "CPU $i: type=`cat type` mgmtime=`cat mgmtime`"
> done
>
> # Transaction does *not* end here
> exec 3<&-
>
> # List CPUs again a different way
> ls /sys/hypervisor/s390/cpus
>
> # Transaction does end here
> exec 4<&-

A process that doesn't care about the transaction (Like a user  
inspecting from a shell), could just cd in and cat random files, and  
it would run a transaction for each file opened, and end the  
transaction when the file is closed.  Obviously that's inefficient,  
but it's obviously correct and works.  Otherwise, from opening the  
transaction filehandle for the first time till closing the last open  
FD to it, the process would get a consistent view of the data, read  
when the "transaction" file is first opened opened.  Perhaps any  
modifications would be synced and written at the close of that  
filehandle.

I don't know if that's something doable given the FD model, but if it  
is, then that's trivial to use for the admin, in shell scripts, perl  
scripts, C programs, etc.  It also works with the one-value-per-file  
sysfs model.

Cheers,
Kyle Moffett


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-29  8:41   ` Kyle Moffett
@ 2006-04-29 21:55     ` Greg KH
  2006-04-30  5:18       ` Kyle Moffett
  2006-05-03  9:33     ` Michael Holzheu
  1 sibling, 1 reply; 62+ messages in thread
From: Greg KH @ 2006-04-29 21:55 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Michael Holzheu, akpm, schwidefsky, penberg, ioe-lkml, joern,
	linux-kernel

On Sat, Apr 29, 2006 at 04:41:05AM -0400, Kyle Moffett wrote:
> On Apr 29, 2006, at 03:53:11, Greg KH wrote:
> >>The update process is triggered when writing 'something' into the  
> >>'update' file at the top level hypfs directory. You can do this  
> >>e.g. with 'echo 1 > update'. During the update the whole directory  
> >>structure is deleted and built up again.
> >
> >This sounds a lot like configfs.  Why not use that instead?
> >
> >Is there a reason that sysfs can't be used for a lot of these  
> >things too?
> >
> >We already have the different cpus in sysfs, why put things in a  
> >different location than that?
> 
> It sounds like a lot of things need some kind of shell-scriptable  
> transaction interface for sysfs files.  You don't want to have more  
> than one value per file, but reading or writing of some values must  
> be done together for consistency reasons.  Is there any way to  
> implement something like this?  This would work for the framebuffer  
> people and solve the needs of a lot of the people who still want  
> ioctls or some other atomic-multivalued transfer that would otherwise  
> be a great sysfs candidate.

relayfs is for that.  You can now put relayfs files in any ram based
file system (procfs, ramfs, sysfs, debugfs, etc.)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-29 21:55     ` Greg KH
@ 2006-04-30  5:18       ` Kyle Moffett
  2006-05-01 20:38         ` Greg KH
  0 siblings, 1 reply; 62+ messages in thread
From: Kyle Moffett @ 2006-04-30  5:18 UTC (permalink / raw)
  To: Greg KH
  Cc: Michael Holzheu, akpm, schwidefsky, penberg, ioe-lkml, joern,
	linux-kernel

On Apr 29, 2006, at 17:55:01, Greg KH wrote:
> relayfs is for that.  You can now put relayfs files in any ram  
> based file system (procfs, ramfs, sysfs, debugfs, etc.)

But you can't twiddle relayfs with echo and cat; it's more suited to  
high-bandwidth transfers than anything else, no?  The idea here would  
be to be able to interact with the files in /sys the same way you  
always do, but provide a sort of consistency system whereby a program  
_or_ sysadmin can attach its view of the /sys/hypervisor directory  
tree to a particular snapshot of the system.  As far as I can tell  
(although I'd be happy to be proven wrong), there is no trivial way  
to manually access or shellscript relayfs files, the way you can  
"cat /sys/devices/<path-to-device>/dev".

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-30  5:18       ` Kyle Moffett
@ 2006-05-01 20:38         ` Greg KH
  2006-05-01 23:29           ` Kyle Moffett
  0 siblings, 1 reply; 62+ messages in thread
From: Greg KH @ 2006-05-01 20:38 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Michael Holzheu, akpm, schwidefsky, penberg, ioe-lkml, joern,
	linux-kernel

On Sun, Apr 30, 2006 at 01:18:46AM -0400, Kyle Moffett wrote:
> On Apr 29, 2006, at 17:55:01, Greg KH wrote:
> >relayfs is for that.  You can now put relayfs files in any ram  
> >based file system (procfs, ramfs, sysfs, debugfs, etc.)
> 
> But you can't twiddle relayfs with echo and cat; it's more suited to  
> high-bandwidth transfers than anything else, no?

Yes.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-01 20:38         ` Greg KH
@ 2006-05-01 23:29           ` Kyle Moffett
  2006-05-02  4:00             ` Greg KH
  0 siblings, 1 reply; 62+ messages in thread
From: Kyle Moffett @ 2006-05-01 23:29 UTC (permalink / raw)
  To: Greg KH
  Cc: Michael Holzheu, akpm, schwidefsky, penberg, ioe-lkml, joern,
	linux-kernel

On May 1, 2006, at 16:38:15, Greg KH wrote:
> On Sun, Apr 30, 2006 at 01:18:46AM -0400, Kyle Moffett wrote:
>> On Apr 29, 2006, at 17:55:01, Greg KH wrote:
>>> relayfs is for that.  You can now put relayfs files in any ram  
>>> based file system (procfs, ramfs, sysfs, debugfs, etc.)
>>
>> But you can't twiddle relayfs with echo and cat; it's more suited  
>> to high-bandwidth transfers than anything else, no?
>
> Yes.

So my question stands:  What is the _recommended_ way to handle  
simple data types in low-bandwidth/frequency multiple-valued  
transactions to hardware?  Examples include reading/modifying  
framebuffer settings (currently done through IOCTLS), s390 current  
state (up for discussion), etc.  In these cases there needs to be an  
atomic snapshot or write of multiple values at the same time.  Given  
the situation it would be _nice_ to use sysfs so the admin can do it  
by hand; makes things shell scriptable and reduces the number of  
binary compatibility issues.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-01 23:29           ` Kyle Moffett
@ 2006-05-02  4:00             ` Greg KH
  2006-05-02  5:23               ` Kay Sievers
  2006-05-02  8:48               ` Kyle Moffett
  0 siblings, 2 replies; 62+ messages in thread
From: Greg KH @ 2006-05-02  4:00 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Michael Holzheu, akpm, schwidefsky, penberg, ioe-lkml, joern,
	linux-kernel

On Mon, May 01, 2006 at 07:29:23PM -0400, Kyle Moffett wrote:
> On May 1, 2006, at 16:38:15, Greg KH wrote:
> >On Sun, Apr 30, 2006 at 01:18:46AM -0400, Kyle Moffett wrote:
> >>On Apr 29, 2006, at 17:55:01, Greg KH wrote:
> >>>relayfs is for that.  You can now put relayfs files in any ram  
> >>>based file system (procfs, ramfs, sysfs, debugfs, etc.)
> >>
> >>But you can't twiddle relayfs with echo and cat; it's more suited  
> >>to high-bandwidth transfers than anything else, no?
> >
> >Yes.
> 
> So my question stands:  What is the _recommended_ way to handle  
> simple data types in low-bandwidth/frequency multiple-valued  
> transactions to hardware?  Examples include reading/modifying  
> framebuffer settings (currently done through IOCTLS), s390 current  
> state (up for discussion), etc.  In these cases there needs to be an  
> atomic snapshot or write of multiple values at the same time.  Given  
> the situation it would be _nice_ to use sysfs so the admin can do it  
> by hand; makes things shell scriptable and reduces the number of  
> binary compatibility issues.

I really don't know of a way to use sysfs for this currently, and hence,
am not complaining too much about the different /proc files that have
this kind of information in it at the moment.

If you or someone else wants to come up with some kind of solution for
it, I'm sure that many people would be very happy to see it.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-02  4:00             ` Greg KH
@ 2006-05-02  5:23               ` Kay Sievers
  2006-05-02  5:37                 ` Greg KH
  2006-05-02  8:48               ` Kyle Moffett
  1 sibling, 1 reply; 62+ messages in thread
From: Kay Sievers @ 2006-05-02  5:23 UTC (permalink / raw)
  To: Greg KH
  Cc: Kyle Moffett, Michael Holzheu, akpm, schwidefsky, penberg,
	ioe-lkml, joern, linux-kernel

On Mon, May 01, 2006 at 09:00:53PM -0700, Greg KH wrote:
> On Mon, May 01, 2006 at 07:29:23PM -0400, Kyle Moffett wrote:
> > On May 1, 2006, at 16:38:15, Greg KH wrote:
> > >On Sun, Apr 30, 2006 at 01:18:46AM -0400, Kyle Moffett wrote:
> > >>On Apr 29, 2006, at 17:55:01, Greg KH wrote:
> > >>>relayfs is for that.  You can now put relayfs files in any ram  
> > >>>based file system (procfs, ramfs, sysfs, debugfs, etc.)
> > >>
> > >>But you can't twiddle relayfs with echo and cat; it's more suited  
> > >>to high-bandwidth transfers than anything else, no?
> > >
> > >Yes.
> > 
> > So my question stands:  What is the _recommended_ way to handle  
> > simple data types in low-bandwidth/frequency multiple-valued  
> > transactions to hardware?  Examples include reading/modifying  
> > framebuffer settings (currently done through IOCTLS), s390 current  
> > state (up for discussion), etc.  In these cases there needs to be an  
> > atomic snapshot or write of multiple values at the same time.  Given  
> > the situation it would be _nice_ to use sysfs so the admin can do it  
> > by hand; makes things shell scriptable and reduces the number of  
> > binary compatibility issues.
> 
> I really don't know of a way to use sysfs for this currently, and hence,
> am not complaining too much about the different /proc files that have
> this kind of information in it at the moment.
> 
> If you or someone else wants to come up with some kind of solution for
> it, I'm sure that many people would be very happy to see it.

If the count of values handled in a transaction is not to high and it
makes sense to group these values logically, why not just create an
attribute group for every transaction, which creates dummy attributes
to fill the values in, and use an "action" file in that group, that
commits all the values at once to whatever target? That should fit into
the ioctl use pattern, right?

Kay

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-02  5:23               ` Kay Sievers
@ 2006-05-02  5:37                 ` Greg KH
  2006-05-02 11:46                   ` Kay Sievers
  0 siblings, 1 reply; 62+ messages in thread
From: Greg KH @ 2006-05-02  5:37 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Kyle Moffett, Michael Holzheu, akpm, schwidefsky, penberg,
	ioe-lkml, joern, linux-kernel

On Tue, May 02, 2006 at 07:23:41AM +0200, Kay Sievers wrote:
> If the count of values handled in a transaction is not to high and it
> makes sense to group these values logically, why not just create an
> attribute group for every transaction, which creates dummy attributes
> to fill the values in, and use an "action" file in that group, that
> commits all the values at once to whatever target? That should fit into
> the ioctl use pattern, right?

That's what configfs can handle easier.  I think the issue is getting
stuff from the kernel in one atomic snapshot (all the different file
values from the same point in time.)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-28 17:43     ` Jörn Engel
@ 2006-05-02  8:06       ` Michael Holzheu
  0 siblings, 0 replies; 62+ messages in thread
From: Michael Holzheu @ 2006-05-02  8:06 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Andrew Morton, ioe-lkml, linux-kernel, mschwid2, penberg

Hi Joern,

Jörn Engel <joern@wohnheim.fh-wedel.de> wrote on 04/28/2006 07:43:02 PM:
> On Fri, 28 April 2006 19:36:30 +0200, Michael Holzheu wrote:
> > Andrew Morton <akpm@osdl.org> wrote on 04/28/2006 11:56:21 AM:
> > > Michael Holzheu <holzheu@de.ibm.com> wrote:
> >
> > > > +static int diag224_idx2name(int index, char *name)
> > > > +{
> > > > +   memcpy(name, diag224_cpu_names + ((index + 1) * 16), 16);
> > > > +   name[16] = 0;
> > >
> > > Should this be "15"?   I guess not...
> >
> > No bug, our strings here have 16 characters and are not
> > 0 terminated.
>
> Hmm.  TMP_SIZE is defined to 64 and used for buffers allocated on the
> stack.  It is not too excessive, but in this case 17 would definitely
> be enough.  Not sure if it's worth going through.
>
> Jörn

Since I use the buffers with size TMP_SIZE also for other purposes
in the same functions, where the cpu types are accessed,
I think, it is not useful to define a second buffer with size 17.

But I think it is better to have a define for the buffer size
of cpu names. something like:

 #define LPAR_NAME_LEN 8                /* lpar name len in diag 204 data
*/
+#define CPU_NAME_LEN 16                /* type name len of a cpu in
diag224 name table */
 #define TMP_SIZE 64            /* size of temporary buffers */

 static int diag224_idx2name(int index, char *name)
 {
-       memcpy(name, diag224_cpu_names + ((index + 1) * 16), 16);
-       name[16] = 0;
+       memcpy(name, diag224_cpu_names + ((index + 1) * CPU_NAME_LEN),
+               CPU_NAME_LEN);
+       name[CPU_NAME_LEN] = 0;
        strstrip(name);
        return 0;
 }

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-02  4:00             ` Greg KH
  2006-05-02  5:23               ` Kay Sievers
@ 2006-05-02  8:48               ` Kyle Moffett
  2006-05-02 21:30                 ` Greg KH
  1 sibling, 1 reply; 62+ messages in thread
From: Kyle Moffett @ 2006-05-02  8:48 UTC (permalink / raw)
  To: Greg KH
  Cc: Michael Holzheu, akpm, schwidefsky, penberg, ioe-lkml, joern,
	linux-kernel

On May 2, 2006, at 00:00:53, Greg KH wrote:
> On Mon, May 01, 2006 at 07:29:23PM -0400, Kyle Moffett wrote:
>> So my question stands:  What is the _recommended_ way to handle  
>> simple data types in low-bandwidth/frequency multiple-valued  
>> transactions to hardware?  Examples include reading/modifying  
>> framebuffer settings (currently done through IOCTLS), s390 current  
>> state (up for discussion), etc.  In these cases there needs to be  
>> an atomic snapshot or write of multiple values at the same time.   
>> Given the situation it would be _nice_ to use sysfs so the admin  
>> can do it by hand; makes things shell scriptable and reduces the  
>> number of binary compatibility issues.
>
> I really don't know of a way to use sysfs for this currently, and  
> hence, am not complaining too much about the different /proc files  
> that have this kind of information in it at the moment.
>
> If you or someone else wants to come up with some kind of solution  
> for it, I'm sure that many people would be very happy to see it.

Hmm, ok; I'll see what I can come up with.  Would anybody object to  
this kind of API (as in my previous email) that uses an open fd as a  
transaction "handle"?

Example script:
> ## Associate this process with an atomic snapshot
> ## of the /sys/hypervisor/s390 filesystem tree.
> exec 3>/sys/hypervisor/s390/transaction
>
> ## Read data from /sys/hypervisor/s390 without
> ## worrying about atomicity; as that's guaranteed
> ## by the open FD 3.
> ls /sys/hypervisor/s390/cpus
> cat /sys/hypervisor/s390/some_data_file
>
> ## Create another reference in this process to the
> ## _same_ atomic snapshot
> exec 4>&3
>
> ## Does *not* close out the atomic snapshot
> exec 3>&-
>
> ## Yet another ref; still the _same_ snapshot
> exec 6>/sys/hypervisor/s390/transaction
> exec 4>&-
>
> ## Regardless of what has changed in the meantime,
> ## our filesystem tree still looks the same
> ls /sys/hypervisor/s390/cpus
>
> ## Write out values
> echo some_state >/sys/hypervisor/s390/statefile
>
> ## Decide we don't like the changes and abort
> echo reset >&6
>
> ## Release the last copy of the snapshot and
> ## commit modified values
> exec 6>&-


This would allow usages like the following:
> exec 3>/sys/hypervisor/s390/transaction
> /bin/s390_change_hypervisor_state
> ## Look at new state; decide if we like it or not
> if [ -z "$I_LIKE_THE_STATE" ]; then
> 	echo reset >&3
> fi
> exec 3>&-


For actually implementing this; I'm considering a design which hangs  
a transaction off of a "struct file" such that fork() and clone()  
preserve the same transaction.  When a new process obtains an FD with  
the given transaction it would add that process' current pointer to a  
hash-table referencing the transaction data structure so that the open 
() call could look up the transaction for a given task in the hash  
table and use the data specified in the transaction.  When a  
transaction is opened it would read the data atomically from the  
hardware or in-kernel data structures and store an "initial" copy as  
well as a "current" copy in per-transaction memory.  As a user could  
theoretically pin NPROC * size_of_transaction_data * 2 of kernel  
memory, transaction files should have fairly strict file modes or  
some sort of resource-accounting semantic.  On a "reset" operation  
the "initial" copy would be used to overwrite the "current" copy  
again, and a changed bit would be unset.  Changes would result in the  
changed bit being set.  When the transaction is closed, if the  
changed bit is set then the data would be committed atomically, then  
all the memory would be freed and the transaction removed from the  
hash table.

Anything that sounds broken/fishy/"No that's impossible because..."  
in there?  I appreciate your input; if this sounds feasable I'll try  
to hack up a patch.

Cheers,
Kyle Moffett


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-29  7:53 ` Greg KH
  2006-04-29  8:41   ` Kyle Moffett
@ 2006-05-02 10:12   ` Michael Holzheu
  2006-05-02 13:00   ` Michael Holzheu
  2006-05-03  8:45   ` Michael Holzheu
  3 siblings, 0 replies; 62+ messages in thread
From: Michael Holzheu @ 2006-05-02 10:12 UTC (permalink / raw)
  To: Greg KH; +Cc: akpm, ioe-lkml, joern, linux-kernel, mschwid2, penberg

Greg KH <greg@kroah.com> wrote on 04/29/2006 09:53:11 AM:
> On Fri, Apr 28, 2006 at 11:22:25AM +0200, Michael Holzheu wrote:
> > On zSeries machines there exists an interface which allows the
operating
> > system  to retrieve LPAR hypervisor accounting data. For example, it is
> > possible to get usage data for physical and virtual cpus. In order to
> > provide this information to user space programs, I implemented a new
> > virtual Linux file system named 'hypfs' using the Linux 2.6 libfs
> > framework. The name 'hypfs' stands for 'Hypervisor Filesystem'. All the
> > accounting information is put into different virtual files which can be
> > accessed from user space. All data is represented as ASCII strings.
> >
> > When the file system is mounted the accounting information is retrieved
> > and a file system tree is created with the attribute files containing
> > the cpu information. The content of the files remains unchanged until a
> > new update is made. An update can be triggered from user space through
> > writing 'something' into a special purpose update file.
> >
> > We create the following directory structure:
> >
> > <mount-point>/
> >         update
> >         cpus/
> >                 <cpu-id>
> >                         type
> >                         mgmtime
> >                 <cpu-id>
> >                         ...
> >         hyp/
> >                 type
> >         systems/
> >                 <lpar-name>
> >                         cpus/
> >                                 <cpu-id>
> >                                         type
> >                                         mgmtime
> >                                         cputime
> >                                         onlinetime
> >                                 <cpu-id>
> >                                         ...
> >                 <lpar-name>
> >                         cpus/
> >                                 ...
> >
> > - update: File to trigger update
> > - cpus/: Directory for all physical cpus
> > - cpus/<cpu-id>/: Directory for one physical cpu.
> > - cpus/<cpu-id>/type: Type name of physical zSeries cpu.
> > - cpus/<cpu-id>/mgmtime: Physical-LPAR-management time in microseconds.
> > - hyp/: Directory for hypervisor information
> > - hyp/type: Typ of hypervisor (currently only 'LPAR Hypervisor')
> > - systems/: Directory for all LPARs
> > - systems/<lpar-name>/: Directory for one LPAR.
> > - systems/<lpar-name>/cpus/<cpu-id>/: Directory for the virtual cpus
> > - systems/<lpar-name>/cpus/<cpu-id>/type: Typ of cpu.
> > - systems/<lpar-name>/cpus/<cpu-id>/mgmtime:
> > Accumulated number of microseconds during which a physical
> > CPU was assigned to the logical cpu and the cpu time was
> > consumed by the hypervisor and was not provided to
> > the LPAR (LPAR overhead).
> >
> > - systems/<lpar-name>/cpus/<cpu-id>/cputime:
> > Accumulated number of microseconds during which a physical CPU
> > was assigned to the logical cpu and the cpu time was consumed
> > by the LPAR.
> >
> > - systems/<lpar-name>/cpus/<cpu-id>/onlinetime:
> > Accumulated number of microseconds during which the logical CPU
> > has been online.
> >
> > As mount point for the filesystem /sys/hypervisor is created.
> >
> > The update process is triggered when writing 'something' into the
> > 'update' file at the top level hypfs directory. You can do this e.g.
> > with 'echo 1 > update'. During the update the whole directory structure
> > is deleted and built up again.
>
> This sounds a lot like configfs.  Why not use that instead?

Right! I read the documentation of configfs and looks like
configfs is exactly what I need.

To get consistend data the user can do the following:

1. mkdir /config/s390-hypervisor/tmpdir

This creates our directory tree with a snapshot of
the hypervisor data.

2. cd tmpdir and read all data

3. rmdir /config/s390-hypervisor/tmpdir

It looks a bit odd to get hypervisor accounting data
from /config, but who cares ...

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-02  5:37                 ` Greg KH
@ 2006-05-02 11:46                   ` Kay Sievers
  2006-05-02 21:28                     ` Greg KH
  0 siblings, 1 reply; 62+ messages in thread
From: Kay Sievers @ 2006-05-02 11:46 UTC (permalink / raw)
  To: Greg KH
  Cc: Kyle Moffett, Michael Holzheu, akpm, schwidefsky, penberg,
	ioe-lkml, joern, linux-kernel

On Mon, May 01, 2006 at 10:37:03PM -0700, Greg KH wrote:
> On Tue, May 02, 2006 at 07:23:41AM +0200, Kay Sievers wrote:
> > If the count of values handled in a transaction is not to high and it
> > makes sense to group these values logically, why not just create an
> > attribute group for every transaction, which creates dummy attributes
> > to fill the values in, and use an "action" file in that group, that
> > commits all the values at once to whatever target? That should fit into
> > the ioctl use pattern, right?
> 
> That's what configfs can handle easier.  I think the issue is getting
> stuff from the kernel in one atomic snapshot (all the different file
> values from the same point in time.)

Sure, but just like an ioctl, the kernel could return the values after
writing to the "action" file in the dummy attributes. That would be
something like a snapshot, right?

Kay

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-29  7:53 ` Greg KH
  2006-04-29  8:41   ` Kyle Moffett
  2006-05-02 10:12   ` Michael Holzheu
@ 2006-05-02 13:00   ` Michael Holzheu
  2006-05-03  8:45   ` Michael Holzheu
  3 siblings, 0 replies; 62+ messages in thread
From: Michael Holzheu @ 2006-05-02 13:00 UTC (permalink / raw)
  To: Greg KH, joel.becker
  Cc: akpm, ioe-lkml, joern, linux-kernel, mschwid2, penberg

Hi Greg and Joel,

Greg KH <greg@kroah.com> wrote on 04/29/2006 09:53:11 AM:
> On Fri, Apr 28, 2006 at 11:22:25AM +0200, Michael Holzheu wrote:
> > On zSeries machines there exists an interface which allows the
operating
> > system  to retrieve LPAR hypervisor accounting data. For example, it is
> > possible to get usage data for physical and virtual cpus. In order to
> > provide this information to user space programs, I implemented a new
> > virtual Linux file system named 'hypfs' using the Linux 2.6 libfs
> > framework. The name 'hypfs' stands for 'Hypervisor Filesystem'. All the
> > accounting information is put into different virtual files which can be
> > accessed from user space. All data is represented as ASCII strings.
> >
> > When the file system is mounted the accounting information is retrieved
> > and a file system tree is created with the attribute files containing
> > the cpu information. The content of the files remains unchanged until a
> > new update is made. An update can be triggered from user space through
> > writing 'something' into a special purpose update file.
> >
> > We create the following directory structure:
> >
> > <mount-point>/
> >         update
> >         cpus/
> >                 <cpu-id>
> >                         type
> >                         mgmtime
> >                 <cpu-id>
> >                         ...
> >         hyp/
> >                 type
> >         systems/
> >                 <lpar-name>
> >                         cpus/
> >                                 <cpu-id>
> >                                         type
> >                                         mgmtime
> >                                         cputime
> >                                         onlinetime
> >                                 <cpu-id>
> >                                         ...
> >                 <lpar-name>
> >                         cpus/
> >                                 ...
> >
> > - update: File to trigger update
> > - cpus/: Directory for all physical cpus
> > - cpus/<cpu-id>/: Directory for one physical cpu.
> > - cpus/<cpu-id>/type: Type name of physical zSeries cpu.
> > - cpus/<cpu-id>/mgmtime: Physical-LPAR-management time in microseconds.
> > - hyp/: Directory for hypervisor information
> > - hyp/type: Typ of hypervisor (currently only 'LPAR Hypervisor')
> > - systems/: Directory for all LPARs
> > - systems/<lpar-name>/: Directory for one LPAR.
> > - systems/<lpar-name>/cpus/<cpu-id>/: Directory for the virtual cpus
> > - systems/<lpar-name>/cpus/<cpu-id>/type: Typ of cpu.
> > - systems/<lpar-name>/cpus/<cpu-id>/mgmtime:
> > Accumulated number of microseconds during which a physical
> > CPU was assigned to the logical cpu and the cpu time was
> > consumed by the hypervisor and was not provided to
> > the LPAR (LPAR overhead).
> >
> > - systems/<lpar-name>/cpus/<cpu-id>/cputime:
> > Accumulated number of microseconds during which a physical CPU
> > was assigned to the logical cpu and the cpu time was consumed
> > by the LPAR.
> >
> > - systems/<lpar-name>/cpus/<cpu-id>/onlinetime:
> > Accumulated number of microseconds during which the logical CPU
> > has been online.
> >
> > As mount point for the filesystem /sys/hypervisor is created.
> >
> > The update process is triggered when writing 'something' into the
> > 'update' file at the top level hypfs directory. You can do this e.g.
> > with 'echo 1 > update'. During the update the whole directory structure
> > is deleted and built up again.
>
> This sounds a lot like configfs.  Why not use that instead?
>

After having a deeper look into configfs, I do not understand how
to create a directory tree. Is it somehow possible to create a
directory tree as a result of mkdir? Maybe I missed here
something...

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-02 11:46                   ` Kay Sievers
@ 2006-05-02 21:28                     ` Greg KH
  2006-05-02 21:33                       ` Kay Sievers
  0 siblings, 1 reply; 62+ messages in thread
From: Greg KH @ 2006-05-02 21:28 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Kyle Moffett, Michael Holzheu, akpm, schwidefsky, penberg,
	ioe-lkml, joern, linux-kernel

On Tue, May 02, 2006 at 01:46:03PM +0200, Kay Sievers wrote:
> On Mon, May 01, 2006 at 10:37:03PM -0700, Greg KH wrote:
> > On Tue, May 02, 2006 at 07:23:41AM +0200, Kay Sievers wrote:
> > > If the count of values handled in a transaction is not to high and it
> > > makes sense to group these values logically, why not just create an
> > > attribute group for every transaction, which creates dummy attributes
> > > to fill the values in, and use an "action" file in that group, that
> > > commits all the values at once to whatever target? That should fit into
> > > the ioctl use pattern, right?
> > 
> > That's what configfs can handle easier.  I think the issue is getting
> > stuff from the kernel in one atomic snapshot (all the different file
> > values from the same point in time.)
> 
> Sure, but just like an ioctl, the kernel could return the values after
> writing to the "action" file in the dummy attributes. That would be
> something like a snapshot, right?

Yes, but where would the buffer be to return the data to on a write?  In
the data that the user passed to write?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-02  8:48               ` Kyle Moffett
@ 2006-05-02 21:30                 ` Greg KH
  2006-05-02 21:49                   ` Kay Sievers
  0 siblings, 1 reply; 62+ messages in thread
From: Greg KH @ 2006-05-02 21:30 UTC (permalink / raw)
  To: Kyle Moffett, Kay Sievers
  Cc: Michael Holzheu, akpm, schwidefsky, penberg, ioe-lkml, joern,
	linux-kernel

On Tue, May 02, 2006 at 04:48:42AM -0400, Kyle Moffett wrote:
> On May 2, 2006, at 00:00:53, Greg KH wrote:
> >On Mon, May 01, 2006 at 07:29:23PM -0400, Kyle Moffett wrote:
> >>So my question stands:  What is the _recommended_ way to handle  
> >>simple data types in low-bandwidth/frequency multiple-valued  
> >>transactions to hardware?  Examples include reading/modifying  
> >>framebuffer settings (currently done through IOCTLS), s390 current  
> >>state (up for discussion), etc.  In these cases there needs to be  
> >>an atomic snapshot or write of multiple values at the same time.   
> >>Given the situation it would be _nice_ to use sysfs so the admin  
> >>can do it by hand; makes things shell scriptable and reduces the  
> >>number of binary compatibility issues.
> >
> >I really don't know of a way to use sysfs for this currently, and  
> >hence, am not complaining too much about the different /proc files  
> >that have this kind of information in it at the moment.
> >
> >If you or someone else wants to come up with some kind of solution  
> >for it, I'm sure that many people would be very happy to see it.
> 
> Hmm, ok; I'll see what I can come up with.  Would anybody object to  
> this kind of API (as in my previous email) that uses an open fd as a  
> transaction "handle"?

No, I think Kay played around with something like using the open fd of
the directory as such a lock (or was he using flock on it, I can't
remember now...)

> Example script:
> >## Associate this process with an atomic snapshot
> >## of the /sys/hypervisor/s390 filesystem tree.
> >exec 3>/sys/hypervisor/s390/transaction
> >
> >## Read data from /sys/hypervisor/s390 without
> >## worrying about atomicity; as that's guaranteed
> >## by the open FD 3.
> >ls /sys/hypervisor/s390/cpus
> >cat /sys/hypervisor/s390/some_data_file
> >
> >## Create another reference in this process to the
> >## _same_ atomic snapshot
> >exec 4>&3
> >
> >## Does *not* close out the atomic snapshot
> >exec 3>&-
> >
> >## Yet another ref; still the _same_ snapshot
> >exec 6>/sys/hypervisor/s390/transaction
> >exec 4>&-
> >
> >## Regardless of what has changed in the meantime,
> >## our filesystem tree still looks the same
> >ls /sys/hypervisor/s390/cpus
> >
> >## Write out values
> >echo some_state >/sys/hypervisor/s390/statefile
> >
> >## Decide we don't like the changes and abort
> >echo reset >&6
> >
> >## Release the last copy of the snapshot and
> >## commit modified values
> >exec 6>&-
> 
> 
> This would allow usages like the following:
> >exec 3>/sys/hypervisor/s390/transaction
> >/bin/s390_change_hypervisor_state
> >## Look at new state; decide if we like it or not
> >if [ -z "$I_LIKE_THE_STATE" ]; then
> >	echo reset >&3
> >fi
> >exec 3>&-
> 
> 
> For actually implementing this; I'm considering a design which hangs  
> a transaction off of a "struct file" such that fork() and clone()  
> preserve the same transaction.  When a new process obtains an FD with  
> the given transaction it would add that process' current pointer to a  
> hash-table referencing the transaction data structure so that the open 
> () call could look up the transaction for a given task in the hash  
> table and use the data specified in the transaction.  When a  
> transaction is opened it would read the data atomically from the  
> hardware or in-kernel data structures and store an "initial" copy as  
> well as a "current" copy in per-transaction memory.  As a user could  
> theoretically pin NPROC * size_of_transaction_data * 2 of kernel  
> memory, transaction files should have fairly strict file modes or  
> some sort of resource-accounting semantic.  On a "reset" operation  
> the "initial" copy would be used to overwrite the "current" copy  
> again, and a changed bit would be unset.  Changes would result in the  
> changed bit being set.  When the transaction is closed, if the  
> changed bit is set then the data would be committed atomically, then  
> all the memory would be freed and the transaction removed from the  
> hash table.
> 
> Anything that sounds broken/fishy/"No that's impossible because..."  
> in there?  I appreciate your input; if this sounds feasable I'll try  
> to hack up a patch.

Sounds a bit complex.  Try looking at flock and see if you can pass that
info back to the sysfs attribute owners.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-02 21:28                     ` Greg KH
@ 2006-05-02 21:33                       ` Kay Sievers
  2006-05-02 21:54                         ` Greg KH
  0 siblings, 1 reply; 62+ messages in thread
From: Kay Sievers @ 2006-05-02 21:33 UTC (permalink / raw)
  To: Greg KH
  Cc: Kyle Moffett, Michael Holzheu, akpm, schwidefsky, penberg,
	ioe-lkml, joern, linux-kernel

On Tue, May 02, 2006 at 02:28:45PM -0700, Greg KH wrote:
> On Tue, May 02, 2006 at 01:46:03PM +0200, Kay Sievers wrote:
> > On Mon, May 01, 2006 at 10:37:03PM -0700, Greg KH wrote:
> > > On Tue, May 02, 2006 at 07:23:41AM +0200, Kay Sievers wrote:
> > > > If the count of values handled in a transaction is not to high and it
> > > > makes sense to group these values logically, why not just create an
> > > > attribute group for every transaction, which creates dummy attributes
> > > > to fill the values in, and use an "action" file in that group, that
> > > > commits all the values at once to whatever target? That should fit into
> > > > the ioctl use pattern, right?
> > > 
> > > That's what configfs can handle easier.  I think the issue is getting
> > > stuff from the kernel in one atomic snapshot (all the different file
> > > values from the same point in time.)
> > 
> > Sure, but just like an ioctl, the kernel could return the values after
> > writing to the "action" file in the dummy attributes. That would be
> > something like a snapshot, right?
> 
> Yes, but where would the buffer be to return the data to on a write?  In
> the data that the user passed to write?

In the "dummy attribute", allocated by the device instance.

Kay

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-02 21:30                 ` Greg KH
@ 2006-05-02 21:49                   ` Kay Sievers
  2006-05-02 23:18                     ` Kyle Moffett
  0 siblings, 1 reply; 62+ messages in thread
From: Kay Sievers @ 2006-05-02 21:49 UTC (permalink / raw)
  To: Greg KH
  Cc: Kyle Moffett, Michael Holzheu, akpm, schwidefsky, penberg,
	ioe-lkml, joern, linux-kernel

On Tue, May 02, 2006 at 02:30:43PM -0700, Greg KH wrote:
> On Tue, May 02, 2006 at 04:48:42AM -0400, Kyle Moffett wrote:
> > On May 2, 2006, at 00:00:53, Greg KH wrote:
> > >On Mon, May 01, 2006 at 07:29:23PM -0400, Kyle Moffett wrote:
> > >>So my question stands:  What is the _recommended_ way to handle  
> > >>simple data types in low-bandwidth/frequency multiple-valued  
> > >>transactions to hardware?  Examples include reading/modifying  
> > >>framebuffer settings (currently done through IOCTLS), s390 current  
> > >>state (up for discussion), etc.  In these cases there needs to be  
> > >>an atomic snapshot or write of multiple values at the same time.   
> > >>Given the situation it would be _nice_ to use sysfs so the admin  
> > >>can do it by hand; makes things shell scriptable and reduces the  
> > >>number of binary compatibility issues.
> > >
> > >I really don't know of a way to use sysfs for this currently, and  
> > >hence, am not complaining too much about the different /proc files  
> > >that have this kind of information in it at the moment.
> > >
> > >If you or someone else wants to come up with some kind of solution  
> > >for it, I'm sure that many people would be very happy to see it.
> > 
> > Hmm, ok; I'll see what I can come up with.  Would anybody object to  
> > this kind of API (as in my previous email) that uses an open fd as a  
> > transaction "handle"?
> 
> No, I think Kay played around with something like using the open fd of
> the directory as such a lock (or was he using flock on it, I can't
> remember now...)

If you can assume that processes accessing the values are cooperative,
it already works without any changes:

  $ time flock /sys/class/firmware echo 1 > /sys/class/firmware/timeout
  real    0m0.005s

  $ flock /sys/class/firmware sleep 5&
  [1] 6468

  $ time flock /sys/class/firmware echo 1 > /sys/class/firmware/timeout
  real    0m3.558s

Kay

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-02 21:33                       ` Kay Sievers
@ 2006-05-02 21:54                         ` Greg KH
  0 siblings, 0 replies; 62+ messages in thread
From: Greg KH @ 2006-05-02 21:54 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Kyle Moffett, Michael Holzheu, akpm, schwidefsky, penberg,
	ioe-lkml, joern, linux-kernel

On Tue, May 02, 2006 at 11:33:52PM +0200, Kay Sievers wrote:
> On Tue, May 02, 2006 at 02:28:45PM -0700, Greg KH wrote:
> > On Tue, May 02, 2006 at 01:46:03PM +0200, Kay Sievers wrote:
> > > On Mon, May 01, 2006 at 10:37:03PM -0700, Greg KH wrote:
> > > > On Tue, May 02, 2006 at 07:23:41AM +0200, Kay Sievers wrote:
> > > > > If the count of values handled in a transaction is not to high and it
> > > > > makes sense to group these values logically, why not just create an
> > > > > attribute group for every transaction, which creates dummy attributes
> > > > > to fill the values in, and use an "action" file in that group, that
> > > > > commits all the values at once to whatever target? That should fit into
> > > > > the ioctl use pattern, right?
> > > > 
> > > > That's what configfs can handle easier.  I think the issue is getting
> > > > stuff from the kernel in one atomic snapshot (all the different file
> > > > values from the same point in time.)
> > > 
> > > Sure, but just like an ioctl, the kernel could return the values after
> > > writing to the "action" file in the dummy attributes. That would be
> > > something like a snapshot, right?
> > 
> > Yes, but where would the buffer be to return the data to on a write?  In
> > the data that the user passed to write?
> 
> In the "dummy attribute", allocated by the device instance.

Ok, I'm totally confused and don't understand anymore.  Care to walk
this through again as to how it would work?

sorry,

greg k-h

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-02 21:49                   ` Kay Sievers
@ 2006-05-02 23:18                     ` Kyle Moffett
  0 siblings, 0 replies; 62+ messages in thread
From: Kyle Moffett @ 2006-05-02 23:18 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Greg KH, Michael Holzheu, akpm, schwidefsky, penberg, ioe-lkml,
	joern, linux-kernel

On May 2, 2006, at 17:49:08, Kay Sievers wrote:
> If you can assume that processes accessing the values are  
> cooperative, it already works without any changes:
>
>   $ time flock /sys/class/firmware echo 1 > /sys/class/firmware/ 
> timeout
>   real    0m0.005s
>
>   $ flock /sys/class/firmware sleep 5&
>   [1] 6468
>
>   $ time flock /sys/class/firmware echo 1 > /sys/class/firmware/ 
> timeout
>   real    0m3.558s

But that doesn't solve the problem for framebuffer devices or for the  
s390 code.  Such transactions have one or more of the following  
properties:

(1)  A read operation is _expensive_ or adds unacceptable latencies  
and should be done as rarely as possible
(2)  The data must be all written to hardware simultaneously by the  
kernel; a partial update does not make sense and would cause  
undesired operation from the hardware.

The idea with the transactions would be to create a kernel-memory  
buffer-layer of sorts on top of the underlying sysfs tree to cache  
the read data and collect writes for an atomic commit.  I'll see if I  
can make something work.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-29  7:53 ` Greg KH
                     ` (2 preceding siblings ...)
  2006-05-02 13:00   ` Michael Holzheu
@ 2006-05-03  8:45   ` Michael Holzheu
  3 siblings, 0 replies; 62+ messages in thread
From: Michael Holzheu @ 2006-05-03  8:45 UTC (permalink / raw)
  To: Greg KH; +Cc: akpm, ioe-lkml, joern, linux-kernel, mschwid2, penberg

Greg KH <greg@kroah.com> wrote on 04/29/2006 09:53:11 AM:
> On Fri, Apr 28, 2006 at 11:22:25AM +0200, Michael Holzheu wrote:
> > On zSeries machines there exists an interface which allows the
operating
> > system  to retrieve LPAR hypervisor accounting data. For example, it is
> > possible to get usage data for physical and virtual cpus. In order to
> > provide this information to user space programs, I implemented a new
> > virtual Linux file system named 'hypfs' using the Linux 2.6 libfs
> > framework. The name 'hypfs' stands for 'Hypervisor Filesystem'. All the
> > accounting information is put into different virtual files which can be
> > accessed from user space. All data is represented as ASCII strings.
> >
> > When the file system is mounted the accounting information is retrieved
> > and a file system tree is created with the attribute files containing
> > the cpu information. The content of the files remains unchanged until a
> > new update is made. An update can be triggered from user space through
> > writing 'something' into a special purpose update file.
> >
> > We create the following directory structure:
> >
> > <mount-point>/
> >         update
> >         cpus/
> >                 <cpu-id>
> >                         type
> >                         mgmtime
> >                 <cpu-id>
> >                         ...
> >         hyp/
> >                 type
> >         systems/
> >                 <lpar-name>
> >                         cpus/
> >                                 <cpu-id>
> >                                         type
> >                                         mgmtime
> >                                         cputime
> >                                         onlinetime
> >                                 <cpu-id>
> >                                         ...
> >                 <lpar-name>
> >                         cpus/
> >                                 ...
> >
> > - update: File to trigger update
> > - cpus/: Directory for all physical cpus
> > - cpus/<cpu-id>/: Directory for one physical cpu.
> > - cpus/<cpu-id>/type: Type name of physical zSeries cpu.
> > - cpus/<cpu-id>/mgmtime: Physical-LPAR-management time in microseconds.
> > - hyp/: Directory for hypervisor information
> > - hyp/type: Typ of hypervisor (currently only 'LPAR Hypervisor')
> > - systems/: Directory for all LPARs
> > - systems/<lpar-name>/: Directory for one LPAR.
> > - systems/<lpar-name>/cpus/<cpu-id>/: Directory for the virtual cpus
> > - systems/<lpar-name>/cpus/<cpu-id>/type: Typ of cpu.
> > - systems/<lpar-name>/cpus/<cpu-id>/mgmtime:
> > Accumulated number of microseconds during which a physical
> > CPU was assigned to the logical cpu and the cpu time was
> > consumed by the hypervisor and was not provided to
> > the LPAR (LPAR overhead).
> >
> > - systems/<lpar-name>/cpus/<cpu-id>/cputime:
> > Accumulated number of microseconds during which a physical CPU
> > was assigned to the logical cpu and the cpu time was consumed
> > by the LPAR.
> >
> > - systems/<lpar-name>/cpus/<cpu-id>/onlinetime:
> > Accumulated number of microseconds during which the logical CPU
> > has been online.
> >
> > As mount point for the filesystem /sys/hypervisor is created.
> >
> > The update process is triggered when writing 'something' into the
> > 'update' file at the top level hypfs directory. You can do this e.g.
> > with 'echo 1 > update'. During the update the whole directory structure
> > is deleted and built up again.
>
> This sounds a lot like configfs.  Why not use that instead?
>

I read more about configfs. There seems to be no way to
create our complete directory tree, when issuing the mkdir.

Therefore I would like to stay with my current implementation
of the Hypervisor Filesystem.

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-29  8:14     ` Andrew Morton
@ 2006-05-03  8:48       ` Michael Holzheu
  2006-05-03 22:10         ` Greg KH
  0 siblings, 1 reply; 62+ messages in thread
From: Michael Holzheu @ 2006-05-03  8:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Greg KH, ioe-lkml, joern, linux-kernel, mschwid2, penberg

Hi Andrew,

Andrew Morton <akpm@osdl.org> wrote on 04/29/2006 10:14:23 AM:

[snip]

> >  > Also, "/sys/hypervisor" is probably insufficiently specific.  In a
few
> >  > years time people will be asking "Which hypervisor?  We have
> eighteen of them!".
> >
> >  I agree, the xen people are already clammering for some kind of sysfs
> >  tree and wanted to create /sys/hypervisor/xen.  How about
> >  /sys/hypervisor/s390?

Fine with me! Then I will create /sys/hypervisor/s390. Should I
create /sys/hypervisor in the hpyfs code or should it be
created somewhere else?

> Yes, something like that.  Even "hypfs" is possibly too generic.

What about s390-hypfs?

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-29  8:41   ` Kyle Moffett
  2006-04-29 21:55     ` Greg KH
@ 2006-05-03  9:33     ` Michael Holzheu
  2006-05-03  9:42       ` Pekka J Enberg
  2006-05-03 10:01       ` Jörn Engel
  1 sibling, 2 replies; 62+ messages in thread
From: Michael Holzheu @ 2006-05-03  9:33 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: akpm, Greg KH, ioe-lkml, joern, linux-kernel, mschwid2, penberg

Kyle Moffett <mrmacman_g4@mac.com> wrote on 04/29/2006 10:41:05 AM:

|snip]

> It sounds like a lot of things need some kind of shell-scriptable
> transaction interface for sysfs files.  You don't want to have more
> than one value per file, but reading or writing of some values must
> be done together for consistency reasons.  Is there any way to
> implement something like this?  This would work for the framebuffer
> people and solve the needs of a lot of the people who still want
> ioctls or some other atomic-multivalued transfer that would otherwise
> be a great sysfs candidate.
>

Martin told me that you will probably kill me for the following,
but I neverless would like to post my suggestion:

All the complicated mechanisms with filesystem trees
to obtain consistent data and transaction functionality
could be avoided, if we would use single files, which
contain all the data. When opening the file, the snapshot
is created and attached to the struct file.

As far as I know, common sense is to avoid that, because
it is ugly and error prone to parse the files in user space.

But If we would provide a standard tag language (I don't
say XML here) for all sort of kernel data, which should
be exported to user space, one standard parser could
be used to obtain the data. This is not error prone since
you can always use the standard parser to access the
data. It is still human readable, which is an advantage
compared with binary files. You can either read
the file directly or use the parser to get the tree structure
in a better readable way. E.g in our s390 hypfs example
you could call:

standardparser < /sys/hypervisor/s390/accountigdata

which prints a nice tree to stdout. Or you could use the
parser in shell scripts to obtain specific attributes.
E.g. you could call:

standardparser < .../acountingdata systems.lparxy.cpu0.time

which prints e.g. "32432515" to stdout.

This approach would solve our atomicity problem and
also the problem of parsing complicated sysfs files.

Putting the tree into one file using a tag language is from
a logical point of view exactly the same as creating a
directory tree within a filesystem.

Regarding performance in kernel space it is not more effort
to create the tree in one file than creating the tree
within a filesystem. It is even less effort, since you do not
need hundreds of system calls (open, read, close) to
obtain the data from user space.

If you think, that this topic has already been discussed
too often on this mailing list and it is not worth discussing
this further, please just ignore this posting!

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03  9:33     ` Michael Holzheu
@ 2006-05-03  9:42       ` Pekka J Enberg
  2006-05-03 12:11         ` Michael Holzheu
  2006-05-03 10:01       ` Jörn Engel
  1 sibling, 1 reply; 62+ messages in thread
From: Pekka J Enberg @ 2006-05-03  9:42 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Kyle Moffett, akpm, Greg KH, ioe-lkml, joern, linux-kernel, mschwid2

On Wed, 3 May 2006, Michael Holzheu wrote:
> All the complicated mechanisms with filesystem trees
> to obtain consistent data and transaction functionality
> could be avoided, if we would use single files, which
> contain all the data. When opening the file, the snapshot
> is created and attached to the struct file.

If we're going to add new infrastructure, I'd vote for adding snapshotting 
capability to filesystems. We need it for stuff like LogFS anyway.

				Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03  9:33     ` Michael Holzheu
  2006-05-03  9:42       ` Pekka J Enberg
@ 2006-05-03 10:01       ` Jörn Engel
  1 sibling, 0 replies; 62+ messages in thread
From: Jörn Engel @ 2006-05-03 10:01 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Kyle Moffett, akpm, Greg KH, ioe-lkml, linux-kernel, mschwid2, penberg

On Wed, 3 May 2006 11:33:01 +0200, Michael Holzheu wrote:
> 
> All the complicated mechanisms with filesystem trees
> to obtain consistent data and transaction functionality
> could be avoided, if we would use single files, which
> contain all the data. When opening the file, the snapshot
> is created and attached to the struct file.

s/single file/single entity/ and this may be useful.  Your filesystem
exports a directory tree, which is nice and easily parsable.  The
problem is that it is a single resource for everyone.  If different
users could have their own views of this filesystem, each with a
private snapshot, many problems would be solved.

Spufs might have something similar already.  Istr something about
returning a directory fd and then using openat(2) and friends.

Jörn

-- 
To my face you have the audacity to advise me to become a thief - the worst
kind of thief that is conceivable, a thief of spiritual things, a thief of
ideas! It is insufferable, intolerable!
-- M. Binet in Scarabouche

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03  9:42       ` Pekka J Enberg
@ 2006-05-03 12:11         ` Michael Holzheu
  2006-05-03 12:33           ` Jörn Engel
  0 siblings, 1 reply; 62+ messages in thread
From: Michael Holzheu @ 2006-05-03 12:11 UTC (permalink / raw)
  To: Pekka J Enberg
  Cc: akpm, Greg KH, ioe-lkml, joern, linux-kernel, Kyle Moffett, mschwid2


Pekka J Enberg <penberg@cs.Helsinki.FI> wrote on 05/03/2006 11:42:01 AM:
> On Wed, 3 May 2006, Michael Holzheu wrote:
> > All the complicated mechanisms with filesystem trees
> > to obtain consistent data and transaction functionality
> > could be avoided, if we would use single files, which
> > contain all the data. When opening the file, the snapshot
> > is created and attached to the struct file.
>
> If we're going to add new infrastructure, I'd vote for adding
snapshotting
> capability to filesystems. We need it for stuff like LogFS anyway.
>

Maybe we need that, too. But I think the advantage of the
one file solution moves the complexity from the kernel
to userspace.

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03 12:11         ` Michael Holzheu
@ 2006-05-03 12:33           ` Jörn Engel
  2006-05-03 12:51             ` Michael Holzheu
  0 siblings, 1 reply; 62+ messages in thread
From: Jörn Engel @ 2006-05-03 12:33 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Pekka J Enberg, akpm, Greg KH, ioe-lkml, linux-kernel,
	Kyle Moffett, mschwid2

On Wed, 3 May 2006 14:11:36 +0200, Michael Holzheu wrote:
> 
> Maybe we need that, too. But I think the advantage of the
> one file solution moves the complexity from the kernel
> to userspace.

Now might be a time to come back to Martin's prediction. ;)

Having a weird format in some file does _not_ move complexity from the
kernel.  It may make the userspace more complex, granted.  But once
you try to change something, you need to keep the ABI stable.  And
part of the ABI is you file format.

Applications will depend on some arcane detail of your format.  They
will depend on exactly five spaces in "foo     bar".  It does not even
matter if you documented "any amount of whitespace".  The application
knows that it was five spaces and doesn't care.  And once you change
it, the blame will be on you, because you broke existing userspace.

If that does not make the kernel complex, I don't know what does.

Jörn

-- 
It does not matter how slowly you go, so long as you do not stop.
-- Confucius

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03 12:33           ` Jörn Engel
@ 2006-05-03 12:51             ` Michael Holzheu
  2006-05-03 13:00               ` Jörn Engel
  0 siblings, 1 reply; 62+ messages in thread
From: Michael Holzheu @ 2006-05-03 12:51 UTC (permalink / raw)
  To: Jörn Engel
  Cc: akpm, Greg KH, ioe-lkml, linux-kernel, Kyle Moffett, mschwid2,
	Pekka J Enberg

Hi Jörn,

Jörn Engel <joern@wohnheim.fh-wedel.de> wrote on 05/03/2006 02:33:39 PM:
> On Wed, 3 May 2006 14:11:36 +0200, Michael Holzheu wrote:
> >
> > Maybe we need that, too. But I think the advantage of the
> > one file solution moves the complexity from the kernel
> > to userspace.
>
> Now might be a time to come back to Martin's prediction. ;)
>
> Having a weird format in some file does _not_ move complexity from the
> kernel.  It may make the userspace more complex, granted.  But once
> you try to change something, you need to keep the ABI stable.  And
> part of the ABI is you file format.

This is also true, if you use a filesystem tree. The tree structure and
the content of the files are part of your ABI. There is no difference
between a standard tag based file (all kernel files should use the
same format of course!) and a filesystem tree.

> Applications will depend on some arcane detail of your format.  They
> will depend on exactly five spaces in "foo     bar".  It does not even
> matter if you documented "any amount of whitespace".  The application
> knows that it was five spaces and doesn't care.  And once you change
> it, the blame will be on you, because you broke existing userspace.

Again, logically there is no difference between the two solutions. It does
not matter, if you have one file with:

<cpu>
    <0>
       <onlinetime = 4711>
    <\0>
<\cpu>

... or whatever the standard kernel format will be
... and a filesystem tree with:

+cpu
   + 0
      + onlinetime


If you implement a standard userspace parser, you can access the
attributes in one file as easyly as the attributes in a filesystem tree.

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03 12:51             ` Michael Holzheu
@ 2006-05-03 13:00               ` Jörn Engel
  2006-05-03 13:18                 ` Michael Holzheu
  0 siblings, 1 reply; 62+ messages in thread
From: Jörn Engel @ 2006-05-03 13:00 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: akpm, Greg KH, ioe-lkml, linux-kernel, Kyle Moffett, mschwid2,
	Pekka J Enberg

On Wed, 3 May 2006 14:51:53 +0200, Michael Holzheu wrote:
> Jörn Engel <joern@wohnheim.fh-wedel.de> wrote on 05/03/2006 02:33:39 PM:
> 
> > Applications will depend on some arcane detail of your format.  They
> > will depend on exactly five spaces in "foo     bar".  It does not even
> > matter if you documented "any amount of whitespace".  The application
> > knows that it was five spaces and doesn't care.  And once you change
> > it, the blame will be on you, because you broke existing userspace.
> 
> Again, logically there is no difference between the two solutions. It does
> not matter, if you have one file with:
> 
> <cpu>
>     <0>
>        <onlinetime = 4711>
>     <\0>
> <\cpu>

Userspace can make your life hell by depending on indentation via 4
spaces.  The problem is that you don't necessarily know that it does
until you managed to change indentation.

In a filesystem tree, it is fairly hard to make assumptions that are
later broken.  It is by no means impossible, agreed.  But the
"indentation" doesn't exist anymore.  A file is part of a subdirectory
or it isn't.  Opening tags without matching closing tags don't exist
either.  List goes on.

In the end, both formats can get abused in ways you'd never foresee.
But the directory tree considerably raises the barrier.

Jörn

-- 
He who knows that enough is enough will always have enough.
-- Lao Tsu

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03 13:00               ` Jörn Engel
@ 2006-05-03 13:18                 ` Michael Holzheu
  2006-05-03 13:22                   ` Jörn Engel
  2006-05-03 15:54                   ` Valdis.Kletnieks
  0 siblings, 2 replies; 62+ messages in thread
From: Michael Holzheu @ 2006-05-03 13:18 UTC (permalink / raw)
  To: Jörn Engel
  Cc: akpm, Greg KH, ioe-lkml, linux-kernel, Kyle Moffett, mschwid2,
	Pekka J Enberg

Hi Jörn,

Jörn Engel <joern@wohnheim.fh-wedel.de> wrote on 05/03/2006 03:00:43 PM:
> On Wed, 3 May 2006 14:51:53 +0200, Michael Holzheu wrote:
> > Jörn Engel <joern@wohnheim.fh-wedel.de> wrote on 05/03/2006 02:33:39
PM:
> >
> > > Applications will depend on some arcane detail of your format.  They
> > > will depend on exactly five spaces in "foo     bar".  It does not
even
> > > matter if you documented "any amount of whitespace".  The application
> > > knows that it was five spaces and doesn't care.  And once you change
> > > it, the blame will be on you, because you broke existing userspace.
> >
> > Again, logically there is no difference between the two solutions. It
does
> > not matter, if you have one file with:
> >
> > <cpu>
> >     <0>
> >        <onlinetime = 4711>
> >     <\0>
> > <\cpu>
>
> Userspace can make your life hell by depending on indentation via 4
> spaces.  The problem is that you don't necessarily know that it does
> until you managed to change indentation.

Of course! But the convention must be, that If userspace wants to
access the data, it has to use our standard linux
parser. If it accesses the data directly, this is broken.
This ensures, that whitespaces do not matter at all! And as
I said before, if you use the parser, you don't have any
difference compared to the filesystem solution from a logical
perspective.

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03 13:18                 ` Michael Holzheu
@ 2006-05-03 13:22                   ` Jörn Engel
  2006-05-03 13:38                     ` Michael Holzheu
  2006-05-03 15:54                   ` Valdis.Kletnieks
  1 sibling, 1 reply; 62+ messages in thread
From: Jörn Engel @ 2006-05-03 13:22 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: akpm, Greg KH, ioe-lkml, linux-kernel, Kyle Moffett, mschwid2,
	Pekka J Enberg

On Wed, 3 May 2006 15:18:41 +0200, Michael Holzheu wrote:
> 
> Of course! But the convention must be, that If userspace wants to
> access the data, it has to use our standard linux
> parser. If it accesses the data directly, this is broken.
> This ensures, that whitespaces do not matter at all! And as
> I said before, if you use the parser, you don't have any
> difference compared to the filesystem solution from a logical
> perspective.

o People are not forced to follow the convention.  If they don't and
  you break an existing application, you get the blame.
o Now you have a dependency on the standard parser, which is in
  userspace.  Any bug in any version of the standard parser and...

Jörn

-- 
There's nothing better for promoting creativity in a medium than
making an audience feel "Hmm ­ I could do better than that!"
-- Douglas Adams in a slashdot interview

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03 13:22                   ` Jörn Engel
@ 2006-05-03 13:38                     ` Michael Holzheu
  2006-05-03 14:17                       ` Martin Schwidefsky
  0 siblings, 1 reply; 62+ messages in thread
From: Michael Holzheu @ 2006-05-03 13:38 UTC (permalink / raw)
  To: Jörn Engel
  Cc: akpm, Greg KH, ioe-lkml, linux-kernel, Kyle Moffett, mschwid2,
	Pekka J Enberg

Hi Jörn,

Jörn Engel <joern@wohnheim.fh-wedel.de> wrote on 05/03/2006 03:22:39 PM:
> On Wed, 3 May 2006 15:18:41 +0200, Michael Holzheu wrote:
> >
> o People are not forced to follow the convention.  If they don't and
>   you break an existing application, you get the blame.

Sure, but this is really just a matter of definition. The kernel defines
the ABI, right?. User space has to follow the rules. If they break the
rules
that's badluck for the userspace tools. Currently you can also
get kernel information directly from /dev/mem. If an application
does that, nobody would say, that we are not allowed to change
kernel data structures because of that user space application.

> o Now you have a dependency on the standard parser, which is in
>   userspace.  Any bug in any version of the standard parser and...

At least this parser should be well tested, if everybody uses it.

But maybe I am completely wrong here ....

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03 13:38                     ` Michael Holzheu
@ 2006-05-03 14:17                       ` Martin Schwidefsky
  2006-05-03 14:23                         ` Michael Holzheu
  0 siblings, 1 reply; 62+ messages in thread
From: Martin Schwidefsky @ 2006-05-03 14:17 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Jörn Engel, akpm, Greg KH, ioe-lkml, linux-kernel,
	Kyle Moffett, mschwid2, Pekka J Enberg

On Wed, 2006-05-03 at 15:38 +0200, Michael Holzheu wrote:
> > On Wed, 3 May 2006 15:18:41 +0200, Michael Holzheu wrote:
> > >
> > o People are not forced to follow the convention.  If they don't and
> >   you break an existing application, you get the blame.
> 
> Sure, but this is really just a matter of definition. The kernel defines
> the ABI, right?. User space has to follow the rules. If they break the
> rules
> that's badluck for the userspace tools. Currently you can also
> get kernel information directly from /dev/mem. If an application
> does that, nobody would say, that we are not allowed to change
> kernel data structures because of that user space application.

The kernel defines the ABI, but what IS the ABI? Is a single space that
the current implementation delivers an indication that there needs to be
a single space between two values, or could there be an arbitrary number
of spaces and tabs? You certainly can't conclude that from the code.
For /proc files people tended to use sscanf to read lines from the
output. Is the format of a line fixed, or can it be extended by
additional fields? Does certain fields have to start at a specific
offset or not? How long can the different fields get? And so on.

> > o Now you have a dependency on the standard parser, which is in
> >   userspace.  Any bug in any version of the standard parser and...
> 
> At least this parser should be well tested, if everybody uses it.

And the user space then uses the parser only? Is now the parser
interface the "ABI" or the kernel interface that is in turn used by the
parser? And what happens if somebody comes up with a "better" parser
that does things subtly different? 

In short: keep the kernel interface as simple as you possibly can. That
is why the single value approach has been invented. A text file that
needs to get parsed is certainly not simple.

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03 14:17                       ` Martin Schwidefsky
@ 2006-05-03 14:23                         ` Michael Holzheu
  2006-05-03 14:58                           ` Martin Schwidefsky
  0 siblings, 1 reply; 62+ messages in thread
From: Michael Holzheu @ 2006-05-03 14:23 UTC (permalink / raw)
  To: mschwid2
  Cc: akpm, Greg KH, ioe-lkml, Jörn Engel, linux-kernel,
	Kyle Moffett, mschwid2, Pekka J Enberg

mschwid2@de.ltcfwd.linux.ibm.com wrote on 05/03/2006 04:17:40 PM:
> And the user space then uses the parser only? Is now the parser
> interface the "ABI" or the kernel interface that is in turn used by the
> parser? And what happens if somebody comes up with a "better" parser
> that does things subtly different?

The ABI is not defined by the Parser. You have to specify the
tag language, which is part of the ABI. Any parser, which is comliant
to the specification of the tag language can be used.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03 14:23                         ` Michael Holzheu
@ 2006-05-03 14:58                           ` Martin Schwidefsky
  2006-05-03 15:22                             ` Michael Holzheu
  0 siblings, 1 reply; 62+ messages in thread
From: Martin Schwidefsky @ 2006-05-03 14:58 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: mschwid2, akpm, Greg KH, ioe-lkml, Jörn Engel, linux-kernel,
	Kyle Moffett, Pekka J Enberg

On Wed, 2006-05-03 at 16:23 +0200, Michael Holzheu wrote:
> mschwid2@de.ltcfwd.linux.ibm.com wrote on 05/03/2006 04:17:40 PM:
> > And the user space then uses the parser only? Is now the parser
> > interface the "ABI" or the kernel interface that is in turn used by the
> > parser? And what happens if somebody comes up with a "better" parser
> > that does things subtly different?
> 
> The ABI is not defined by the Parser. You have to specify the
> tag language, which is part of the ABI. Any parser, which is comliant
> to the specification of the tag language can be used.

Optimist.

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03 14:58                           ` Martin Schwidefsky
@ 2006-05-03 15:22                             ` Michael Holzheu
  0 siblings, 0 replies; 62+ messages in thread
From: Michael Holzheu @ 2006-05-03 15:22 UTC (permalink / raw)
  To: mschwid2
  Cc: akpm, Greg KH, ioe-lkml, Jörn Engel, linux-kernel,
	Kyle Moffett, mschwid2, Pekka J Enberg

mschwid2@de.ltcfwd.linux.ibm.com wrote on 05/03/2006 04:58:06 PM:
> On Wed, 2006-05-03 at 16:23 +0200, Michael Holzheu wrote:
> > mschwid2@de.ltcfwd.linux.ibm.com wrote on 05/03/2006 04:17:40 PM:
> > > And the user space then uses the parser only? Is now the parser
> > > interface the "ABI" or the kernel interface that is in turn used by
the
> > > parser? And what happens if somebody comes up with a "better" parser
> > > that does things subtly different?
> >
> > The ABI is not defined by the Parser. You have to specify the
> > tag language, which is part of the ABI. Any parser, which is comliant
> > to the specification of the tag language can be used.
>
> Optimist.

One very last comment: I think for our problem to
ensure consitency of hypervisor data, when an application
always wants to get the complete set of information,
the "one file" solution with a fully specified ASCII
tag language format looks for me to be the easiest way
to implement our solution. And I think, if one decides to
use one file to provide all the information, it is better to
have a standard data format than always invent new
formats. And to use a standard ASCII format is
in my eyes also better than to have a binary interface.

If everybody says, that it is in principal not a good idea
to use one file, than a snapshot mechanism
for filesystems is probably go good method to provide
consitency. And this is definitely not my decission...

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03 13:18                 ` Michael Holzheu
  2006-05-03 13:22                   ` Jörn Engel
@ 2006-05-03 15:54                   ` Valdis.Kletnieks
  1 sibling, 0 replies; 62+ messages in thread
From: Valdis.Kletnieks @ 2006-05-03 15:54 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Jörn Engel, akpm, Greg KH, ioe-lkml, linux-kernel,
	Kyle Moffett, mschwid2, Pekka J Enberg

[-- Attachment #1: Type: text/plain, Size: 506 bytes --]

On Wed, 03 May 2006 15:18:41 +0200, Michael Holzheu said:
> Of course! But the convention must be, that If userspace wants to
> access the data, it has to use our standard linux
> parser. If it accesses the data directly, this is broken.

Yet another case of Eternal Optimism flying in the face of Reality... ;)

a) you can't *force* the use of your parser.
b) this creates a userspace dependency that can get messy if the parser is buggy
or requires modification to deal with a kernel change.


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03  8:48       ` Michael Holzheu
@ 2006-05-03 22:10         ` Greg KH
  2006-05-04 10:22           ` Michael Holzheu
  0 siblings, 1 reply; 62+ messages in thread
From: Greg KH @ 2006-05-03 22:10 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Andrew Morton, ioe-lkml, joern, linux-kernel, mschwid2, penberg

On Wed, May 03, 2006 at 10:48:19AM +0200, Michael Holzheu wrote:
> Hi Andrew,
> 
> Andrew Morton <akpm@osdl.org> wrote on 04/29/2006 10:14:23 AM:
> 
> [snip]
> 
> > >  > Also, "/sys/hypervisor" is probably insufficiently specific.  In a
> few
> > >  > years time people will be asking "Which hypervisor?  We have
> > eighteen of them!".
> > >
> > >  I agree, the xen people are already clammering for some kind of sysfs
> > >  tree and wanted to create /sys/hypervisor/xen.  How about
> > >  /sys/hypervisor/s390?
> 
> Fine with me! Then I will create /sys/hypervisor/s390. Should I
> create /sys/hypervisor in the hpyfs code or should it be
> created somewhere else?

Somewhere else is probably best.

drivers/base/hypervisor.c ?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-03 22:10         ` Greg KH
@ 2006-05-04 10:22           ` Michael Holzheu
  2006-05-04 14:42             ` Greg KH
  0 siblings, 1 reply; 62+ messages in thread
From: Michael Holzheu @ 2006-05-04 10:22 UTC (permalink / raw)
  To: Greg KH; +Cc: Andrew Morton, ioe-lkml, joern, linux-kernel, mschwid2, penberg

Greg KH <greg@kroah.com> wrote on 05/04/2006 12:10:37 AM:

> > Fine with me! Then I will create /sys/hypervisor/s390. Should I
> > create /sys/hypervisor in the hpyfs code or should it be
> > created somewhere else?
>
> Somewhere else is probably best.
>
> drivers/base/hypervisor.c ?
>

We could do that, but then we have to create two new files
hypervisor.c and hypervisor.h just for one new mountpoint
in sysfs.

I would suggest do do it like /sys/kernel and put the code
into kernel/ksysfs.c and include/linux/kobject.h

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-04 10:22           ` Michael Holzheu
@ 2006-05-04 14:42             ` Greg KH
  2006-05-04 15:01               ` Michael Holzheu
  0 siblings, 1 reply; 62+ messages in thread
From: Greg KH @ 2006-05-04 14:42 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Andrew Morton, ioe-lkml, joern, linux-kernel, mschwid2, penberg

On Thu, May 04, 2006 at 12:22:42PM +0200, Michael Holzheu wrote:
> Greg KH <greg@kroah.com> wrote on 05/04/2006 12:10:37 AM:
> 
> > > Fine with me! Then I will create /sys/hypervisor/s390. Should I
> > > create /sys/hypervisor in the hpyfs code or should it be
> > > created somewhere else?
> >
> > Somewhere else is probably best.
> >
> > drivers/base/hypervisor.c ?
> >
> 
> We could do that, but then we have to create two new files
> hypervisor.c and hypervisor.h just for one new mountpoint
> in sysfs.
> 
> I would suggest do do it like /sys/kernel and put the code
> into kernel/ksysfs.c and include/linux/kobject.h

No, if you do that then every kernel gets that mount point, when almost
no one really wants it :)

If you leave it as a separate file, then the build system can just
include the file as needed.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-04 14:42             ` Greg KH
@ 2006-05-04 15:01               ` Michael Holzheu
  2006-05-04 15:34                 ` Greg KH
  0 siblings, 1 reply; 62+ messages in thread
From: Michael Holzheu @ 2006-05-04 15:01 UTC (permalink / raw)
  To: Greg KH; +Cc: Andrew Morton, ioe-lkml, joern, linux-kernel, mschwid2, penberg

Hi Greg,

Greg KH <greg@kroah.com> wrote on 05/04/2006 04:42:59 PM:
> > I would suggest do do it like /sys/kernel and put the code
> > into kernel/ksysfs.c and include/linux/kobject.h
>
> No, if you do that then every kernel gets that mount point, when almost
> no one really wants it :)
>
> If you leave it as a separate file, then the build system can just
> include the file as needed.
>

So you want a new config option CONFIG_HYPERVISOR?

When no one except for us wants it, wouldn't it be best
then to create /sys/hypervisor first in the hypfs code?

If someone else needs it in the future, we still can move
it common code.

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-04 15:01               ` Michael Holzheu
@ 2006-05-04 15:34                 ` Greg KH
  0 siblings, 0 replies; 62+ messages in thread
From: Greg KH @ 2006-05-04 15:34 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Andrew Morton, ioe-lkml, joern, linux-kernel, mschwid2, penberg

On Thu, May 04, 2006 at 05:01:07PM +0200, Michael Holzheu wrote:
> Hi Greg,
> 
> Greg KH <greg@kroah.com> wrote on 05/04/2006 04:42:59 PM:
> > > I would suggest do do it like /sys/kernel and put the code
> > > into kernel/ksysfs.c and include/linux/kobject.h
> >
> > No, if you do that then every kernel gets that mount point, when almost
> > no one really wants it :)
> >
> > If you leave it as a separate file, then the build system can just
> > include the file as needed.
> >
> 
> So you want a new config option CONFIG_HYPERVISOR?

Sure.  But don't make it a user selectable config option, but rather,
one your S390 option sets.

That way the Xen and other groups can also set it when they need it.

> When no one except for us wants it, wouldn't it be best
> then to create /sys/hypervisor first in the hypfs code?
> 
> If someone else needs it in the future, we still can move
> it common code.

The Xen people need it too.  Now who knows when their code will ever hit
mainline...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-08 12:24 Michael Holzheu
@ 2006-05-09  5:01 ` Greg KH
  0 siblings, 0 replies; 62+ messages in thread
From: Greg KH @ 2006-05-09  5:01 UTC (permalink / raw)
  To: Michael Holzheu; +Cc: akpm, ioe-lkml, joern, linux-kernel, mschwid2, penberg

On Mon, May 08, 2006 at 02:24:16PM +0200, Michael Holzheu wrote:
> Hi Greg,
> 
> Greg KH <greg@kroah.com> wrote on 05/05/2006 11:14:47 PM:
> > > I added a invisible config option CONFIG_SYS_HYPERVISOR. If this
> > > option is enabled, /sys/hypervisor is created. CONFIG_S390_HYPFS
> > > enables this option automatically using "select".
> > > 
> > > This the following patch acceptable for you?
> 
> Ok, I modified the patch according to your input. So
> hopefully that's the final one... ok?

It is so close...

> diff -urpN linux-2.6.16/drivers/base/base.h linux-2.6.16-hypervisor/drivers/base/base.h
> --- linux-2.6.16/drivers/base/base.h	2006-03-20 06:53:29.000000000 +0100
> +++ linux-2.6.16-hypervisor/drivers/base/base.h	2006-05-08 14:13:18.000000000 +0200
> @@ -5,6 +5,11 @@ extern int devices_init(void);
>  extern int buses_init(void);
>  extern int classes_init(void);
>  extern int firmware_init(void);
> +#ifdef CONFIG_SYS_HYPERVISOR
> +extern int hypervisor_init(void);
> +#else
> +#define hypervisor_init() do { } while(0)
> +#endif

Hm, that second case there doesn't return anything incase we actually
try to check the return value of hypervisor_init()...

Other than that, looks good.  Care to fix this and send it again with a
real changelog and Signed-off-by: line, and I'll be glad to add it to my
tree.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
@ 2006-05-08 12:24 Michael Holzheu
  2006-05-09  5:01 ` Greg KH
  0 siblings, 1 reply; 62+ messages in thread
From: Michael Holzheu @ 2006-05-08 12:24 UTC (permalink / raw)
  To: greg; +Cc: akpm, ioe-lkml, joern, linux-kernel, mschwid2, penberg

Hi Greg,

Greg KH <greg@kroah.com> wrote on 05/05/2006 11:14:47 PM:
> > I added a invisible config option CONFIG_SYS_HYPERVISOR. If this
> > option is enabled, /sys/hypervisor is created. CONFIG_S390_HYPFS
> > enables this option automatically using "select".
> > 
> > This the following patch acceptable for you?

Ok, I modified the patch according to your input. So
hopefully that's the final one... ok?

---

diff -urpN linux-2.6.16/drivers/base/Kconfig linux-2.6.16-hypervisor/drivers/base/Kconfig
--- linux-2.6.16/drivers/base/Kconfig	2006-03-20 06:53:29.000000000 +0100
+++ linux-2.6.16-hypervisor/drivers/base/Kconfig	2006-05-08 13:22:52.000000000 +0200
@@ -38,3 +38,7 @@ config DEBUG_DRIVER
 	  If you are unsure about this, say N here.
 
 endmenu
+
+config SYS_HYPERVISOR
+	bool
+	default n
diff -urpN linux-2.6.16/drivers/base/Makefile linux-2.6.16-hypervisor/drivers/base/Makefile
--- linux-2.6.16/drivers/base/Makefile	2006-03-20 06:53:29.000000000 +0100
+++ linux-2.6.16-hypervisor/drivers/base/Makefile	2006-05-08 13:22:35.000000000 +0200
@@ -9,6 +9,7 @@ obj-$(CONFIG_FW_LOADER)	+= firmware_clas
 obj-$(CONFIG_NUMA)	+= node.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory.o
 obj-$(CONFIG_SMP)	+= topology.o
+obj-$(CONFIG_SYS_HYPERVISOR) += hypervisor.o
 
 ifeq ($(CONFIG_DEBUG_DRIVER),y)
 EXTRA_CFLAGS += -DDEBUG
diff -urpN linux-2.6.16/drivers/base/base.h linux-2.6.16-hypervisor/drivers/base/base.h
--- linux-2.6.16/drivers/base/base.h	2006-03-20 06:53:29.000000000 +0100
+++ linux-2.6.16-hypervisor/drivers/base/base.h	2006-05-08 14:13:18.000000000 +0200
@@ -5,6 +5,11 @@ extern int devices_init(void);
 extern int buses_init(void);
 extern int classes_init(void);
 extern int firmware_init(void);
+#ifdef CONFIG_SYS_HYPERVISOR
+extern int hypervisor_init(void);
+#else
+#define hypervisor_init() do { } while(0)
+#endif
 extern int platform_bus_init(void);
 extern int system_bus_init(void);
 extern int cpu_dev_init(void);
diff -urpN linux-2.6.16/drivers/base/hypervisor.c linux-2.6.16-hypervisor/drivers/base/hypervisor.c
--- linux-2.6.16/drivers/base/hypervisor.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.16-hypervisor/drivers/base/hypervisor.c	2006-05-08 13:22:45.000000000 +0200
@@ -0,0 +1,19 @@
+/*
+ * hypervisor.c - /sys/hypervisor subsystem.
+ *
+ * This file is released under the GPLv2
+ *
+ */
+
+#include <linux/kobject.h>
+#include <linux/device.h>
+
+#include "base.h"
+
+decl_subsys(hypervisor, NULL, NULL);
+EXPORT_SYMBOL_GPL(hypervisor_subsys);
+
+int __init hypervisor_init(void)
+{
+	return subsystem_register(&hypervisor_subsys);
+}
diff -urpN linux-2.6.16/drivers/base/init.c linux-2.6.16-hypervisor/drivers/base/init.c
--- linux-2.6.16/drivers/base/init.c	2006-03-20 06:53:29.000000000 +0100
+++ linux-2.6.16-hypervisor/drivers/base/init.c	2006-05-08 13:22:39.000000000 +0200
@@ -27,6 +27,7 @@ void __init driver_init(void)
 	buses_init();
 	classes_init();
 	firmware_init();
+	hypervisor_init();
 
 	/* These are also core pieces, but must come after the
 	 * core core pieces.
diff -urpN linux-2.6.16/include/linux/kobject.h linux-2.6.16-hypervisor/include/linux/kobject.h
--- linux-2.6.16/include/linux/kobject.h	2006-03-20 06:53:29.000000000 +0100
+++ linux-2.6.16-hypervisor/include/linux/kobject.h	2006-05-08 13:22:21.000000000 +0200
@@ -186,6 +186,8 @@ struct subsystem _varname##_subsys = { \
 
 /* The global /sys/kernel/ subsystem for people to chain off of */
 extern struct subsystem kernel_subsys;
+/* The global /sys/hypervisor/ subsystem  */
+extern struct subsystem hypervisor_subsys;
 
 /**
  * Helpers for setting the kset of registered objects.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-05-05 13:22 Michael Holzheu
@ 2006-05-05 21:14 ` Greg KH
  0 siblings, 0 replies; 62+ messages in thread
From: Greg KH @ 2006-05-05 21:14 UTC (permalink / raw)
  To: Michael Holzheu; +Cc: akpm, ioe-lkml, joern, linux-kernel, mschwid2, penberg

On Fri, May 05, 2006 at 03:22:49PM +0200, Michael Holzheu wrote:
> Greg KH <greg@kroah.com> wrote on 05/04/2006 05:34:11 PM:
> > > So you want a new config option CONFIG_HYPERVISOR?
> > 
> > Sure.  But don't make it a user selectable config option, but rather,
> > one your S390 option sets.
> > 
> > That way the Xen and other groups can also set it when they need it.
> > 
> 
> [snip]
> > 
> > The Xen people need it too.  Now who knows when their code will ever hit
> > mainline...
> 
> I added a invisible config option CONFIG_SYS_HYPERVISOR. If this
> option is enabled, /sys/hypervisor is created. CONFIG_S390_HYPFS
> enables this option automatically using "select".
> 
> This the following patch acceptable for you?
> 
> ---
> 
>  drivers/base/Kconfig      |    4 ++++
>  drivers/base/Makefile     |    2 +-
>  drivers/base/hypervisor.c |   30 ++++++++++++++++++++++++++++++
>  drivers/base/init.c       |    1 +
>  include/linux/kobject.h   |    4 ++++
>  5 files changed, 40 insertions(+), 1 deletion(-)
> 
> diff -urpN linux-2.6.16/drivers/base/Kconfig linux-2.6.16-hypervisor/drivers/base/Kconfig
> --- linux-2.6.16/drivers/base/Kconfig	2006-03-20 06:53:29.000000000 +0100
> +++ linux-2.6.16-hypervisor/drivers/base/Kconfig	2006-05-05 15:13:10.000000000 +0200
> @@ -38,3 +38,7 @@ config DEBUG_DRIVER
>  	  If you are unsure about this, say N here.
>  
>  endmenu
> +
> +config SYS_HYPERVISOR
> +	bool
> +	default n
> diff -urpN linux-2.6.16/drivers/base/Makefile linux-2.6.16-hypervisor/drivers/base/Makefile
> --- linux-2.6.16/drivers/base/Makefile	2006-03-20 06:53:29.000000000 +0100
> +++ linux-2.6.16-hypervisor/drivers/base/Makefile	2006-05-05 15:12:48.000000000 +0200
> @@ -3,7 +3,7 @@
>  obj-y			:= core.o sys.o bus.o dd.o \
>  			   driver.o class.o platform.o \
>  			   cpu.o firmware.o init.o map.o dmapool.o \
> -			   attribute_container.o transport_class.o
> +			   attribute_container.o transport_class.o hypervisor.o
>  obj-y			+= power/
>  obj-$(CONFIG_FW_LOADER)	+= firmware_class.o
>  obj-$(CONFIG_NUMA)	+= node.o

This should be:
  obj-$(CONFIG_HYPERVISOR) += hypervisor.o
don't always load it in.

> diff -urpN linux-2.6.16/drivers/base/hypervisor.c linux-2.6.16-hypervisor/drivers/base/hypervisor.c
> --- linux-2.6.16/drivers/base/hypervisor.c	1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.16-hypervisor/drivers/base/hypervisor.c	2006-05-05 15:12:57.000000000 +0200
> @@ -0,0 +1,30 @@
> +/*
> + * hypervisor.c - /sys/hypervisor subsystem.
> + *
> + * This file is released under the GPLv2
> + *
> + */
> +
> +#include <linux/kobject.h>
> +#include <linux/device.h>
> +
> +#include "base.h"
> +
> +#ifdef CONFIG_SYS_HYPERVISOR
> +
> +decl_subsys(hypervisor, NULL, NULL);
> +EXPORT_SYMBOL_GPL(hypervisor_subsys);
> +
> +int __init hypervisor_init(void)
> +{
> +	return subsystem_register(&hypervisor_subsys);
> +}
> +
> +#else
> +
> +int __init hypervisor_init(void)
> +{
> +	return -1;
> +}
> +
> +#endif /* CONFIG_SYS_HYPERVISOR */

No ifdef needed here then.

> diff -urpN linux-2.6.16/drivers/base/init.c linux-2.6.16-hypervisor/drivers/base/init.c
> --- linux-2.6.16/drivers/base/init.c	2006-03-20 06:53:29.000000000 +0100
> +++ linux-2.6.16-hypervisor/drivers/base/init.c	2006-05-05 15:12:40.000000000 +0200
> @@ -27,6 +27,7 @@ void __init driver_init(void)
>  	buses_init();
>  	classes_init();
>  	firmware_init();
> +	hypervisor_init();

But we need the ifdef in a header file to make this compile properly.

>  
>  	/* These are also core pieces, but must come after the
>  	 * core core pieces.
> diff -urpN linux-2.6.16/include/linux/kobject.h linux-2.6.16-hypervisor/include/linux/kobject.h
> --- linux-2.6.16/include/linux/kobject.h	2006-03-20 06:53:29.000000000 +0100
> +++ linux-2.6.16-hypervisor/include/linux/kobject.h	2006-05-05 15:13:35.000000000 +0200
> @@ -186,6 +186,10 @@ struct subsystem _varname##_subsys = { \
>  
>  /* The global /sys/kernel/ subsystem for people to chain off of */
>  extern struct subsystem kernel_subsys;
> +#ifdef CONFIG_SYS_HYPERVISOR
> +/* The global /sys/hypervisor/ subsystem  */
> +extern struct subsystem hypervisor_subsys;
> +#endif /* CONFIG_SYS_HYPERVISOR */

No ifdef here should be needed, a link error will happen if you try to
access it and it's not linked in properly.

So, it's almost there :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
@ 2006-05-05 13:22 Michael Holzheu
  2006-05-05 21:14 ` Greg KH
  0 siblings, 1 reply; 62+ messages in thread
From: Michael Holzheu @ 2006-05-05 13:22 UTC (permalink / raw)
  To: greg; +Cc: akpm, ioe-lkml, joern, linux-kernel, mschwid2, penberg

Greg KH <greg@kroah.com> wrote on 05/04/2006 05:34:11 PM:
> > So you want a new config option CONFIG_HYPERVISOR?
> 
> Sure.  But don't make it a user selectable config option, but rather,
> one your S390 option sets.
> 
> That way the Xen and other groups can also set it when they need it.
> 

[snip]
> 
> The Xen people need it too.  Now who knows when their code will ever hit
> mainline...

I added a invisible config option CONFIG_SYS_HYPERVISOR. If this
option is enabled, /sys/hypervisor is created. CONFIG_S390_HYPFS
enables this option automatically using "select".

This the following patch acceptable for you?

---

 drivers/base/Kconfig      |    4 ++++
 drivers/base/Makefile     |    2 +-
 drivers/base/hypervisor.c |   30 ++++++++++++++++++++++++++++++
 drivers/base/init.c       |    1 +
 include/linux/kobject.h   |    4 ++++
 5 files changed, 40 insertions(+), 1 deletion(-)

diff -urpN linux-2.6.16/drivers/base/Kconfig linux-2.6.16-hypervisor/drivers/base/Kconfig
--- linux-2.6.16/drivers/base/Kconfig	2006-03-20 06:53:29.000000000 +0100
+++ linux-2.6.16-hypervisor/drivers/base/Kconfig	2006-05-05 15:13:10.000000000 +0200
@@ -38,3 +38,7 @@ config DEBUG_DRIVER
 	  If you are unsure about this, say N here.
 
 endmenu
+
+config SYS_HYPERVISOR
+	bool
+	default n
diff -urpN linux-2.6.16/drivers/base/Makefile linux-2.6.16-hypervisor/drivers/base/Makefile
--- linux-2.6.16/drivers/base/Makefile	2006-03-20 06:53:29.000000000 +0100
+++ linux-2.6.16-hypervisor/drivers/base/Makefile	2006-05-05 15:12:48.000000000 +0200
@@ -3,7 +3,7 @@
 obj-y			:= core.o sys.o bus.o dd.o \
 			   driver.o class.o platform.o \
 			   cpu.o firmware.o init.o map.o dmapool.o \
-			   attribute_container.o transport_class.o
+			   attribute_container.o transport_class.o hypervisor.o
 obj-y			+= power/
 obj-$(CONFIG_FW_LOADER)	+= firmware_class.o
 obj-$(CONFIG_NUMA)	+= node.o
diff -urpN linux-2.6.16/drivers/base/hypervisor.c linux-2.6.16-hypervisor/drivers/base/hypervisor.c
--- linux-2.6.16/drivers/base/hypervisor.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.16-hypervisor/drivers/base/hypervisor.c	2006-05-05 15:12:57.000000000 +0200
@@ -0,0 +1,30 @@
+/*
+ * hypervisor.c - /sys/hypervisor subsystem.
+ *
+ * This file is released under the GPLv2
+ *
+ */
+
+#include <linux/kobject.h>
+#include <linux/device.h>
+
+#include "base.h"
+
+#ifdef CONFIG_SYS_HYPERVISOR
+
+decl_subsys(hypervisor, NULL, NULL);
+EXPORT_SYMBOL_GPL(hypervisor_subsys);
+
+int __init hypervisor_init(void)
+{
+	return subsystem_register(&hypervisor_subsys);
+}
+
+#else
+
+int __init hypervisor_init(void)
+{
+	return -1;
+}
+
+#endif /* CONFIG_SYS_HYPERVISOR */
diff -urpN linux-2.6.16/drivers/base/init.c linux-2.6.16-hypervisor/drivers/base/init.c
--- linux-2.6.16/drivers/base/init.c	2006-03-20 06:53:29.000000000 +0100
+++ linux-2.6.16-hypervisor/drivers/base/init.c	2006-05-05 15:12:40.000000000 +0200
@@ -27,6 +27,7 @@ void __init driver_init(void)
 	buses_init();
 	classes_init();
 	firmware_init();
+	hypervisor_init();
 
 	/* These are also core pieces, but must come after the
 	 * core core pieces.
diff -urpN linux-2.6.16/include/linux/kobject.h linux-2.6.16-hypervisor/include/linux/kobject.h
--- linux-2.6.16/include/linux/kobject.h	2006-03-20 06:53:29.000000000 +0100
+++ linux-2.6.16-hypervisor/include/linux/kobject.h	2006-05-05 15:13:35.000000000 +0200
@@ -186,6 +186,10 @@ struct subsystem _varname##_subsys = { \
 
 /* The global /sys/kernel/ subsystem for people to chain off of */
 extern struct subsystem kernel_subsys;
+#ifdef CONFIG_SYS_HYPERVISOR
+/* The global /sys/hypervisor/ subsystem  */
+extern struct subsystem hypervisor_subsys;
+#endif /* CONFIG_SYS_HYPERVISOR */
 
 /**
  * Helpers for setting the kset of registered objects.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-28 17:47 ` Jörn Engel
@ 2006-05-02  7:25   ` Michael Holzheu
  0 siblings, 0 replies; 62+ messages in thread
From: Michael Holzheu @ 2006-05-02  7:25 UTC (permalink / raw)
  To: Jörn Engel; +Cc: akpm, ioe-lkml, linux-kernel, mschwid2, penberg

Jörn Engel <joern@wohnheim.fh-wedel.de> wrote on 04/28/2006 07:47:54 PM:

> On Fri, 28 April 2006 19:37:08 +0200, Michael Holzheu wrote:
> >
> > +static void *diag204_get_buffer(enum diag204_format fmt, int *pages)
> > +{
> > +   void *buf;
> > +
> > +   if (fmt == INFO_SIMPLE)
> > +      *pages = 1;
> > +   else
> > +      *pages = diag204(SUBC_RSI | fmt, 0, 0);
> > +
> > +   if (*pages <= 0)
> > +      return ERR_PTR(-ENOSYS);
>
> Is -ENOSYS the right thing here?  I thought it was for stuff not
> implemented by Linux.  If the hardware or some hypervisor would return
> -ENOSYS, it would be -EIO from Linux' perspective.  But I may be
> wrong.

The only case "diag204(SUBC_RSI | fmt, 0, 0)" can fail is,
if it is not implemented by the hardware. The means
ENOSYS (Function not implemented), doen't it?

Michael


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
  2006-04-28 17:37 Michael Holzheu
@ 2006-04-28 17:47 ` Jörn Engel
  2006-05-02  7:25   ` Michael Holzheu
  0 siblings, 1 reply; 62+ messages in thread
From: Jörn Engel @ 2006-04-28 17:47 UTC (permalink / raw)
  To: Michael Holzheu; +Cc: akpm, ioe-lkml, linux-kernel, mschwid2, penberg

On Fri, 28 April 2006 19:37:08 +0200, Michael Holzheu wrote:
>  
> +static void *diag204_get_buffer(enum diag204_format fmt, int *pages)
> +{
> +	void *buf;
> +
> +	if (fmt == INFO_SIMPLE)
> +		*pages = 1;
> +	else
> +		*pages = diag204(SUBC_RSI | fmt, 0, 0);
> +
> +	if (*pages <= 0)
> +		return ERR_PTR(-ENOSYS);

Is -ENOSYS the right thing here?  I thought it was for stuff not
implemented by Linux.  If the hardware or some hypervisor would return
-ENOSYS, it would be -EIO from Linux' perspective.  But I may be
wrong.

Jörn

-- 
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] s390: Hypervisor File System
@ 2006-04-28 17:37 Michael Holzheu
  2006-04-28 17:47 ` Jörn Engel
  0 siblings, 1 reply; 62+ messages in thread
From: Michael Holzheu @ 2006-04-28 17:37 UTC (permalink / raw)
  To: akpm; +Cc: ioe-lkml, joern, linux-kernel, mschwid2, penberg

Hi Andrew,

Here the patch with the bugfixes for your and Pekka's comments:

- move some globals into the superblock
- remove some inlines
- fix some error handling functions
- implement diag204_get_buffer() helper function
- "hypfs" prefix for global functions
- use uid_t and gid_t
- use kstrdup()
- Cosmetics

Signed-off-by: Michael Holzheu <holzheu@de.ibm.com>

---

 arch/s390/Kconfig            |    2
 arch/s390/hypfs/hypfs_diag.c |  127 +++++++++++++++++++++++++++----------------
 arch/s390/hypfs/hypfs_diag.h |    6 +-
 arch/s390/hypfs/inode.c      |  107 ++++++++++++++++++++----------------
 4 files changed, 145 insertions(+), 97 deletions(-)

diff -urpN linux-2.6.16-hypfs-2006-04-28/arch/s390/Kconfig linux-2.6.16-hypfs-2006-04-28-FIXED/arch/s390/Kconfig
--- linux-2.6.16-hypfs-2006-04-28/arch/s390/Kconfig	2006-04-28 19:16:56.000000000 +0200
+++ linux-2.6.16-hypfs-2006-04-28-FIXED/arch/s390/Kconfig	2006-04-28 19:17:37.000000000 +0200
@@ -447,7 +447,7 @@ config HYPFS_FS
 	default y
 	help
 	  This is a virtual file system intended to provide accounting
-	  information in a s390 hypervisor environment.
+	  information in an s390 hypervisor environment.
 
 config KEXEC
 	bool "kexec system call (EXPERIMENTAL)"
diff -urpN linux-2.6.16-hypfs-2006-04-28/arch/s390/hypfs/hypfs_diag.c linux-2.6.16-hypfs-2006-04-28-FIXED/arch/s390/hypfs/hypfs_diag.c
--- linux-2.6.16-hypfs-2006-04-28/arch/s390/hypfs/hypfs_diag.c	2006-04-28 19:16:56.000000000 +0200
+++ linux-2.6.16-hypfs-2006-04-28-FIXED/arch/s390/hypfs/hypfs_diag.c	2006-04-28 19:17:51.000000000 +0200
@@ -322,7 +322,7 @@ static inline __u64 phys_cpu__ctidx(enum
 
 /* Diagnose 204 functions */
 
-static inline int diag204(unsigned long subcode, unsigned long size, void *addr)
+static int diag204(unsigned long subcode, unsigned long size, void *addr)
 {
 	register unsigned long _subcode asm("0") = subcode;
 	register unsigned long _size asm("1") = size;
@@ -345,6 +345,25 @@ static inline int diag204(unsigned long 
 		return _size;
 }
 
+static void *diag204_get_buffer(enum diag204_format fmt, int *pages)
+{
+	void *buf;
+
+	if (fmt == INFO_SIMPLE)
+		*pages = 1;
+	else
+		*pages = diag204(SUBC_RSI | fmt, 0, 0);
+
+	if (*pages <= 0)
+		return ERR_PTR(-ENOSYS);
+
+	buf = kmalloc(*pages * PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return ERR_PTR(-ENOMEM);
+
+	return buf;
+}
+
 /* 
  * diag204_probe() has to find out, which type of diagnose 204 implementation
  * we have on our machine. Currently there are three possible scanarios:
@@ -364,13 +383,14 @@ static int diag204_probe(void)
 	void *buf;
 	int pages, rc;
 
-	pages = diag204(SUBC_RSI | INFO_EXT, 0, 0);
-	if (pages > 0) {
-		buf = kmalloc(pages * PAGE_SIZE, GFP_KERNEL);
-		if (!buf) {
-			rc = -ENOMEM;
-			goto err_out;
-		}
+	/*
+	 * We first try to get the number of pages needed to store the extended
+	 * data format. If this already fails, our machine does not support
+	 * subcode 6 and 7. We then continue checking for subcode 4.
+	 */
+
+	buf = diag204_get_buffer(INFO_EXT, &pages);
+	if (!IS_ERR(buf)) {
 		if (diag204(SUBC_STIB7 | INFO_EXT, pages, buf) >= 0) {
 			diag204_store_sc = SUBC_STIB7;
 			diag204_info_type = INFO_EXT;
@@ -382,10 +402,19 @@ static int diag204_probe(void)
 			goto out;
 		}
 		kfree(buf);
+	} else if (PTR_ERR(buf) != -ENOSYS) {
+		/* other error than 'not implemented'. Give up! */
+		return PTR_ERR(buf);
 	}
-	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
-	if (!buf) {
-		rc = -ENOMEM;
+
+	/*
+	 * subcode 6 and 7 did not work. Now try subcode 4 with the simple
+	 * data format.
+	 */
+
+	buf = diag204_get_buffer(INFO_SIMPLE, &pages);
+	if (IS_ERR(buf)) {
+		rc = PTR_ERR(buf);
 		goto err_out;
 	}
 	if (diag204(SUBC_STIB4 | INFO_SIMPLE, pages, buf) >= 0) {
@@ -396,9 +425,9 @@ static int diag204_probe(void)
 		rc = -ENOSYS;
 		goto err_out;
 	}
-      out:
+out:
 	rc = 0;
-      err_out:
+err_out:
 	kfree(buf);
 	return rc;
 }
@@ -408,34 +437,26 @@ static void *diag204_store(void)
 	void *buf;
 	int pages;
 
-	if (diag204_store_sc == SUBC_STIB4)
-		pages = 1;
-	else
-		pages = diag204(SUBC_RSI | diag204_info_type, 0, 0);
-
-	if (pages < 0)
-		return ERR_PTR(-ENOSYS);
-
-	buf = kmalloc(pages * PAGE_SIZE, GFP_KERNEL);
-	if (!buf)
-		return ERR_PTR(-ENOMEM);
-
+	buf = diag204_get_buffer(diag204_info_type, &pages);
+	if (IS_ERR(buf))
+		goto out;
 	if (diag204(diag204_store_sc | diag204_info_type, pages, buf) < 0) {
 		kfree(buf);
 		return ERR_PTR(-ENOSYS);
 	}
+out:
 	return buf;
 }
 
 /* Diagnose 224 functions */
 
-static inline void diag224(void *ptr)
+static void diag224(void *ptr)
 {
 	asm volatile("   diag    %0,%1,0x224\n"
 		     : :"d" (0), "d"(ptr) : "memory");
 }
 
-static inline int diag224_get_name_table(void)
+static int diag224_get_name_table(void)
 {
 	/* memory must be below 2GB */
 	diag224_cpu_names = kmalloc(PAGE_SIZE, GFP_KERNEL | GFP_DMA);
@@ -446,7 +467,7 @@ static inline int diag224_get_name_table
 	return 0;
 }
 
-static inline void diag224_delete_name_table(void)
+static void diag224_delete_name_table(void)
 {
 	kfree(diag224_cpu_names);
 }
@@ -459,7 +480,7 @@ static int diag224_idx2name(int index, c
 	return 0;
 }
 
-int diag_init(void)
+__init int hypfs_diag_init(void)
 {
 	int rc;
 
@@ -475,7 +496,7 @@ int diag_init(void)
 	return rc;
 }
 
-void diag_exit(void)
+__exit void hypfs_diag_exit(void)
 {
 	diag224_delete_name_table();
 }
@@ -590,39 +611,51 @@ static void *hypfs_create_phys_files(str
 	return cpu_info;
 }
 
-int diag_create_files(struct super_block *sb, struct dentry *root)
+int hypfs_diag_create_files(struct super_block *sb, struct dentry *root)
 {
 	struct dentry *systems_dir, *hyp_dir;
 	void *time_hdr, *part_hdr;
-	int i;
-	void *buffer, *rc;
+	int i, rc;
+	void *buffer, *ptr;
 
 	buffer = diag204_store();
 	if (IS_ERR(buffer))
 		return PTR_ERR(buffer);
 
 	systems_dir = hypfs_mkdir(sb, root, "systems");
-	if (IS_ERR(systems_dir))
-		return PTR_ERR(systems_dir);
+	if (IS_ERR(systems_dir)) {
+		rc = PTR_ERR(systems_dir);
+		goto err_out;
+	}
 	time_hdr = (struct x_info_blk_hdr *)buffer;
 	part_hdr = time_hdr + info_blk_hdr__size(diag204_info_type);
 	for (i = 0; i < info_blk_hdr__npar(diag204_info_type, time_hdr); i++) {
 		part_hdr = hypfs_create_lpar_files(sb, systems_dir, part_hdr);
-		if (IS_ERR(part_hdr))
-			return PTR_ERR(part_hdr);
+		if (IS_ERR(part_hdr)) {
+			rc = PTR_ERR(part_hdr);
+			goto err_out;
+		}
 	}
 	if (info_blk_hdr__flags(diag204_info_type, time_hdr) & LPAR_PHYS_FLG) {
-		void *rc;
-		rc = hypfs_create_phys_files(sb, root, part_hdr);
-		if (IS_ERR(rc))
-			return PTR_ERR(rc);
+		ptr = hypfs_create_phys_files(sb, root, part_hdr);
+		if (IS_ERR(ptr)) {
+			rc = PTR_ERR(ptr);
+			goto err_out;
+		}
 	}
 	hyp_dir = hypfs_mkdir(sb, root, "hyp");
-	if (IS_ERR(hyp_dir))
-		return PTR_ERR(hyp_dir);
-	rc = hypfs_create_str(sb, hyp_dir, "type", "LPAR Hypervisor");
-	if (IS_ERR(rc))
-		return PTR_ERR(rc);
+	if (IS_ERR(hyp_dir)) {
+		rc = PTR_ERR(hyp_dir);
+		goto err_out;
+	}
+	ptr = hypfs_create_str(sb, hyp_dir, "type", "LPAR Hypervisor");
+	if (IS_ERR(ptr)) {
+		rc = PTR_ERR(ptr);
+		goto err_out;
+	}
+	rc = 0;
+
+err_out:
 	kfree(buffer);
-	return 0;
+	return rc;
 }
diff -urpN linux-2.6.16-hypfs-2006-04-28/arch/s390/hypfs/hypfs_diag.h linux-2.6.16-hypfs-2006-04-28-FIXED/arch/s390/hypfs/hypfs_diag.h
--- linux-2.6.16-hypfs-2006-04-28/arch/s390/hypfs/hypfs_diag.h	2006-04-28 19:16:56.000000000 +0200
+++ linux-2.6.16-hypfs-2006-04-28-FIXED/arch/s390/hypfs/hypfs_diag.h	2006-04-28 19:17:51.000000000 +0200
@@ -9,8 +9,8 @@
 #ifndef _HYPFS_DIAG_H_
 #define _HYPFS_DIAG_H_
 
-extern int diag_init(void);
-extern void diag_exit(void);
-extern int diag_create_files(struct super_block *sb, struct dentry *root);
+extern int hypfs_diag_init(void);
+extern void hypfs_diag_exit(void);
+extern int hypfs_diag_create_files(struct super_block *sb, struct dentry *root);
 
 #endif /* _HYPFS_DIAG_H_ */
diff -urpN linux-2.6.16-hypfs-2006-04-28/arch/s390/hypfs/inode.c linux-2.6.16-hypfs-2006-04-28-FIXED/arch/s390/hypfs/inode.c
--- linux-2.6.16-hypfs-2006-04-28/arch/s390/hypfs/inode.c	2006-04-28 19:16:56.000000000 +0200
+++ linux-2.6.16-hypfs-2006-04-28-FIXED/arch/s390/hypfs/inode.c	2006-04-28 19:17:51.000000000 +0200
@@ -28,25 +28,26 @@ static struct dentry *hypfs_create_updat
 					       struct dentry *dir);
 
 struct hypfs_sb_info {
-	int uid;		/* uid used for files and dirs */
-	int gid;		/* gid used for files and dirs */
+	uid_t uid;			/* uid used for files and dirs */
+	gid_t gid;			/* gid used for files and dirs */
+	struct dentry *update_file;	/* file to trigger update */
+	time_t last_update;		/* last update time in secs since 1970 */
+	struct mutex lock;		/* lock to protect update process */
 };
 
-static struct dentry *update_file_dentry;
 static struct file_operations hypfs_file_ops;
 static struct file_system_type hypfs_type;
 static struct super_operations hypfs_s_ops;
-static time_t last_update_time = 0;	/* update time in seconds since 1970 */
-static DEFINE_MUTEX(hypfs_lock);
 
 /* start of list of all dentries, which have to be deleted on update */
 static struct dentry *hypfs_last_dentry;
 
-static void hypfs_update_update(void)
+static void hypfs_update_update(struct super_block *sb)
 {
-	struct inode* inode = update_file_dentry->d_inode;
+	struct hypfs_sb_info *sb_info = sb->s_fs_info;
+	struct inode *inode = sb_info->update_file->d_inode;
 	
-	last_update_time = get_seconds();
+	sb_info->last_update = get_seconds();
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 }
 
@@ -103,7 +104,8 @@ static void hypfs_drop_inode(struct inod
 
 static int hypfs_open(struct inode *inode, struct file *filp)
 {
-	char *data = (char *)filp->f_dentry->d_inode->u.generic_ip;
+	char *data = filp->f_dentry->d_inode->u.generic_ip;
+	struct hypfs_sb_info *fs_info;
 
 	if (filp->f_mode & FMODE_WRITE) {
 		if (!(inode->i_mode & S_IWUGO))
@@ -113,14 +115,17 @@ static int hypfs_open(struct inode *inod
 		if (!(inode->i_mode & S_IRUGO))
 			return -EACCES;
 	}
-	mutex_lock(&hypfs_lock);
-	filp->private_data = kmalloc(strlen(data) + 1, GFP_KERNEL);
-	if (!filp->private_data) {
-		mutex_unlock(&hypfs_lock);
-		return -ENOMEM;
+
+	fs_info = inode->i_sb->s_fs_info;
+	if(data) {
+		mutex_lock(&fs_info->lock);
+		filp->private_data = kstrdup(data, GFP_KERNEL);
+		if (!filp->private_data) {
+			mutex_unlock(&fs_info->lock);
+			return -ENOMEM;
+		}
+		mutex_unlock(&fs_info->lock);
 	}
-	strcpy(filp->private_data, data);
-	mutex_unlock(&hypfs_lock);
 	return 0;
 }
 
@@ -131,7 +136,7 @@ static ssize_t hypfs_aio_read(struct kio
 	size_t len;
 	struct file *filp = iocb->ki_filp;
 
-	data = (char *)filp->private_data;
+	data = filp->private_data;
 	len = strlen(data);
 	if (offset > len) {
 		count = 0;
@@ -145,7 +150,7 @@ static ssize_t hypfs_aio_read(struct kio
 	}
 	iocb->ki_pos += count;
 	file_accessed(filp);
-      out:
+out:
 	return count;
 }
 static ssize_t hypfs_aio_write(struct kiocb *iocb, const char __user *buf,
@@ -153,24 +158,36 @@ static ssize_t hypfs_aio_write(struct ki
 {
 	int rc;
 	struct super_block *sb;
+	struct hypfs_sb_info *fs_info;
 
-	mutex_lock(&hypfs_lock);
 	sb = iocb->ki_filp->f_dentry->d_inode->i_sb;
-	if (last_update_time == get_seconds()) {
+	fs_info = sb->s_fs_info;
+	/*
+	 * Currently we only allow one update per second for two reasons:
+	 * 1. diag 204 is VERY expensive
+	 * 2. If several processes do updates in parallel and then read the
+	 *    hypfs data, the likelihood of collisions is reduced, if we restrict
+	 *    the minimum update interval. A collision occurs, if during the
+	 *    data gathering of one process another process triggers an update
+	 *    If the first process wants to ensure consistent data, it has
+	 *    to restart data collection in this case.
+	 */
+	mutex_lock(&fs_info->lock);
+	if (fs_info->last_update == get_seconds()) {
 		rc = -EBUSY;
 		goto out;
 	}
 	hypfs_delete_tree(sb->s_root);
-	rc = diag_create_files(sb, sb->s_root);
+	rc = hypfs_diag_create_files(sb, sb->s_root);
 	if (rc) {
 		printk(KERN_ERR "hypfs: Update failed\n");
 		hypfs_delete_tree(sb->s_root);
 		goto out;
 	}
-	hypfs_update_update();
+	hypfs_update_update(sb);
 	rc = count;
-      out:
-	mutex_unlock(&hypfs_lock);
+out:
+	mutex_unlock(&fs_info->lock);
 	return rc;
 }
 
@@ -229,9 +246,10 @@ static int hypfs_fill_super(struct super
 	int rc = 0;
 	struct hypfs_sb_info *sbi;
 
-	sbi = kmalloc(sizeof(struct hypfs_sb_info), GFP_KERNEL);
+	sbi = kzalloc(sizeof(struct hypfs_sb_info), GFP_KERNEL);
 	if (!sbi)
 		return -ENOMEM;
+	mutex_init(&sbi->lock);
 	sbi->uid = current->uid;
 	sbi->gid = current->gid;
 	sb->s_fs_info = sbi;
@@ -255,23 +273,23 @@ static int hypfs_fill_super(struct super
 		rc = -ENOMEM;
 		goto err_inode;
 	}
-	rc = diag_create_files(sb, sb->s_root);
+	rc = hypfs_diag_create_files(sb, sb->s_root);
 	if (rc)
 		goto err_tree;
-	update_file_dentry = hypfs_create_update_file(sb, sb->s_root);
-	if (IS_ERR(update_file_dentry)) {
-		rc = PTR_ERR(update_file_dentry);
+	sbi->update_file = hypfs_create_update_file(sb, sb->s_root);
+	if (IS_ERR(sbi->update_file)) {
+		rc = PTR_ERR(sbi->update_file);
 		goto err_tree;
 	}
-	hypfs_update_update();
+	hypfs_update_update(sb);
 	return 0;
 
-      err_tree:
+err_tree:
 	hypfs_delete_tree(sb->s_root);
 	dput(sb->s_root);
-      err_inode:
+err_inode:
 	iput(root_inode);
-      err_alloc:
+err_alloc:
 	kfree(sbi);
 	return rc;
 }
@@ -309,7 +327,7 @@ static struct dentry *hypfs_create_file(
 	}
 	if (mode & S_IFREG) {
 		inode->i_fop = &hypfs_file_ops;
-		if(data)
+		if (data)
 			inode->i_size = strlen(data);
 		else
 			inode->i_size = 0;
@@ -344,8 +362,6 @@ static struct dentry *hypfs_create_updat
 
 	dentry = hypfs_create_file(sb, dir, "update", NULL,
 				   S_IFREG | UPDATE_FILE_MODE);
-	if (!dentry)
-		return ERR_PTR(-ENOMEM);
 	/*
 	 * We do not put the update file on the 'delete' list with
 	 * hypfs_add_dentry(), since it should not be removed when the tree
@@ -362,13 +378,12 @@ struct dentry *hypfs_create_u64(struct s
 	struct dentry *dentry;
 
 	snprintf(tmp, TMP_SIZE, "%lld\n", (unsigned long long int)value);
-	buffer = kmalloc(strlen(tmp) + 1, GFP_KERNEL);
+	buffer = kstrdup(tmp, GFP_KERNEL);
 	if (!buffer)
 		return ERR_PTR(-ENOMEM);
-	strcpy(buffer, tmp);
 	dentry =
 	    hypfs_create_file(sb, dir, name, buffer, S_IFREG | REG_FILE_MODE);
-	if (!dentry) {
+	if (IS_ERR(dentry)) {
 		kfree(buffer);
 		return ERR_PTR(-ENOMEM);
 	}
@@ -388,7 +403,7 @@ struct dentry *hypfs_create_str(struct s
 	sprintf(buffer, "%s\n", string);
 	dentry =
 	    hypfs_create_file(sb, dir, name, buffer, S_IFREG | REG_FILE_MODE);
-	if (!dentry) {
+	if (IS_ERR(dentry)) {
 		kfree(buffer);
 		return ERR_PTR(-ENOMEM);
 	}
@@ -426,7 +441,7 @@ static int __init hypfs_init(void)
 
 	if (MACHINE_IS_VM)
 		return -ENODATA;
-	if (diag_init()) {
+	if (hypfs_diag_init()) {
 		rc = -ENODATA;
 		goto err_msg;
 	}
@@ -438,18 +453,18 @@ static int __init hypfs_init(void)
 		goto err_sysfs;
 	return 0;
 
-      err_sysfs:
+err_sysfs:
 	subsystem_unregister(&hypervisor_subsys);
-      err_diag:
-	diag_exit();
-      err_msg:
+err_diag:
+	hypfs_diag_exit();
+err_msg:
 	printk(KERN_ERR "hypfs: Initialization failed with rc = %i.\n", rc);
 	return rc;
 }
 
 static void __exit hypfs_exit(void)
 {
-	diag_exit();
+	hypfs_diag_exit();
 	unregister_filesystem(&hypfs_type);
 	subsystem_unregister(&hypervisor_subsys);
 }


^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2006-05-09  5:03 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-04-28  9:22 [PATCH] s390: Hypervisor File System Michael Holzheu
2006-04-28  9:43 ` Jörn Engel
2006-04-28 11:53   ` Michael Holzheu
2006-04-28 15:48     ` Jörn Engel
2006-04-28  9:56 ` Andrew Morton
2006-04-28 17:36   ` Michael Holzheu
2006-04-28 17:43     ` Jörn Engel
2006-05-02  8:06       ` Michael Holzheu
2006-04-28 19:44     ` Andrew Morton
2006-04-28 10:36 ` Pekka J Enberg
2006-04-28 13:14   ` Michael Holzheu
2006-04-29  6:44 ` Andrew Morton
2006-04-29  7:51   ` Greg KH
2006-04-29  8:14     ` Andrew Morton
2006-05-03  8:48       ` Michael Holzheu
2006-05-03 22:10         ` Greg KH
2006-05-04 10:22           ` Michael Holzheu
2006-05-04 14:42             ` Greg KH
2006-05-04 15:01               ` Michael Holzheu
2006-05-04 15:34                 ` Greg KH
2006-04-29  7:53 ` Greg KH
2006-04-29  8:41   ` Kyle Moffett
2006-04-29 21:55     ` Greg KH
2006-04-30  5:18       ` Kyle Moffett
2006-05-01 20:38         ` Greg KH
2006-05-01 23:29           ` Kyle Moffett
2006-05-02  4:00             ` Greg KH
2006-05-02  5:23               ` Kay Sievers
2006-05-02  5:37                 ` Greg KH
2006-05-02 11:46                   ` Kay Sievers
2006-05-02 21:28                     ` Greg KH
2006-05-02 21:33                       ` Kay Sievers
2006-05-02 21:54                         ` Greg KH
2006-05-02  8:48               ` Kyle Moffett
2006-05-02 21:30                 ` Greg KH
2006-05-02 21:49                   ` Kay Sievers
2006-05-02 23:18                     ` Kyle Moffett
2006-05-03  9:33     ` Michael Holzheu
2006-05-03  9:42       ` Pekka J Enberg
2006-05-03 12:11         ` Michael Holzheu
2006-05-03 12:33           ` Jörn Engel
2006-05-03 12:51             ` Michael Holzheu
2006-05-03 13:00               ` Jörn Engel
2006-05-03 13:18                 ` Michael Holzheu
2006-05-03 13:22                   ` Jörn Engel
2006-05-03 13:38                     ` Michael Holzheu
2006-05-03 14:17                       ` Martin Schwidefsky
2006-05-03 14:23                         ` Michael Holzheu
2006-05-03 14:58                           ` Martin Schwidefsky
2006-05-03 15:22                             ` Michael Holzheu
2006-05-03 15:54                   ` Valdis.Kletnieks
2006-05-03 10:01       ` Jörn Engel
2006-05-02 10:12   ` Michael Holzheu
2006-05-02 13:00   ` Michael Holzheu
2006-05-03  8:45   ` Michael Holzheu
2006-04-28 17:37 Michael Holzheu
2006-04-28 17:47 ` Jörn Engel
2006-05-02  7:25   ` Michael Holzheu
2006-05-05 13:22 Michael Holzheu
2006-05-05 21:14 ` Greg KH
2006-05-08 12:24 Michael Holzheu
2006-05-09  5:01 ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.