All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET 0/8 version 3] exofs
@ 2009-02-09 13:07 Boaz Harrosh
  2009-02-09 13:12 ` [PATCH 1/8] exofs: Kbuild, Headers and osd utils Boaz Harrosh
                   ` (7 more replies)
  0 siblings, 8 replies; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-09 13:07 UTC (permalink / raw)
  To: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel, open-osd
  Cc: James Bottomley, linux-kernel

exofs is a file system that uses an OSD device as it's back store.

OSD is a new T10 command set that views storage devices not as a large/flat
array of sectors but as a container of objects, each having a length, quota,
time attributes and more. Each object is addressed by a 64bit ID, and is
contained in a 64bit ID partition. Each object has associated attributes
attached to it, which are integral part of the object and provide metadata about
the object. The standard defines some common obligatory attributes, but user
attributes can be added as needed.

What's new since last iteration:
Lots and lots of changes the actual diff is big.

- mkexofs is removed from Kernel code
    Open-osd has a user-mode library now. It is the exact same API (and source) as
    the Kernel library. Commands are issued through (patched) bsg.ko. If anyone
    is interested see URLs below, patches will be sent to the open-osd-mailing-list.
    Horay.

- Incorporate changes made to ext2 since exofs was first derived.
    Thanks to Andrew Morton, I've referred to the git-log of fs/ext2 and
    have incorporated all relevant changes into exofs. Lots of bug fixes
    and short comings where fixed this way.

- Made everything LE safe.
    All on disk types are now using proper Endian types, and properly converted
    back and forth. exofs is an LE filesystem just as ext2.

- Completely got rid of IBM API
    Previous version was using IBM API for osd commands and those API where implemented
    over open-osd in osd.c. open-osd API is now used directly and osd.c is reduced to
    a few exofs internal helpers.

- A sleep forever Bug fix, related to async object creation.

- Cleaned up some usage of types, naming conventions, code organization.
    Lots of cleanups, and reorganization.

And the regular information for first comers

Our intention with exofs is to make it exportable by Linux
pNFS server, as reference implementation for pNFS-object-layout
server. A pNFS-objects client implementation is also in the works
(See all about pNFS in Linux at:
http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design)

exofs was originally developed by Avishay Traeger <avishay@gmail.com>
from IBM. A very old version of it is hosted on sourceforge as the osdfs
project. The Original code was based on ext2 of the 2.6.10 Kernel and ran
over the old IBM's osd-initiator Linux driver.
Since then it was picked by us, open-osd, and was both forward ported to
current Kernel, as well as converted to run over open-osd Kernel Library.

I have mechanically divided the code in parts, each introducing a
group of vfs function vectors, all tied at the end into a full filesystem.
Each patch can be compiled but it will only run at the very end.
This was done for the hope of easier reviewing.

Here is the list of patches
[PATCH 1/8] exofs: Kbuild, Headers and osd utils
[PATCH 2/8] exofs: file and file_inode operations
[PATCH 3/8] exofs: symlink_inode and fast_symlink_inode operations
[PATCH 4/8] exofs: address_space_operations
[PATCH 5/8] exofs: dir_inode and directory operations
[PATCH 6/8] exofs: super_operations and file_system_type
[PATCH 7/8] exofs: Documentation
[PATCH 8/8] fs: Add exofs to Kernel build

This patchset is also available on:
  git-clone git://git.open-osd.org/linux-open-osd.git linux-next
or on the web at:
  http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/linux-next

(Above tree is based on Linus 2.6.29-rc4)

If anyone wants to actually run this code and test it
then please start reading at:
    http://open-osd.org
You will need to checkout the out-of-tree git (below) for the user-mode utilities.
Also the exofs.txt file in patch 7/8 should help

If you want to review the user-mode library and supporting plumbings,
patches will be sent to this mailing-list:
http://www.open-osd.org/bin/view/Main/MailingList

And code is on this git-tree:
  git-clone git://git.open-osd.org/open-osd.git
or on the web at:
  http://git.open-osd.org/gitweb.cgi?p=open-osd.git;a=summary

Thank you all for the very useful comments, suggestions, and banging me on the
head about the user-mode API.

Boaz

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-02-09 13:07 [PATCHSET 0/8 version 3] exofs Boaz Harrosh
@ 2009-02-09 13:12 ` Boaz Harrosh
  2009-02-16  4:18   ` FUJITA Tomonori
  2009-02-09 13:18 ` [PATCH 2/8] exofs: file and file_inode operations Boaz Harrosh
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-09 13:12 UTC (permalink / raw)
  To: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel, open-osd
  Cc: linux-kernel, James Bottomley

This patch includes osd infrastructure that will be used later by
the file system.

Also the declarations of constants, on disk structures,
and prototypes.

And the Kbuild+Kconfig files needed to build the exofs module.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/exofs/Kbuild   |   30 +++++++
 fs/exofs/Kconfig  |   13 +++
 fs/exofs/common.h |  181 +++++++++++++++++++++++++++++++++++++++++
 fs/exofs/exofs.h  |  139 ++++++++++++++++++++++++++++++++
 fs/exofs/osd.c    |  230 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 593 insertions(+), 0 deletions(-)
 create mode 100644 fs/exofs/Kbuild
 create mode 100644 fs/exofs/Kconfig
 create mode 100644 fs/exofs/common.h
 create mode 100644 fs/exofs/exofs.h
 create mode 100644 fs/exofs/osd.c

diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
new file mode 100644
index 0000000..c806738
--- /dev/null
+++ b/fs/exofs/Kbuild
@@ -0,0 +1,30 @@
+#
+# Kbuild for the EXOFS module
+#
+# Copyright (C) 2008 Panasas Inc.  All rights reserved.
+#
+# Authors:
+#   Boaz Harrosh <bharrosh@panasas.com>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License version 2
+#
+# Kbuild - Gets included from the Kernels Makefile and build system
+#
+
+ifneq ($(OSD_INC),)
+# we are built out-of-tree Kconfigure everything as on
+
+CONFIG_EXOFS_FS=m
+ccflags-y += -DCONFIG_EXOFS_FS -DCONFIG_EXOFS_FS_MODULE
+# ccflags-y += -DCONFIG_EXOFS_DEBUG
+
+# if we are built out-of-tree and the hosting kernel has OSD headers
+# then "ccflags-y +=" will not pick the out-off-tree headers. Only by doing
+# this it will work. This might break in future kernels
+KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
+
+endif
+
+exofs-objs := osd.o
+obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig
new file mode 100644
index 0000000..86194b2
--- /dev/null
+++ b/fs/exofs/Kconfig
@@ -0,0 +1,13 @@
+config EXOFS_FS
+	tristate "exofs: OSD based file system support"
+	depends on SCSI_OSD_ULD
+	help
+	  EXOFS is a file system that uses an OSD storage device,
+	  as its backing storage.
+
+# Debugging-related stuff
+config EXOFS_DEBUG
+	bool "Enable debugging"
+	depends on EXOFS_FS
+	help
+	  This option enables EXOFS debug prints.
diff --git a/fs/exofs/common.h b/fs/exofs/common.h
new file mode 100644
index 0000000..894487e
--- /dev/null
+++ b/fs/exofs/common.h
@@ -0,0 +1,181 @@
+/*
+ * common.h - Common definitions for both Kernel and user-mode utilities
+ *
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * Copyrights for code taken from ext2:
+ *     Copyright (C) 1992, 1993, 1994, 1995
+ *     Remy Card (card@masi.ibp.fr)
+ *     Laboratoire MASI - Institut Blaise Pascal
+ *     Universite Pierre et Marie Curie (Paris VI)
+ *     from
+ *     linux/fs/minix/inode.c
+ *     Copyright (C) 1991, 1992  Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#ifndef __EXOFS_COM_H__
+#define __EXOFS_COM_H__
+
+#include <linux/types.h>
+
+#include <scsi/osd_attributes.h>
+#include <scsi/osd_initiator.h>
+#include <scsi/osd_sec.h>
+
+/****************************************************************************
+ * Object ID related defines
+ * NOTE: inode# = object ID - EXOFS_OBJ_OFF
+ ****************************************************************************/
+#define EXOFS_MIN_PID   0x10000	/* Smallest partition ID */
+#define EXOFS_OBJ_OFF	0x10000	/* offset for objects */
+#define EXOFS_SUPER_ID	0x10000	/* object ID for on-disk superblock */
+#define EXOFS_ROOT_ID	0x10002	/* object ID for root directory */
+
+/* exofs Application specific page/attribute */
+# define EXOFS_APAGE_FS_DATA	(OSD_APAGE_APP_DEFINED_FIRST + 3)
+# define EXOFS_ATTR_INODE_DATA	1
+
+/*
+ * The maximum number of files we can have is limited by the size of the
+ * inode number.  This is the largest object ID that the file system supports.
+ * Object IDs 0, 1, and 2 are always in use (see above defines).
+ */
+enum {
+	EXOFS_UINT64_MAX = (~0LL),
+	EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? EXOFS_UINT64_MAX :
+					(1LL << (sizeof(ino_t) * 8 - 1)),
+	EXOFS_MAX_ID	 = (EXOFS_MAX_INO_ID - 1 - EXOFS_OBJ_OFF),
+};
+
+/****************************************************************************
+ * Misc.
+ ****************************************************************************/
+#define EXOFS_BLKSHIFT	12
+#define EXOFS_BLKSIZE	(1UL << EXOFS_BLKSHIFT)
+
+/****************************************************************************
+ * superblock-related things
+ ****************************************************************************/
+#define EXOFS_SUPER_MAGIC	0x5DF5
+
+/*
+ * The file system control block - stored in an object's data (mainly, the one
+ * with ID EXOFS_SUPER_ID).  This is where the in-memory superblock is stored
+ * on disk.  Right now it just has a magic value, which is basically a sanity
+ * check on our ability to communicate with the object store.
+ */
+struct exofs_fscb {
+	__le64  s_nextid;	/* Highest object ID used */
+	__le32  s_numfiles;	/* Number of files on fs */
+	__le16  s_magic;	/* Magic signature */
+	__le16  s_newfs;	/* Non-zero if this is a new fs */
+};
+
+/****************************************************************************
+ * inode-related things
+ ****************************************************************************/
+#define EXOFS_IDATA		5
+
+/*
+ * The file control block - stored in an object's attributes.  This is where
+ * the in-memory inode is stored on disk.
+ */
+struct exofs_fcb {
+	__le64  i_size;			/* Size of the file */
+	__le16  i_mode;         	/* File mode */
+	__le16  i_links_count;  	/* Links count */
+	__le32  i_uid;          	/* Owner Uid */
+	__le32  i_gid;          	/* Group Id */
+	__le32  i_atime;        	/* Access time */
+	__le32  i_ctime;        	/* Creation time */
+	__le32  i_mtime;        	/* Modification time */
+	__le32  i_flags;        	/* File flags (unused for now)*/
+	__le32  i_generation;   	/* File version (for NFS) */
+	__le32  i_data[EXOFS_IDATA];	/* Short symlink names and device #s */
+};
+
+#define EXOFS_INO_ATTR_SIZE	sizeof(struct exofs_fcb)
+
+/* This is the Attribute the fcb is stored in */
+static const struct __weak osd_attr g_attr_inode_data = ATTR_DEF(
+	EXOFS_APAGE_FS_DATA,
+	EXOFS_ATTR_INODE_DATA,
+	EXOFS_INO_ATTR_SIZE);
+
+/****************************************************************************
+ * dentry-related things
+ ****************************************************************************/
+#define EXOFS_NAME_LEN	255
+
+/*
+ * The on-disk directory entry
+ */
+struct exofs_dir_entry {
+	__le64		inode_no;		/* inode number           */
+	__le16		rec_len;		/* directory entry length */
+	u8		name_len;		/* name length            */
+	u8		file_type;		/* umm...file type        */
+	char		name[EXOFS_NAME_LEN];	/* file name              */
+};
+
+enum {
+	EXOFS_FT_UNKNOWN,
+	EXOFS_FT_REG_FILE,
+	EXOFS_FT_DIR,
+	EXOFS_FT_CHRDEV,
+	EXOFS_FT_BLKDEV,
+	EXOFS_FT_FIFO,
+	EXOFS_FT_SOCK,
+	EXOFS_FT_SYMLINK,
+	EXOFS_FT_MAX
+};
+
+#define EXOFS_DIR_PAD			4
+#define EXOFS_DIR_ROUND			(EXOFS_DIR_PAD - 1)
+#define EXOFS_DIR_REC_LEN(name_len) \
+	(((name_len) + offsetof(struct exofs_dir_entry, name)  + \
+	  EXOFS_DIR_ROUND) & ~EXOFS_DIR_ROUND)
+
+/*************************
+ * function declarations *
+ *************************/
+/* osd.c                 */
+void exofs_make_credential(u8 cred_a[OSD_CAP_LEN],
+			   const struct osd_obj_id *obj);
+
+int exofs_check_ok(struct osd_request *or);
+int exofs_sync_op(struct osd_request *or, int timeout, u8 *cred);
+int exofs_async_op(struct osd_request *or,
+	osd_req_done_fn *async_done, void *caller_context, u8 *cred);
+
+int extract_attr_from_req(struct osd_request *or, struct osd_attr *attr);
+
+int osd_req_read_kern(struct osd_request *or,
+	const struct osd_obj_id *obj, u64 offset, void *buff, u64 len);
+
+int osd_req_write_kern(struct osd_request *or,
+	const struct osd_obj_id *obj, u64 offset, void *buff, u64 len);
+
+#endif /*ifndef __EXOFS_COM_H__*/
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
new file mode 100644
index 0000000..0637237
--- /dev/null
+++ b/fs/exofs/exofs.h
@@ -0,0 +1,139 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * Copyrights for code taken from ext2:
+ *     Copyright (C) 1992, 1993, 1994, 1995
+ *     Remy Card (card@masi.ibp.fr)
+ *     Laboratoire MASI - Institut Blaise Pascal
+ *     Universite Pierre et Marie Curie (Paris VI)
+ *     from
+ *     linux/fs/minix/inode.c
+ *     Copyright (C) 1991, 1992  Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include <linux/fs.h>
+#include <linux/time.h>
+#include "common.h"
+
+#ifndef __EXOFS_H__
+#define __EXOFS_H__
+
+#define EXOFS_ERR(fmt, a...) printk(KERN_ERR "exofs: " fmt, ##a)
+
+#ifdef CONFIG_EXOFS_DEBUG
+#define EXOFS_DBGMSG(fmt, a...) \
+	printk(KERN_NOTICE "exofs @%s:%d: " fmt, __func__, __LINE__, ##a)
+#else
+#define EXOFS_DBGMSG(fmt, a...) \
+	do {} while (0)
+#endif
+
+/* u64 has problems with printk this will cast it to unsigned long long */
+#define _LLU(x) (unsigned long long)(x)
+
+/*
+ * our extension to the in-memory superblock
+ */
+struct exofs_sb_info {
+	struct osd_dev	*s_dev;			/* returned by get_osd_dev    */
+	osd_id		s_pid;			/* partition ID of file system*/
+	int		s_timeout;		/* timeout for OSD operations */
+	uint64_t	s_nextid;		/* highest object ID used     */
+	uint32_t	s_numfiles;		/* number of files on fs      */
+	spinlock_t	s_next_gen_lock;	/* spinlock for gen # update  */
+	u32		s_next_generation;	/* next gen # to use          */
+	atomic_t	s_curr_pending;		/* number of pending commands */
+	uint8_t		s_cred[OSD_CAP_LEN];	/* all-powerful credential    */
+};
+
+/*
+ * our extension to the in-memory inode
+ */
+struct exofs_i_info {
+	unsigned long  i_flags;            /* various atomic flags            */
+	uint32_t       i_data[EXOFS_IDATA];/*short symlink names and device #s*/
+	uint32_t       i_dir_start_lookup; /* which page to start lookup      */
+	wait_queue_head_t i_wq;            /* wait queue for inode            */
+	uint64_t       i_commit_size;      /* the object's written length     */
+	uint8_t        i_cred[OSD_CAP_LEN];/* all-powerful credential         */
+	struct inode   vfs_inode;          /* normal in-memory inode          */
+};
+
+/*
+ * our inode flags
+ */
+#define OBJ_2BCREATED	0	/* object will be created soon*/
+#define OBJ_CREATED	1	/* object has been created on the osd*/
+
+static inline int obj_2bcreated(struct exofs_i_info *oi)
+{
+	return test_bit(OBJ_2BCREATED, &(oi->i_flags));
+}
+
+static inline void set_obj_2bcreated(struct exofs_i_info *oi)
+{
+	set_bit(OBJ_2BCREATED, &(oi->i_flags));
+}
+
+static inline int obj_created(struct exofs_i_info *oi)
+{
+	return test_bit(OBJ_CREATED, &(oi->i_flags));
+}
+
+static inline void set_obj_created(struct exofs_i_info *oi)
+{
+	set_bit(OBJ_CREATED, &(oi->i_flags));
+}
+
+int __exofs_wait_obj_created(struct exofs_i_info *oi);
+static inline int wait_obj_created(struct exofs_i_info *oi)
+{
+	if (likely(obj_created(oi)))
+		return 0;
+
+	return __exofs_wait_obj_created(oi);
+}
+
+/*
+ * get to our inode from the vfs inode
+ */
+static inline struct exofs_i_info *exofs_i(struct inode *inode)
+{
+	return container_of(inode, struct exofs_i_info, vfs_inode);
+}
+
+/*************************
+ * function declarations *
+ *************************/
+/* osd.c                 */
+int osd_req_read_pages(struct osd_request *or,
+	const struct osd_obj_id *, u64 offset, u64 length,
+	struct page **pages, int page_count);
+
+int osd_req_write_pages(struct osd_request *or,
+	const struct osd_obj_id *, u64 offset, u64 length,
+	struct page **pages, int page_count);
+
+#endif
diff --git a/fs/exofs/osd.c b/fs/exofs/osd.c
new file mode 100644
index 0000000..84c573d
--- /dev/null
+++ b/fs/exofs/osd.c
@@ -0,0 +1,230 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include <scsi/scsi_device.h>
+#include <scsi/osd_sense.h>
+
+#include "exofs.h"
+
+int exofs_check_ok(struct osd_request *or)
+{
+	struct osd_sense_info osi;
+	int ret = osd_req_decode_sense(or, &osi);
+
+	if (ret) { /* translate to Linux codes */
+		if (osi.additional_code == scsi_invalid_field_in_cdb) {
+			if (osi.cdb_field_offset == OSD_CFO_STARTING_BYTE)
+				ret = -EFAULT;
+			if (osi.cdb_field_offset == OSD_CFO_OBJECT_ID)
+				ret = -ENOENT;
+			else
+				ret = -EINVAL;
+		} else if (osi.additional_code == osd_quota_error)
+			ret = -ENOSPC;
+		else
+			ret = -EIO;
+	}
+
+	return ret;
+}
+
+void exofs_make_credential(u8 cred_a[OSD_CAP_LEN], const struct osd_obj_id *obj)
+{
+	osd_sec_init_nosec_doall_caps(cred_a, obj, false, true);
+}
+
+/*
+ * Perform a synchronous OSD operation.
+ */
+int exofs_sync_op(struct osd_request *or, int timeout, uint8_t *credential)
+{
+	int ret;
+
+	or->timeout = timeout;
+	ret = osd_finalize_request(or, 0, credential, NULL);
+	if (ret) {
+		EXOFS_DBGMSG("Faild to osd_finalize_request() => %d\n", ret);
+		return ret;
+	}
+
+	ret = osd_execute_request(or);
+
+	if (ret)
+		EXOFS_DBGMSG("osd_execute_request() => %d\n", ret);
+	/* osd_req_decode_sense(or, ret); */
+	return ret;
+}
+
+/*
+ * Perform an asynchronous OSD operation.
+ */
+int exofs_async_op(struct osd_request *or, osd_req_done_fn *async_done,
+		   void *caller_context, u8 *cred)
+{
+	int ret;
+
+	ret = osd_finalize_request(or, 0, cred, NULL);
+	if (ret) {
+		EXOFS_DBGMSG("Faild to osd_finalize_request() => %d\n", ret);
+		return ret;
+	}
+
+	ret = osd_execute_request_async(or, async_done, caller_context);
+
+	if (ret)
+		EXOFS_DBGMSG("osd_execute_request_async() => %d\n", ret);
+	return ret;
+}
+
+int extract_attr_from_req(struct osd_request *or, struct osd_attr *attr)
+{
+	struct osd_attr cur_attr = {.attr_page = 0}; /* start with zeros */
+	void *iter = NULL;
+	int nelem;
+
+	do {
+		nelem = 1;
+		osd_req_decode_get_attr_list(or, &cur_attr, &nelem, &iter);
+		if ((cur_attr.attr_page == attr->attr_page) &&
+		    (cur_attr.attr_id == attr->attr_id)) {
+			attr->len = cur_attr.len;
+			attr->val_ptr = cur_attr.val_ptr;
+			return 0;
+		}
+	} while (iter);
+
+	return -EIO;
+}
+
+static void _osd_read(struct osd_request *or,
+	const struct osd_obj_id *obj, uint64_t offset, struct bio *bio)
+{
+	osd_req_read(or, obj, bio, offset);
+	EXOFS_DBGMSG("osd_req_read(p=%llX, ob=%llX, l=%llu, of=%llu)\n",
+		_LLU(obj->partition), _LLU(obj->id), _LLU(bio->bi_size),
+		_LLU(offset));
+}
+
+#ifdef __KERNEL__
+static struct bio *_bio_map_pages(struct request_queue *req_q,
+				  struct page **pages, unsigned page_count,
+				  size_t length, gfp_t gfp_mask)
+{
+	struct bio *bio;
+	int i;
+
+	bio = bio_alloc(gfp_mask, page_count);
+	if (!bio) {
+		EXOFS_DBGMSG("Failed to bio_alloc page_count=%d\n", page_count);
+		return NULL;
+	}
+
+	for (i = 0; i < page_count && length; i++) {
+		size_t use_len = min(length, PAGE_SIZE);
+
+		if (use_len !=
+		    bio_add_pc_page(req_q, bio, pages[i], use_len, 0)) {
+			EXOFS_ERR("Failed bio_add_pc_page req_q=%p pages[i]=%p "
+				  "use_len=%Zd page_count=%d length=%Zd\n",
+				  req_q, pages[i], use_len, page_count, length);
+			bio_put(bio);
+			return NULL;
+		}
+
+		length -= use_len;
+	}
+
+	WARN_ON(length);
+	return bio;
+}
+
+int osd_req_read_pages(struct osd_request *or,
+	const struct osd_obj_id *obj, u64 offset, u64 length,
+	struct page **pages, int page_count)
+{
+	struct request_queue *req_q = or->osd_dev->scsi_device->request_queue;
+	struct bio *bio = _bio_map_pages(req_q, pages, page_count, length,
+					 GFP_KERNEL);
+
+	if (!bio)
+		return -ENOMEM;
+
+	_osd_read(or, obj, offset, bio);
+	return 0;
+}
+#endif /* def __KERNEL__ */
+
+int osd_req_read_kern(struct osd_request *or,
+	const struct osd_obj_id *obj, u64 offset, void* buff, u64 len)
+{
+	struct request_queue *req_q = or->osd_dev->scsi_device->request_queue;
+	struct bio *bio = bio_map_kern(req_q, buff, len, GFP_KERNEL);
+
+	if (!bio)
+		return -ENOMEM;
+
+	_osd_read(or, obj, offset, bio);
+	return 0;
+}
+
+static void _osd_write(struct osd_request *or,
+	const struct osd_obj_id *obj, uint64_t offset, struct bio *bio)
+{
+	osd_req_write(or, obj, bio, offset);
+	EXOFS_DBGMSG("osd_req_write(p=%llX, ob=%llX, l=%llu, of=%llu)\n",
+		_LLU(obj->partition), _LLU(obj->id), _LLU(bio->bi_size),
+		_LLU(offset));
+}
+
+#ifdef __KERNEL__
+int osd_req_write_pages(struct osd_request *or,
+	const struct osd_obj_id *obj, u64 offset, u64 length,
+	struct page **pages, int page_count)
+{
+	struct request_queue *req_q = or->osd_dev->scsi_device->request_queue;
+	struct bio *bio = _bio_map_pages(req_q, pages, page_count, length,
+					 GFP_KERNEL);
+
+	if (!bio)
+		return -ENOMEM;
+
+	_osd_write(or, obj, offset, bio);
+	return 0;
+}
+#endif /* def __KERNEL__ */
+
+int osd_req_write_kern(struct osd_request *or,
+	const struct osd_obj_id *obj, u64 offset, void* buff, u64 len)
+{
+	struct request_queue *req_q = or->osd_dev->scsi_device->request_queue;
+	struct bio *bio = bio_map_kern(req_q, buff, len, GFP_KERNEL);
+
+	if (!bio)
+		return -ENOMEM;
+
+	_osd_write(or, obj, offset, bio);
+	return 0;
+}
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/8] exofs: file and file_inode operations
  2009-02-09 13:07 [PATCHSET 0/8 version 3] exofs Boaz Harrosh
  2009-02-09 13:12 ` [PATCH 1/8] exofs: Kbuild, Headers and osd utils Boaz Harrosh
@ 2009-02-09 13:18 ` Boaz Harrosh
  2009-02-09 13:20 ` [PATCH 3/8] exofs: symlink_inode and fast_symlink_inode operations Boaz Harrosh
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-09 13:18 UTC (permalink / raw)
  To: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel, open-osd
  Cc: linux-kernel, James Bottomley

implementation of the file_operations and inode_operations for
regular data files.

Most file_operations are generic vfs implementations except:
- exofs_truncate will truncate the OSD object as well
- Generic file_fsync is not good for none_bd devices so open code it
- The default for .flush in Linux is todo nothing so call exofs_fsync
  on the file.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/exofs/Kbuild  |    2 +-
 fs/exofs/exofs.h |   11 ++++
 fs/exofs/file.c  |   82 ++++++++++++++++++++++++++++++
 fs/exofs/inode.c |  148 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 242 insertions(+), 1 deletions(-)
 create mode 100644 fs/exofs/file.c
 create mode 100644 fs/exofs/inode.c

diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index c806738..d8bd4d5 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -26,5 +26,5 @@ KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
 
 endif
 
-exofs-objs := osd.o
+exofs-objs := osd.o inode.o file.o
 obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index 0637237..b6d9089 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -136,4 +136,15 @@ int osd_req_write_pages(struct osd_request *or,
 	const struct osd_obj_id *, u64 offset, u64 length,
 	struct page **pages, int page_count);
 
+/* inode.c               */
+void exofs_truncate(struct inode *inode);
+int exofs_setattr(struct dentry *, struct iattr *);
+
+/*********************
+ * operation vectors *
+ *********************/
+/* file.c            */
+extern const struct inode_operations exofs_file_inode_operations;
+extern const struct file_operations exofs_file_operations;
+
 #endif
diff --git a/fs/exofs/file.c b/fs/exofs/file.c
new file mode 100644
index 0000000..4738c3f
--- /dev/null
+++ b/fs/exofs/file.c
@@ -0,0 +1,82 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * Copyrights for code taken from ext2:
+ *     Copyright (C) 1992, 1993, 1994, 1995
+ *     Remy Card (card@masi.ibp.fr)
+ *     Laboratoire MASI - Institut Blaise Pascal
+ *     Universite Pierre et Marie Curie (Paris VI)
+ *     from
+ *     linux/fs/minix/inode.c
+ *     Copyright (C) 1991, 1992  Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include <linux/buffer_head.h>
+
+#include "exofs.h"
+
+static int exofs_release_file(struct inode *inode, struct file *filp)
+{
+	return 0;
+}
+
+static int exofs_file_fsync(struct file *filp, struct dentry *dentry,
+			    int datasync)
+{
+	int ret1, ret2;
+	struct address_space *mapping = filp->f_mapping;
+
+	ret1 = filemap_write_and_wait(mapping);
+	ret2 = file_fsync(filp, dentry, datasync);
+
+	return ret1 ? ret1 : ret2;
+}
+
+static int exofs_flush(struct file *file, fl_owner_t id)
+{
+	exofs_file_fsync(file, file->f_path.dentry, 1);
+	/* TODO: Flush the OSD target */
+	return 0;
+}
+
+const struct file_operations exofs_file_operations = {
+	.llseek		= generic_file_llseek,
+	.read		= do_sync_read,
+	.write		= do_sync_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write,
+	.mmap		= generic_file_mmap,
+	.open		= generic_file_open,
+	.release	= exofs_release_file,
+	.fsync		= exofs_file_fsync,
+	.flush		= exofs_flush,
+	.splice_read	= generic_file_splice_read,
+	.splice_write	= generic_file_splice_write,
+};
+
+const struct inode_operations exofs_file_inode_operations = {
+	.truncate	= exofs_truncate,
+	.setattr	= exofs_setattr,
+};
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
new file mode 100644
index 0000000..b0bda1e
--- /dev/null
+++ b/fs/exofs/inode.c
@@ -0,0 +1,148 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * Copyrights for code taken from ext2:
+ *     Copyright (C) 1992, 1993, 1994, 1995
+ *     Remy Card (card@masi.ibp.fr)
+ *     Laboratoire MASI - Institut Blaise Pascal
+ *     Universite Pierre et Marie Curie (Paris VI)
+ *     from
+ *     linux/fs/minix/inode.c
+ *     Copyright (C) 1991, 1992  Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include <linux/writeback.h>
+#include <linux/buffer_head.h>
+
+#include "exofs.h"
+
+#ifdef CONFIG_EXOFS_DEBUG
+#  define EXOFS_DEBUG_OBJ_ISIZE 1
+#endif
+
+/******************************************************************************
+ * INODE OPERATIONS
+ *****************************************************************************/
+
+/*
+ * Test whether an inode is a fast symlink.
+ */
+static inline int exofs_inode_is_fast_symlink(struct inode *inode)
+{
+	struct exofs_i_info *oi = exofs_i(inode);
+
+	return S_ISLNK(inode->i_mode) && (oi->i_data[0] != 0);
+}
+
+/*
+ * get_block_t - Fill in a buffer_head
+ * An OSD takes care of block allocation so we just fake an allocation by
+ * putting in the inode's sector_t in the buffer_head.
+ * TODO: What about the case of create==0 and @iblock does not exist in the
+ * object?
+ */
+static int exofs_get_block(struct inode *inode, sector_t iblock,
+		    struct buffer_head *bh_result, int create)
+{
+	map_bh(bh_result, inode->i_sb, iblock);
+	return 0;
+}
+
+const struct osd_attr g_attr_logical_length = ATTR_DEF(
+	OSD_APAGE_OBJECT_INFORMATION, OSD_ATTR_OI_LOGICAL_LENGTH, 8);
+
+/*
+ * Truncate a file to the specified size - all we have to do is set the size
+ * attribute.  We make sure the object exists first.
+ */
+void exofs_truncate(struct inode *inode)
+{
+	struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+	struct exofs_i_info *oi = exofs_i(inode);
+	struct osd_obj_id obj = {sbi->s_pid, inode->i_ino + EXOFS_OBJ_OFF};
+	struct osd_request *or;
+	struct osd_attr attr;
+	loff_t isize = i_size_read(inode);
+	__be64 newsize;
+	int ret;
+
+	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)
+	     || S_ISLNK(inode->i_mode)))
+		return;
+	if (exofs_inode_is_fast_symlink(inode))
+		return;
+	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+		return;
+	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+
+	nobh_truncate_page(inode->i_mapping, isize, exofs_get_block);
+
+	or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+	if (unlikely(!or)) {
+		EXOFS_ERR("ERROR: exofs_truncate: osd_start_request failed\n");
+		goto fail;
+	}
+
+	osd_req_set_attributes(or, &obj);
+
+	newsize = cpu_to_be64((u64)isize);
+	attr = g_attr_logical_length;
+	attr.val_ptr = &newsize;
+	osd_req_add_set_attr_list(or, &attr, 1);
+
+	/* if we are about to truncate an object, and it hasn't been
+	 * created yet, wait
+	 */
+	if (unlikely(wait_obj_created(oi)))
+		goto fail;
+
+	ret = exofs_sync_op(or, sbi->s_timeout, oi->i_cred);
+	osd_end_request(or);
+	if (ret)
+		goto fail;
+
+out:
+	mark_inode_dirty(inode);
+	return;
+fail:
+	make_bad_inode(inode);
+	goto out;
+}
+
+/*
+ * Set inode attributes - just call generic functions.
+ */
+int exofs_setattr(struct dentry *dentry, struct iattr *iattr)
+{
+	struct inode *inode = dentry->d_inode;
+	int error;
+
+	error = inode_change_ok(inode, iattr);
+	if (error)
+		return error;
+
+	error = inode_setattr(inode, iattr);
+	return error;
+}
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/8] exofs: symlink_inode and fast_symlink_inode operations
  2009-02-09 13:07 [PATCHSET 0/8 version 3] exofs Boaz Harrosh
  2009-02-09 13:12 ` [PATCH 1/8] exofs: Kbuild, Headers and osd utils Boaz Harrosh
  2009-02-09 13:18 ` [PATCH 2/8] exofs: file and file_inode operations Boaz Harrosh
@ 2009-02-09 13:20 ` Boaz Harrosh
  2009-02-09 13:22 ` [PATCH 4/8] exofs: address_space_operations Boaz Harrosh
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-09 13:20 UTC (permalink / raw)
  To: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel, open-osd
  Cc: linux-kernel, James Bottomley

Generic implementation of symlink ops.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/exofs/Kbuild    |    2 +-
 fs/exofs/exofs.h   |    4 +++
 fs/exofs/symlink.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 62 insertions(+), 1 deletions(-)
 create mode 100644 fs/exofs/symlink.c

diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index d8bd4d5..18c6158 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -26,5 +26,5 @@ KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
 
 endif
 
-exofs-objs := osd.o inode.o file.o
+exofs-objs := osd.o inode.o file.o symlink.o
 obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index b6d9089..9470be3 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -147,4 +147,8 @@ int exofs_setattr(struct dentry *, struct iattr *);
 extern const struct inode_operations exofs_file_inode_operations;
 extern const struct file_operations exofs_file_operations;
 
+/* symlink.c         */
+extern const struct inode_operations exofs_symlink_inode_operations;
+extern const struct inode_operations exofs_fast_symlink_inode_operations;
+
 #endif
diff --git a/fs/exofs/symlink.c b/fs/exofs/symlink.c
new file mode 100644
index 0000000..36e2d7b
--- /dev/null
+++ b/fs/exofs/symlink.c
@@ -0,0 +1,57 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * Copyrights for code taken from ext2:
+ *     Copyright (C) 1992, 1993, 1994, 1995
+ *     Remy Card (card@masi.ibp.fr)
+ *     Laboratoire MASI - Institut Blaise Pascal
+ *     Universite Pierre et Marie Curie (Paris VI)
+ *     from
+ *     linux/fs/minix/inode.c
+ *     Copyright (C) 1991, 1992  Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include <linux/namei.h>
+
+#include "exofs.h"
+
+static void *exofs_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+	struct exofs_i_info *oi = exofs_i(dentry->d_inode);
+
+	nd_set_link(nd, (char *)oi->i_data);
+	return NULL;
+}
+
+const struct inode_operations exofs_symlink_inode_operations = {
+	.readlink	= generic_readlink,
+	.follow_link	= page_follow_link_light,
+	.put_link	= page_put_link,
+};
+
+const struct inode_operations exofs_fast_symlink_inode_operations = {
+	.readlink	= generic_readlink,
+	.follow_link	= exofs_follow_link,
+};
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/8] exofs: address_space_operations
  2009-02-09 13:07 [PATCHSET 0/8 version 3] exofs Boaz Harrosh
                   ` (2 preceding siblings ...)
  2009-02-09 13:20 ` [PATCH 3/8] exofs: symlink_inode and fast_symlink_inode operations Boaz Harrosh
@ 2009-02-09 13:22 ` Boaz Harrosh
  2009-02-09 13:24 ` [PATCH 5/8] exofs: dir_inode and directory operations Boaz Harrosh
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-09 13:22 UTC (permalink / raw)
  To: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel, open-osd
  Cc: linux-kernel, James Bottomley

OK Now we start to read and write from osd-objects, page-by-page.
The page index is the object's offset.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/exofs/exofs.h |    6 +
 fs/exofs/inode.c |  322 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 328 insertions(+), 0 deletions(-)

diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index 9470be3..59163eb 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -139,6 +139,9 @@ int osd_req_write_pages(struct osd_request *or,
 /* inode.c               */
 void exofs_truncate(struct inode *inode);
 int exofs_setattr(struct dentry *, struct iattr *);
+int exofs_write_begin(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata);
 
 /*********************
  * operation vectors *
@@ -147,6 +150,9 @@ int exofs_setattr(struct dentry *, struct iattr *);
 extern const struct inode_operations exofs_file_inode_operations;
 extern const struct file_operations exofs_file_operations;
 
+/* inode.c           */
+extern const struct address_space_operations exofs_aops;
+
 /* symlink.c         */
 extern const struct inode_operations exofs_symlink_inode_operations;
 extern const struct inode_operations exofs_fast_symlink_inode_operations;
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index b0bda1e..f4979ea 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -42,6 +42,328 @@
 #  define EXOFS_DEBUG_OBJ_ISIZE 1
 #endif
 
+/*
+ * Callback for readpage
+ */
+static int __readpage_done(struct osd_request *or, void *p, int unlock)
+{
+	struct page *page = p;
+	struct inode *inode = page->mapping->host;
+	struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+	int ret;
+
+	ret = exofs_check_ok(or);
+	osd_end_request(or);
+
+	EXOFS_DBGMSG("ret=>%d unlock=%d page=%p\n", ret, unlock, page);
+
+	if (ret == 0) {
+		/* Everything is OK */
+		SetPageUptodate(page);
+		if (PageError(page))
+			ClearPageError(page);
+	} else if (ret == -EFAULT) {
+		/* In this case we were trying to read something that wasn't on
+		 * disk yet - return a page full of zeroes.  This should be OK,
+		 * because the object should be empty (if there was a write
+		 * before this read, the read would be waiting with the page
+		 * locked */
+		clear_highpage(page);
+
+		SetPageUptodate(page);
+		if (PageError(page))
+			ClearPageError(page);
+	} else /* Error */
+		SetPageError(page);
+
+	atomic_dec(&sbi->s_curr_pending);
+	if (unlock)
+		unlock_page(page);
+
+	return ret;
+}
+
+static void readpage_done(struct osd_request *or, void *p)
+{
+	__readpage_done(or, p, true);
+}
+
+/*
+ * Read a page from the OSD
+ */
+static int __readpage_filler(struct page *page, bool is_async)
+{
+	struct osd_request *or = NULL;
+	struct inode *inode = page->mapping->host;
+	struct exofs_i_info *oi = exofs_i(inode);
+	ino_t ino = inode->i_ino;
+	loff_t i_size = i_size_read(inode);
+	loff_t i_start = page->index << PAGE_CACHE_SHIFT;
+	pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+	struct super_block *sb = inode->i_sb;
+	struct exofs_sb_info *sbi = sb->s_fs_info;
+	struct osd_obj_id obj = {sbi->s_pid, ino + EXOFS_OBJ_OFF};
+	uint64_t amount;
+	int ret = 0;
+
+	BUG_ON(!PageLocked(page));
+
+	if (PageUptodate(page))
+		goto unlock;
+
+	if (page->index < end_index)
+		amount = PAGE_CACHE_SIZE;
+	else
+		amount = i_size & (PAGE_CACHE_SIZE - 1);
+
+	/* this will be out of bounds, or doesn't exist yet */
+	if ((page->index >= end_index + 1) || !obj_created(oi) || !amount
+	    /*|| (i_start >= oi->i_commit_size)*/) {
+		clear_highpage(page);
+
+		SetPageUptodate(page);
+		if (PageError(page))
+			ClearPageError(page);
+		goto unlock;
+	}
+
+	if (amount != PAGE_CACHE_SIZE)
+		zero_user(page, amount, PAGE_CACHE_SIZE - amount);
+
+	or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+	if (unlikely(!or)) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	ret = osd_req_read_pages(or, &obj, i_start, amount, &page, 1);
+	if (unlikely(ret))
+		goto err;
+
+	atomic_inc(&sbi->s_curr_pending);
+	if (is_async) {
+		ret = exofs_async_op(or, readpage_done, page, oi->i_cred);
+		if (unlikely(ret)) {
+			atomic_dec(&sbi->s_curr_pending);
+			goto err;
+		}
+	} else {
+		exofs_sync_op(or, sbi->s_timeout, oi->i_cred);
+		ret = __readpage_done(or, page, false);
+	}
+
+	EXOFS_DBGMSG("ret=>%d unlock=%d page=%p\n", ret, is_async, page);
+	return ret;
+
+err:
+	if (or)
+		osd_end_request(or);
+	SetPageError(page);
+	EXOFS_DBGMSG("@err\n");
+unlock:
+	if (is_async)
+		unlock_page(page);
+	EXOFS_DBGMSG("@unlock is_async=%d\n", is_async);
+	return ret;
+}
+
+static int readpage_filler(struct page *page)
+{
+	int ret = __readpage_filler(page, true);
+
+	return ret;
+}
+
+/*
+ * We don't need the file
+ */
+static int exofs_readpage(struct file *file, struct page *page)
+{
+	return readpage_filler(page);
+}
+
+/*
+ * We don't need the data
+ */
+static int readpage_strip(void *data, struct page *page)
+{
+	return readpage_filler(page);
+}
+
+/*
+ * read a bunch of pages - usually for readahead
+ */
+static int exofs_readpages(struct file *file, struct address_space *mapping,
+			   struct list_head *pages, unsigned nr_pages)
+{
+	return read_cache_pages(mapping, pages, readpage_strip, NULL);
+}
+
+/*
+ * Callback function when writepage finishes.  Check for errors, unlock, clean
+ * up, etc.
+ */
+static void writepage_done(struct osd_request *or, void *p)
+{
+	int ret;
+	struct page *page = p;
+	struct inode *inode = page->mapping->host;
+	struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+
+	ret = exofs_check_ok(or);
+	osd_end_request(or);
+	atomic_dec(&sbi->s_curr_pending);
+
+	if (ret) {
+		if (ret == -ENOSPC)
+			set_bit(AS_ENOSPC, &page->mapping->flags);
+		else
+			set_bit(AS_EIO, &page->mapping->flags);
+
+		SetPageError(page);
+	}
+
+	end_page_writeback(page);
+	unlock_page(page);
+}
+
+/*
+ * Write a page to disk.  page->index gives us the page number.  The page is
+ * locked before this function is called.  We write asynchronously and then the
+ * callback function (writepage_done) is called.  We signify that the operation
+ * has completed by unlocking the page and calling end_page_writeback().
+ */
+static int exofs_writepage(struct page *page, struct writeback_control *wbc)
+{
+	struct inode *inode = page->mapping->host;
+	struct exofs_i_info *oi = exofs_i(inode);
+	struct osd_obj_id obj;
+	loff_t i_size = i_size_read(inode);
+	unsigned long end_index = i_size >> PAGE_CACHE_SHIFT;
+	unsigned offset = 0;
+	struct osd_request *or;
+	struct exofs_sb_info *sbi;
+	uint64_t start;
+	uint64_t len = PAGE_CACHE_SIZE;
+	int ret = 0;
+
+	BUG_ON(!PageLocked(page));
+
+	/* if the object has not been created, and we are not in sync mode,
+	 * just return.  otherwise, wait. */
+	if (!obj_created(oi)) {
+		BUG_ON(!obj_2bcreated(oi));
+
+		if (wbc->sync_mode == WB_SYNC_NONE) {
+			redirty_page_for_writepage(wbc, page);
+			unlock_page(page);
+			ret = 0;
+			goto out;
+		} else
+			wait_event(oi->i_wq, obj_created(oi));
+	}
+
+	/* in this case, the page is within the limits of the file */
+	if (page->index < end_index)
+		goto do_it;
+
+	offset = i_size & (PAGE_CACHE_SIZE - 1);
+	len = offset;
+
+	/*in this case, the page is outside the limits (truncate in progress)*/
+	if (page->index >= end_index + 1 || !offset) {
+		unlock_page(page);
+		goto out;
+	}
+
+do_it:
+	BUG_ON(PageWriteback(page));
+	set_page_writeback(page);
+	start = page->index << PAGE_CACHE_SHIFT;
+	sbi = inode->i_sb->s_fs_info;
+	oi->i_commit_size = min_t(uint64_t, oi->i_commit_size, len + start);
+
+	or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+	if (unlikely(!or)) {
+		EXOFS_ERR("ERROR: writepage failed.\n");
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	obj.partition = sbi->s_pid;
+	obj.id = inode->i_ino + EXOFS_OBJ_OFF;
+	ret = osd_req_write_pages(or, &obj, start, len, &page, 1);
+	if (ret)
+		goto fail;
+
+	ret = exofs_async_op(or, writepage_done, page, oi->i_cred);
+	if (ret)
+		goto fail;
+
+	atomic_inc(&sbi->s_curr_pending);
+out:
+	return ret;
+fail:
+	if (or)
+		osd_end_request(or);
+	set_bit(AS_EIO, &page->mapping->flags);
+	end_page_writeback(page);
+	unlock_page(page);
+	goto out;
+}
+
+int exofs_write_begin(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata)
+{
+	int ret = 0;
+	struct page *page;
+
+	page = *pagep;
+	if (page == NULL) {
+		ret = simple_write_begin(file, mapping, pos, len, flags, pagep,
+					 fsdata);
+		if (ret) {
+			EXOFS_DBGMSG("simple_write_begin faild\n");
+			return ret;
+		}
+
+		page = *pagep;
+	}
+
+	 /* read modify write */
+	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
+		ret = __readpage_filler(page, false);
+		if (ret) {
+			/*SetPageError was done by readpage_filler. Is it ok?*/
+			unlock_page(page);
+			EXOFS_DBGMSG("__readpage_filler faild\n");
+		}
+	}
+
+	return ret;
+}
+
+static int exofs_write_begin_export(struct file *file,
+		struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata)
+{
+	*pagep = NULL;
+
+	return exofs_write_begin(file, mapping, pos, len, flags, pagep,
+					fsdata);
+}
+
+const struct address_space_operations exofs_aops = {
+	.readpage	= exofs_readpage,
+	.readpages	= exofs_readpages,
+	.writepage	= exofs_writepage,
+	.write_begin	= exofs_write_begin_export,
+	.write_end	= simple_write_end,
+	.writepages	= generic_writepages,
+};
+
 /******************************************************************************
  * INODE OPERATIONS
  *****************************************************************************/
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 5/8] exofs: dir_inode and directory operations
  2009-02-09 13:07 [PATCHSET 0/8 version 3] exofs Boaz Harrosh
                   ` (3 preceding siblings ...)
  2009-02-09 13:22 ` [PATCH 4/8] exofs: address_space_operations Boaz Harrosh
@ 2009-02-09 13:24 ` Boaz Harrosh
  2009-02-15 17:08   ` Evgeniy Polyakov
  2009-02-09 13:25 ` [PATCH 6/8] exofs: super_operations and file_system_type Boaz Harrosh
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-09 13:24 UTC (permalink / raw)
  To: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel, open-osd
  Cc: linux-kernel, James Bottomley

implementation of directory and inode operations.

* A directory is treated as a file, and essentially contains a list
  of <file name, inode #> pairs for files that are found in that
  directory. The object IDs correspond to the files' inode numbers
  and are allocated using a 64bit incrementing global counter.
* Each file's control block (AKA on-disk inode) is stored in its
  object's attributes. This applies to both regular files and other
  types (directories, device files, symlinks, etc.).

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/exofs/Kbuild  |    2 +-
 fs/exofs/dir.c   |  643 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/exofs/exofs.h |   26 +++
 fs/exofs/inode.c |  272 +++++++++++++++++++++++
 fs/exofs/namei.c |  338 ++++++++++++++++++++++++++++
 5 files changed, 1280 insertions(+), 1 deletions(-)
 create mode 100644 fs/exofs/dir.c
 create mode 100644 fs/exofs/namei.c

diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index 18c6158..61162c6 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -26,5 +26,5 @@ KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
 
 endif
 
-exofs-objs := osd.o inode.o file.o symlink.o
+exofs-objs := osd.o inode.o file.o symlink.o namei.o dir.o
 obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
new file mode 100644
index 0000000..245fdee
--- /dev/null
+++ b/fs/exofs/dir.c
@@ -0,0 +1,643 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * Copyrights for code taken from ext2:
+ *     Copyright (C) 1992, 1993, 1994, 1995
+ *     Remy Card (card@masi.ibp.fr)
+ *     Laboratoire MASI - Institut Blaise Pascal
+ *     Universite Pierre et Marie Curie (Paris VI)
+ *     from
+ *     linux/fs/minix/inode.c
+ *     Copyright (C) 1991, 1992  Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "exofs.h"
+
+static inline unsigned exofs_chunk_size(struct inode *inode)
+{
+	return inode->i_sb->s_blocksize;
+}
+
+static inline void exofs_put_page(struct page *page)
+{
+	kunmap(page);
+	page_cache_release(page);
+}
+
+static inline unsigned long dir_pages(struct inode *inode)
+{
+	return (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+}
+
+static unsigned exofs_last_byte(struct inode *inode, unsigned long page_nr)
+{
+	unsigned last_byte = inode->i_size;
+
+	last_byte -= page_nr << PAGE_CACHE_SHIFT;
+	if (last_byte > PAGE_CACHE_SIZE)
+		last_byte = PAGE_CACHE_SIZE;
+	return last_byte;
+}
+
+static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len)
+{
+	struct address_space *mapping = page->mapping;
+	struct inode *dir = mapping->host;
+	int err = 0;
+
+	dir->i_version++;
+
+	if (!PageUptodate(page))
+		SetPageUptodate(page);
+
+	if (pos+len > dir->i_size) {
+		i_size_write(dir, pos+len);
+		mark_inode_dirty(dir);
+	}
+	set_page_dirty(page);
+
+	if (IS_DIRSYNC(dir))
+		err = write_one_page(page, 1);
+	else
+		unlock_page(page);
+
+	return err;
+}
+
+static void exofs_check_page(struct page *page)
+{
+	struct inode *dir = page->mapping->host;
+	unsigned chunk_size = exofs_chunk_size(dir);
+	char *kaddr = page_address(page);
+	unsigned offs, rec_len;
+	unsigned limit = PAGE_CACHE_SIZE;
+	struct exofs_dir_entry *p;
+	char *error;
+
+	/* if the page is the last one in the directory */
+	if ((dir->i_size >> PAGE_CACHE_SHIFT) == page->index) {
+		limit = dir->i_size & ~PAGE_CACHE_MASK;
+		if (limit & (chunk_size - 1))
+			goto Ebadsize;
+		if (!limit)
+			goto out;
+	}
+	for (offs = 0; offs <= limit - EXOFS_DIR_REC_LEN(1); offs += rec_len) {
+		p = (struct exofs_dir_entry *)(kaddr + offs);
+		rec_len = le16_to_cpu(p->rec_len);
+
+		if (rec_len < EXOFS_DIR_REC_LEN(1))
+			goto Eshort;
+		if (rec_len & 3)
+			goto Ealign;
+		if (rec_len < EXOFS_DIR_REC_LEN(p->name_len))
+			goto Enamelen;
+		if (((offs + rec_len - 1) ^ offs) & ~(chunk_size-1))
+			goto Espan;
+	}
+	if (offs != limit)
+		goto Eend;
+out:
+	SetPageChecked(page);
+	return;
+
+Ebadsize:
+	EXOFS_ERR("ERROR [exofs_check_page]: "
+		"size of directory #%lu is not a multiple of chunk size",
+		dir->i_ino
+	);
+	goto fail;
+Eshort:
+	error = "rec_len is smaller than minimal";
+	goto bad_entry;
+Ealign:
+	error = "unaligned directory entry";
+	goto bad_entry;
+Enamelen:
+	error = "rec_len is too small for name_len";
+	goto bad_entry;
+Espan:
+	error = "directory entry across blocks";
+	goto bad_entry;
+bad_entry:
+	EXOFS_ERR(
+		"ERROR [exofs_check_page]: bad entry in directory #%lu: %s - "
+		"offset=%lu, inode=%llu, rec_len=%d, name_len=%d",
+		dir->i_ino, error, (page->index<<PAGE_CACHE_SHIFT)+offs,
+		_LLU(le64_to_cpu(p->inode_no)),
+		rec_len, p->name_len);
+	goto fail;
+Eend:
+	p = (struct exofs_dir_entry *)(kaddr + offs);
+	EXOFS_ERR("ERROR [exofs_check_page]: "
+		"entry in directory #%lu spans the page boundary"
+		"offset=%lu, inode=%llu",
+		dir->i_ino, (page->index<<PAGE_CACHE_SHIFT)+offs,
+		_LLU(le64_to_cpu(p->inode_no)));
+fail:
+	SetPageChecked(page);
+	SetPageError(page);
+}
+
+static struct page *exofs_get_page(struct inode *dir, unsigned long n)
+{
+	struct address_space *mapping = dir->i_mapping;
+	struct page *page = read_mapping_page(mapping, n, NULL);
+
+	if (!IS_ERR(page)) {
+		kmap(page);
+		if (!PageChecked(page))
+			exofs_check_page(page);
+		if (PageError(page))
+			goto fail;
+	}
+	return page;
+
+fail:
+	exofs_put_page(page);
+	return ERR_PTR(-EIO);
+}
+
+static inline int exofs_match(int len, const unsigned char *name,
+					struct exofs_dir_entry *de)
+{
+	if (len != de->name_len)
+		return 0;
+	if (!de->inode_no)
+		return 0;
+	return !memcmp(name, de->name, len);
+}
+
+static inline
+struct exofs_dir_entry *exofs_next_entry(struct exofs_dir_entry *p)
+{
+	return (struct exofs_dir_entry *)((char *)p + le16_to_cpu(p->rec_len));
+}
+
+static inline unsigned
+exofs_validate_entry(char *base, unsigned offset, unsigned mask)
+{
+	struct exofs_dir_entry *de = (struct exofs_dir_entry *)(base + offset);
+	struct exofs_dir_entry *p =
+			(struct exofs_dir_entry *)(base + (offset&mask));
+	while ((char *)p < (char *)de) {
+		if (p->rec_len == 0)
+			break;
+		p = exofs_next_entry(p);
+	}
+	return (char *)p - base;
+}
+
+static unsigned char exofs_filetype_table[EXOFS_FT_MAX] = {
+	[EXOFS_FT_UNKNOWN]	= DT_UNKNOWN,
+	[EXOFS_FT_REG_FILE]	= DT_REG,
+	[EXOFS_FT_DIR]		= DT_DIR,
+	[EXOFS_FT_CHRDEV]	= DT_CHR,
+	[EXOFS_FT_BLKDEV]	= DT_BLK,
+	[EXOFS_FT_FIFO]		= DT_FIFO,
+	[EXOFS_FT_SOCK]		= DT_SOCK,
+	[EXOFS_FT_SYMLINK]	= DT_LNK,
+};
+
+#define S_SHIFT 12
+static unsigned char exofs_type_by_mode[S_IFMT >> S_SHIFT] = {
+	[S_IFREG >> S_SHIFT]	= EXOFS_FT_REG_FILE,
+	[S_IFDIR >> S_SHIFT]	= EXOFS_FT_DIR,
+	[S_IFCHR >> S_SHIFT]	= EXOFS_FT_CHRDEV,
+	[S_IFBLK >> S_SHIFT]	= EXOFS_FT_BLKDEV,
+	[S_IFIFO >> S_SHIFT]	= EXOFS_FT_FIFO,
+	[S_IFSOCK >> S_SHIFT]	= EXOFS_FT_SOCK,
+	[S_IFLNK >> S_SHIFT]	= EXOFS_FT_SYMLINK,
+};
+
+static inline
+void exofs_set_de_type(struct exofs_dir_entry *de, struct inode *inode)
+{
+	mode_t mode = inode->i_mode;
+	de->file_type = exofs_type_by_mode[(mode & S_IFMT) >> S_SHIFT];
+}
+
+static int
+exofs_readdir(struct file *filp, void *dirent, filldir_t filldir)
+{
+	loff_t pos = filp->f_pos;
+	struct inode *inode = filp->f_path.dentry->d_inode;
+	unsigned int offset = pos & ~PAGE_CACHE_MASK;
+	unsigned long n = pos >> PAGE_CACHE_SHIFT;
+	unsigned long npages = dir_pages(inode);
+	unsigned chunk_mask = ~(exofs_chunk_size(inode)-1);
+	unsigned char *types = NULL;
+	int need_revalidate = (filp->f_version != inode->i_version);
+
+	if (pos > inode->i_size - EXOFS_DIR_REC_LEN(1))
+		return 0;
+
+	types = exofs_filetype_table;
+
+	for ( ; n < npages; n++, offset = 0) {
+		char *kaddr, *limit;
+		struct exofs_dir_entry *de;
+		struct page *page = exofs_get_page(inode, n);
+
+		if (IS_ERR(page)) {
+			EXOFS_ERR("ERROR: "
+				   "bad page in #%lu",
+				   inode->i_ino);
+			filp->f_pos += PAGE_CACHE_SIZE - offset;
+			return PTR_ERR(page);
+		}
+		kaddr = page_address(page);
+		if (unlikely(need_revalidate)) {
+			if (offset) {
+				offset = exofs_validate_entry(kaddr, offset,
+								chunk_mask);
+				filp->f_pos = (n<<PAGE_CACHE_SHIFT) + offset;
+			}
+			filp->f_version = inode->i_version;
+			need_revalidate = 0;
+		}
+		de = (struct exofs_dir_entry *)(kaddr + offset);
+		limit = kaddr + exofs_last_byte(inode, n) -
+							EXOFS_DIR_REC_LEN(1);
+		for (; (char *)de <= limit; de = exofs_next_entry(de)) {
+			if (de->rec_len == 0) {
+				EXOFS_ERR("ERROR: "
+					"zero-length directory entry");
+				exofs_put_page(page);
+				return -EIO;
+			}
+			if (de->inode_no) {
+				int over;
+				unsigned char d_type = DT_UNKNOWN;
+
+				if (types && de->file_type < EXOFS_FT_MAX)
+					d_type = types[de->file_type];
+
+				offset = (char *)de - kaddr;
+				over = filldir(dirent, de->name, de->name_len,
+						(n<<PAGE_CACHE_SHIFT) | offset,
+						le64_to_cpu(de->inode_no),
+						d_type);
+				if (over) {
+					exofs_put_page(page);
+					return 0;
+				}
+			}
+			filp->f_pos += le16_to_cpu(de->rec_len);
+		}
+		exofs_put_page(page);
+	}
+
+	return 0;
+}
+
+struct exofs_dir_entry *exofs_find_entry(struct inode *dir,
+			struct dentry *dentry, struct page **res_page)
+{
+	const unsigned char *name = dentry->d_name.name;
+	int namelen = dentry->d_name.len;
+	unsigned reclen = EXOFS_DIR_REC_LEN(namelen);
+	unsigned long start, n;
+	unsigned long npages = dir_pages(dir);
+	struct page *page = NULL;
+	struct exofs_i_info *oi = exofs_i(dir);
+	struct exofs_dir_entry *de;
+
+	if (npages == 0)
+		goto out;
+
+	*res_page = NULL;
+
+	start = oi->i_dir_start_lookup;
+	if (start >= npages)
+		start = 0;
+	n = start;
+	do {
+		char *kaddr;
+		page = exofs_get_page(dir, n);
+		if (!IS_ERR(page)) {
+			kaddr = page_address(page);
+			de = (struct exofs_dir_entry *) kaddr;
+			kaddr += exofs_last_byte(dir, n) - reclen;
+			while ((char *) de <= kaddr) {
+				if (de->rec_len == 0) {
+					EXOFS_ERR(
+						"ERROR: exofs_find_entry: "
+						"zero-length directory entry");
+					exofs_put_page(page);
+					goto out;
+				}
+				if (exofs_match(namelen, name, de))
+					goto found;
+				de = exofs_next_entry(de);
+			}
+			exofs_put_page(page);
+		}
+		if (++n >= npages)
+			n = 0;
+	} while (n != start);
+out:
+	return NULL;
+
+found:
+	*res_page = page;
+	oi->i_dir_start_lookup = n;
+	return de;
+}
+
+struct exofs_dir_entry *exofs_dotdot(struct inode *dir, struct page **p)
+{
+	struct page *page = exofs_get_page(dir, 0);
+	struct exofs_dir_entry *de = NULL;
+
+	if (!IS_ERR(page)) {
+		de = exofs_next_entry(
+				(struct exofs_dir_entry *)page_address(page));
+		*p = page;
+	}
+	return de;
+}
+
+ino_t exofs_inode_by_name(struct inode *dir, struct dentry *dentry)
+{
+	ino_t res = 0;
+	struct exofs_dir_entry *de;
+	struct page *page;
+
+	de = exofs_find_entry(dir, dentry, &page);
+	if (de) {
+		res = le64_to_cpu(de->inode_no);
+		exofs_put_page(page);
+	}
+	return res;
+}
+
+void exofs_set_link(struct inode *dir, struct exofs_dir_entry *de,
+			struct page *page, struct inode *inode)
+{
+	loff_t pos = page_offset(page) +
+			(char *) de - (char *) page_address(page);
+	unsigned len = le16_to_cpu(de->rec_len);
+	int err;
+
+	lock_page(page);
+	err = exofs_write_begin(NULL, page->mapping, pos, len,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
+	BUG_ON(err);
+	de->inode_no = cpu_to_le64(inode->i_ino);
+	exofs_set_de_type(de, inode);
+	err = exofs_commit_chunk(page, pos, len);
+	exofs_put_page(page);
+	dir->i_mtime = dir->i_ctime = CURRENT_TIME;
+	mark_inode_dirty(dir);
+}
+
+int exofs_add_link(struct dentry *dentry, struct inode *inode)
+{
+	struct inode *dir = dentry->d_parent->d_inode;
+	const unsigned char *name = dentry->d_name.name;
+	int namelen = dentry->d_name.len;
+	unsigned chunk_size = exofs_chunk_size(dir);
+	unsigned reclen = EXOFS_DIR_REC_LEN(namelen);
+	unsigned short rec_len, name_len;
+	struct page *page = NULL;
+	struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+	struct exofs_dir_entry *de;
+	unsigned long npages = dir_pages(dir);
+	unsigned long n;
+	char *kaddr;
+	loff_t pos;
+	int err;
+
+	for (n = 0; n <= npages; n++) {
+		char *dir_end;
+
+		page = exofs_get_page(dir, n);
+		err = PTR_ERR(page);
+		if (IS_ERR(page))
+			goto out;
+		lock_page(page);
+		kaddr = page_address(page);
+		dir_end = kaddr + exofs_last_byte(dir, n);
+		de = (struct exofs_dir_entry *)kaddr;
+		kaddr += PAGE_CACHE_SIZE - reclen;
+		while ((char *)de <= kaddr) {
+			if ((char *)de == dir_end) {
+				name_len = 0;
+				rec_len = chunk_size;
+				de->rec_len = cpu_to_le16(chunk_size);
+				de->inode_no = 0;
+				goto got_it;
+			}
+			if (de->rec_len == 0) {
+				EXOFS_ERR("ERROR: exofs_add_link: "
+					"zero-length directory entry");
+				err = -EIO;
+				goto out_unlock;
+			}
+			err = -EEXIST;
+			if (exofs_match(namelen, name, de))
+				goto out_unlock;
+			name_len = EXOFS_DIR_REC_LEN(de->name_len);
+			rec_len = le16_to_cpu(de->rec_len);
+			if (!de->inode_no && rec_len >= reclen)
+				goto got_it;
+			if (rec_len >= name_len + reclen)
+				goto got_it;
+			de = (struct exofs_dir_entry *) ((char *) de + rec_len);
+		}
+		unlock_page(page);
+		exofs_put_page(page);
+	}
+	BUG();
+	return -EINVAL;
+
+got_it:
+	pos = page_offset(page) +
+		(char *)de - (char *)page_address(page);
+	err = exofs_write_begin(NULL, page->mapping, pos, rec_len, 0,
+							&page, NULL);
+	if (err)
+		goto out_unlock;
+	if (de->inode_no) {
+		struct exofs_dir_entry *de1 =
+			(struct exofs_dir_entry *)((char *)de + name_len);
+		de1->rec_len = cpu_to_le16(rec_len - name_len);
+		de->rec_len = cpu_to_le16(name_len);
+		de = de1;
+	}
+	de->name_len = namelen;
+	memcpy(de->name, name, namelen);
+	de->inode_no = cpu_to_le64(inode->i_ino);
+	exofs_set_de_type(de, inode);
+	err = exofs_commit_chunk(page, pos, rec_len);
+	dir->i_mtime = dir->i_ctime = CURRENT_TIME;
+	mark_inode_dirty(dir);
+	sbi->s_numfiles++;
+
+out_put:
+	exofs_put_page(page);
+out:
+	return err;
+out_unlock:
+	unlock_page(page);
+	goto out_put;
+}
+
+int exofs_delete_entry(struct exofs_dir_entry *dir, struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+	struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+	char *kaddr = page_address(page);
+	unsigned from = ((char *)dir - kaddr) & ~(exofs_chunk_size(inode)-1);
+	unsigned to = ((char *)dir - kaddr) + le16_to_cpu(dir->rec_len);
+	loff_t pos;
+	struct exofs_dir_entry *pde = NULL;
+	struct exofs_dir_entry *de = (struct exofs_dir_entry *) (kaddr + from);
+	int err;
+
+	while ((char *)de < (char *)dir) {
+		if (de->rec_len == 0) {
+			EXOFS_ERR("ERROR: exofs_delete_entry:"
+				"zero-length directory entry");
+			err = -EIO;
+			goto out;
+		}
+		pde = de;
+		de = exofs_next_entry(de);
+	}
+	if (pde)
+		from = (char *)pde - (char *)page_address(page);
+	pos = page_offset(page) + from;
+	lock_page(page);
+	err = exofs_write_begin(NULL, page->mapping, pos, to - from, 0,
+							&page, NULL);
+	BUG_ON(err);
+	if (pde)
+		pde->rec_len = cpu_to_le16(to - from);
+	dir->inode_no = 0;
+	err = exofs_commit_chunk(page, pos, to - from);
+	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
+	mark_inode_dirty(inode);
+	sbi->s_numfiles--;
+out:
+	exofs_put_page(page);
+	return err;
+}
+
+int exofs_make_empty(struct inode *inode, struct inode *parent)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct page *page = grab_cache_page(mapping, 0);
+	unsigned chunk_size = exofs_chunk_size(inode);
+	struct exofs_dir_entry *de;
+	int err;
+	void *kaddr;
+
+	if (!page)
+		return -ENOMEM;
+
+	err = exofs_write_begin(NULL, page->mapping, 0, chunk_size, 0,
+							&page, NULL);
+	if (err) {
+		unlock_page(page);
+		goto fail;
+	}
+
+	kaddr = kmap_atomic(page, KM_USER0);
+	de = (struct exofs_dir_entry *)kaddr;
+	de->name_len = 1;
+	de->rec_len = cpu_to_le16(EXOFS_DIR_REC_LEN(1));
+	memcpy(de->name, ".\0\0", 4);
+	de->inode_no = cpu_to_le64(inode->i_ino);
+	exofs_set_de_type(de, inode);
+
+	de = (struct exofs_dir_entry *)(kaddr + EXOFS_DIR_REC_LEN(1));
+	de->name_len = 2;
+	de->rec_len = cpu_to_le16(chunk_size - EXOFS_DIR_REC_LEN(1));
+	de->inode_no = cpu_to_le64(parent->i_ino);
+	memcpy(de->name, "..\0", 4);
+	exofs_set_de_type(de, inode);
+	kunmap_atomic(page, KM_USER0);
+	err = exofs_commit_chunk(page, 0, chunk_size);
+fail:
+	page_cache_release(page);
+	return err;
+}
+
+int exofs_empty_dir(struct inode *inode)
+{
+	struct page *page = NULL;
+	unsigned long i, npages = dir_pages(inode);
+
+	for (i = 0; i < npages; i++) {
+		char *kaddr;
+		struct exofs_dir_entry *de;
+		page = exofs_get_page(inode, i);
+
+		if (IS_ERR(page))
+			continue;
+
+		kaddr = page_address(page);
+		de = (struct exofs_dir_entry *)kaddr;
+		kaddr += exofs_last_byte(inode, i) - EXOFS_DIR_REC_LEN(1);
+
+		while ((char *)de <= kaddr) {
+			if (de->rec_len == 0) {
+				EXOFS_ERR("ERROR: exofs_empty_dir: "
+					  "zero-length directory entry"
+					  "kaddr=%p, de=%p\n", kaddr, de);
+				goto not_empty;
+			}
+			if (de->inode_no != 0) {
+				/* check for . and .. */
+				if (de->name[0] != '.')
+					goto not_empty;
+				if (de->name_len > 2)
+					goto not_empty;
+				if (de->name_len < 2) {
+					if (le64_to_cpu(de->inode_no) !=
+					    inode->i_ino)
+						goto not_empty;
+				} else if (de->name[1] != '.')
+					goto not_empty;
+			}
+			de = exofs_next_entry(de);
+		}
+		exofs_put_page(page);
+	}
+	return 1;
+
+not_empty:
+	exofs_put_page(page);
+	return 0;
+}
+
+const struct file_operations exofs_dir_operations = {
+	.llseek		= generic_file_llseek,
+	.read		= generic_read_dir,
+	.readdir	= exofs_readdir,
+};
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index 59163eb..4c859bd 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -124,6 +124,11 @@ static inline struct exofs_i_info *exofs_i(struct inode *inode)
 	return container_of(inode, struct exofs_i_info, vfs_inode);
 }
 
+/*
+ * Maximum count of links to a file
+ */
+#define EXOFS_LINK_MAX           32000
+
 /*************************
  * function declarations *
  *************************/
@@ -142,10 +147,27 @@ int exofs_setattr(struct dentry *, struct iattr *);
 int exofs_write_begin(struct file *file, struct address_space *mapping,
 		loff_t pos, unsigned len, unsigned flags,
 		struct page **pagep, void **fsdata);
+extern struct inode *exofs_iget(struct super_block *, unsigned long);
+struct inode *exofs_new_inode(struct inode *, int);
+
+/* dir.c:                */
+int exofs_add_link(struct dentry *, struct inode *);
+ino_t exofs_inode_by_name(struct inode *, struct dentry *);
+int exofs_delete_entry(struct exofs_dir_entry *, struct page *);
+int exofs_make_empty(struct inode *, struct inode *);
+struct exofs_dir_entry *exofs_find_entry(struct inode *, struct dentry *,
+					 struct page **);
+int exofs_empty_dir(struct inode *);
+struct exofs_dir_entry *exofs_dotdot(struct inode *, struct page **);
+void exofs_set_link(struct inode *, struct exofs_dir_entry *, struct page *,
+		    struct inode *);
 
 /*********************
  * operation vectors *
  *********************/
+/* dir.c:            */
+extern const struct file_operations exofs_dir_operations;
+
 /* file.c            */
 extern const struct inode_operations exofs_file_inode_operations;
 extern const struct file_operations exofs_file_operations;
@@ -153,6 +175,10 @@ extern const struct file_operations exofs_file_operations;
 /* inode.c           */
 extern const struct address_space_operations exofs_aops;
 
+/* namei.c           */
+extern const struct inode_operations exofs_dir_inode_operations;
+extern const struct inode_operations exofs_special_inode_operations;
+
 /* symlink.c         */
 extern const struct inode_operations exofs_symlink_inode_operations;
 extern const struct inode_operations exofs_fast_symlink_inode_operations;
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index f4979ea..8d3385c 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -468,3 +468,275 @@ int exofs_setattr(struct dentry *dentry, struct iattr *iattr)
 	error = inode_setattr(inode, iattr);
 	return error;
 }
+
+/*
+ * Read an inode from the OSD, and return it as is.  We also return the size
+ * attribute in the 'sanity' argument if we got compiled with debugging turned
+ * on.
+ */
+static int exofs_get_inode(struct super_block *sb, struct exofs_i_info *oi,
+		    struct exofs_fcb *inode, uint64_t *sanity)
+{
+	struct exofs_sb_info *sbi = sb->s_fs_info;
+	struct osd_request *or;
+	struct osd_attr attr;
+	struct osd_obj_id obj = {sbi->s_pid,
+				 oi->vfs_inode.i_ino + EXOFS_OBJ_OFF};
+	int ret;
+
+	exofs_make_credential(oi->i_cred, &obj);
+
+	or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+	if (unlikely(!or)) {
+		EXOFS_ERR("exofs_get_inode: osd_start_request failed.\n");
+		return -ENOMEM;
+	}
+	osd_req_get_attributes(or, &obj);
+
+	/* we need the inode attribute */
+	osd_req_add_get_attr_list(or, &g_attr_inode_data, 1);
+
+#ifdef EXOFS_DEBUG_OBJ_ISIZE
+	/* we get the size attributes to do a sanity check */
+	osd_req_add_get_attr_list(or, &g_attr_logical_length, 1);
+#endif
+
+	ret = exofs_sync_op(or, sbi->s_timeout, oi->i_cred);
+	if (ret)
+		goto out;
+
+	attr = g_attr_inode_data;
+	ret = extract_attr_from_req(or, &attr);
+	if (ret) {
+		EXOFS_ERR("exofs_get_inode: extract_attr_from_req failed\n");
+		goto out;
+	}
+
+	WARN_ON(attr.len != EXOFS_INO_ATTR_SIZE);
+	memcpy(inode, attr.val_ptr, EXOFS_INO_ATTR_SIZE);
+
+#ifdef EXOFS_DEBUG_OBJ_ISIZE
+	attr = g_attr_logical_length;
+	ret = extract_attr_from_req(or, &attr);
+	if (ret) {
+		EXOFS_ERR("ERROR: extract attr from or failed\n");
+		goto out;
+	}
+	*sanity = get_unaligned_be64(attr.val_ptr);
+#endif
+
+out:
+	osd_end_request(or);
+	return ret;
+}
+
+/*
+ * Fill in an inode read from the OSD and set it up for use
+ */
+struct inode *exofs_iget(struct super_block *sb, unsigned long ino)
+{
+	struct exofs_i_info *oi;
+	struct exofs_fcb fcb;
+	struct inode *inode;
+	uint64_t sanity;
+	int ret;
+
+	inode = iget_locked(sb, ino);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+	if (!(inode->i_state & I_NEW))
+		return inode;
+	oi = exofs_i(inode);
+
+	/* read the inode from the osd */
+	ret = exofs_get_inode(sb, oi, &fcb, &sanity);
+	if (ret)
+		goto bad_inode;
+
+	init_waitqueue_head(&oi->i_wq);
+	set_obj_created(oi);
+
+	/* copy stuff from on-disk struct to in-memory struct */
+	inode->i_mode = le16_to_cpu(fcb.i_mode);
+	inode->i_uid = le32_to_cpu(fcb.i_uid);
+	inode->i_gid = le32_to_cpu(fcb.i_gid);
+	inode->i_nlink = le16_to_cpu(fcb.i_links_count);
+	inode->i_ctime.tv_sec = (signed)le32_to_cpu(fcb.i_ctime);
+	inode->i_atime.tv_sec = (signed)le32_to_cpu(fcb.i_atime);
+	inode->i_mtime.tv_sec = (signed)le32_to_cpu(fcb.i_mtime);
+	inode->i_ctime.tv_nsec =
+		inode->i_atime.tv_nsec = inode->i_mtime.tv_nsec = 0;
+	oi->i_commit_size = le64_to_cpu(fcb.i_size);
+	i_size_write(inode, oi->i_commit_size);
+	inode->i_blkbits = EXOFS_BLKSHIFT;
+	inode->i_generation = le32_to_cpu(fcb.i_generation);
+
+#ifdef EXOFS_DEBUG_OBJ_ISIZE
+	if ((inode->i_size != sanity) &&
+		(!exofs_inode_is_fast_symlink(inode))) {
+		EXOFS_ERR("WARNING: Size of object from inode and "
+			  "attributes differ (%lld != %llu)\n",
+			  inode->i_size, _LLU(sanity));
+	}
+#endif
+
+	oi->i_dir_start_lookup = 0;
+
+	if ((inode->i_nlink == 0) && (inode->i_mode == 0)) {
+		ret = -ESTALE;
+		goto bad_inode;
+	}
+
+	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
+		if (fcb.i_data[0])
+			inode->i_rdev =
+				old_decode_dev(le32_to_cpu(fcb.i_data[0]));
+		else
+			inode->i_rdev =
+				new_decode_dev(le32_to_cpu(fcb.i_data[1]));
+	} else {
+		memcpy(oi->i_data, fcb.i_data, sizeof(fcb.i_data));
+	}
+
+	if (S_ISREG(inode->i_mode)) {
+		inode->i_op = &exofs_file_inode_operations;
+		inode->i_fop = &exofs_file_operations;
+		inode->i_mapping->a_ops = &exofs_aops;
+	} else if (S_ISDIR(inode->i_mode)) {
+		inode->i_op = &exofs_dir_inode_operations;
+		inode->i_fop = &exofs_dir_operations;
+		inode->i_mapping->a_ops = &exofs_aops;
+	} else if (S_ISLNK(inode->i_mode)) {
+		if (exofs_inode_is_fast_symlink(inode))
+			inode->i_op = &exofs_fast_symlink_inode_operations;
+		else {
+			inode->i_op = &exofs_symlink_inode_operations;
+			inode->i_mapping->a_ops = &exofs_aops;
+		}
+	} else {
+		inode->i_op = &exofs_special_inode_operations;
+		if (fcb.i_data[0])
+			init_special_inode(inode, inode->i_mode,
+			   old_decode_dev(le32_to_cpu(fcb.i_data[0])));
+		else
+			init_special_inode(inode, inode->i_mode,
+			   new_decode_dev(le32_to_cpu(fcb.i_data[1])));
+	}
+
+	unlock_new_inode(inode);
+	return inode;
+
+bad_inode:
+	iget_failed(inode);
+	return ERR_PTR(ret);
+}
+
+int __exofs_wait_obj_created(struct exofs_i_info *oi)
+{
+	if (!obj_created(oi)) {
+		BUG_ON(!obj_2bcreated(oi));
+		wait_event(oi->i_wq, obj_created(oi));
+	}
+	return unlikely(is_bad_inode(&oi->vfs_inode)) ? -EIO : 0;
+}
+/*
+ * Callback function from exofs_new_inode().  The important thing is that we
+ * set the obj_created flag so that other methods know that the object exists on
+ * the OSD.
+ */
+static void create_done(struct osd_request *or, void *p)
+{
+	struct inode *inode = p;
+	struct exofs_i_info *oi = exofs_i(inode);
+	struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
+	int ret;
+
+	ret = exofs_check_ok(or);
+	osd_end_request(or);
+	atomic_dec(&sbi->s_curr_pending);
+
+	if (unlikely(ret)) {
+		EXOFS_ERR("object=0x%llx creation faild in pid=0x%llx",
+			  _LLU(sbi->s_pid), _LLU(inode->i_ino + EXOFS_OBJ_OFF));
+		make_bad_inode(inode);
+	} else
+		set_obj_created(oi);
+
+	atomic_dec(&inode->i_count);
+	wake_up(&oi->i_wq);
+}
+
+/*
+ * Set up a new inode and create an object for it on the OSD
+ */
+struct inode *exofs_new_inode(struct inode *dir, int mode)
+{
+	struct super_block *sb;
+	struct inode *inode;
+	struct exofs_i_info *oi;
+	struct exofs_sb_info *sbi;
+	struct osd_request *or;
+	struct osd_obj_id obj;
+	int ret;
+
+	sb = dir->i_sb;
+	inode = new_inode(sb);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	oi = exofs_i(inode);
+
+	init_waitqueue_head(&oi->i_wq);
+	set_obj_2bcreated(oi);
+
+	sbi = sb->s_fs_info;
+
+	sb->s_dirt = 1;
+	inode->i_uid = current->cred->fsuid;
+	if (dir->i_mode & S_ISGID) {
+		inode->i_gid = dir->i_gid;
+		if (S_ISDIR(mode))
+			mode |= S_ISGID;
+	} else {
+		inode->i_gid = current->cred->fsgid;
+	}
+	inode->i_mode = mode;
+
+	inode->i_ino = sbi->s_nextid++;
+	inode->i_blkbits = EXOFS_BLKSHIFT;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
+	oi->i_commit_size = inode->i_size = 0;
+	spin_lock(&sbi->s_next_gen_lock);
+	inode->i_generation = sbi->s_next_generation++;
+	spin_unlock(&sbi->s_next_gen_lock);
+	insert_inode_hash(inode);
+
+	mark_inode_dirty(inode);
+
+	obj.partition = sbi->s_pid;
+	obj.id = inode->i_ino + EXOFS_OBJ_OFF;
+	exofs_make_credential(oi->i_cred, &obj);
+
+	or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+	if (unlikely(!or)) {
+		EXOFS_ERR("exofs_new_inode: osd_start_request failed\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	osd_req_create_object(or, &obj);
+
+	/* increment the refcount so that the inode will still be around when we
+	 * reach the callback
+	 */
+	atomic_inc(&inode->i_count);
+
+	ret = exofs_async_op(or, create_done, inode, oi->i_cred);
+	if (ret) {
+		atomic_dec(&inode->i_count);
+		osd_end_request(or);
+		return ERR_PTR(-EIO);
+	}
+	atomic_inc(&sbi->s_curr_pending);
+
+	return inode;
+}
diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
new file mode 100644
index 0000000..4220d90
--- /dev/null
+++ b/fs/exofs/namei.c
@@ -0,0 +1,338 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * Copyrights for code taken from ext2:
+ *     Copyright (C) 1992, 1993, 1994, 1995
+ *     Remy Card (card@masi.ibp.fr)
+ *     Laboratoire MASI - Institut Blaise Pascal
+ *     Universite Pierre et Marie Curie (Paris VI)
+ *     from
+ *     linux/fs/minix/inode.c
+ *     Copyright (C) 1991, 1992  Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "exofs.h"
+
+static inline int exofs_add_nondir(struct dentry *dentry, struct inode *inode)
+{
+	int err = exofs_add_link(dentry, inode);
+	if (!err) {
+		d_instantiate(dentry, inode);
+		return 0;
+	}
+	inode_dec_link_count(inode);
+	iput(inode);
+	return err;
+}
+
+static struct dentry *exofs_lookup(struct inode *dir, struct dentry *dentry,
+				   struct nameidata *nd)
+{
+	struct inode *inode;
+	ino_t ino;
+
+	if (dentry->d_name.len > EXOFS_NAME_LEN)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	ino = exofs_inode_by_name(dir, dentry);
+	inode = NULL;
+	if (ino) {
+		inode = exofs_iget(dir->i_sb, ino);
+		if (IS_ERR(inode))
+			return ERR_CAST(inode);
+	}
+	return d_splice_alias(inode, dentry);
+}
+
+static int exofs_create(struct inode *dir, struct dentry *dentry, int mode,
+			 struct nameidata *nd)
+{
+	struct inode *inode = exofs_new_inode(dir, mode);
+	int err = PTR_ERR(inode);
+	if (!IS_ERR(inode)) {
+		inode->i_op = &exofs_file_inode_operations;
+		inode->i_fop = &exofs_file_operations;
+		inode->i_mapping->a_ops = &exofs_aops;
+		mark_inode_dirty(inode);
+		err = exofs_add_nondir(dentry, inode);
+	}
+	return err;
+}
+
+static int exofs_mknod(struct inode *dir, struct dentry *dentry, int mode,
+		       dev_t rdev)
+{
+	struct inode *inode;
+	int err;
+
+	if (!new_valid_dev(rdev))
+		return -EINVAL;
+
+	inode = exofs_new_inode(dir, mode);
+	err = PTR_ERR(inode);
+	if (!IS_ERR(inode)) {
+		init_special_inode(inode, inode->i_mode, rdev);
+		mark_inode_dirty(inode);
+		err = exofs_add_nondir(dentry, inode);
+	}
+	return err;
+}
+
+static int exofs_symlink(struct inode *dir, struct dentry *dentry,
+			  const char *symname)
+{
+	struct super_block *sb = dir->i_sb;
+	int err = -ENAMETOOLONG;
+	unsigned l = strlen(symname)+1;
+	struct inode *inode;
+	struct exofs_i_info *oi;
+
+	if (l > sb->s_blocksize)
+		goto out;
+
+	inode = exofs_new_inode(dir, S_IFLNK | S_IRWXUGO);
+	err = PTR_ERR(inode);
+	if (IS_ERR(inode))
+		goto out;
+
+	oi = exofs_i(inode);
+	if (l > sizeof(oi->i_data)) {
+		/* slow symlink */
+		inode->i_op = &exofs_symlink_inode_operations;
+		inode->i_mapping->a_ops = &exofs_aops;
+		memset(oi->i_data, 0, sizeof(oi->i_data));
+
+		err = page_symlink(inode, symname, l);
+		if (err)
+			goto out_fail;
+	} else {
+		/* fast symlink */
+		inode->i_op = &exofs_fast_symlink_inode_operations;
+		memcpy(oi->i_data, symname, l);
+		inode->i_size = l-1;
+	}
+	mark_inode_dirty(inode);
+
+	err = exofs_add_nondir(dentry, inode);
+out:
+	return err;
+
+out_fail:
+	inode_dec_link_count(inode);
+	iput(inode);
+	goto out;
+}
+
+static int exofs_link(struct dentry *old_dentry, struct inode *dir,
+		struct dentry *dentry)
+{
+	struct inode *inode = old_dentry->d_inode;
+
+	if (inode->i_nlink >= EXOFS_LINK_MAX)
+		return -EMLINK;
+
+	inode->i_ctime = CURRENT_TIME;
+	inode_inc_link_count(inode);
+	atomic_inc(&inode->i_count);
+
+	return exofs_add_nondir(dentry, inode);
+}
+
+static int exofs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
+{
+	struct inode *inode;
+	int err = -EMLINK;
+
+	if (dir->i_nlink >= EXOFS_LINK_MAX)
+		goto out;
+
+	inode_inc_link_count(dir);
+
+	inode = exofs_new_inode(dir, S_IFDIR | mode);
+	err = PTR_ERR(inode);
+	if (IS_ERR(inode))
+		goto out_dir;
+
+	inode->i_op = &exofs_dir_inode_operations;
+	inode->i_fop = &exofs_dir_operations;
+	inode->i_mapping->a_ops = &exofs_aops;
+
+	inode_inc_link_count(inode);
+
+	err = exofs_make_empty(inode, dir);
+	if (err)
+		goto out_fail;
+
+	err = exofs_add_link(dentry, inode);
+	if (err)
+		goto out_fail;
+
+	d_instantiate(dentry, inode);
+out:
+	return err;
+
+out_fail:
+	inode_dec_link_count(inode);
+	inode_dec_link_count(inode);
+	iput(inode);
+out_dir:
+	inode_dec_link_count(dir);
+	goto out;
+}
+
+static int exofs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	struct exofs_dir_entry *de;
+	struct page *page;
+	int err = -ENOENT;
+
+	de = exofs_find_entry(dir, dentry, &page);
+	if (!de)
+		goto out;
+
+	err = exofs_delete_entry(de, page);
+	if (err)
+		goto out;
+
+	inode->i_ctime = dir->i_ctime;
+	inode_dec_link_count(inode);
+	err = 0;
+out:
+	return err;
+}
+
+static int exofs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	int err = -ENOTEMPTY;
+
+	if (exofs_empty_dir(inode)) {
+		err = exofs_unlink(dir, dentry);
+		if (!err) {
+			inode->i_size = 0;
+			inode_dec_link_count(inode);
+			inode_dec_link_count(dir);
+		}
+	}
+	return err;
+}
+
+static int exofs_rename(struct inode *old_dir, struct dentry *old_dentry,
+		struct inode *new_dir, struct dentry *new_dentry)
+{
+	struct inode *old_inode = old_dentry->d_inode;
+	struct inode *new_inode = new_dentry->d_inode;
+	struct page *dir_page = NULL;
+	struct exofs_dir_entry *dir_de = NULL;
+	struct page *old_page;
+	struct exofs_dir_entry *old_de;
+	int err = -ENOENT;
+
+	old_de = exofs_find_entry(old_dir, old_dentry, &old_page);
+	if (!old_de)
+		goto out;
+
+	if (S_ISDIR(old_inode->i_mode)) {
+		err = -EIO;
+		dir_de = exofs_dotdot(old_inode, &dir_page);
+		if (!dir_de)
+			goto out_old;
+	}
+
+	if (new_inode) {
+		struct page *new_page;
+		struct exofs_dir_entry *new_de;
+
+		err = -ENOTEMPTY;
+		if (dir_de && !exofs_empty_dir(new_inode))
+			goto out_dir;
+
+		err = -ENOENT;
+		new_de = exofs_find_entry(new_dir, new_dentry, &new_page);
+		if (!new_de)
+			goto out_dir;
+		inode_inc_link_count(old_inode);
+		exofs_set_link(new_dir, new_de, new_page, old_inode);
+		new_inode->i_ctime = CURRENT_TIME;
+		if (dir_de)
+			drop_nlink(new_inode);
+		inode_dec_link_count(new_inode);
+	} else {
+		if (dir_de) {
+			err = -EMLINK;
+			if (new_dir->i_nlink >= EXOFS_LINK_MAX)
+				goto out_dir;
+		}
+		inode_inc_link_count(old_inode);
+		err = exofs_add_link(new_dentry, old_inode);
+		if (err) {
+			inode_dec_link_count(old_inode);
+			goto out_dir;
+		}
+		if (dir_de)
+			inode_inc_link_count(new_dir);
+	}
+
+	old_inode->i_ctime = CURRENT_TIME;
+
+	exofs_delete_entry(old_de, old_page);
+	inode_dec_link_count(old_inode);
+
+	if (dir_de) {
+		exofs_set_link(old_inode, dir_de, dir_page, new_dir);
+		inode_dec_link_count(old_dir);
+	}
+	return 0;
+
+
+out_dir:
+	if (dir_de) {
+		kunmap(dir_page);
+		page_cache_release(dir_page);
+	}
+out_old:
+	kunmap(old_page);
+	page_cache_release(old_page);
+out:
+	return err;
+}
+
+const struct inode_operations exofs_dir_inode_operations = {
+	.create 	= exofs_create,
+	.lookup 	= exofs_lookup,
+	.link   	= exofs_link,
+	.unlink 	= exofs_unlink,
+	.symlink	= exofs_symlink,
+	.mkdir  	= exofs_mkdir,
+	.rmdir  	= exofs_rmdir,
+	.mknod  	= exofs_mknod,
+	.rename 	= exofs_rename,
+	.setattr	= exofs_setattr,
+};
+
+const struct inode_operations exofs_special_inode_operations = {
+	.setattr	= exofs_setattr,
+};
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 6/8] exofs: super_operations and file_system_type
  2009-02-09 13:07 [PATCHSET 0/8 version 3] exofs Boaz Harrosh
                   ` (4 preceding siblings ...)
  2009-02-09 13:24 ` [PATCH 5/8] exofs: dir_inode and directory operations Boaz Harrosh
@ 2009-02-09 13:25 ` Boaz Harrosh
  2009-02-15 17:24   ` Evgeniy Polyakov
  2009-02-09 13:29 ` [PATCH 7/8] exofs: Documentation Boaz Harrosh
  2009-02-09 13:31 ` [PATCH 8/8] fs: Add exofs to Kernel build Boaz Harrosh
  7 siblings, 1 reply; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-09 13:25 UTC (permalink / raw)
  To: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel, open-osd
  Cc: linux-kernel, James Bottomley

This patch ties all operation vectors into a file system superblock
and registers the exofs file_system_type at module's load time.

* The file system control block (AKA on-disk superblock) resides in
  an object with a special ID (defined in common.h).
  Information included in the file system control block is used to
  fill the in-memory superblock structure at mount time. This object
  is created before the file system is used by mkexofs.c It contains
  information such as:
	- The file system's magic number
	- The next inode number to be allocated

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/exofs/Kbuild  |    2 +-
 fs/exofs/exofs.h |   22 +++
 fs/exofs/inode.c |  189 ++++++++++++++++++++
 fs/exofs/super.c |  520 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 732 insertions(+), 1 deletions(-)
 create mode 100644 fs/exofs/super.c

diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index 61162c6..2268502 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -26,5 +26,5 @@ KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
 
 endif
 
-exofs-objs := osd.o inode.o file.o symlink.o namei.o dir.o
+exofs-objs := osd.o inode.o file.o symlink.o namei.o dir.o super.o
 obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
index 4c859bd..38a82a6 100644
--- a/fs/exofs/exofs.h
+++ b/fs/exofs/exofs.h
@@ -54,6 +54,15 @@
 #define _LLU(x) (unsigned long long)(x)
 
 /*
+ * struct to hold what we get from mount options
+ */
+struct exofs_mountopt {
+	const char *dev_name;
+	uint64_t pid;
+	int timeout;
+};
+
+/*
  * our extension to the in-memory superblock
  */
 struct exofs_sb_info {
@@ -125,6 +134,14 @@ static inline struct exofs_i_info *exofs_i(struct inode *inode)
 }
 
 /*
+ * ugly struct so that we can pass two arguments to update_inode's callback
+ */
+struct updatei_args {
+	struct exofs_sb_info	*sbi;
+	struct exofs_fcb	fcb;
+};
+
+/*
  * Maximum count of links to a file
  */
 #define EXOFS_LINK_MAX           32000
@@ -149,6 +166,8 @@ int exofs_write_begin(struct file *file, struct address_space *mapping,
 		struct page **pagep, void **fsdata);
 extern struct inode *exofs_iget(struct super_block *, unsigned long);
 struct inode *exofs_new_inode(struct inode *, int);
+extern int exofs_write_inode(struct inode *, int);
+extern void exofs_delete_inode(struct inode *);
 
 /* dir.c:                */
 int exofs_add_link(struct dentry *, struct inode *);
@@ -179,6 +198,9 @@ extern const struct address_space_operations exofs_aops;
 extern const struct inode_operations exofs_dir_inode_operations;
 extern const struct inode_operations exofs_special_inode_operations;
 
+/* super.c           */
+extern const struct super_operations exofs_sops;
+
 /* symlink.c         */
 extern const struct inode_operations exofs_symlink_inode_operations;
 extern const struct inode_operations exofs_fast_symlink_inode_operations;
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 8d3385c..578560d 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -740,3 +740,192 @@ struct inode *exofs_new_inode(struct inode *dir, int mode)
 
 	return inode;
 }
+
+/*
+ * Callback function from exofs_update_inode().
+ */
+static void updatei_done(struct osd_request *or, void *p)
+{
+	struct updatei_args *args = p;
+
+	osd_end_request(or);
+
+	atomic_dec(&args->sbi->s_curr_pending);
+
+	kfree(args);
+}
+
+/*
+ * Write the inode to the OSD.  Just fill up the struct, and set the attribute
+ * synchronously or asynchronously depending on the do_sync flag.
+ */
+static int exofs_update_inode(struct inode *inode, int do_sync)
+{
+	struct exofs_i_info *oi = exofs_i(inode);
+	struct super_block *sb = inode->i_sb;
+	struct exofs_sb_info *sbi = sb->s_fs_info;
+	struct osd_obj_id obj = {sbi->s_pid, inode->i_ino + EXOFS_OBJ_OFF};
+	struct osd_request *or;
+	struct osd_attr attr;
+	struct exofs_fcb *fcb;
+	struct updatei_args *args;
+	int ret;
+
+// 	EXOFS_DBGMSG("obj=%llx do_sync=%d\n", _LLU(obj.id), do_sync);
+	args = kzalloc(sizeof(*args), GFP_KERNEL);
+	if (!args)
+		return -ENOMEM;
+
+	fcb = &args->fcb;
+
+	fcb->i_mode = cpu_to_le16(inode->i_mode);
+	fcb->i_uid = cpu_to_le32(inode->i_uid);
+	fcb->i_gid = cpu_to_le32(inode->i_gid);
+	fcb->i_links_count = cpu_to_le16(inode->i_nlink);
+	fcb->i_ctime = cpu_to_le32(inode->i_ctime.tv_sec);
+	fcb->i_atime = cpu_to_le32(inode->i_atime.tv_sec);
+	fcb->i_mtime = cpu_to_le32(inode->i_mtime.tv_sec);
+	oi->i_commit_size = i_size_read(inode);
+	fcb->i_size = cpu_to_le64(oi->i_commit_size);
+	fcb->i_generation = cpu_to_le32(inode->i_generation);
+
+	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
+		if (old_valid_dev(inode->i_rdev)) {
+			fcb->i_data[0] =
+				cpu_to_le32(old_encode_dev(inode->i_rdev));
+			fcb->i_data[1] = 0;
+		} else {
+			fcb->i_data[0] = 0;
+			fcb->i_data[1] =
+				cpu_to_le32(new_encode_dev(inode->i_rdev));
+			fcb->i_data[2] = 0;
+		}
+	} else
+		memcpy(fcb->i_data, oi->i_data, sizeof(fcb->i_data));
+
+	or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+	if (unlikely(!or)) {
+		EXOFS_ERR("exofs_update_inode: osd_start_request failed.\n");
+		ret = -ENOMEM;
+		goto free_args;
+	}
+
+	osd_req_set_attributes(or, &obj);
+
+	attr = g_attr_inode_data;
+	attr.val_ptr = fcb;
+	osd_req_add_set_attr_list(or, &attr, 1);
+
+	if (!obj_created(oi)) {
+		EXOFS_DBGMSG("!obj_created\n");
+		BUG_ON(!obj_2bcreated(oi));
+		wait_event(oi->i_wq, obj_created(oi));
+		EXOFS_DBGMSG("wait_event done\n");
+	}
+
+	if (do_sync) {
+		ret = exofs_sync_op(or, sbi->s_timeout, oi->i_cred);
+		osd_end_request(or);
+		goto free_args;
+	} else {
+		args->sbi = sbi;
+
+		ret = exofs_async_op(or, updatei_done, args, oi->i_cred);
+		if (ret) {
+			osd_end_request(or);
+			goto free_args;
+		}
+		atomic_inc(&sbi->s_curr_pending);
+		goto out; /* deallocation in updatei_done */
+	}
+
+free_args:
+	kfree(args);
+out:
+	EXOFS_DBGMSG("ret=>%d\n", ret);
+	return ret;
+}
+
+int exofs_write_inode(struct inode *inode, int wait)
+{
+	return exofs_update_inode(inode, wait);
+}
+
+int exofs_sync_inode(struct inode *inode)
+{
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_ALL,
+		.nr_to_write = 0,	/* sys_fsync did this */
+	};
+
+	return sync_inode(inode, &wbc);
+}
+
+/*
+ * Callback function from exofs_delete_inode() - don't have much cleaning up to
+ * do.
+ */
+static void delete_done(struct osd_request *or, void *p)
+{
+	struct exofs_sb_info *sbi;
+	osd_end_request(or);
+	sbi = p;
+	atomic_dec(&sbi->s_curr_pending);
+}
+
+/*
+ * Called when the refcount of an inode reaches zero.  We remove the object
+ * from the OSD here.  We make sure the object was created before we try and
+ * delete it.
+ */
+void exofs_delete_inode(struct inode *inode)
+{
+	struct exofs_i_info *oi = exofs_i(inode);
+	struct super_block *sb = inode->i_sb;
+	struct exofs_sb_info *sbi = sb->s_fs_info;
+	struct osd_obj_id obj = {sbi->s_pid, inode->i_ino + EXOFS_OBJ_OFF};
+	struct osd_request *or;
+	int ret;
+
+	truncate_inode_pages(&inode->i_data, 0);
+
+	if (is_bad_inode(inode))
+		goto no_delete;
+
+	mark_inode_dirty(inode);
+	exofs_update_inode(inode, inode_needs_sync(inode));
+
+	inode->i_size = 0;
+	if (inode->i_blocks)
+		exofs_truncate(inode);
+
+	clear_inode(inode);
+
+	or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+	if (unlikely(!or)) {
+		EXOFS_ERR("exofs_delete_inode: osd_start_request failed\n");
+		return;
+	}
+
+	osd_req_remove_object(or, &obj);
+
+	/* if we are deleting an obj that hasn't been created yet, wait */
+	if (!obj_created(oi)) {
+		BUG_ON(!obj_2bcreated(oi));
+		wait_event(oi->i_wq, obj_created(oi));
+	}
+
+	ret = exofs_async_op(or, delete_done, sbi, oi->i_cred);
+	if (ret) {
+		EXOFS_ERR(
+		       "ERROR: @exofs_delete_inode exofs_async_op failed\n");
+		osd_end_request(or);
+		return;
+	}
+	atomic_inc(&sbi->s_curr_pending);
+
+	return;
+
+no_delete:
+	clear_inode(inode);
+}
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
new file mode 100644
index 0000000..9153db2
--- /dev/null
+++ b/fs/exofs/super.c
@@ -0,0 +1,520 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * Copyrights for code taken from ext2:
+ *     Copyright (C) 1992, 1993, 1994, 1995
+ *     Remy Card (card@masi.ibp.fr)
+ *     Laboratoire MASI - Institut Blaise Pascal
+ *     Universite Pierre et Marie Curie (Paris VI)
+ *     from
+ *     linux/fs/minix/inode.c
+ *     Copyright (C) 1991, 1992  Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include <linux/string.h>
+#include <linux/parser.h>
+#include <linux/vfs.h>
+#include <linux/random.h>
+
+#include "exofs.h"
+
+/******************************************************************************
+ * MOUNT OPTIONS
+ *****************************************************************************/
+
+/*
+ * exofs-specific mount-time options.
+ */
+enum { Opt_pid, Opt_to, Opt_mkfs, Opt_format, Opt_err };
+
+/*
+ * Our mount-time options.  These should ideally be 64-bit unsigned, but the
+ * kernel's parsing functions do not currently support that.  32-bit should be
+ * sufficient for most applications now.
+ */
+static match_table_t tokens = {
+	{Opt_pid, "pid=%u"},
+	{Opt_to, "to=%u"},
+	{Opt_err, NULL}
+};
+
+/*
+ * The main option parsing method.  Also makes sure that all of the mandatory
+ * mount options were set.
+ */
+static int parse_options(char *options, struct exofs_mountopt *opts)
+{
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int option;
+	bool s_pid = false;
+
+	EXOFS_DBGMSG("parse_options %s\n", options);
+	/* defaults */
+	memset(opts, 0, sizeof(*opts));
+	opts->timeout = BLK_DEFAULT_SG_TIMEOUT;
+
+	while ((p = strsep(&options, ",")) != NULL) {
+		int token;
+		char str[32];
+
+		if (!*p)
+			continue;
+
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case Opt_pid:
+			if (0 == match_strlcpy(str, &args[0], sizeof(str)))
+				return -EINVAL;
+			opts->pid = simple_strtoull(str, NULL, 0);
+			if (opts->pid < EXOFS_MIN_PID) {
+				EXOFS_ERR("Partition ID must be >= %u",
+					  EXOFS_MIN_PID);
+				return -EINVAL;
+			}
+			s_pid = 1;
+			break;
+		case Opt_to:
+			if (match_int(&args[0], &option))
+				return -EINVAL;
+			if (option <= 0) {
+				EXOFS_ERR("Timout must be > 0");
+				return -EINVAL;
+			}
+			opts->timeout = option * HZ;
+			break;
+		}
+	}
+
+	if (!s_pid) {
+		EXOFS_ERR("Need to specify the following options:\n");
+		EXOFS_ERR("    -o pid=pid_no_to_use\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/******************************************************************************
+ * INODE CACHE
+ *****************************************************************************/
+
+/*
+ * Our inode cache.  Isn't it pretty?
+ */
+static struct kmem_cache *exofs_inode_cachep;
+
+/*
+ * Allocate an inode in the cache
+ */
+static struct inode *exofs_alloc_inode(struct super_block *sb)
+{
+	struct exofs_i_info *oi;
+
+	oi = kmem_cache_alloc(exofs_inode_cachep, GFP_KERNEL);
+	if (!oi)
+		return NULL;
+
+	oi->vfs_inode.i_version = 1;
+	return &oi->vfs_inode;
+}
+
+/*
+ * Remove an inode from the cache
+ */
+static void exofs_destroy_inode(struct inode *inode)
+{
+	kmem_cache_free(exofs_inode_cachep, exofs_i(inode));
+}
+
+/*
+ * Initialize the inode
+ */
+static void exofs_init_once(void *foo)
+{
+	struct exofs_i_info *oi = foo;
+
+	inode_init_once(&oi->vfs_inode);
+}
+
+/*
+ * Create and initialize the inode cache
+ */
+static int init_inodecache(void)
+{
+	exofs_inode_cachep = kmem_cache_create("exofs_inode_cache",
+				sizeof(struct exofs_i_info), 0,
+				SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+				exofs_init_once);
+	if (exofs_inode_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+/*
+ * Destroy the inode cache
+ */
+static void destroy_inodecache(void)
+{
+	kmem_cache_destroy(exofs_inode_cachep);
+}
+
+/******************************************************************************
+ * SUPERBLOCK FUNCTIONS
+ *****************************************************************************/
+
+/*
+ * Write the superblock to the OSD
+ */
+static void exofs_write_super(struct super_block *sb)
+{
+	struct exofs_sb_info *sbi;
+	struct exofs_fscb *fscb;
+	struct osd_request *or;
+	struct osd_obj_id obj;
+	int ret;
+
+	fscb = kzalloc(sizeof(struct exofs_fscb), GFP_KERNEL);
+	if (!fscb) {
+		EXOFS_ERR("exofs_write_super: memory allocation failed.\n");
+		return;
+	}
+
+	lock_kernel();
+	sbi = sb->s_fs_info;
+	fscb->s_nextid = cpu_to_le64(sbi->s_nextid);
+	fscb->s_numfiles = cpu_to_le32(sbi->s_numfiles);
+	fscb->s_magic = cpu_to_le16(sb->s_magic);
+	fscb->s_newfs = 0;
+
+	or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+	if (unlikely(!or)) {
+		EXOFS_ERR("exofs_write_super: osd_start_request failed.\n");
+		goto out;
+	}
+
+	obj.partition = sbi->s_pid;
+	obj.id = EXOFS_SUPER_ID;
+	ret = osd_req_write_kern(or, &obj, 0, fscb, sizeof(*fscb));
+	if (unlikely(ret)) {
+		EXOFS_ERR("exofs_write_super: osd_req_write_kern failed.\n");
+		goto out;
+	}
+
+	ret = exofs_sync_op(or, sbi->s_timeout, sbi->s_cred);
+	if (unlikely(ret)) {
+		EXOFS_ERR("exofs_write_super: exofs_sync_op failed.\n");
+		goto out;
+	}
+	sb->s_dirt = 0;
+
+out:
+	if (or)
+		osd_end_request(or);
+	unlock_kernel();
+	kfree(fscb);
+}
+
+/*
+ * This function is called when the vfs is freeing the superblock.  We just
+ * need to free our own part.
+ */
+static void exofs_put_super(struct super_block *sb)
+{
+	int num_pend;
+	struct exofs_sb_info *sbi = sb->s_fs_info;
+
+	/* make sure there are no pending commands */
+	for (num_pend = atomic_read(&sbi->s_curr_pending); num_pend > 0;
+	     num_pend = atomic_read(&sbi->s_curr_pending)) {
+		wait_queue_head_t wq;
+		init_waitqueue_head(&wq);
+		wait_event_timeout(wq,
+				  (atomic_read(&sbi->s_curr_pending) == 0),
+				  msecs_to_jiffies(100));
+	}
+
+	osduld_put_device(sbi->s_dev);
+	kfree(sb->s_fs_info);
+	sb->s_fs_info = NULL;
+}
+
+/*
+ * Read the superblock from the OSD and fill in the fields
+ */
+static int exofs_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct inode *root;
+	struct exofs_mountopt *opts = data;
+	struct exofs_sb_info *sbi;	/*extended info                  */
+	struct exofs_fscb fscb;		/*on-disk superblock info        */
+	struct osd_request *or = NULL;
+	struct osd_obj_id obj;
+	int ret;
+
+	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
+	if (!sbi)
+		return -ENOMEM;
+	sb->s_fs_info = sbi;
+
+	/* use mount options to fill superblock */
+	sbi->s_dev = osduld_path_lookup(opts->dev_name);
+	if (IS_ERR(sbi->s_dev)) {
+		ret = PTR_ERR(sbi->s_dev);
+		sbi->s_dev = NULL;
+		goto free_sbi;
+	}
+
+	sbi->s_pid = opts->pid;
+	sbi->s_timeout = opts->timeout;
+
+	/* fill in some other data by hand */
+	memset(sb->s_id, 0, sizeof(sb->s_id));
+	strcpy(sb->s_id, "exofs");
+	sb->s_blocksize = EXOFS_BLKSIZE;
+	sb->s_blocksize_bits = EXOFS_BLKSHIFT;
+	atomic_set(&sbi->s_curr_pending, 0);
+	sb->s_bdev = NULL;
+	sb->s_dev = 0;
+
+	/* read data from on-disk superblock object */
+	obj.partition = sbi->s_pid;
+	obj.id = EXOFS_SUPER_ID;
+	exofs_make_credential(sbi->s_cred, &obj);
+
+	or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+	if (unlikely(!or)) {
+		if (!silent)
+			EXOFS_ERR(
+			       "exofs_fill_super: osd_start_request failed.\n");
+		ret = -ENOMEM;
+		goto free_sbi;
+	}
+	ret = osd_req_read_kern(or, &obj, 0, &fscb, sizeof(fscb));
+	if (unlikely(ret)) {
+		if (!silent)
+			EXOFS_ERR(
+			       "exofs_fill_super: osd_req_read_kern failed.\n");
+		ret = -ENOMEM;
+		goto free_sbi;
+	}
+
+	ret = exofs_sync_op(or, sbi->s_timeout, sbi->s_cred);
+	if (unlikely(ret)) {
+		if (!silent)
+			EXOFS_ERR("exofs_fill_super: exofs_sync_op failed.\n");
+		ret = -EIO;
+		goto free_sbi;
+	}
+
+	sb->s_magic = le16_to_cpu(fscb.s_magic);
+	sbi->s_nextid = le64_to_cpu(fscb.s_nextid);
+	sbi->s_numfiles = le32_to_cpu(fscb.s_numfiles);
+
+	/* make sure what we read from the object store is correct */
+	if (sb->s_magic != EXOFS_SUPER_MAGIC) {
+		if (!silent)
+			EXOFS_ERR("ERROR: Bad magic value\n");
+		ret = -EINVAL;
+		goto free_sbi;
+	}
+
+	/* start generation numbers from a random point */
+	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
+	spin_lock_init(&sbi->s_next_gen_lock);
+
+	/* set up operation vectors */
+	sb->s_op = &exofs_sops;
+	root = exofs_iget(sb, EXOFS_ROOT_ID - EXOFS_OBJ_OFF);
+	if (IS_ERR(root)) {
+		EXOFS_ERR("ERROR: exofs_iget failed\n");
+		ret = PTR_ERR(root);
+		goto free_sbi;
+	}
+	sb->s_root = d_alloc_root(root);
+	if (!sb->s_root) {
+		iput(root);
+		EXOFS_ERR("ERROR: get root inode failed\n");
+		ret = -ENOMEM;
+		goto free_sbi;
+	}
+
+	if (!S_ISDIR(root->i_mode)) {
+		dput(sb->s_root);
+		sb->s_root = NULL;
+		EXOFS_ERR("ERROR: corrupt root inode (mode = %hd)\n",
+		       root->i_mode);
+		ret = -EINVAL;
+		goto free_sbi;
+	}
+
+	ret = 0;
+out:
+	if (or)
+		osd_end_request(or);
+	return ret;
+
+free_sbi:
+	osduld_put_device(sbi->s_dev); /* NULL safe */
+	kfree(sbi);
+	goto out;
+}
+
+/*
+ * Set up the superblock (calls exofs_fill_super eventually)
+ */
+static int exofs_get_sb(struct file_system_type *type,
+			  int flags, const char *dev_name,
+			  void *data, struct vfsmount *mnt)
+{
+	struct exofs_mountopt opts;
+	int ret;
+
+	ret = parse_options(data, &opts);
+	if (ret)
+		return ret;
+
+	opts.dev_name = dev_name;
+	return get_sb_nodev(type, flags, &opts, exofs_fill_super, mnt);
+}
+
+/*
+ * Return information about the file system state in the buffer.  This is used
+ * by the 'df' command, for example.
+ */
+static int exofs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+	struct super_block *sb = dentry->d_sb;
+	struct exofs_sb_info *sbi = sb->s_fs_info;
+	struct osd_obj_id obj = {sbi->s_pid, 0};
+	struct osd_attr attrs[] = {
+		ATTR_DEF(OSD_APAGE_PARTITION_QUOTAS,
+			OSD_ATTR_PQ_CAPACITY_QUOTA, sizeof(__be64)),
+		ATTR_DEF(OSD_APAGE_PARTITION_INFORMATION,
+			OSD_ATTR_PI_USED_CAPACITY, sizeof(__be64)),
+	};
+	uint64_t capacity = ~0;
+	uint64_t used = ~0;
+	struct osd_request *or;
+	uint8_t cred_a[OSD_CAP_LEN];
+	int ret;
+
+	/* get used/capacity attributes */
+	exofs_make_credential(cred_a, &obj);
+
+	or = osd_start_request(sbi->s_dev, GFP_KERNEL);
+	if (unlikely(!or)) {
+		EXOFS_DBGMSG("exofs_statfs: osd_start_request failed.\n");
+		return -ENOMEM;
+	}
+
+	osd_req_get_attributes(or, &obj);
+	osd_req_add_get_attr_list(or, attrs, ARRAY_SIZE(attrs));
+	ret = exofs_sync_op(or, sbi->s_timeout, cred_a);
+	if (unlikely(ret))
+		goto out;
+
+	ret = extract_attr_from_req(or, &attrs[0]);
+	if (likely(!ret))
+		capacity = get_unaligned_be64(attrs[0].val_ptr);
+	else
+		EXOFS_DBGMSG("exofs_statfs: get capacity failed.\n");
+
+	ret = extract_attr_from_req(or, &attrs[1]);
+	if (likely(!ret))
+		used = get_unaligned_be64(attrs[1].val_ptr);
+	else
+		EXOFS_DBGMSG("exofs_statfs: get used-space failed.\n");
+
+	/* fill in the stats buffer */
+	buf->f_type = EXOFS_SUPER_MAGIC;
+	buf->f_bsize = EXOFS_BLKSIZE;
+	buf->f_blocks = (capacity >> EXOFS_BLKSHIFT);
+	buf->f_bfree = ((capacity - used) >> EXOFS_BLKSHIFT);
+	buf->f_bavail = buf->f_bfree;
+	buf->f_files = sbi->s_numfiles;
+	buf->f_ffree = EXOFS_MAX_ID - sbi->s_numfiles;
+	buf->f_namelen = EXOFS_NAME_LEN;
+
+out:
+	osd_end_request(or);
+	return ret;
+}
+
+const struct super_operations exofs_sops = {
+	.alloc_inode    = exofs_alloc_inode,
+	.destroy_inode  = exofs_destroy_inode,
+	.write_inode    = exofs_write_inode,
+	.delete_inode   = exofs_delete_inode,
+	.put_super      = exofs_put_super,
+	.write_super    = exofs_write_super,
+	.statfs         = exofs_statfs,
+};
+
+/******************************************************************************
+ * INSMOD/RMMOD
+ *****************************************************************************/
+
+/*
+ * struct that describes this file system
+ */
+static struct file_system_type exofs_type = {
+	.owner          = THIS_MODULE,
+	.name           = "exofs",
+	.get_sb         = exofs_get_sb,
+	.kill_sb        = generic_shutdown_super,
+};
+
+static int __init init_exofs(void)
+{
+	int err;
+
+	err = init_inodecache();
+	if (err)
+		goto out;
+
+	err = register_filesystem(&exofs_type);
+	if (err)
+		goto out_d;
+
+	return 0;
+out_d:
+	destroy_inodecache();
+out:
+	return err;
+}
+
+static void __exit exit_exofs(void)
+{
+	unregister_filesystem(&exofs_type);
+	destroy_inodecache();
+}
+
+MODULE_AUTHOR("Avishay Traeger <avishay@gmail.com>");
+MODULE_DESCRIPTION("exofs");
+MODULE_LICENSE("GPL");
+
+module_init(init_exofs)
+module_exit(exit_exofs)
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 7/8] exofs: Documentation
  2009-02-09 13:07 [PATCHSET 0/8 version 3] exofs Boaz Harrosh
                   ` (5 preceding siblings ...)
  2009-02-09 13:25 ` [PATCH 6/8] exofs: super_operations and file_system_type Boaz Harrosh
@ 2009-02-09 13:29 ` Boaz Harrosh
  2009-02-09 13:31 ` [PATCH 8/8] fs: Add exofs to Kernel build Boaz Harrosh
  7 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-09 13:29 UTC (permalink / raw)
  To: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel, open-osd
  Cc: linux-kernel, James Bottomley

Added some documentation in exofs.txt, as well as a BUGS file.

For further reading, operation instructions, example scripts
and up to date infomation and code please see:
http://open-osd.org

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 Documentation/filesystems/exofs.txt |  176 +++++++++++++++++++++++++++++++++++
 fs/exofs/BUGS                       |    3 +
 2 files changed, 179 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/exofs.txt
 create mode 100644 fs/exofs/BUGS

diff --git a/Documentation/filesystems/exofs.txt b/Documentation/filesystems/exofs.txt
new file mode 100644
index 0000000..0ced74c
--- /dev/null
+++ b/Documentation/filesystems/exofs.txt
@@ -0,0 +1,176 @@
+===============================================================================
+WHAT IS EXOFS?
+===============================================================================
+
+exofs is a file system that uses an OSD and exports the API of a normal Linux
+file system. Users access exofs like any other local file system, and exofs
+will in turn issue commands to the local OSD initiator.
+
+OSD is a new T10 command set that views storage devices not as a large/flat
+array of sectors but as a container of objects, each having a length, quota,
+time attributes and more. Each object is addressed by a 64bit ID, and is
+contained in a 64bit ID partition. Each object has associated attributes
+attached to it, which are integral part of the object and provide metadata about
+the object. The standard defines some common obligatory attributes, but user
+attributes can be added as needed.
+
+===============================================================================
+ENVIRONMENT
+===============================================================================
+
+To use this file system, you need to have an object store to run it on.  You
+may download a target from:
+http://open-osd.org
+
+See Documentation/scsi/osd.txt for how to setup a working osd environment.
+
+===============================================================================
+USAGE
+===============================================================================
+
+1. Download and compile exofs and open-osd initiator:
+  You need an external Kernel source tree or kernel headers from your
+  distribution. (anything based on 2.6.26 or later).
+
+  a. download open-osd including exofs source using:
+     [parent-directory]$ git clone git://git.open-osd.org/open-osd.git
+
+  b. Build the library module like this:
+     [parent-directory]$ make -C KSRC=$(KER_DIR) open-osd
+
+     This will build both the open-osd initiator as well as the exofs kernel
+     module. Use whatever parameters you compiled your Kernel with and
+     $(KER_DIR) above pointing to the Kernel you compile against. See the file
+     open-osd/top-level-Makefile for an example.
+
+2. Get the OSD initiator and target set up properly, and login to the target.
+  See Documentation/scsi/osd.txt for farther instructions. Also see ./do-osd
+  for example script that does all these steps.
+
+3. Insmod the exofs.ko module:
+   [exofs]$ insmod exofs.ko
+
+4. Make sure the directory where you want to mount exists. If not, create it.
+   (For example, mkdir /mnt/exofs)
+
+5. At first run you will need to invoke the mkfs.exofs application
+
+   As an example, this will create the file system on:
+   /dev/osd0 partition ID 65536
+
+   mkfs.exofs --pid=65536 --format /dev/osd0
+
+   The --format is optional if not specified no OSD_FORMAT will be
+   preformed and a clean file system will be created in the specified pid,
+   in the available space of the target. (Use --format=size_in_meg to limit
+   the total LUN space available)
+
+   If pid already exist it will be deleted and a new one will be created in it's
+   place. Be careful.
+
+   An exofs lives inside a single OSD partition. You can create multiple exofs
+   filesystems on the same device using multiple pids.
+
+   (run mkfs.exofs without any parameters for usage help message)
+
+6. Mount the file system.
+
+   For example, to mount /dev/osd0, partition ID 0x10000 on /mnt/exofs:
+
+	mount -t exofs -o pid=65536 /dev/osd0 /mnt/exofs/
+
+7. For reference (See do-exofs example script):
+	do-exofs start - an example of how to perform the above steps.
+	do-exofs stop -  an example of how to unmount the file system.
+	do-exofs format - an example of how to format and mkfs a new exofs.
+
+8. Extra compilation flags (uncomment in fs/exofs/Kbuild):
+	CONFIG_EXOFS_DEBUG - for debug messages and extra checks.
+
+===============================================================================
+exofs mount options
+===============================================================================
+Similar to any mount command:
+	mount -t exofs -o exofs_options /dev/osdX mount_exofs_directory
+
+Where:
+    -t exofs: specifies the exofs file system
+
+    /dev/osdX: X is a decimal number. /dev/osdX was created after a successful
+               login into an OSD target.
+
+    mount_exofs_directory: The directory to mount the file system on
+
+    exofs specific options: Options are separated by commas (,)
+		pid=<integer> - The partition number to mount/create as
+                                container of the filesystem.
+                                This option is mandatory
+                to=<integer>  - Timeout in ticks for a single command
+                                default is (60 * HZ) [for debugging only]
+
+===============================================================================
+DESIGN
+===============================================================================
+
+* The file system control block (AKA on-disk superblock) resides in an object
+  with a special ID (defined in common.h).
+  Information included in the file system control block is used to fill the
+  in-memory superblock structure at mount time. This object is created before
+  the file system is used by mkexofs.c It contains information such as:
+	- The file system's magic number
+	- The next inode number to be allocated
+
+* Each file resides in its own object and contains the data (and it will be
+  possible to extend the file over multiple objects, though this has not been
+  implemented yet).
+
+* A directory is treated as a file, and essentially contains a list of <file
+  name, inode #> pairs for files that are found in that directory. The object
+  IDs correspond to the files' inode numbers and will be allocated according to
+  a bitmap (stored in a separate object). Now they are allocated using a
+  counter.
+
+* Each file's control block (AKA on-disk inode) is stored in its object's
+  attributes. This applies to both regular files and other types (directories,
+  device files, symlinks, etc.).
+
+* Credentials are generated per object (inode and superblock) when they is
+  created in memory (read off disk or created). The credential works for all
+  operations and is used as long as the object remains in memory.
+
+* Async OSD operations are used whenever possible, but the target may execute
+  them out of order. The operations that concern us are create, delete,
+  readpage, writepage, update_inode, and truncate. The following pairs of
+  operations should execute in the order written, and we need to prevent them
+  from executing in reverse order:
+	- The following are handled with the OBJ_CREATED and OBJ_2BCREATED
+	  flags. OBJ_CREATED is set when we know the object exists on the OSD -
+	  in create's callback function, and when we successfully do a read_inode.
+	  OBJ_2BCREATED is set in the beginning of the create function, so we
+	  know that we should wait.
+		- create/delete: delete should wait until the object is created
+		  on the OSD.
+		- create/readpage: readpage should be able to return a page
+		  full of zeroes in this case. If there was a write already
+		  en-route (i.e. create, writepage, readpage) then the page
+		  would be locked, and so it would really be the same as
+		  create/writepage.
+		- create/writepage: if writepage is called for a sync write, it
+		  should wait until the object is created on the OSD.
+		  Otherwise, it should just return.
+		- create/truncate: truncate should wait until the object is
+		  created on the OSD.
+		- create/update_inode: update_inode should wait until the
+		  object is created on the OSD.
+	- Handled by VFS locks:
+		- readpage/delete: shouldn't happen because of page lock.
+		- writepage/delete: shouldn't happen because of page lock.
+		- readpage/writepage: shouldn't happen because of page lock.
+
+===============================================================================
+LICENSE/COPYRIGHT
+===============================================================================
+The exofs file system is based on ext2 v0.5b (distributed with the Linux kernel
+version 2.6.10).  All files include the original copyrights, and the license
+is GPL version 2 (only version 2, as is true for the Linux kernel).  The
+Linux kernel can be downloaded from www.kernel.org.
diff --git a/fs/exofs/BUGS b/fs/exofs/BUGS
new file mode 100644
index 0000000..1b2d4c6
--- /dev/null
+++ b/fs/exofs/BUGS
@@ -0,0 +1,3 @@
+- Out-of-space may cause a severe problem if the object (and directory entry)
+  were written, but the inode attributes failed. Then if the filesystem was
+  unmounted and mounted the kernel can get into an endless loop doing a readdir.
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 8/8] fs: Add exofs to Kernel build
  2009-02-09 13:07 [PATCHSET 0/8 version 3] exofs Boaz Harrosh
                   ` (6 preceding siblings ...)
  2009-02-09 13:29 ` [PATCH 7/8] exofs: Documentation Boaz Harrosh
@ 2009-02-09 13:31 ` Boaz Harrosh
  7 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-09 13:31 UTC (permalink / raw)
  To: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel, open-osd
  Cc: linux-kernel, James Bottomley

- Add exofs to fs/Kconfig under "menu 'Miscellaneous filesystems'"
- Add exofs to fs/Makefile

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/Kconfig  |    2 ++
 fs/Makefile |    1 +
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 93945dd..d0c544c 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -223,6 +223,8 @@ source "fs/romfs/Kconfig"
 source "fs/sysv/Kconfig"
 source "fs/ufs/Kconfig"
 
+source "fs/exofs/Kconfig"
+
 endif # MISC_FILESYSTEMS
 
 menuconfig NETWORK_FILESYSTEMS
diff --git a/fs/Makefile b/fs/Makefile
index 38bc735..c907161 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -122,3 +122,4 @@ obj-$(CONFIG_DEBUG_FS)		+= debugfs/
 obj-$(CONFIG_OCFS2_FS)		+= ocfs2/
 obj-$(CONFIG_BTRFS_FS)		+= btrfs/
 obj-$(CONFIG_GFS2_FS)           += gfs2/
+obj-$(CONFIG_EXOFS_FS)          += exofs/
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] exofs: dir_inode and directory operations
  2009-02-09 13:24 ` [PATCH 5/8] exofs: dir_inode and directory operations Boaz Harrosh
@ 2009-02-15 17:08   ` Evgeniy Polyakov
  2009-02-16  9:31     ` Boaz Harrosh
  0 siblings, 1 reply; 36+ messages in thread
From: Evgeniy Polyakov @ 2009-02-15 17:08 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel,
	open-osd, linux-kernel, James Bottomley

Hi.

On Mon, Feb 09, 2009 at 03:24:13PM +0200, Boaz Harrosh (bharrosh@panasas.com) wrote:

> +void exofs_set_link(struct inode *dir, struct exofs_dir_entry *de,
> +			struct page *page, struct inode *inode)
> +{
> +	loff_t pos = page_offset(page) +
> +			(char *) de - (char *) page_address(page);
> +	unsigned len = le16_to_cpu(de->rec_len);
> +	int err;
> +
> +	lock_page(page);
> +	err = exofs_write_begin(NULL, page->mapping, pos, len,
> +				AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
> +	BUG_ON(err);

How unfriendly :)
simple_write_begin() may fail if there is no memory or appropriate
cgroup does not allow to charge more memory.

> +	de->inode_no = cpu_to_le64(inode->i_ino);
> +	exofs_set_de_type(de, inode);
> +	err = exofs_commit_chunk(page, pos, len);
> +	exofs_put_page(page);
> +	dir->i_mtime = dir->i_ctime = CURRENT_TIME;
> +	mark_inode_dirty(dir);
> +}
> +
> +int exofs_add_link(struct dentry *dentry, struct inode *inode)
> +{
> +	struct inode *dir = dentry->d_parent->d_inode;
> +	const unsigned char *name = dentry->d_name.name;
> +	int namelen = dentry->d_name.len;
> +	unsigned chunk_size = exofs_chunk_size(dir);
> +	unsigned reclen = EXOFS_DIR_REC_LEN(namelen);
> +	unsigned short rec_len, name_len;
> +	struct page *page = NULL;
> +	struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
> +	struct exofs_dir_entry *de;
> +	unsigned long npages = dir_pages(dir);
> +	unsigned long n;
> +	char *kaddr;
> +	loff_t pos;
> +	int err;
> +
> +	for (n = 0; n <= npages; n++) {
> +		char *dir_end;
> +
> +		page = exofs_get_page(dir, n);
> +		err = PTR_ERR(page);
> +		if (IS_ERR(page))
> +			goto out;
> +		lock_page(page);
> +		kaddr = page_address(page);
> +		dir_end = kaddr + exofs_last_byte(dir, n);
> +		de = (struct exofs_dir_entry *)kaddr;
> +		kaddr += PAGE_CACHE_SIZE - reclen;
> +		while ((char *)de <= kaddr) {
> +			if ((char *)de == dir_end) {
> +				name_len = 0;
> +				rec_len = chunk_size;
> +				de->rec_len = cpu_to_le16(chunk_size);
> +				de->inode_no = 0;
> +				goto got_it;
> +			}
> +			if (de->rec_len == 0) {
> +				EXOFS_ERR("ERROR: exofs_add_link: "
> +					"zero-length directory entry");
> +				err = -EIO;
> +				goto out_unlock;
> +			}
> +			err = -EEXIST;
> +			if (exofs_match(namelen, name, de))
> +				goto out_unlock;
> +			name_len = EXOFS_DIR_REC_LEN(de->name_len);
> +			rec_len = le16_to_cpu(de->rec_len);
> +			if (!de->inode_no && rec_len >= reclen)
> +				goto got_it;
> +			if (rec_len >= name_len + reclen)
> +				goto got_it;
> +			de = (struct exofs_dir_entry *) ((char *) de + rec_len);
> +		}
> +		unlock_page(page);
> +		exofs_put_page(page);
> +	}
> +	BUG();
> +	return -EINVAL;
> +

So it will crash the system if directory entry does not contain any
data? What was wrong with -EINVAL?

Also, dir_pages(), readpage_done() and similar functions scream for less
generic names, and at least dir_pages() is already implemented in another
5 filesystems.

> +int exofs_delete_entry(struct exofs_dir_entry *dir, struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +	struct inode *inode = mapping->host;
> +	struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
> +	char *kaddr = page_address(page);
> +	unsigned from = ((char *)dir - kaddr) & ~(exofs_chunk_size(inode)-1);
> +	unsigned to = ((char *)dir - kaddr) + le16_to_cpu(dir->rec_len);
> +	loff_t pos;
> +	struct exofs_dir_entry *pde = NULL;
> +	struct exofs_dir_entry *de = (struct exofs_dir_entry *) (kaddr + from);
> +	int err;
> +
> +	while ((char *)de < (char *)dir) {

They have the same type, why is it needed to cast them to char pointer?

> +		if (de->rec_len == 0) {
> +			EXOFS_ERR("ERROR: exofs_delete_entry:"
> +				"zero-length directory entry");
> +			err = -EIO;
> +			goto out;
> +		}
> +		pde = de;
> +		de = exofs_next_entry(de);
> +	}
> +	if (pde)
> +		from = (char *)pde - (char *)page_address(page);
> +	pos = page_offset(page) + from;
> +	lock_page(page);
> +	err = exofs_write_begin(NULL, page->mapping, pos, to - from, 0,
> +							&page, NULL);
> +	BUG_ON(err);

Ugh, in the exofs_make_empty() it is handled without so visible
pain.

> +	if (pde)
> +		pde->rec_len = cpu_to_le16(to - from);
> +	dir->inode_no = 0;
> +	err = exofs_commit_chunk(page, pos, to - from);
> +	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
> +	mark_inode_dirty(inode);
> +	sbi->s_numfiles--;
> +out:
> +	exofs_put_page(page);
> +	return err;
> +}
> +
> +int exofs_make_empty(struct inode *inode, struct inode *parent)
> +{
> +	struct address_space *mapping = inode->i_mapping;
> +	struct page *page = grab_cache_page(mapping, 0);
> +	unsigned chunk_size = exofs_chunk_size(inode);
> +	struct exofs_dir_entry *de;
> +	int err;
> +	void *kaddr;
> +
> +	if (!page)
> +		return -ENOMEM;
> +
> +	err = exofs_write_begin(NULL, page->mapping, 0, chunk_size, 0,
> +							&page, NULL);
> +	if (err) {
> +		unlock_page(page);
> +		goto fail;
> +	}
> +
> +	kaddr = kmap_atomic(page, KM_USER0);
> +	de = (struct exofs_dir_entry *)kaddr;
> +	de->name_len = 1;
> +	de->rec_len = cpu_to_le16(EXOFS_DIR_REC_LEN(1));
> +	memcpy(de->name, ".\0\0", 4);

Plus one byte from the stack?

> +	de->inode_no = cpu_to_le64(inode->i_ino);
> +	exofs_set_de_type(de, inode);
> +
> +	de = (struct exofs_dir_entry *)(kaddr + EXOFS_DIR_REC_LEN(1));
> +	de->name_len = 2;
> +	de->rec_len = cpu_to_le16(chunk_size - EXOFS_DIR_REC_LEN(1));
> +	de->inode_no = cpu_to_le64(parent->i_ino);
> +	memcpy(de->name, "..\0", 4);

And another one.

> +	exofs_set_de_type(de, inode);
> +	kunmap_atomic(page, KM_USER0);
> +	err = exofs_commit_chunk(page, 0, chunk_size);
> +fail:
> +	page_cache_release(page);
> +	return err;
> +}
> +

> +struct inode *exofs_new_inode(struct inode *dir, int mode)
> +{
> +	struct super_block *sb;
> +	struct inode *inode;
> +	struct exofs_i_info *oi;
> +	struct exofs_sb_info *sbi;
> +	struct osd_request *or;
> +	struct osd_obj_id obj;
> +	int ret;
> +
> +	sb = dir->i_sb;
> +	inode = new_inode(sb);
> +	if (!inode)
> +		return ERR_PTR(-ENOMEM);
> +
> +	oi = exofs_i(inode);
> +
> +	init_waitqueue_head(&oi->i_wq);
> +	set_obj_2bcreated(oi);
> +
> +	sbi = sb->s_fs_info;
> +
> +	sb->s_dirt = 1;
> +	inode->i_uid = current->cred->fsuid;
> +	if (dir->i_mode & S_ISGID) {
> +		inode->i_gid = dir->i_gid;
> +		if (S_ISDIR(mode))
> +			mode |= S_ISGID;
> +	} else {
> +		inode->i_gid = current->cred->fsgid;
> +	}
> +	inode->i_mode = mode;
> +
> +	inode->i_ino = sbi->s_nextid++;
> +	inode->i_blkbits = EXOFS_BLKSHIFT;
> +	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
> +	oi->i_commit_size = inode->i_size = 0;
> +	spin_lock(&sbi->s_next_gen_lock);
> +	inode->i_generation = sbi->s_next_generation++;
> +	spin_unlock(&sbi->s_next_gen_lock);
> +	insert_inode_hash(inode);
> +
> +	mark_inode_dirty(inode);
> +
> +	obj.partition = sbi->s_pid;
> +	obj.id = inode->i_ino + EXOFS_OBJ_OFF;
> +	exofs_make_credential(oi->i_cred, &obj);
> +
> +	or = osd_start_request(sbi->s_dev, GFP_KERNEL);
> +	if (unlikely(!or)) {
> +		EXOFS_ERR("exofs_new_inode: osd_start_request failed\n");
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	osd_req_create_object(or, &obj);
> +
> +	/* increment the refcount so that the inode will still be around when we
> +	 * reach the callback
> +	 */
> +	atomic_inc(&inode->i_count);
> +
> +	ret = exofs_async_op(or, create_done, inode, oi->i_cred);
> +	if (ret) {
> +		atomic_dec(&inode->i_count);

igrab()/iput()?

> +		osd_end_request(or);
> +		return ERR_PTR(-EIO);
> +	}
> +	atomic_inc(&sbi->s_curr_pending);
> +
> +	return inode;
> +}

> +static int exofs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
> +{
> +	struct inode *inode;
> +	int err = -EMLINK;
> +
> +	if (dir->i_nlink >= EXOFS_LINK_MAX)
> +		goto out;
> +
> +	inode_inc_link_count(dir);
> +
> +	inode = exofs_new_inode(dir, S_IFDIR | mode);
> +	err = PTR_ERR(inode);
> +	if (IS_ERR(inode))
> +		goto out_dir;
> +
> +	inode->i_op = &exofs_dir_inode_operations;
> +	inode->i_fop = &exofs_dir_operations;
> +	inode->i_mapping->a_ops = &exofs_aops;
> +
> +	inode_inc_link_count(inode);
> +
> +	err = exofs_make_empty(inode, dir);
> +	if (err)
> +		goto out_fail;
> +
> +	err = exofs_add_link(dentry, inode);
> +	if (err)
> +		goto out_fail;
> +
> +	d_instantiate(dentry, inode);
> +out:
> +	return err;
> +
> +out_fail:
> +	inode_dec_link_count(inode);
> +	inode_dec_link_count(inode);

Why two decrements, will it be ok after exofs_make_empty() fail when it
was incremented only once?

> +	iput(inode);
> +out_dir:
> +	inode_dec_link_count(dir);
> +	goto out;
> +}

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/8] exofs: super_operations and file_system_type
  2009-02-09 13:25 ` [PATCH 6/8] exofs: super_operations and file_system_type Boaz Harrosh
@ 2009-02-15 17:24   ` Evgeniy Polyakov
  2009-02-16  9:59     ` Boaz Harrosh
  0 siblings, 1 reply; 36+ messages in thread
From: Evgeniy Polyakov @ 2009-02-15 17:24 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel,
	open-osd, linux-kernel, James Bottomley

Hi.

On Mon, Feb 09, 2009 at 03:25:53PM +0200, Boaz Harrosh (bharrosh@panasas.com) wrote:
> +static int parse_options(char *options, struct exofs_mountopt *opts)
> +{
> +	char *p;
> +	substring_t args[MAX_OPT_ARGS];
> +	int option;
> +	bool s_pid = false;
> +
> +	EXOFS_DBGMSG("parse_options %s\n", options);
> +	/* defaults */
> +	memset(opts, 0, sizeof(*opts));
> +	opts->timeout = BLK_DEFAULT_SG_TIMEOUT;
> +
> +	while ((p = strsep(&options, ",")) != NULL) {
> +		int token;
> +		char str[32];
> +
> +		if (!*p)
> +			continue;
> +
> +		token = match_token(p, tokens, args);
> +		switch (token) {
> +		case Opt_pid:
> +			if (0 == match_strlcpy(str, &args[0], sizeof(str)))
> +				return -EINVAL;
> +			opts->pid = simple_strtoull(str, NULL, 0);
> +			if (opts->pid < EXOFS_MIN_PID) {
> +				EXOFS_ERR("Partition ID must be >= %u",
> +					  EXOFS_MIN_PID);
> +				return -EINVAL;
> +			}
> +			s_pid = 1;
> +			break;
> +		case Opt_to:
> +			if (match_int(&args[0], &option))
> +				return -EINVAL;
> +			if (option <= 0) {
> +				EXOFS_ERR("Timout must be > 0");
> +				return -EINVAL;
> +			}
> +			opts->timeout = option * HZ;

Is it intentional to be a different timeouton systems with different HX
but the same mount option?

> +static struct inode *exofs_alloc_inode(struct super_block *sb)
> +{
> +	struct exofs_i_info *oi;
> +
> +	oi = kmem_cache_alloc(exofs_inode_cachep, GFP_KERNEL);

I'm curious if this should be GFP_NOFS or not?

> +	if (!oi)
> +		return NULL;
> +
> +	oi->vfs_inode.i_version = 1;
> +	return &oi->vfs_inode;
> +}

> +static void exofs_put_super(struct super_block *sb)
> +{
> +	int num_pend;
> +	struct exofs_sb_info *sbi = sb->s_fs_info;
> +
> +	/* make sure there are no pending commands */
> +	for (num_pend = atomic_read(&sbi->s_curr_pending); num_pend > 0;
> +	     num_pend = atomic_read(&sbi->s_curr_pending)) {

This rises a question. Let's check exofs_new_inode() for example (it is
a bad example, since inode can not be created when we already in the
put_super() callback, but still there are others), it increments
s_curr_pending way after inode was created, so is it possible that
some in-flight callback is about to be executed and its subsequent
s_curr_pending manipulation will not be detected by this loop?

Should s_curr_pending increment be audited all over the code to be
increased before the potential postponing command starts (which is not
the case in exofs_new_inode() above)?

> +		wait_queue_head_t wq;
> +		init_waitqueue_head(&wq);
> +		wait_event_timeout(wq,
> +				  (atomic_read(&sbi->s_curr_pending) == 0),
> +				  msecs_to_jiffies(100));
> +	}
> +
> +	osduld_put_device(sbi->s_dev);
> +	kfree(sb->s_fs_info);
> +	sb->s_fs_info = NULL;
> +}

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-02-09 13:12 ` [PATCH 1/8] exofs: Kbuild, Headers and osd utils Boaz Harrosh
@ 2009-02-16  4:18   ` FUJITA Tomonori
  2009-02-16  8:49     ` Boaz Harrosh
  0 siblings, 1 reply; 36+ messages in thread
From: FUJITA Tomonori @ 2009-02-16  4:18 UTC (permalink / raw)
  To: bharrosh
  Cc: avishay, jeff, akpm, linux-fsdevel, osd-dev, linux-kernel,
	James.Bottomley, jens.axboe, linux-scsi

On Mon,  9 Feb 2009 15:12:09 +0200
Boaz Harrosh <bharrosh@panasas.com> wrote:

> This patch includes osd infrastructure that will be used later by
> the file system.
> 
> Also the declarations of constants, on disk structures,
> and prototypes.
> 
> And the Kbuild+Kconfig files needed to build the exofs module.
> 
> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
> ---
>  fs/exofs/Kbuild   |   30 +++++++
>  fs/exofs/Kconfig  |   13 +++
>  fs/exofs/common.h |  181 +++++++++++++++++++++++++++++++++++++++++
>  fs/exofs/exofs.h  |  139 ++++++++++++++++++++++++++++++++
>  fs/exofs/osd.c    |  230 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 593 insertions(+), 0 deletions(-)
>  create mode 100644 fs/exofs/Kbuild
>  create mode 100644 fs/exofs/Kconfig
>  create mode 100644 fs/exofs/common.h
>  create mode 100644 fs/exofs/exofs.h
>  create mode 100644 fs/exofs/osd.c

> +static void _osd_read(struct osd_request *or,
> +	const struct osd_obj_id *obj, uint64_t offset, struct bio *bio)
> +{
> +	osd_req_read(or, obj, bio, offset);
> +	EXOFS_DBGMSG("osd_req_read(p=%llX, ob=%llX, l=%llu, of=%llu)\n",
> +		_LLU(obj->partition), _LLU(obj->id), _LLU(bio->bi_size),
> +		_LLU(offset));
> +}
> +
> +#ifdef __KERNEL__

Hmm?

> +static struct bio *_bio_map_pages(struct request_queue *req_q,
> +				  struct page **pages, unsigned page_count,
> +				  size_t length, gfp_t gfp_mask)
> +{
> +	struct bio *bio;
> +	int i;
> +
> +	bio = bio_alloc(gfp_mask, page_count);
> +	if (!bio) {
> +		EXOFS_DBGMSG("Failed to bio_alloc page_count=%d\n", page_count);
> +		return NULL;
> +	}
> +
> +	for (i = 0; i < page_count && length; i++) {
> +		size_t use_len = min(length, PAGE_SIZE);
> +
> +		if (use_len !=
> +		    bio_add_pc_page(req_q, bio, pages[i], use_len, 0)) {
> +			EXOFS_ERR("Failed bio_add_pc_page req_q=%p pages[i]=%p "
> +				  "use_len=%Zd page_count=%d length=%Zd\n",
> +				  req_q, pages[i], use_len, page_count, length);
> +			bio_put(bio);
> +			return NULL;
> +		}
> +
> +		length -= use_len;
> +	}
> +
> +	WARN_ON(length);
> +	return bio;
> +}

1) exofs builds bios by hand.
2) exofs passes bio to OSD SCSI ULD.

As a result, exofs and OSD SCSI ULD need to know the internal of bio,
that is, you reinvent the bio handling infrastructure, as pointed out
in another thread in scsi-ml.

_bio_map_pages is called where the VFS passes an array of a pointer to
a page frame.

Why can't you simply pass the array to OSD SCSI ULD? Then OSD SCSI ULD
can use the block layer helper functions to build a request out of
pages without knowing the internal of bio.


> +int osd_req_read_pages(struct osd_request *or,
> +	const struct osd_obj_id *obj, u64 offset, u64 length,
> +	struct page **pages, int page_count)
> +{
> +	struct request_queue *req_q = or->osd_dev->scsi_device->request_queue;
> +	struct bio *bio = _bio_map_pages(req_q, pages, page_count, length,
> +					 GFP_KERNEL);
> +
> +	if (!bio)
> +		return -ENOMEM;
> +
> +	_osd_read(or, obj, offset, bio);
> +	return 0;
> +}
> +#endif /* def __KERNEL__ */
> +
> +int osd_req_read_kern(struct osd_request *or,
> +	const struct osd_obj_id *obj, u64 offset, void* buff, u64 len)
> +{
> +	struct request_queue *req_q = or->osd_dev->scsi_device->request_queue;
> +	struct bio *bio = bio_map_kern(req_q, buff, len, GFP_KERNEL);
> +
> +	if (!bio)
> +		return -ENOMEM;
> +
> +	_osd_read(or, obj, offset, bio);
> +	return 0;
> +}
> +
> +static void _osd_write(struct osd_request *or,
> +	const struct osd_obj_id *obj, uint64_t offset, struct bio *bio)
> +{
> +	osd_req_write(or, obj, bio, offset);
> +	EXOFS_DBGMSG("osd_req_write(p=%llX, ob=%llX, l=%llu, of=%llu)\n",
> +		_LLU(obj->partition), _LLU(obj->id), _LLU(bio->bi_size),
> +		_LLU(offset));
> +}
> +
> +#ifdef __KERNEL__
> +int osd_req_write_pages(struct osd_request *or,
> +	const struct osd_obj_id *obj, u64 offset, u64 length,
> +	struct page **pages, int page_count)
> +{
> +	struct request_queue *req_q = or->osd_dev->scsi_device->request_queue;
> +	struct bio *bio = _bio_map_pages(req_q, pages, page_count, length,
> +					 GFP_KERNEL);
> +
> +	if (!bio)
> +		return -ENOMEM;
> +
> +	_osd_write(or, obj, offset, bio);
> +	return 0;
> +}
> +#endif /* def __KERNEL__ */
> +
> +int osd_req_write_kern(struct osd_request *or,
> +	const struct osd_obj_id *obj, u64 offset, void* buff, u64 len)
> +{
> +	struct request_queue *req_q = or->osd_dev->scsi_device->request_queue;
> +	struct bio *bio = bio_map_kern(req_q, buff, len, GFP_KERNEL);
> +
> +	if (!bio)
> +		return -ENOMEM;
> +
> +	_osd_write(or, obj, offset, bio);
> +	return 0;
> +}
> -- 
> 1.6.0.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-02-16  4:18   ` FUJITA Tomonori
@ 2009-02-16  8:49     ` Boaz Harrosh
  2009-02-16  9:00       ` FUJITA Tomonori
  0 siblings, 1 reply; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-16  8:49 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: avishay, jeff, akpm, linux-fsdevel, osd-dev, linux-kernel,
	James.Bottomley, jens.axboe, linux-scsi

FUJITA Tomonori wrote:
> On Mon,  9 Feb 2009 15:12:09 +0200
> Boaz Harrosh <bharrosh@panasas.com> wrote:
> 
>> This patch includes osd infrastructure that will be used later by
>> the file system.
>>
>> Also the declarations of constants, on disk structures,
>> and prototypes.
>>
>> And the Kbuild+Kconfig files needed to build the exofs module.
>>
>> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
>> ---
>>  fs/exofs/Kbuild   |   30 +++++++
>>  fs/exofs/Kconfig  |   13 +++
>>  fs/exofs/common.h |  181 +++++++++++++++++++++++++++++++++++++++++
>>  fs/exofs/exofs.h  |  139 ++++++++++++++++++++++++++++++++
>>  fs/exofs/osd.c    |  230 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  5 files changed, 593 insertions(+), 0 deletions(-)
>>  create mode 100644 fs/exofs/Kbuild
>>  create mode 100644 fs/exofs/Kconfig
>>  create mode 100644 fs/exofs/common.h
>>  create mode 100644 fs/exofs/exofs.h
>>  create mode 100644 fs/exofs/osd.c
> 
>> +static void _osd_read(struct osd_request *or,
>> +	const struct osd_obj_id *obj, uint64_t offset, struct bio *bio)
>> +{
>> +	osd_req_read(or, obj, bio, offset);
>> +	EXOFS_DBGMSG("osd_req_read(p=%llX, ob=%llX, l=%llu, of=%llu)\n",
>> +		_LLU(obj->partition), _LLU(obj->id), _LLU(bio->bi_size),
>> +		_LLU(offset));
>> +}
>> +
>> +#ifdef __KERNEL__
> 
> Hmm?
> 

Yep, this file also complies in user mode.

>> +static struct bio *_bio_map_pages(struct request_queue *req_q,
>> +				  struct page **pages, unsigned page_count,
>> +				  size_t length, gfp_t gfp_mask)
>> +{
>> +	struct bio *bio;
>> +	int i;
>> +
>> +	bio = bio_alloc(gfp_mask, page_count);
>> +	if (!bio) {
>> +		EXOFS_DBGMSG("Failed to bio_alloc page_count=%d\n", page_count);
>> +		return NULL;
>> +	}
>> +
>> +	for (i = 0; i < page_count && length; i++) {
>> +		size_t use_len = min(length, PAGE_SIZE);
>> +
>> +		if (use_len !=
>> +		    bio_add_pc_page(req_q, bio, pages[i], use_len, 0)) {
>> +			EXOFS_ERR("Failed bio_add_pc_page req_q=%p pages[i]=%p "
>> +				  "use_len=%Zd page_count=%d length=%Zd\n",
>> +				  req_q, pages[i], use_len, page_count, length);
>> +			bio_put(bio);
>> +			return NULL;
>> +		}
>> +
>> +		length -= use_len;
>> +	}
>> +
>> +	WARN_ON(length);
>> +	return bio;
>> +}
> 
> 1) exofs builds bios by hand.
> 2) exofs passes bio to OSD SCSI ULD.
> 
> As a result, exofs and OSD SCSI ULD need to know the internal of bio,
> that is, you reinvent the bio handling infrastructure, as pointed out
> in another thread in scsi-ml.
> 
> _bio_map_pages is called where the VFS passes an array of a pointer to
> a page frame.
> 
> Why can't you simply pass the array to OSD SCSI ULD? Then OSD SCSI ULD
> can use the block layer helper functions to build a request out of
> pages without knowing the internal of bio.
> 
> 

Because actually this code is wrong and temporary and will change soon.
At vfs write_pages I do not get a linear array of page pointers but a
link-list of pages. This will not fit any current model. Also looking
ahead I will have RAID 0, 1, 5, and 6 on objects of different devices. bio
is the perfect collector for memory information in this situation.

exofs is not the first and only file system who is using bios. Proof of
the matter is that block exports a bio submit routine.

As I said on the other thread, I could live without it for now, for a short while,
but I will regret it badly and it will hurt performance in the long run.

<snip>

Boaz

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-02-16  8:49     ` Boaz Harrosh
@ 2009-02-16  9:00       ` FUJITA Tomonori
  2009-02-16  9:19         ` Boaz Harrosh
  0 siblings, 1 reply; 36+ messages in thread
From: FUJITA Tomonori @ 2009-02-16  9:00 UTC (permalink / raw)
  To: bharrosh
  Cc: fujita.tomonori, avishay, jeff, akpm, linux-fsdevel, osd-dev,
	linux-kernel, James.Bottomley, jens.axboe, linux-scsi

On Mon, 16 Feb 2009 10:49:39 +0200
Boaz Harrosh <bharrosh@panasas.com> wrote:

> FUJITA Tomonori wrote:
> > On Mon,  9 Feb 2009 15:12:09 +0200
> > Boaz Harrosh <bharrosh@panasas.com> wrote:
> > 
> >> This patch includes osd infrastructure that will be used later by
> >> the file system.
> >>
> >> Also the declarations of constants, on disk structures,
> >> and prototypes.
> >>
> >> And the Kbuild+Kconfig files needed to build the exofs module.
> >>
> >> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
> >> ---
> >>  fs/exofs/Kbuild   |   30 +++++++
> >>  fs/exofs/Kconfig  |   13 +++
> >>  fs/exofs/common.h |  181 +++++++++++++++++++++++++++++++++++++++++
> >>  fs/exofs/exofs.h  |  139 ++++++++++++++++++++++++++++++++
> >>  fs/exofs/osd.c    |  230 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  5 files changed, 593 insertions(+), 0 deletions(-)
> >>  create mode 100644 fs/exofs/Kbuild
> >>  create mode 100644 fs/exofs/Kconfig
> >>  create mode 100644 fs/exofs/common.h
> >>  create mode 100644 fs/exofs/exofs.h
> >>  create mode 100644 fs/exofs/osd.c
> > 
> >> +static void _osd_read(struct osd_request *or,
> >> +	const struct osd_obj_id *obj, uint64_t offset, struct bio *bio)
> >> +{
> >> +	osd_req_read(or, obj, bio, offset);
> >> +	EXOFS_DBGMSG("osd_req_read(p=%llX, ob=%llX, l=%llu, of=%llu)\n",
> >> +		_LLU(obj->partition), _LLU(obj->id), _LLU(bio->bi_size),
> >> +		_LLU(offset));
> >> +}
> >> +
> >> +#ifdef __KERNEL__
> > 
> > Hmm?
> > 
> 
> Yep, this file also complies in user mode.

Even if you do, it's a good thing to add __KERNEL__ to fs/*.c?


> >> +static struct bio *_bio_map_pages(struct request_queue *req_q,
> >> +				  struct page **pages, unsigned page_count,
> >> +				  size_t length, gfp_t gfp_mask)
> >> +{
> >> +	struct bio *bio;
> >> +	int i;
> >> +
> >> +	bio = bio_alloc(gfp_mask, page_count);
> >> +	if (!bio) {
> >> +		EXOFS_DBGMSG("Failed to bio_alloc page_count=%d\n", page_count);
> >> +		return NULL;
> >> +	}
> >> +
> >> +	for (i = 0; i < page_count && length; i++) {
> >> +		size_t use_len = min(length, PAGE_SIZE);
> >> +
> >> +		if (use_len !=
> >> +		    bio_add_pc_page(req_q, bio, pages[i], use_len, 0)) {
> >> +			EXOFS_ERR("Failed bio_add_pc_page req_q=%p pages[i]=%p "
> >> +				  "use_len=%Zd page_count=%d length=%Zd\n",
> >> +				  req_q, pages[i], use_len, page_count, length);
> >> +			bio_put(bio);
> >> +			return NULL;
> >> +		}
> >> +
> >> +		length -= use_len;
> >> +	}
> >> +
> >> +	WARN_ON(length);
> >> +	return bio;
> >> +}
> > 
> > 1) exofs builds bios by hand.
> > 2) exofs passes bio to OSD SCSI ULD.
> > 
> > As a result, exofs and OSD SCSI ULD need to know the internal of bio,
> > that is, you reinvent the bio handling infrastructure, as pointed out
> > in another thread in scsi-ml.
> > 
> > _bio_map_pages is called where the VFS passes an array of a pointer to
> > a page frame.
> > 
> > Why can't you simply pass the array to OSD SCSI ULD? Then OSD SCSI ULD
> > can use the block layer helper functions to build a request out of
> > pages without knowing the internal of bio.
> > 
> > 
> 
> Because actually this code is wrong and temporary and will change soon.
> At vfs write_pages I do not get a linear array of page pointers but a
> link-list of pages. This will not fit any current model.

Then, why can't you pass a link-list of pages?


> Also looking
> ahead I will have RAID 0, 1, 5, and 6 on objects of different devices. bio
> is the perfect collector for memory information in this situation.

You will add such features to exofs, handling multiple devices
internally?


> exofs is not the first and only file system who is using bios. Proof of
> the matter is that block exports a bio submit routine.

Seems that exofs just passes pages and the ULD sends a SCSI command
including these pages. I don't see how exofs needs to handle bio
directly.


> As I said on the other thread, I could live without it for now, for a short while,
> but I will regret it badly and it will hurt performance in the long run.
> 
> <snip>
> 
> Boaz
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-02-16  9:00       ` FUJITA Tomonori
@ 2009-02-16  9:19         ` Boaz Harrosh
  2009-02-16  9:27           ` Jeff Garzik
  2009-02-16  9:38           ` [PATCH 1/8] exofs: Kbuild, Headers and osd utils FUJITA Tomonori
  0 siblings, 2 replies; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-16  9:19 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: avishay, jeff, akpm, linux-fsdevel, osd-dev, linux-kernel,
	James.Bottomley, jens.axboe, linux-scsi

FUJITA Tomonori wrote:
> On Mon, 16 Feb 2009 10:49:39 +0200
> Boaz Harrosh <bharrosh@panasas.com> wrote:
> 
>> FUJITA Tomonori wrote:
>>> On Mon,  9 Feb 2009 15:12:09 +0200
>>> Boaz Harrosh <bharrosh@panasas.com> wrote:
>>>
>>>> This patch includes osd infrastructure that will be used later by
>>>> the file system.
>>>>
>>>> Also the declarations of constants, on disk structures,
>>>> and prototypes.
>>>>
>>>> And the Kbuild+Kconfig files needed to build the exofs module.
>>>>
>>>> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
>>>> ---
>>>>  fs/exofs/Kbuild   |   30 +++++++
>>>>  fs/exofs/Kconfig  |   13 +++
>>>>  fs/exofs/common.h |  181 +++++++++++++++++++++++++++++++++++++++++
>>>>  fs/exofs/exofs.h  |  139 ++++++++++++++++++++++++++++++++
>>>>  fs/exofs/osd.c    |  230 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  5 files changed, 593 insertions(+), 0 deletions(-)
>>>>  create mode 100644 fs/exofs/Kbuild
>>>>  create mode 100644 fs/exofs/Kconfig
>>>>  create mode 100644 fs/exofs/common.h
>>>>  create mode 100644 fs/exofs/exofs.h
>>>>  create mode 100644 fs/exofs/osd.c
>>>> +static void _osd_read(struct osd_request *or,
>>>> +	const struct osd_obj_id *obj, uint64_t offset, struct bio *bio)
>>>> +{
>>>> +	osd_req_read(or, obj, bio, offset);
>>>> +	EXOFS_DBGMSG("osd_req_read(p=%llX, ob=%llX, l=%llu, of=%llu)\n",
>>>> +		_LLU(obj->partition), _LLU(obj->id), _LLU(bio->bi_size),
>>>> +		_LLU(offset));
>>>> +}
>>>> +
>>>> +#ifdef __KERNEL__
>>> Hmm?
>>>
>> Yep, this file also complies in user mode.
> 
> Even if you do, it's a good thing to add __KERNEL__ to fs/*.c?
> 
> 
>>>> +static struct bio *_bio_map_pages(struct request_queue *req_q,
>>>> +				  struct page **pages, unsigned page_count,
>>>> +				  size_t length, gfp_t gfp_mask)
>>>> +{
>>>> +	struct bio *bio;
>>>> +	int i;
>>>> +
>>>> +	bio = bio_alloc(gfp_mask, page_count);
>>>> +	if (!bio) {
>>>> +		EXOFS_DBGMSG("Failed to bio_alloc page_count=%d\n", page_count);
>>>> +		return NULL;
>>>> +	}
>>>> +
>>>> +	for (i = 0; i < page_count && length; i++) {
>>>> +		size_t use_len = min(length, PAGE_SIZE);
>>>> +
>>>> +		if (use_len !=
>>>> +		    bio_add_pc_page(req_q, bio, pages[i], use_len, 0)) {
>>>> +			EXOFS_ERR("Failed bio_add_pc_page req_q=%p pages[i]=%p "
>>>> +				  "use_len=%Zd page_count=%d length=%Zd\n",
>>>> +				  req_q, pages[i], use_len, page_count, length);
>>>> +			bio_put(bio);
>>>> +			return NULL;
>>>> +		}
>>>> +
>>>> +		length -= use_len;
>>>> +	}
>>>> +
>>>> +	WARN_ON(length);
>>>> +	return bio;
>>>> +}
>>> 1) exofs builds bios by hand.
>>> 2) exofs passes bio to OSD SCSI ULD.
>>>
>>> As a result, exofs and OSD SCSI ULD need to know the internal of bio,
>>> that is, you reinvent the bio handling infrastructure, as pointed out
>>> in another thread in scsi-ml.
>>>
>>> _bio_map_pages is called where the VFS passes an array of a pointer to
>>> a page frame.
>>>
>>> Why can't you simply pass the array to OSD SCSI ULD? Then OSD SCSI ULD
>>> can use the block layer helper functions to build a request out of
>>> pages without knowing the internal of bio.
>>>
>>>
>> Because actually this code is wrong and temporary and will change soon.
>> At vfs write_pages I do not get a linear array of page pointers but a
>> link-list of pages. This will not fit any current model.
> 
> Then, why can't you pass a link-list of pages?
> 

What? How to do that? I mean how to move from link-list of pages
to request?

> 
>> Also looking
>> ahead I will have RAID 0, 1, 5, and 6 on objects of different devices. bio
>> is the perfect collector for memory information in this situation.
> 
> You will add such features to exofs, handling multiple devices
> internally?
> 

Multiple objects on Multiple devices, Yes.

> 
>> exofs is not the first and only file system who is using bios. Proof of
>> the matter is that block exports a bio submit routine.
> 
> Seems that exofs just passes pages and the ULD sends a SCSI command
> including these pages. I don't see how exofs needs to handle bio
> directly.
> 

How do you propose to collect these pages? and keep them without allocating
an extra list? without pre-allocating a struct request? and without re-inventing
the bio structure?

> 
>> As I said on the other thread, I could live without it for now, for a short while,
>> but I will regret it badly and it will hurt performance in the long run.
>>
>> <snip>
>>
>> Boaz

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-02-16  9:19         ` Boaz Harrosh
@ 2009-02-16  9:27           ` Jeff Garzik
  2009-02-16 10:19             ` Boaz Harrosh
  2009-02-16  9:38           ` [PATCH 1/8] exofs: Kbuild, Headers and osd utils FUJITA Tomonori
  1 sibling, 1 reply; 36+ messages in thread
From: Jeff Garzik @ 2009-02-16  9:27 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: FUJITA Tomonori, avishay, akpm, linux-fsdevel, osd-dev,
	linux-kernel, James.Bottomley, jens.axboe, linux-scsi

Boaz Harrosh wrote:
> FUJITA Tomonori wrote:
>> Boaz Harrosh <bharrosh@panasas.com> wrote:
>>> Also looking
>>> ahead I will have RAID 0, 1, 5, and 6 on objects of different devices. bio
>>> is the perfect collector for memory information in this situation.

>> You will add such features to exofs, handling multiple devices
>> internally?

> Multiple objects on Multiple devices, Yes.

That sort of feature does not belong in exofs, but somewhat separate. 
Ideally we should be able to share "MD for OSD" with other OSD 
filesystems, and the "osdblk" device that I will produce once libosd 
hits upstream.

	Jeff





^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] exofs: dir_inode and directory operations
  2009-02-15 17:08   ` Evgeniy Polyakov
@ 2009-02-16  9:31     ` Boaz Harrosh
  2009-03-15 18:10       ` Boaz Harrosh
  0 siblings, 1 reply; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-16  9:31 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel,
	open-osd, linux-kernel, James Bottomley

Evgeniy Polyakov wrote:
> Hi.
> 
> On Mon, Feb 09, 2009 at 03:24:13PM +0200, Boaz Harrosh (bharrosh@panasas.com) wrote:
> 
>> +void exofs_set_link(struct inode *dir, struct exofs_dir_entry *de,
>> +			struct page *page, struct inode *inode)
>> +{
>> +	loff_t pos = page_offset(page) +
>> +			(char *) de - (char *) page_address(page);
>> +	unsigned len = le16_to_cpu(de->rec_len);
>> +	int err;
>> +
>> +	lock_page(page);
>> +	err = exofs_write_begin(NULL, page->mapping, pos, len,
>> +				AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
>> +	BUG_ON(err);
> 
> How unfriendly :)
> simple_write_begin() may fail if there is no memory or appropriate
> cgroup does not allow to charge more memory.
> 

You are right on the money. I'll go and revisit all the BUGs and BUG_ONs
Thanks good catch.

>> +		unlock_page(page);
>> +		exofs_put_page(page);
>> +	}
>> +	BUG();
>> +	return -EINVAL;
>> +
> 
> So it will crash the system if directory entry does not contain any
> data? What was wrong with -EINVAL?
> 

Yes, thanks, will fix

> Also, dir_pages(), readpage_done() and similar functions scream for less
> generic names, and at least dir_pages() is already implemented in another
> 5 filesystems.
> 

I will fix that too. I thought I changed all these, I must have missed
a few.

We are so used to filesystems been loadable modules we never try to compile
a few in-kernel.

>> +int exofs_delete_entry(struct exofs_dir_entry *dir, struct page *page)
>> +{
>> +	struct address_space *mapping = page->mapping;
>> +	struct inode *inode = mapping->host;
>> +	struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
>> +	char *kaddr = page_address(page);
>> +	unsigned from = ((char *)dir - kaddr) & ~(exofs_chunk_size(inode)-1);
>> +	unsigned to = ((char *)dir - kaddr) + le16_to_cpu(dir->rec_len);
>> +	loff_t pos;
>> +	struct exofs_dir_entry *pde = NULL;
>> +	struct exofs_dir_entry *de = (struct exofs_dir_entry *) (kaddr + from);
>> +	int err;
>> +
>> +	while ((char *)de < (char *)dir) {
> 
> They have the same type, why is it needed to cast them to char pointer?
> 

Will fix, thanks

>> +		if (de->rec_len == 0) {
>> +			EXOFS_ERR("ERROR: exofs_delete_entry:"
>> +				"zero-length directory entry");
>> +			err = -EIO;
>> +			goto out;
>> +		}
>> +		pde = de;
>> +		de = exofs_next_entry(de);
>> +	}
>> +	if (pde)
>> +		from = (char *)pde - (char *)page_address(page);
>> +	pos = page_offset(page) + from;
>> +	lock_page(page);
>> +	err = exofs_write_begin(NULL, page->mapping, pos, to - from, 0,
>> +							&page, NULL);
>> +	BUG_ON(err);
> 
> Ugh, in the exofs_make_empty() it is handled without so visible
> pain.
> 

Yep, will fix

>> +	if (pde)
>> +		pde->rec_len = cpu_to_le16(to - from);
>> +	dir->inode_no = 0;
>> +	err = exofs_commit_chunk(page, pos, to - from);
>> +	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
>> +	mark_inode_dirty(inode);
>> +	sbi->s_numfiles--;
>> +out:
>> +	exofs_put_page(page);
>> +	return err;
>> +}

<snip>
>> +
>> +	atomic_inc(&inode->i_count);
>> +
>> +	ret = exofs_async_op(or, create_done, inode, oi->i_cred);
>> +	if (ret) {
>> +		atomic_dec(&inode->i_count);
> 
> igrab()/iput()?
> 

Thanks, makes much more sense. Sorry leftovers from 2.6.10

>> +		osd_end_request(or);
>> +		return ERR_PTR(-EIO);
>> +	}
>> +	atomic_inc(&sbi->s_curr_pending);
>> +
>> +	return inode;
>> +}
> 
>> +static int exofs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
>> +{
>> +	struct inode *inode;
>> +	int err = -EMLINK;
>> +
>> +	if (dir->i_nlink >= EXOFS_LINK_MAX)
>> +		goto out;
>> +
>> +	inode_inc_link_count(dir);
>> +
>> +	inode = exofs_new_inode(dir, S_IFDIR | mode);
>> +	err = PTR_ERR(inode);
>> +	if (IS_ERR(inode))
>> +		goto out_dir;
>> +
>> +	inode->i_op = &exofs_dir_inode_operations;
>> +	inode->i_fop = &exofs_dir_operations;
>> +	inode->i_mapping->a_ops = &exofs_aops;
>> +
>> +	inode_inc_link_count(inode);
>> +
>> +	err = exofs_make_empty(inode, dir);
>> +	if (err)
>> +		goto out_fail;
>> +
>> +	err = exofs_add_link(dentry, inode);
>> +	if (err)
>> +		goto out_fail;
>> +
>> +	d_instantiate(dentry, inode);
>> +out:
>> +	return err;
>> +
>> +out_fail:
>> +	inode_dec_link_count(inode);
>> +	inode_dec_link_count(inode);
> 
> Why two decrements, will it be ok after exofs_make_empty() fail when it
> was incremented only once?
> 

That's hard to say, I'll investigate it some more.
Thanks

>> +	iput(inode);
>> +out_dir:
>> +	inode_dec_link_count(dir);
>> +	goto out;
>> +}
> 

Most valuable input, thank you for taking the time to review.

Boaz

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-02-16  9:19         ` Boaz Harrosh
  2009-02-16  9:27           ` Jeff Garzik
@ 2009-02-16  9:38           ` FUJITA Tomonori
  2009-02-16 10:29             ` Boaz Harrosh
  1 sibling, 1 reply; 36+ messages in thread
From: FUJITA Tomonori @ 2009-02-16  9:38 UTC (permalink / raw)
  To: bharrosh
  Cc: fujita.tomonori, avishay, jeff, akpm, linux-fsdevel, osd-dev,
	linux-kernel, James.Bottomley, jens.axboe, linux-scsi

On Mon, 16 Feb 2009 11:19:21 +0200
Boaz Harrosh <bharrosh@panasas.com> wrote:

> >> Also looking
> >> ahead I will have RAID 0, 1, 5, and 6 on objects of different devices. bio
> >> is the perfect collector for memory information in this situation.
> > 
> > You will add such features to exofs, handling multiple devices
> > internally?
> > 
> 
> Multiple objects on Multiple devices, Yes.

I thought that exofs is kinda example (reference) file system.

Nobody has seen your code. Let's discuss when we have the
code. Over-designing for what we've not seen is not a good idea.


> >> exofs is not the first and only file system who is using bios. Proof of
> >> the matter is that block exports a bio submit routine.
> > 
> > Seems that exofs just passes pages and the ULD sends a SCSI command
> > including these pages. I don't see how exofs needs to handle bio
> > directly.
> > 
> 
> How do you propose to collect these pages? and keep them without allocating
> an extra list? without pre-allocating a struct request? and without re-inventing
> the bio structure?

I don't think that allocating an extra list (or something) to keep
them hurts performance. We can talk about it when you have the real
performance results.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/8] exofs: super_operations and file_system_type
  2009-02-15 17:24   ` Evgeniy Polyakov
@ 2009-02-16  9:59     ` Boaz Harrosh
  0 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-16  9:59 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel,
	open-osd, linux-kernel, James Bottomley

Evgeniy Polyakov wrote:
> Hi.
> 

Hi

> On Mon, Feb 09, 2009 at 03:25:53PM +0200, Boaz Harrosh (bharrosh@panasas.com) wrote:
>> +		case Opt_to:
>> +			if (match_int(&args[0], &option))
>> +				return -EINVAL;
>> +			if (option <= 0) {
>> +				EXOFS_ERR("Timout must be > 0");
>> +				return -EINVAL;
>> +			}
>> +			opts->timeout = option * HZ;
> 
> Is it intentional to be a different timeouton systems with different HX
> but the same mount option?
> 

Does not "option * HZ" means that "option" is in seconds?
Any way I would not bother, it is an undocumented option for debugging only.

>> +static struct inode *exofs_alloc_inode(struct super_block *sb)
>> +{
>> +	struct exofs_i_info *oi;
>> +
>> +	oi = kmem_cache_alloc(exofs_inode_cachep, GFP_KERNEL);
> 
> I'm curious if this should be GFP_NOFS or not?
> 

Currently none of the OSD transports, (Shhhh ... including osd initiator
library), are not SWAP safe.

I will revisit all that, far in the future, when I will need SWAPyness?

>> +	if (!oi)
>> +		return NULL;
>> +
>> +	oi->vfs_inode.i_version = 1;
>> +	return &oi->vfs_inode;
>> +}
> 
>> +static void exofs_put_super(struct super_block *sb)
>> +{
>> +	int num_pend;
>> +	struct exofs_sb_info *sbi = sb->s_fs_info;
>> +
>> +	/* make sure there are no pending commands */
>> +	for (num_pend = atomic_read(&sbi->s_curr_pending); num_pend > 0;
>> +	     num_pend = atomic_read(&sbi->s_curr_pending)) {
> 
> This rises a question. Let's check exofs_new_inode() for example (it is
> a bad example, since inode can not be created when we already in the
> put_super() callback, but still there are others), it increments
> s_curr_pending way after inode was created, so is it possible that
> some in-flight callback is about to be executed and its subsequent
> s_curr_pending manipulation will not be detected by this loop?
> 
> Should s_curr_pending increment be audited all over the code to be
> increased before the potential postponing command starts (which is not
> the case in exofs_new_inode() above)?
> 

I have experimented with this a bit in the passed. And did not find any
problems. To the best of my understanding, I'm somehow protected by the VFS.
If the FS is busy ie. file handles open. It will refuse an unmount. Once all
handles are closed, it will remove visibility from the FS and only then
unmount. So the loop above will only wait for old commands. Actually I put
prints in there and I never ever got a count of pending commands.

It is hard to test because it is hard to find the time slot after all file
handles are close, but with heavy pending IO, and before the umount kicks
in.

Also note that I've decided to fsync on file close so I think most/all IO
should be finished by the time a file is closed.

>> +		wait_queue_head_t wq;
>> +		init_waitqueue_head(&wq);
>> +		wait_event_timeout(wq,
>> +				  (atomic_read(&sbi->s_curr_pending) == 0),
>> +				  msecs_to_jiffies(100));
>> +	}
>> +
>> +	osduld_put_device(sbi->s_dev);
>> +	kfree(sb->s_fs_info);
>> +	sb->s_fs_info = NULL;
>> +}
> 

Thanks, I appreciate your comments, they make me think

Boaz

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-02-16  9:27           ` Jeff Garzik
@ 2009-02-16 10:19             ` Boaz Harrosh
  2009-02-16 11:05               ` pNFS rant (was Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils) Jeff Garzik
  0 siblings, 1 reply; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-16 10:19 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: FUJITA Tomonori, avishay, akpm, linux-fsdevel, osd-dev,
	linux-kernel, James.Bottomley, jens.axboe, linux-scsi

Jeff Garzik wrote:
> Boaz Harrosh wrote:
>> FUJITA Tomonori wrote:
>>> Boaz Harrosh <bharrosh@panasas.com> wrote:
>>>> Also looking
>>>> ahead I will have RAID 0, 1, 5, and 6 on objects of different devices. bio
>>>> is the perfect collector for memory information in this situation.
> 
>>> You will add such features to exofs, handling multiple devices
>>> internally?
> 
>> Multiple objects on Multiple devices, Yes.
> 
> That sort of feature does not belong in exofs, but somewhat separate. 
> Ideally we should be able to share "MD for OSD" with other OSD 
> filesystems, and the "osdblk" device that I will produce once libosd 
> hits upstream.
> 

No can do. exofs is meant to be a reference implementation of a pNFS-objects
file serving system. Have you read the spec of pNFS-objects layout? they define
RAID 0, 1, 5, and 6. In pNFS the MDS is suppose to be able to write the data
for its clients as NFS, so it needs to have all the infra structure and knowledge
of an Client pNFS-object layout drive.

But don't worry, the plans are that layout-drive and exofs will reuse all the
same library code that does all that. There will not be a single line of duplicate
code.

In fact one of the things I wanted to talk about in LSF is a generic, BIO based
(or some thing else), RAID engine, That could be used by all RAIDers in Kernel,
DM, MD, btrfs, exofs pNFS-objects, TUX3, ZFS and so on. And I don't mean just the
low level memory-pointers XOR functions, but the more higher level of memory
splitters/collectors, abstract-device lists, and RAID description structures.
(Because RAIDs can be stacked like 10, 50, and all kind of crazy things)

> 	Jeff
> 
> 
> 
> 

Boaz

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-02-16  9:38           ` [PATCH 1/8] exofs: Kbuild, Headers and osd utils FUJITA Tomonori
@ 2009-02-16 10:29             ` Boaz Harrosh
  2009-02-17  0:20               ` FUJITA Tomonori
  0 siblings, 1 reply; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-16 10:29 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: avishay, jeff, akpm, linux-fsdevel, osd-dev, linux-kernel,
	James.Bottomley, jens.axboe, linux-scsi

FUJITA Tomonori wrote:
> On Mon, 16 Feb 2009 11:19:21 +0200
> Boaz Harrosh <bharrosh@panasas.com> wrote:
> 
>>>> Also looking
>>>> ahead I will have RAID 0, 1, 5, and 6 on objects of different devices. bio
>>>> is the perfect collector for memory information in this situation.
>>> You will add such features to exofs, handling multiple devices
>>> internally?
>>>
>> Multiple objects on Multiple devices, Yes.
> 
> I thought that exofs is kinda example (reference) file system.
> 
> Nobody has seen your code. Let's discuss when we have the
> code. Over-designing for what we've not seen is not a good idea.
> 

Thanks for the insults, and high credit ;)

Yes it's "kinda example (reference) file system" of a pNFS-objects
file system. What can I do life is tough.

> 
>>>> exofs is not the first and only file system who is using bios. Proof of
>>>> the matter is that block exports a bio submit routine.
>>> Seems that exofs just passes pages and the ULD sends a SCSI command
>>> including these pages. I don't see how exofs needs to handle bio
>>> directly.
>>>
>> How do you propose to collect these pages? and keep them without allocating
>> an extra list? without pre-allocating a struct request? and without re-inventing
>> the bio structure?
> 
> I don't think that allocating an extra list (or something) to keep
> them hurts performance. We can talk about it when you have the real
> performance results.

So you are the one that starts to invent the wheel here. I thought I was
the one that does that, only you only called me by names, because you never showed
me where.

But please only answer one question for me: Please don't write back if you do not
answer this question:

Why do other filesystems allow to use bios? are they going to stop? Who is going
to remove that?

And as I said, I am going to remove it for now, please be patient. You have never
herd from me that I refuse to do it, did you?

Boaz

^ permalink raw reply	[flat|nested] 36+ messages in thread

* pNFS rant (was Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils)
  2009-02-16 10:19             ` Boaz Harrosh
@ 2009-02-16 11:05               ` Jeff Garzik
  2009-02-16 12:45                 ` Boaz Harrosh
                                   ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Jeff Garzik @ 2009-02-16 11:05 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: FUJITA Tomonori, avishay, akpm, linux-fsdevel, osd-dev,
	linux-kernel, James.Bottomley, jens.axboe, linux-scsi

Boaz Harrosh wrote:
> No can do. exofs is meant to be a reference implementation of a pNFS-objects
> file serving system. Have you read the spec of pNFS-objects layout? they define
> RAID 0, 1, 5, and 6. In pNFS the MDS is suppose to be able to write the data
> for its clients as NFS, so it needs to have all the infra structure and knowledge
> of an Client pNFS-object layout drive.

Yes, I have studied pNFS!  I plan to add v4.1 and pNFS support to my NFS 
server, once v4.0 support is working well.


pNFS The Theory:   is wise and necessary:  permit clients to directly 
connect to data storage, rather than copying through the metadata 
server(s).  This is what every distributed filesystem is doing these 
days -- direct to data server for bulk data read/write.

pNFS The Specification:   is an utter piece of shit.  I can only presume 
some shady backroom deal in a smoke-filled room was the reason this saw 
the light of day.


In a sane world, NFS clients would speak... NFS.

In the crazy world of pNFS, NFS clients are now forced to speak NFS, 
SCSI, RAID, and any number of proprietary layout types.  When will HTTP 
be added to the list?  :)

But anything beyond the NFS protocol for talking client <-> data servers 
is code bloat complexity madness for an NFS client that wishes to be 
compatible with "most of the NFS 4.1 world".

An ideal NFS client for pNFS should be asked to do these two things, and 
nothing more:

1) send metadata transactions to one or more metadata servers, using 
well-known NFS protocol

2) send data to one or more data servers, using well-known NFS protocol 
subset designed for storage (v4.1, section 13.6)

But no.

pNFS has forced a huge complexity on the NFS client, by permitting an 
unbounded number of network protocols.  A "layout plugin" layer is 
required.  SCSI and OSD support are REQUIRED for any reasonably 
compatible setup going forward.

But even more than the technical complexity, this is the first time in 
NFS history that NFS has required a protocol besides... NFS.

pNFS means that a useful. compatible NFS client must know all these 
storage protocols, in addition to NFS.

Furthermore, enabling proprietary layout types means that it is easy for 
a "compatible" v4.1 client to be denied parallel access to data 
available to other "compatible" v4.1 clients:

	Client A: Linux, fully open source

	Client B: Linux, with closed source module for
		  layout type SuperWhizBang storage

	Both Client A and Client B can claim to be NFS v4.1 and pNFS
	compatible,
	yet Client A must read data through the metadata
	server because it lacks the SuperWhizBang storage plugin.

pNFS means a never-ending arms race for the best storage layout, where 
NFS clients are inevitably compatibly with a __random subset__ of total 
available layout types.  pNFS will be a continuing train wreck of 
fly-by-night storage companies, and their pet layout types & storage 
protocols.

It is a support nightmare, an admin nightmare, a firewall nightmare, a 
client implementor's nightmare, but a storage vendor's wet dream.

NFS was never beautiful, but at least until v4.0 it was well known and 
widely cross-compatible.  And only required one network protocol: NFS.

	Jeff




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: pNFS rant (was Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils)
  2009-02-16 11:05               ` pNFS rant (was Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils) Jeff Garzik
@ 2009-02-16 12:45                 ` Boaz Harrosh
  2009-02-16 15:50                 ` James Bottomley
  2009-02-16 16:23                 ` Benny Halevy
  2 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-16 12:45 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: FUJITA Tomonori, avishay, akpm, linux-fsdevel, osd-dev,
	linux-kernel, James.Bottomley, jens.axboe, linux-scsi

Jeff Garzik wrote:
> Boaz Harrosh wrote:
>> No can do. exofs is meant to be a reference implementation of a pNFS-objects
>> file serving system. Have you read the spec of pNFS-objects layout? they define
>> RAID 0, 1, 5, and 6. In pNFS the MDS is suppose to be able to write the data
>> for its clients as NFS, so it needs to have all the infra structure and knowledge
>> of an Client pNFS-object layout drive.
> 
> Yes, I have studied pNFS!  I plan to add v4.1 and pNFS support to my NFS 
> server, once v4.0 support is working well.
> 
> 
> pNFS The Theory:   is wise and necessary:  permit clients to directly 
> connect to data storage, rather than copying through the metadata 
> server(s).  This is what every distributed filesystem is doing these 
> days -- direct to data server for bulk data read/write.
> 
> pNFS The Specification:   is an utter piece of shit.  I can only presume 
> some shady backroom deal in a smoke-filled room was the reason this saw 
> the light of day.
> 
> 
> In a sane world, NFS clients would speak... NFS.
> 
> In the crazy world of pNFS, NFS clients are now forced to speak NFS, 
> SCSI, RAID, and any number of proprietary layout types.  When will HTTP 
> be added to the list?  :)
> 
> But anything beyond the NFS protocol for talking client <-> data servers 
> is code bloat complexity madness for an NFS client that wishes to be 
> compatible with "most of the NFS 4.1 world".
> 
> An ideal NFS client for pNFS should be asked to do these two things, and 
> nothing more:
> 
> 1) send metadata transactions to one or more metadata servers, using 
> well-known NFS protocol
> 
> 2) send data to one or more data servers, using well-known NFS protocol 
> subset designed for storage (v4.1, section 13.6)
> 
> But no.
> 
> pNFS has forced a huge complexity on the NFS client, by permitting an 
> unbounded number of network protocols.  A "layout plugin" layer is 
> required.  SCSI and OSD support are REQUIRED for any reasonably 
> compatible setup going forward.
> 
> But even more than the technical complexity, this is the first time in 
> NFS history that NFS has required a protocol besides... NFS.
> 
> pNFS means that a useful. compatible NFS client must know all these 
> storage protocols, in addition to NFS.
> 
> Furthermore, enabling proprietary layout types means that it is easy for 
> a "compatible" v4.1 client to be denied parallel access to data 
> available to other "compatible" v4.1 clients:
> 
> 	Client A: Linux, fully open source
> 
> 	Client B: Linux, with closed source module for
> 		  layout type SuperWhizBang storage
> 
> 	Both Client A and Client B can claim to be NFS v4.1 and pNFS
> 	compatible,
> 	yet Client A must read data through the metadata
> 	server because it lacks the SuperWhizBang storage plugin.
> 
> pNFS means a never-ending arms race for the best storage layout, where 
> NFS clients are inevitably compatibly with a __random subset__ of total 
> available layout types.  pNFS will be a continuing train wreck of 
> fly-by-night storage companies, and their pet layout types & storage 
> protocols.
> 
> It is a support nightmare, an admin nightmare, a firewall nightmare, a 
> client implementor's nightmare, but a storage vendor's wet dream.
> 
> NFS was never beautiful, but at least until v4.0 it was well known and 
> widely cross-compatible.  And only required one network protocol: NFS.
> 
> 	Jeff
> 

I hear you. I'm paying close attention and noting down all of above
hazardous signals. However, please allow me my own on-look on the matter.
Perhaps one day soon, (Probably not in LSF, no travel budget approval yet),
we will meet and we can talk about it more closely, and maybe I could
convince you of other aspects as well.

But pragmatically speaking. All the above has nothing that I can do about it.
My job is to show an OO reference implementation of pNFS-objects, a public
and signed protocol. I admit that pNFS-objects is a Panasas's pet, which is
my boss and the inventor of pNFS. I hope to remove it from your above
SuperWhizBang category, please. Actually my job is so it will not be. I want
an open standard implementation from day one, so there will be no questions. I
understand that you argue about the do or die of the OSD protocol under pNFS.
For me it is just that much more challenge of swimming up stream, as a Salmon.
Everyone is doing "Files" I get to do "Objects". I hope when finally the code
will arrive, soon, and it gets to be used, it's merits, performance, security,
and ease-of-use will win users over, big time.
(Lets compare notes, what is the minimal NFS DS implementation you can imagine?
 What would you say an OSD's target is? not counting all the extras an OSD gives
 you, like no proprietary back-channel protocol between MDS-to-DS inside the cluster.)

But I do hear you, really, you have very valid points that must be taken into consideration.

Thanks
Boaz

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: pNFS rant (was Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils)
  2009-02-16 11:05               ` pNFS rant (was Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils) Jeff Garzik
  2009-02-16 12:45                 ` Boaz Harrosh
@ 2009-02-16 15:50                 ` James Bottomley
  2009-02-16 16:27                   ` Benny Halevy
  2009-02-16 16:23                 ` Benny Halevy
  2 siblings, 1 reply; 36+ messages in thread
From: James Bottomley @ 2009-02-16 15:50 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Boaz Harrosh, FUJITA Tomonori, avishay, akpm, linux-fsdevel,
	osd-dev, linux-kernel, jens.axboe, linux-scsi

On Mon, 2009-02-16 at 06:05 -0500, Jeff Garzik wrote:
> Boaz Harrosh wrote:
> > No can do. exofs is meant to be a reference implementation of a pNFS-objects
> > file serving system. Have you read the spec of pNFS-objects layout? they define
> > RAID 0, 1, 5, and 6. In pNFS the MDS is suppose to be able to write the data
> > for its clients as NFS, so it needs to have all the infra structure and knowledge
> > of an Client pNFS-object layout drive.
> 
> Yes, I have studied pNFS!  I plan to add v4.1 and pNFS support to my NFS 
> server, once v4.0 support is working well.
> 
> 
> pNFS The Theory:   is wise and necessary:  permit clients to directly 
> connect to data storage, rather than copying through the metadata 
> server(s).  This is what every distributed filesystem is doing these 
> days -- direct to data server for bulk data read/write.
> 
> pNFS The Specification:   is an utter piece of shit.  I can only presume 
> some shady backroom deal in a smoke-filled room was the reason this saw 
> the light of day.
> 
> 
> In a sane world, NFS clients would speak... NFS.
> 
> In the crazy world of pNFS, NFS clients are now forced to speak NFS, 
> SCSI, RAID, and any number of proprietary layout types.  When will HTTP 
> be added to the list?  :)

Heh, it's one of the endearing faults of the storage industry that we
never learn from our mistakes ... particularly in storage protocols.

Actually, perhaps that's a mischaracterised: we never actually learn
from our successes.  For example, most popular storage protocols solve
about 80% of the problem (NFSv2) get something bolted on to take that to
95% (locking) and rule for decades.  We end up obsessing about the 5%
and produce something that's like 10x the overhead to solve it.
Customers, for some unfathomable reason, hate complexity  (I suspect
principally because it in some measure equals expense) so the 100%
solution (which actually turns out to be a 95% one because the over
engineered complexity adds another 5% of different problems that take
years to find) tends to work its way into a niche and stay there ...
eventually fading.

If you're really lucky, the niche evolves into something sustainable.
For example iSCSI: blew its early promise, pulled a bunch of unnecessary
networking into the protocol and ended up too big to fit in disk
firmware (thus destroying the ability to have a simple network tap to
replace storage fabric).  It's been slowly fading until Virtualisation
came along.  Now all the other solutions to getting storage into virtual
machines are so horrible and arcane that iSCSI looks like a winner (if
the alternative is Frankenstein's monster, Grendel's mother suddenly
looks more attractive as a partner).

So, trust the customer ... if it's so horrible it shouldn't have seen
the light of day, the chances are that no-one will buy it anyway.

James



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: pNFS rant (was Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils)
  2009-02-16 11:05               ` pNFS rant (was Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils) Jeff Garzik
  2009-02-16 12:45                 ` Boaz Harrosh
  2009-02-16 15:50                 ` James Bottomley
@ 2009-02-16 16:23                 ` Benny Halevy
  2 siblings, 0 replies; 36+ messages in thread
From: Benny Halevy @ 2009-02-16 16:23 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Boaz Harrosh, FUJITA Tomonori, avishay, akpm, linux-fsdevel,
	osd-dev, linux-kernel, James.Bottomley, jens.axboe, linux-scsi

On Feb. 16, 2009, 13:05 +0200, Jeff Garzik <jeff@garzik.org> wrote:
> Boaz Harrosh wrote:
>> No can do. exofs is meant to be a reference implementation of a pNFS-objects
>> file serving system. Have you read the spec of pNFS-objects layout? they define
>> RAID 0, 1, 5, and 6. In pNFS the MDS is suppose to be able to write the data
>> for its clients as NFS, so it needs to have all the infra structure and knowledge
>> of an Client pNFS-object layout drive.
> 
> Yes, I have studied pNFS!  I plan to add v4.1 and pNFS support to my NFS 
> server, once v4.0 support is working well.
> 
> 
> pNFS The Theory:   is wise and necessary:  permit clients to directly 
> connect to data storage, rather than copying through the metadata 
> server(s).  This is what every distributed filesystem is doing these 
> days -- direct to data server for bulk data read/write.
> 
> pNFS The Specification:   is an utter piece of shit.  I can only presume 
> some shady backroom deal in a smoke-filled room was the reason this saw 
> the light of day.
> 
> 
> In a sane world, NFS clients would speak... NFS.
> 
> In the crazy world of pNFS, NFS clients are now forced to speak NFS, 
> SCSI, RAID, and any number of proprietary layout types.  When will HTTP 
> be added to the list?  :)
> 
> But anything beyond the NFS protocol for talking client <-> data servers 
> is code bloat complexity madness for an NFS client that wishes to be 
> compatible with "most of the NFS 4.1 world".
> 
> An ideal NFS client for pNFS should be asked to do these two things, and 
> nothing more:
> 
> 1) send metadata transactions to one or more metadata servers, using 
> well-known NFS protocol
> 
> 2) send data to one or more data servers, using well-known NFS protocol 
> subset designed for storage (v4.1, section 13.6)
> 
> But no.
> 
> pNFS has forced a huge complexity on the NFS client, by permitting an 
> unbounded number of network protocols.  A "layout plugin" layer is 
> required.  SCSI and OSD support are REQUIRED for any reasonably 
> compatible setup going forward.
> 
> But even more than the technical complexity, this is the first time in 
> NFS history that NFS has required a protocol besides... NFS.
> 
> pNFS means that a useful. compatible NFS client must know all these 
> storage protocols, in addition to NFS.
> 
> Furthermore, enabling proprietary layout types means that it is easy for 
> a "compatible" v4.1 client to be denied parallel access to data 
> available to other "compatible" v4.1 clients:
> 
> 	Client A: Linux, fully open source
> 
> 	Client B: Linux, with closed source module for
> 		  layout type SuperWhizBang storage
> 
> 	Both Client A and Client B can claim to be NFS v4.1 and pNFS
> 	compatible,
> 	yet Client A must read data through the metadata
> 	server because it lacks the SuperWhizBang storage plugin.

At least, for SuperWhizBang to comply with NFSv4.1 requirements it
needs to follow the IETF process as an open protocol and address
important semantic details (as listed by nfsv4.1) on-top of the
wire data structures like security considerations and client fencing.

Being open source or not is somewhat orthogonal to that since
from the protocol specification one should be able to implement
a fully comply client and/or server.

> 
> pNFS means a never-ending arms race for the best storage layout, where 
> NFS clients are inevitably compatibly with a __random subset__ of total 
> available layout types.  pNFS will be a continuing train wreck of 
> fly-by-night storage companies, and their pet layout types & storage 
> protocols.
> 
> It is a support nightmare, an admin nightmare, a firewall nightmare, a 
> client implementor's nightmare, but a storage vendor's wet dream.

What you're basically saying is similar to rejecting the filesystem export
kABI since this will cause a never-ending arms race for the best file
system.  (a.k.a. "640KB of RAM and FAT file system is likely anything
that a sane user will ever need"... ;-)

I believe that competition is good.  Good for customers, for which the pNFS
protocol was designed for, and good for vendors as well.  The extra complexity
is there since one-size does not fit all.  Different storage technologies
fit certain applications better than others and exposing storage for
direct client access can convey these strengths all the way up to the host
running the application.

Besides, the pNFS specification was also driven by customers that use
proprietary clustered filesystems today, that use block or object-based
storage.  These customers want pNFS to be a standard to steam up
competition for several reasons, like:
- encourage open source implementations
- second source availability
- building best-of-breed systems by integrating
  parts from different vendors
- reuse of existing storage and networking infrastructure

Benny

> 
> NFS was never beautiful, but at least until v4.0 it was well known and 
> widely cross-compatible.  And only required one network protocol: NFS.
> 
> 	Jeff
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: pNFS rant (was Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils)
  2009-02-16 15:50                 ` James Bottomley
@ 2009-02-16 16:27                   ` Benny Halevy
  0 siblings, 0 replies; 36+ messages in thread
From: Benny Halevy @ 2009-02-16 16:27 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jeff Garzik, Boaz Harrosh, FUJITA Tomonori, avishay, akpm,
	linux-fsdevel, osd-dev, linux-kernel, jens.axboe, linux-scsi

On Feb. 16, 2009, 17:50 +0200, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> On Mon, 2009-02-16 at 06:05 -0500, Jeff Garzik wrote:
>> Boaz Harrosh wrote:
>>> No can do. exofs is meant to be a reference implementation of a pNFS-objects
>>> file serving system. Have you read the spec of pNFS-objects layout? they define
>>> RAID 0, 1, 5, and 6. In pNFS the MDS is suppose to be able to write the data
>>> for its clients as NFS, so it needs to have all the infra structure and knowledge
>>> of an Client pNFS-object layout drive.
>> Yes, I have studied pNFS!  I plan to add v4.1 and pNFS support to my NFS 
>> server, once v4.0 support is working well.
>>
>>
>> pNFS The Theory:   is wise and necessary:  permit clients to directly 
>> connect to data storage, rather than copying through the metadata 
>> server(s).  This is what every distributed filesystem is doing these 
>> days -- direct to data server for bulk data read/write.
>>
>> pNFS The Specification:   is an utter piece of shit.  I can only presume 
>> some shady backroom deal in a smoke-filled room was the reason this saw 
>> the light of day.
>>
>>
>> In a sane world, NFS clients would speak... NFS.
>>
>> In the crazy world of pNFS, NFS clients are now forced to speak NFS, 
>> SCSI, RAID, and any number of proprietary layout types.  When will HTTP 
>> be added to the list?  :)
> 
> Heh, it's one of the endearing faults of the storage industry that we
> never learn from our mistakes ... particularly in storage protocols.
> 
> Actually, perhaps that's a mischaracterised: we never actually learn
> from our successes.  For example, most popular storage protocols solve
> about 80% of the problem (NFSv2) get something bolted on to take that to
> 95% (locking) and rule for decades.  We end up obsessing about the 5%
> and produce something that's like 10x the overhead to solve it.
> Customers, for some unfathomable reason, hate complexity  (I suspect
> principally because it in some measure equals expense) so the 100%
> solution (which actually turns out to be a 95% one because the over
> engineered complexity adds another 5% of different problems that take
> years to find) tends to work its way into a niche and stay there ...
> eventually fading.
> 
> If you're really lucky, the niche evolves into something sustainable.
> For example iSCSI: blew its early promise, pulled a bunch of unnecessary
> networking into the protocol and ended up too big to fit in disk
> firmware (thus destroying the ability to have a simple network tap to
> replace storage fabric).  It's been slowly fading until Virtualisation
> came along.  Now all the other solutions to getting storage into virtual
> machines are so horrible and arcane that iSCSI looks like a winner (if
> the alternative is Frankenstein's monster, Grendel's mother suddenly
> looks more attractive as a partner).
> 
> So, trust the customer ... if it's so horrible it shouldn't have seen
> the light of day, the chances are that no-one will buy it anyway.

I completely agree with this sentence.
And no customer, whatsoever, that I've talked to about pNFS had
any reservations about supporting multiple layout types.  On the
contrary...

Benny

> 
> James
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-02-16 10:29             ` Boaz Harrosh
@ 2009-02-17  0:20               ` FUJITA Tomonori
  2009-02-17  8:10                 ` [osd-dev] " Boaz Harrosh
  0 siblings, 1 reply; 36+ messages in thread
From: FUJITA Tomonori @ 2009-02-17  0:20 UTC (permalink / raw)
  To: bharrosh
  Cc: fujita.tomonori, avishay, jeff, akpm, linux-fsdevel, osd-dev,
	linux-kernel, James.Bottomley, jens.axboe, linux-scsi

On Mon, 16 Feb 2009 12:29:06 +0200
Boaz Harrosh <bharrosh@panasas.com> wrote:

> >>>> exofs is not the first and only file system who is using bios. Proof of
> >>>> the matter is that block exports a bio submit routine.
> >>> Seems that exofs just passes pages and the ULD sends a SCSI command
> >>> including these pages. I don't see how exofs needs to handle bio
> >>> directly.
> >>>
> >> How do you propose to collect these pages? and keep them without allocating
> >> an extra list? without pre-allocating a struct request? and without re-inventing
> >> the bio structure?
> > 
> > I don't think that allocating an extra list (or something) to keep
> > them hurts performance. We can talk about it when you have the real
> > performance results.
> 
> So you are the one that starts to invent the wheel here. I thought I was
> the one that does that, only you only called me by names, because you never showed
> me where.
> 
> But please only answer one question for me: Please don't write back if you do not
> answer this question:
> 
> Why do other filesystems allow to use bios? are they going to stop? Who is going
> to remove that?

Can you stop the argument, "exofs is similar to the existing
traditional file systems hence it should be treated equally". It's
simply untrue. Does anyone except for panasas people insist the same
argument?

We are talking about the design of exofs, which also affects the
design of OSD ULD (including the library) living in SCSI
mid-layer. It's something completely different from existing
traditional file systems that work nicely on the top of the block
layer.

As discussed in another thread, now OSD ULD reinvents the bio handling
infrastructure because of the design of exofs. But OSD ULD can use the
block layer helper functions to avoid the re-invention if we change
the exofs design to take pages instead of bios. For now, it works
perfectly for exofs. In the future, we might change it but we don't
know until you submit patches (or the performance results) that show
taking pages doesn't work for exofs nicely.

I guess that we need to evolve the block layer to support OSD stuff
cleanly than we've discussed recently. But again we can do when we
definitely need to do.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [osd-dev] [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-02-17  0:20               ` FUJITA Tomonori
@ 2009-02-17  8:10                 ` Boaz Harrosh
  2009-02-27  8:09                   ` FUJITA Tomonori
  0 siblings, 1 reply; 36+ messages in thread
From: Boaz Harrosh @ 2009-02-17  8:10 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: James.Bottomley, linux-scsi, jeff, linux-kernel, avishay,
	osd-dev, jens.axboe, linux-fsdevel, akpm

FUJITA Tomonori wrote:
> 
> Can you stop the argument, "exofs is similar to the existing
> traditional file systems hence it should be treated equally". It's
> simply untrue. Does anyone except for panasas people insist the same
> argument?
> 

No I will not, it is true. exofs is just a regular old filesystem
nothing different.

> We are talking about the design of exofs, which also affects the
> design of OSD ULD (including the library) living in SCSI
> mid-layer. 

The ULD belongs to scsi but the library could sit else where, how
is that an argument?

> It's something completely different from existing
> traditional file systems that work nicely on the top of the block
> layer.
> 

Nicely is a matter of opinion. I think that building a bio in stages
in the background, then at the point of execution build a request-from-bio
and execute is a nice design that makes sure nothing is duplicated, copied,
and wheels are not re-invented. Current Kernel design is nice, why change
it?

> As discussed in another thread, now OSD ULD reinvents the bio handling
> infrastructure because of the design of exofs. 

Not true, show me where? You keep saying that. Where in the code is it
reinvented?

> But OSD ULD can use the
> block layer helper functions to avoid the re-invention if we change
> the exofs design to take pages instead of bios.

That, above is exactly a re-invention of block layer. What was all that
scatterlist pointers and scsi_execute_async() cleanup that you worked
so hard to get rid off. It was a list of pages+offsets+lengths, that's what
it was. Now you ask me to do the same, keep an external structure of
pages+offsets+lengths. pass them three layers down and at some point in
time force new block_layer interfaces, which do not fully exist today,
to prepare a request for submission.

No! the decision was, keep preparation of request local and submit it
in place, without intermediate structures. From-memory-to-request
in one stage.

That's what I want. The bio lets me do that yesterday, lots of file
systems do that yesterday.

All I'm asking for is one small blk_make_request() that is a parallel
of generic_make_request() of the BLOCK_FS, for the BLOCK_PC requests

If someone wanted a filesystem over tape drives, over st.c or osst.c.
He would design it similar. collect bios in background, point and shoot.
The blk_map_xxx functions where made to satisfy user-mode interfaces, for
filesystems it was bio for ages.

> For now, it works

> perfectly for exofs. In the future, we might change it but we don't
> know until you submit patches (or the performance results) that show
> taking pages doesn't work for exofs nicely.
> 

I don't know about you, but me, I don't have to do some work to know
it's bad. I can imagine before hand that it is bad. I usually run
hundreds of simulations in my head, discarding any bad options until I
find the one way I like. Usually the short easiest way is also the best.
(Since I'm very lazy)
Like with bidi for example, Why not just take two requests instead of
one? But I was sent to do all that gigantic work so everyone will see
that.

> I guess that we need to evolve the block layer to support OSD stuff
> cleanly than we've discussed recently. But again we can do when we
> definitely need to do.

It's not that big and long evolution. It is a simple:

struct request *blk_make_request(struct bio*, gfp_t gfp);

And we are done. more simple then that? I don't know

Boaz

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [osd-dev] [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-02-17  8:10                 ` [osd-dev] " Boaz Harrosh
@ 2009-02-27  8:09                   ` FUJITA Tomonori
  2009-03-01 10:43                     ` Boaz Harrosh
  0 siblings, 1 reply; 36+ messages in thread
From: FUJITA Tomonori @ 2009-02-27  8:09 UTC (permalink / raw)
  To: bharrosh
  Cc: fujita.tomonori, James.Bottomley, linux-scsi, jeff, linux-kernel,
	avishay, osd-dev, jens.axboe, linux-fsdevel, akpm

On Tue, 17 Feb 2009 10:10:15 +0200
Boaz Harrosh <bharrosh@panasas.com> wrote:

> FUJITA Tomonori wrote:
> > 
> > Can you stop the argument, "exofs is similar to the existing
> > traditional file systems hence it should be treated equally". It's
> > simply untrue. Does anyone except for panasas people insist the same
> > argument?
> > 
> 
> No I will not, it is true. exofs is just a regular old filesystem
> nothing different.

After reading this, I gave up discussing this issue with you but I
still wait for your fixes that you promised:

http://marc.info/?l=linux-scsi&m=123445759718253&w=2


Thanks,

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [osd-dev] [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-02-27  8:09                   ` FUJITA Tomonori
@ 2009-03-01 10:43                     ` Boaz Harrosh
  0 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2009-03-01 10:43 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: James.Bottomley, linux-scsi, jeff, linux-kernel, avishay,
	osd-dev, jens.axboe, linux-fsdevel, akpm

FUJITA Tomonori wrote:
> On Tue, 17 Feb 2009 10:10:15 +0200
> Boaz Harrosh <bharrosh@panasas.com> wrote:
> 
>> FUJITA Tomonori wrote:
>>> Can you stop the argument, "exofs is similar to the existing
>>> traditional file systems hence it should be treated equally". It's
>>> simply untrue. Does anyone except for panasas people insist the same
>>> argument?
>>>
>> No I will not, it is true. exofs is just a regular old filesystem
>> nothing different.
> 
> After reading this, I gave up discussing this issue with you but I
> still wait for your fixes that you promised:
> 
> http://marc.info/?l=linux-scsi&m=123445759718253&w=2
> 
> 
> Thanks,
> --

They are on the way, I have not forgotten

Boaz


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] exofs: dir_inode and directory operations
  2009-02-16  9:31     ` Boaz Harrosh
@ 2009-03-15 18:10       ` Boaz Harrosh
  2009-03-15 18:37         ` Evgeniy Polyakov
  0 siblings, 1 reply; 36+ messages in thread
From: Boaz Harrosh @ 2009-03-15 18:10 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel,
	open-osd, linux-kernel, James Bottomley

Boaz Harrosh wrote:
> Evgeniy Polyakov wrote:
>> Hi.
>>
>> On Mon, Feb 09, 2009 at 03:24:13PM +0200, Boaz Harrosh (bharrosh@panasas.com) wrote:
>>
<snip>
> <snip>
>>> +
>>> +	atomic_inc(&inode->i_count);
>>> +
>>> +	ret = exofs_async_op(or, create_done, inode, oi->i_cred);
>>> +	if (ret) {
>>> +		atomic_dec(&inode->i_count);
>> igrab()/iput()?
>>
> 
> Thanks, makes much more sense. Sorry leftovers from 2.6.10
> 

It's the same at ext2. I looked at the igrab()/iput() code it does some extra
locks which I'm afraid of at this stage. I'll postpone this to the next (next)
merge window, after I ran with it for a while.

>>> +		osd_end_request(or);
>>> +		return ERR_PTR(-EIO);
>>> +	}
>>> +	atomic_inc(&sbi->s_curr_pending);
>>> +
>>> +	return inode;
>>> +}
>>> +static int exofs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
>>> +{
>>> +	struct inode *inode;
>>> +	int err = -EMLINK;
>>> +
>>> +	if (dir->i_nlink >= EXOFS_LINK_MAX)
>>> +		goto out;
>>> +
>>> +	inode_inc_link_count(dir);
>>> +
>>> +	inode = exofs_new_inode(dir, S_IFDIR | mode);
>>> +	err = PTR_ERR(inode);
>>> +	if (IS_ERR(inode))
>>> +		goto out_dir;
>>> +
>>> +	inode->i_op = &exofs_dir_inode_operations;
>>> +	inode->i_fop = &exofs_dir_operations;
>>> +	inode->i_mapping->a_ops = &exofs_aops;
>>> +
>>> +	inode_inc_link_count(inode);
>>> +
>>> +	err = exofs_make_empty(inode, dir);
>>> +	if (err)
>>> +		goto out_fail;
>>> +
>>> +	err = exofs_add_link(dentry, inode);
>>> +	if (err)
>>> +		goto out_fail;
>>> +
>>> +	d_instantiate(dentry, inode);
>>> +out:
>>> +	return err;
>>> +
>>> +out_fail:
>>> +	inode_dec_link_count(inode);
>>> +	inode_dec_link_count(inode);
>> Why two decrements, will it be ok after exofs_make_empty() fail when it
>> was incremented only once?
>>
> 
> That's hard to say, I'll investigate it some more.
> Thanks
> 

I've put some prints and current code is correct. inode->i_nlink (the target
of inode_dec_link_count()) is 1 after exofs_new_inode() and 2 after
inode_inc_link_count() (Just before exofs_make_empty()). Very confusing, but
exofs_add_link() does not touch inode->i_nlink.

I will be posting a new set tomorrow

Thanks
Boaz


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] exofs: dir_inode and directory operations
  2009-03-15 18:10       ` Boaz Harrosh
@ 2009-03-15 18:37         ` Evgeniy Polyakov
  0 siblings, 0 replies; 36+ messages in thread
From: Evgeniy Polyakov @ 2009-03-15 18:37 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Avishay Traeger, Jeff Garzik, Andrew Morton, linux-fsdevel,
	open-osd, linux-kernel, James Bottomley

On Sun, Mar 15, 2009 at 08:10:36PM +0200, Boaz Harrosh (bharrosh@panasas.com) wrote:
> >>> +	atomic_inc(&inode->i_count);
> >>> +
> >>> +	ret = exofs_async_op(or, create_done, inode, oi->i_cred);
> >>> +	if (ret) {
> >>> +		atomic_dec(&inode->i_count);
> >> igrab()/iput()?
> >>
> > 
> > Thanks, makes much more sense. Sorry leftovers from 2.6.10
> > 
> 
> It's the same at ext2. I looked at the igrab()/iput() code it does some extra
> locks which I'm afraid of at this stage. I'll postpone this to the next (next)
> merge window, after I ran with it for a while.

It does not allow to work with to be freed inode, getting that it is
fresh inode, things should be ok just to increase the reference counter,
but iput() may highlight problems if inode's reference counter can be
decreased, and apparently it can not since otherwise increment would not
be added around exofs_async_op().

What if exofs_async_op() drops a reference and returns error? inode will
not be freed and will not be placed into to be freed list, which in turn
will break accounting and potentially prevent superblock freeing.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-03-31  8:04   ` Andrew Morton
@ 2009-03-31  8:57     ` Boaz Harrosh
  0 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2009-03-31  8:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Avishay Traeger, Jeff Garzik, Evgeniy Polyakov, linux-fsdevel,
	open-osd, linux-kernel, James Bottomley, FUJITA Tomonori

On 03/31/2009 11:04 AM, Andrew Morton wrote:
> On Wed, 18 Mar 2009 19:57:36 +0200 Boaz Harrosh <bharrosh@panasas.com> wrote:
> 
>> This patch includes osd infrastructure that will be used later by
>> the file system.
>>
>> Also the declarations of constants, on disk structures,
>> and prototypes.
>>
>> And the Kbuild+Kconfig files needed to build the exofs module.
>>
>> ...
>>
>> --- /dev/null
>> +++ b/fs/exofs/Kbuild
>> @@ -0,0 +1,30 @@
>> +#
>> +# Kbuild for the EXOFS module
>> +#
>> +# Copyright (C) 2008 Panasas Inc.  All rights reserved.
>> +#
>> +# Authors:
>> +#   Boaz Harrosh <bharrosh@panasas.com>
>> +#
>> +# This program is free software; you can redistribute it and/or modify
>> +# it under the terms of the GNU General Public License version 2
>> +#
>> +# Kbuild - Gets included from the Kernels Makefile and build system
>> +#
>> +
>> +ifneq ($(OSD_INC),)
>> +# we are built out-of-tree Kconfigure everything as on
>> +
>> +CONFIG_EXOFS_FS=m
>> +ccflags-y += -DCONFIG_EXOFS_FS -DCONFIG_EXOFS_FS_MODULE
>> +# ccflags-y += -DCONFIG_EXOFS_DEBUG
>> +
>> +# if we are built out-of-tree and the hosting kernel has OSD headers
>> +# then "ccflags-y +=" will not pick the out-off-tree headers. Only by doing
>> +# this it will work. This might break in future kernels
>> +KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
>> +
>> +endif
> 
> But this patch is putting the fs into the tree, so all the above is unneeded.
> 
>> ...
>>
>> + * Object IDs 0, 1, and 2 are always in use (see above defines).
>> + */
>> +enum {
>> +	EXOFS_UINT64_MAX = (~0LL),
> 
> Use ULLONG_MAX?
> 
> ~0ULL would be more consistent.
> 
>> +	EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? EXOFS_UINT64_MAX :
>> +					(1LL << (sizeof(ino_t) * 8 - 1)),
> 
> Tricky, needs a comment.
> 
> Would be clearer to use 1ULL.
> 
>> +	EXOFS_MAX_ID	 = (EXOFS_MAX_INO_ID - 1 - EXOFS_OBJ_OFF),
>> +};
>> +

OK, OK, OK

>> +/****************************************************************************
>> + * Misc.
>> + ****************************************************************************/
>> +#define EXOFS_BLKSHIFT	12
>> +#define EXOFS_BLKSIZE	(1UL << EXOFS_BLKSHIFT)
>> +
>> +/****************************************************************************
>> + * superblock-related things
>> + ****************************************************************************/
>> +#define EXOFS_SUPER_MAGIC	0x5DF5
> 
> Should be in include/linux/magic.h
> 

Is this relevant for OSD, I guess if there are going to
be more OSD filesystems then yes.

I will do it, thanks.

>> ...
>>
>> +/*
>> + * The file control block - stored in an object's attributes.  This is where
>> + * the in-memory inode is stored on disk.
>> + */
>> +struct exofs_fcb {
>> +	__le64  i_size;			/* Size of the file */
>> +	__le16  i_mode;         	/* File mode */
>> +	__le16  i_links_count;  	/* Links count */
>> +	__le32  i_uid;          	/* Owner Uid */
>> +	__le32  i_gid;          	/* Group Id */
>> +	__le32  i_atime;        	/* Access time */
>> +	__le32  i_ctime;        	/* Creation time */
>> +	__le32  i_mtime;        	/* Modification time */
>> +	__le32  i_flags;        	/* File flags (unused for now)*/
>> +	__le32  i_generation;   	/* File version (for NFS) */
>> +	__le32  i_data[EXOFS_IDATA];	/* Short symlink names and device #s */
>> +};
> 
> There is no room for future expansion.  Would that be appropriate/wise?
> I guess it would need versioning information somewhere too.
> 

In osd we have the size-of-the-attribute it sits in. So if in future we
add members we can switch according to size, also we can just stick it in
a different attribute number, so like EXOFS_ATTR_INODE_DATA_VER1
EXOFS_ATTR_INODE_DATA_VER2 attribute. Presence of, means support. Hell we can
even be backward compatible with having 2 or three versions at once.

>> ...
>>
>> +/* u64 has problems with printk this will cast it to unsigned long long */
>> +#define _LLU(x) (unsigned long long)(x)
> 
> ug.
> 
> Normally the response is "please open-code this".  But given that one
> day real soon this printk(u64) problem will be fixed, I guess the use
> of _LLU will make it easy to find and delete all the now-unneeded
> casts.
> 

Exactly my thoughts

>> ...
>>
>> +/*
>> + * our inode flags
>> + */
>> +#define OBJ_2BCREATED	0	/* object will be created soon*/
>> +#define OBJ_CREATED	1	/* object has been created on the osd*/
>> +
>> +static inline int obj_2bcreated(struct exofs_i_info *oi)
>> +{
>> +	return test_bit(OBJ_2BCREATED, &(oi->i_flags));
>> +}
> 
> unneeded parentheses around oi->i_flags.
> 
>> +static inline void set_obj_2bcreated(struct exofs_i_info *oi)
>> +{
>> +	set_bit(OBJ_2BCREATED, &(oi->i_flags));
>> +}
>> +
>> +static inline int obj_created(struct exofs_i_info *oi)
>> +{
>> +	return test_bit(OBJ_CREATED, &(oi->i_flags));
>> +}
>> +
>> +static inline void set_obj_created(struct exofs_i_info *oi)
>> +{
>> +	set_bit(OBJ_CREATED, &(oi->i_flags));
>> +}
> 
> dittoes.
> 
>> ...
>>
> 

Thanks
will fix

Boaz

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-03-18 17:57 ` Boaz Harrosh
@ 2009-03-31  8:04   ` Andrew Morton
  2009-03-31  8:57     ` Boaz Harrosh
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2009-03-31  8:04 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Avishay Traeger, Jeff Garzik, Evgeniy Polyakov, linux-fsdevel,
	open-osd, linux-kernel, James Bottomley, FUJITA Tomonori

On Wed, 18 Mar 2009 19:57:36 +0200 Boaz Harrosh <bharrosh@panasas.com> wrote:

> This patch includes osd infrastructure that will be used later by
> the file system.
> 
> Also the declarations of constants, on disk structures,
> and prototypes.
> 
> And the Kbuild+Kconfig files needed to build the exofs module.
> 
> ...
>
> --- /dev/null
> +++ b/fs/exofs/Kbuild
> @@ -0,0 +1,30 @@
> +#
> +# Kbuild for the EXOFS module
> +#
> +# Copyright (C) 2008 Panasas Inc.  All rights reserved.
> +#
> +# Authors:
> +#   Boaz Harrosh <bharrosh@panasas.com>
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License version 2
> +#
> +# Kbuild - Gets included from the Kernels Makefile and build system
> +#
> +
> +ifneq ($(OSD_INC),)
> +# we are built out-of-tree Kconfigure everything as on
> +
> +CONFIG_EXOFS_FS=m
> +ccflags-y += -DCONFIG_EXOFS_FS -DCONFIG_EXOFS_FS_MODULE
> +# ccflags-y += -DCONFIG_EXOFS_DEBUG
> +
> +# if we are built out-of-tree and the hosting kernel has OSD headers
> +# then "ccflags-y +=" will not pick the out-off-tree headers. Only by doing
> +# this it will work. This might break in future kernels
> +KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
> +
> +endif

But this patch is putting the fs into the tree, so all the above is unneeded.

>
> ...
>
> + * Object IDs 0, 1, and 2 are always in use (see above defines).
> + */
> +enum {
> +	EXOFS_UINT64_MAX = (~0LL),

Use ULLONG_MAX?

~0ULL would be more consistent.

> +	EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? EXOFS_UINT64_MAX :
> +					(1LL << (sizeof(ino_t) * 8 - 1)),

Tricky, needs a comment.

Would be clearer to use 1ULL.

> +	EXOFS_MAX_ID	 = (EXOFS_MAX_INO_ID - 1 - EXOFS_OBJ_OFF),
> +};
> +
> +/****************************************************************************
> + * Misc.
> + ****************************************************************************/
> +#define EXOFS_BLKSHIFT	12
> +#define EXOFS_BLKSIZE	(1UL << EXOFS_BLKSHIFT)
> +
> +/****************************************************************************
> + * superblock-related things
> + ****************************************************************************/
> +#define EXOFS_SUPER_MAGIC	0x5DF5

Should be in include/linux/magic.h

>
> ...
>
> +/*
> + * The file control block - stored in an object's attributes.  This is where
> + * the in-memory inode is stored on disk.
> + */
> +struct exofs_fcb {
> +	__le64  i_size;			/* Size of the file */
> +	__le16  i_mode;         	/* File mode */
> +	__le16  i_links_count;  	/* Links count */
> +	__le32  i_uid;          	/* Owner Uid */
> +	__le32  i_gid;          	/* Group Id */
> +	__le32  i_atime;        	/* Access time */
> +	__le32  i_ctime;        	/* Creation time */
> +	__le32  i_mtime;        	/* Modification time */
> +	__le32  i_flags;        	/* File flags (unused for now)*/
> +	__le32  i_generation;   	/* File version (for NFS) */
> +	__le32  i_data[EXOFS_IDATA];	/* Short symlink names and device #s */
> +};

There is no room for future expansion.  Would that be appropriate/wise?
I guess it would need versioning information somewhere too.

>
> ...
>
> +/* u64 has problems with printk this will cast it to unsigned long long */
> +#define _LLU(x) (unsigned long long)(x)

ug.

Normally the response is "please open-code this".  But given that one
day real soon this printk(u64) problem will be fixed, I guess the use
of _LLU will make it easy to find and delete all the now-unneeded
casts.

>
> ...
>
> +/*
> + * our inode flags
> + */
> +#define OBJ_2BCREATED	0	/* object will be created soon*/
> +#define OBJ_CREATED	1	/* object has been created on the osd*/
> +
> +static inline int obj_2bcreated(struct exofs_i_info *oi)
> +{
> +	return test_bit(OBJ_2BCREATED, &(oi->i_flags));
> +}

unneeded parentheses around oi->i_flags.

> +static inline void set_obj_2bcreated(struct exofs_i_info *oi)
> +{
> +	set_bit(OBJ_2BCREATED, &(oi->i_flags));
> +}
> +
> +static inline int obj_created(struct exofs_i_info *oi)
> +{
> +	return test_bit(OBJ_CREATED, &(oi->i_flags));
> +}
> +
> +static inline void set_obj_created(struct exofs_i_info *oi)
> +{
> +	set_bit(OBJ_CREATED, &(oi->i_flags));
> +}

dittoes.

>
> ...
>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-03-18 17:45 [PATCHSET 0/8 version 4] exofs for kernel 2.6.30 Boaz Harrosh
  2009-03-18 17:57 ` [PATCH 1/8] exofs: Kbuild, Headers and osd utils Boaz Harrosh
@ 2009-03-18 17:57 ` Boaz Harrosh
  2009-03-31  8:04   ` Andrew Morton
  1 sibling, 1 reply; 36+ messages in thread
From: Boaz Harrosh @ 2009-03-18 17:57 UTC (permalink / raw)
  To: Avishay Traeger, Jeff Garzik, Andrew Morton, Evgeniy Polyakov,
	linux-fsdevel, open-osd
  Cc: linux-kernel, James Bottomley, FUJITA Tomonori

This patch includes osd infrastructure that will be used later by
the file system.

Also the declarations of constants, on disk structures,
and prototypes.

And the Kbuild+Kconfig files needed to build the exofs module.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/exofs/Kbuild   |   30 +++++++++
 fs/exofs/Kconfig  |   13 ++++
 fs/exofs/common.h |  185 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/exofs/exofs.h  |  127 ++++++++++++++++++++++++++++++++++++
 fs/exofs/osd.c    |  153 +++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 508 insertions(+), 0 deletions(-)
 create mode 100644 fs/exofs/Kbuild
 create mode 100644 fs/exofs/Kconfig
 create mode 100644 fs/exofs/common.h
 create mode 100644 fs/exofs/exofs.h
 create mode 100644 fs/exofs/osd.c

diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
new file mode 100644
index 0000000..63d822c
--- /dev/null
+++ b/fs/exofs/Kbuild
@@ -0,0 +1,30 @@
+#
+# Kbuild for the EXOFS module
+#
+# Copyright (C) 2008 Panasas Inc.  All rights reserved.
+#
+# Authors:
+#   Boaz Harrosh <bharrosh@panasas.com>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License version 2
+#
+# Kbuild - Gets included from the Kernels Makefile and build system
+#
+
+ifneq ($(OSD_INC),)
+# we are built out-of-tree Kconfigure everything as on
+
+CONFIG_EXOFS_FS=m
+ccflags-y += -DCONFIG_EXOFS_FS -DCONFIG_EXOFS_FS_MODULE
+# ccflags-y += -DCONFIG_EXOFS_DEBUG
+
+# if we are built out-of-tree and the hosting kernel has OSD headers
+# then "ccflags-y +=" will not pick the out-off-tree headers. Only by doing
+# this it will work. This might break in future kernels
+KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
+
+endif
+
+exofs-y := osd.o
+obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig
new file mode 100644
index 0000000..86194b2
--- /dev/null
+++ b/fs/exofs/Kconfig
@@ -0,0 +1,13 @@
+config EXOFS_FS
+	tristate "exofs: OSD based file system support"
+	depends on SCSI_OSD_ULD
+	help
+	  EXOFS is a file system that uses an OSD storage device,
+	  as its backing storage.
+
+# Debugging-related stuff
+config EXOFS_DEBUG
+	bool "Enable debugging"
+	depends on EXOFS_FS
+	help
+	  This option enables EXOFS debug prints.
diff --git a/fs/exofs/common.h b/fs/exofs/common.h
new file mode 100644
index 0000000..bcc4882
--- /dev/null
+++ b/fs/exofs/common.h
@@ -0,0 +1,185 @@
+/*
+ * common.h - Common definitions for both Kernel and user-mode utilities
+ *
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * Copyrights for code taken from ext2:
+ *     Copyright (C) 1992, 1993, 1994, 1995
+ *     Remy Card (card@masi.ibp.fr)
+ *     Laboratoire MASI - Institut Blaise Pascal
+ *     Universite Pierre et Marie Curie (Paris VI)
+ *     from
+ *     linux/fs/minix/inode.c
+ *     Copyright (C) 1991, 1992  Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#ifndef __EXOFS_COM_H__
+#define __EXOFS_COM_H__
+
+#include <linux/types.h>
+
+#include <scsi/osd_attributes.h>
+#include <scsi/osd_initiator.h>
+#include <scsi/osd_sec.h>
+
+/****************************************************************************
+ * Object ID related defines
+ * NOTE: inode# = object ID - EXOFS_OBJ_OFF
+ ****************************************************************************/
+#define EXOFS_MIN_PID   0x10000	/* Smallest partition ID */
+#define EXOFS_OBJ_OFF	0x10000	/* offset for objects */
+#define EXOFS_SUPER_ID	0x10000	/* object ID for on-disk superblock */
+#define EXOFS_ROOT_ID	0x10002	/* object ID for root directory */
+
+/* exofs Application specific page/attribute */
+# define EXOFS_APAGE_FS_DATA	(OSD_APAGE_APP_DEFINED_FIRST + 3)
+# define EXOFS_ATTR_INODE_DATA	1
+
+/*
+ * The maximum number of files we can have is limited by the size of the
+ * inode number.  This is the largest object ID that the file system supports.
+ * Object IDs 0, 1, and 2 are always in use (see above defines).
+ */
+enum {
+	EXOFS_UINT64_MAX = (~0LL),
+	EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? EXOFS_UINT64_MAX :
+					(1LL << (sizeof(ino_t) * 8 - 1)),
+	EXOFS_MAX_ID	 = (EXOFS_MAX_INO_ID - 1 - EXOFS_OBJ_OFF),
+};
+
+/****************************************************************************
+ * Misc.
+ ****************************************************************************/
+#define EXOFS_BLKSHIFT	12
+#define EXOFS_BLKSIZE	(1UL << EXOFS_BLKSHIFT)
+
+/****************************************************************************
+ * superblock-related things
+ ****************************************************************************/
+#define EXOFS_SUPER_MAGIC	0x5DF5
+
+/*
+ * The file system control block - stored in an object's data (mainly, the one
+ * with ID EXOFS_SUPER_ID).  This is where the in-memory superblock is stored
+ * on disk.  Right now it just has a magic value, which is basically a sanity
+ * check on our ability to communicate with the object store.
+ */
+struct exofs_fscb {
+	__le64  s_nextid;	/* Highest object ID used */
+	__le32  s_numfiles;	/* Number of files on fs */
+	__le16  s_magic;	/* Magic signature */
+	__le16  s_newfs;	/* Non-zero if this is a new fs */
+};
+
+/****************************************************************************
+ * inode-related things
+ ****************************************************************************/
+#define EXOFS_IDATA		5
+
+/*
+ * The file control block - stored in an object's attributes.  This is where
+ * the in-memory inode is stored on disk.
+ */
+struct exofs_fcb {
+	__le64  i_size;			/* Size of the file */
+	__le16  i_mode;         	/* File mode */
+	__le16  i_links_count;  	/* Links count */
+	__le32  i_uid;          	/* Owner Uid */
+	__le32  i_gid;          	/* Group Id */
+	__le32  i_atime;        	/* Access time */
+	__le32  i_ctime;        	/* Creation time */
+	__le32  i_mtime;        	/* Modification time */
+	__le32  i_flags;        	/* File flags (unused for now)*/
+	__le32  i_generation;   	/* File version (for NFS) */
+	__le32  i_data[EXOFS_IDATA];	/* Short symlink names and device #s */
+};
+
+#define EXOFS_INO_ATTR_SIZE	sizeof(struct exofs_fcb)
+
+/* This is the Attribute the fcb is stored in */
+static const struct __weak osd_attr g_attr_inode_data = ATTR_DEF(
+	EXOFS_APAGE_FS_DATA,
+	EXOFS_ATTR_INODE_DATA,
+	EXOFS_INO_ATTR_SIZE);
+
+/****************************************************************************
+ * dentry-related things
+ ****************************************************************************/
+#define EXOFS_NAME_LEN	255
+
+/*
+ * The on-disk directory entry
+ */
+struct exofs_dir_entry {
+	__le64		inode_no;		/* inode number           */
+	__le16		rec_len;		/* directory entry length */
+	u8		name_len;		/* name length            */
+	u8		file_type;		/* umm...file type        */
+	char		name[EXOFS_NAME_LEN];	/* file name              */
+};
+
+enum {
+	EXOFS_FT_UNKNOWN,
+	EXOFS_FT_REG_FILE,
+	EXOFS_FT_DIR,
+	EXOFS_FT_CHRDEV,
+	EXOFS_FT_BLKDEV,
+	EXOFS_FT_FIFO,
+	EXOFS_FT_SOCK,
+	EXOFS_FT_SYMLINK,
+	EXOFS_FT_MAX
+};
+
+#define EXOFS_DIR_PAD			4
+#define EXOFS_DIR_ROUND			(EXOFS_DIR_PAD - 1)
+#define EXOFS_DIR_REC_LEN(name_len) \
+	(((name_len) + offsetof(struct exofs_dir_entry, name)  + \
+	  EXOFS_DIR_ROUND) & ~EXOFS_DIR_ROUND)
+
+/*************************
+ * function declarations *
+ *************************/
+/* osd.c                 */
+void exofs_make_credential(u8 cred_a[OSD_CAP_LEN],
+			   const struct osd_obj_id *obj);
+
+int exofs_check_ok_resid(struct osd_request *or, u64 *in_resid, u64 *out_resid);
+static inline int exofs_check_ok(struct osd_request *or)
+{
+	return exofs_check_ok_resid(or, NULL, NULL);
+}
+int exofs_sync_op(struct osd_request *or, int timeout, u8 *cred);
+int exofs_async_op(struct osd_request *or,
+	osd_req_done_fn *async_done, void *caller_context, u8 *cred);
+
+int extract_attr_from_req(struct osd_request *or, struct osd_attr *attr);
+
+int osd_req_read_kern(struct osd_request *or,
+	const struct osd_obj_id *obj, u64 offset, void *buff, u64 len);
+
+int osd_req_write_kern(struct osd_request *or,
+	const struct osd_obj_id *obj, u64 offset, void *buff, u64 len);
+
+#endif /*ifndef __EXOFS_COM_H__*/
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
new file mode 100644
index 0000000..304e052
--- /dev/null
+++ b/fs/exofs/exofs.h
@@ -0,0 +1,127 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * Copyrights for code taken from ext2:
+ *     Copyright (C) 1992, 1993, 1994, 1995
+ *     Remy Card (card@masi.ibp.fr)
+ *     Laboratoire MASI - Institut Blaise Pascal
+ *     Universite Pierre et Marie Curie (Paris VI)
+ *     from
+ *     linux/fs/minix/inode.c
+ *     Copyright (C) 1991, 1992  Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include <linux/fs.h>
+#include <linux/time.h>
+#include "common.h"
+
+#ifndef __EXOFS_H__
+#define __EXOFS_H__
+
+#define EXOFS_ERR(fmt, a...) printk(KERN_ERR "exofs: " fmt, ##a)
+
+#ifdef CONFIG_EXOFS_DEBUG
+#define EXOFS_DBGMSG(fmt, a...) \
+	printk(KERN_NOTICE "exofs @%s:%d: " fmt, __func__, __LINE__, ##a)
+#else
+#define EXOFS_DBGMSG(fmt, a...) \
+	do {} while (0)
+#endif
+
+/* u64 has problems with printk this will cast it to unsigned long long */
+#define _LLU(x) (unsigned long long)(x)
+
+/*
+ * our extension to the in-memory superblock
+ */
+struct exofs_sb_info {
+	struct osd_dev	*s_dev;			/* returned by get_osd_dev    */
+	osd_id		s_pid;			/* partition ID of file system*/
+	int		s_timeout;		/* timeout for OSD operations */
+	uint64_t	s_nextid;		/* highest object ID used     */
+	uint32_t	s_numfiles;		/* number of files on fs      */
+	spinlock_t	s_next_gen_lock;	/* spinlock for gen # update  */
+	u32		s_next_generation;	/* next gen # to use          */
+	atomic_t	s_curr_pending;		/* number of pending commands */
+	uint8_t		s_cred[OSD_CAP_LEN];	/* all-powerful credential    */
+};
+
+/*
+ * our extension to the in-memory inode
+ */
+struct exofs_i_info {
+	unsigned long  i_flags;            /* various atomic flags            */
+	uint32_t       i_data[EXOFS_IDATA];/*short symlink names and device #s*/
+	uint32_t       i_dir_start_lookup; /* which page to start lookup      */
+	wait_queue_head_t i_wq;            /* wait queue for inode            */
+	uint64_t       i_commit_size;      /* the object's written length     */
+	uint8_t        i_cred[OSD_CAP_LEN];/* all-powerful credential         */
+	struct inode   vfs_inode;          /* normal in-memory inode          */
+};
+
+/*
+ * our inode flags
+ */
+#define OBJ_2BCREATED	0	/* object will be created soon*/
+#define OBJ_CREATED	1	/* object has been created on the osd*/
+
+static inline int obj_2bcreated(struct exofs_i_info *oi)
+{
+	return test_bit(OBJ_2BCREATED, &(oi->i_flags));
+}
+
+static inline void set_obj_2bcreated(struct exofs_i_info *oi)
+{
+	set_bit(OBJ_2BCREATED, &(oi->i_flags));
+}
+
+static inline int obj_created(struct exofs_i_info *oi)
+{
+	return test_bit(OBJ_CREATED, &(oi->i_flags));
+}
+
+static inline void set_obj_created(struct exofs_i_info *oi)
+{
+	set_bit(OBJ_CREATED, &(oi->i_flags));
+}
+
+int __exofs_wait_obj_created(struct exofs_i_info *oi);
+static inline int wait_obj_created(struct exofs_i_info *oi)
+{
+	if (likely(obj_created(oi)))
+		return 0;
+
+	return __exofs_wait_obj_created(oi);
+}
+
+/*
+ * get to our inode from the vfs inode
+ */
+static inline struct exofs_i_info *exofs_i(struct inode *inode)
+{
+	return container_of(inode, struct exofs_i_info, vfs_inode);
+}
+
+#endif
diff --git a/fs/exofs/osd.c b/fs/exofs/osd.c
new file mode 100644
index 0000000..b249ae9
--- /dev/null
+++ b/fs/exofs/osd.c
@@ -0,0 +1,153 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include <scsi/scsi_device.h>
+#include <scsi/osd_sense.h>
+
+#include "exofs.h"
+
+int exofs_check_ok_resid(struct osd_request *or, u64 *in_resid, u64 *out_resid)
+{
+	struct osd_sense_info osi;
+	int ret = osd_req_decode_sense(or, &osi);
+
+	if (ret) { /* translate to Linux codes */
+		if (osi.additional_code == scsi_invalid_field_in_cdb) {
+			if (osi.cdb_field_offset == OSD_CFO_STARTING_BYTE)
+				ret = -EFAULT;
+			if (osi.cdb_field_offset == OSD_CFO_OBJECT_ID)
+				ret = -ENOENT;
+			else
+				ret = -EINVAL;
+		} else if (osi.additional_code == osd_quota_error)
+			ret = -ENOSPC;
+		else
+			ret = -EIO;
+	}
+
+	/* FIXME: should be include in osd_sense_info */
+	if (in_resid)
+		*in_resid = or->in.req ? or->in.req->data_len : 0;
+
+	if (out_resid)
+		*out_resid = or->out.req ? or->out.req->data_len : 0;
+
+	return ret;
+}
+
+void exofs_make_credential(u8 cred_a[OSD_CAP_LEN], const struct osd_obj_id *obj)
+{
+	osd_sec_init_nosec_doall_caps(cred_a, obj, false, true);
+}
+
+/*
+ * Perform a synchronous OSD operation.
+ */
+int exofs_sync_op(struct osd_request *or, int timeout, uint8_t *credential)
+{
+	int ret;
+
+	or->timeout = timeout;
+	ret = osd_finalize_request(or, 0, credential, NULL);
+	if (ret) {
+		EXOFS_DBGMSG("Faild to osd_finalize_request() => %d\n", ret);
+		return ret;
+	}
+
+	ret = osd_execute_request(or);
+
+	if (ret)
+		EXOFS_DBGMSG("osd_execute_request() => %d\n", ret);
+	/* osd_req_decode_sense(or, ret); */
+	return ret;
+}
+
+/*
+ * Perform an asynchronous OSD operation.
+ */
+int exofs_async_op(struct osd_request *or, osd_req_done_fn *async_done,
+		   void *caller_context, u8 *cred)
+{
+	int ret;
+
+	ret = osd_finalize_request(or, 0, cred, NULL);
+	if (ret) {
+		EXOFS_DBGMSG("Faild to osd_finalize_request() => %d\n", ret);
+		return ret;
+	}
+
+	ret = osd_execute_request_async(or, async_done, caller_context);
+
+	if (ret)
+		EXOFS_DBGMSG("osd_execute_request_async() => %d\n", ret);
+	return ret;
+}
+
+int extract_attr_from_req(struct osd_request *or, struct osd_attr *attr)
+{
+	struct osd_attr cur_attr = {.attr_page = 0}; /* start with zeros */
+	void *iter = NULL;
+	int nelem;
+
+	do {
+		nelem = 1;
+		osd_req_decode_get_attr_list(or, &cur_attr, &nelem, &iter);
+		if ((cur_attr.attr_page == attr->attr_page) &&
+		    (cur_attr.attr_id == attr->attr_id)) {
+			attr->len = cur_attr.len;
+			attr->val_ptr = cur_attr.val_ptr;
+			return 0;
+		}
+	} while (iter);
+
+	return -EIO;
+}
+
+int osd_req_read_kern(struct osd_request *or,
+	const struct osd_obj_id *obj, u64 offset, void* buff, u64 len)
+{
+	struct request_queue *req_q = or->osd_dev->scsi_device->request_queue;
+	struct bio *bio = bio_map_kern(req_q, buff, len, GFP_KERNEL);
+
+	if (!bio)
+		return -ENOMEM;
+
+	osd_req_read(or, obj, bio, offset);
+	return 0;
+}
+
+int osd_req_write_kern(struct osd_request *or,
+	const struct osd_obj_id *obj, u64 offset, void* buff, u64 len)
+{
+	struct request_queue *req_q = or->osd_dev->scsi_device->request_queue;
+	struct bio *bio = bio_map_kern(req_q, buff, len, GFP_KERNEL);
+
+	if (!bio)
+		return -ENOMEM;
+
+	osd_req_write(or, obj, bio, offset);
+	return 0;
+}
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/8] exofs: Kbuild, Headers and osd utils
  2009-03-18 17:45 [PATCHSET 0/8 version 4] exofs for kernel 2.6.30 Boaz Harrosh
@ 2009-03-18 17:57 ` Boaz Harrosh
  2009-03-18 17:57 ` Boaz Harrosh
  1 sibling, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2009-03-18 17:57 UTC (permalink / raw)
  To: Avishay Traeger, Jeff Garzik, Andrew Morton, Evgeniy Polyakov,
	linux-fsdevel
  Cc: linux-kernel, James Bottomley, FUJITA Tomonori

This patch includes osd infrastructure that will be used later by
the file system.

Also the declarations of constants, on disk structures,
and prototypes.

And the Kbuild+Kconfig files needed to build the exofs module.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/exofs/Kbuild   |   30 +++++++++
 fs/exofs/Kconfig  |   13 ++++
 fs/exofs/common.h |  185 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/exofs/exofs.h  |  127 ++++++++++++++++++++++++++++++++++++
 fs/exofs/osd.c    |  153 +++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 508 insertions(+), 0 deletions(-)
 create mode 100644 fs/exofs/Kbuild
 create mode 100644 fs/exofs/Kconfig
 create mode 100644 fs/exofs/common.h
 create mode 100644 fs/exofs/exofs.h
 create mode 100644 fs/exofs/osd.c

diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
new file mode 100644
index 0000000..63d822c
--- /dev/null
+++ b/fs/exofs/Kbuild
@@ -0,0 +1,30 @@
+#
+# Kbuild for the EXOFS module
+#
+# Copyright (C) 2008 Panasas Inc.  All rights reserved.
+#
+# Authors:
+#   Boaz Harrosh <bharrosh@panasas.com>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License version 2
+#
+# Kbuild - Gets included from the Kernels Makefile and build system
+#
+
+ifneq ($(OSD_INC),)
+# we are built out-of-tree Kconfigure everything as on
+
+CONFIG_EXOFS_FS=m
+ccflags-y += -DCONFIG_EXOFS_FS -DCONFIG_EXOFS_FS_MODULE
+# ccflags-y += -DCONFIG_EXOFS_DEBUG
+
+# if we are built out-of-tree and the hosting kernel has OSD headers
+# then "ccflags-y +=" will not pick the out-off-tree headers. Only by doing
+# this it will work. This might break in future kernels
+KBUILD_CPPFLAGS := -I$(OSD_INC) $(KBUILD_CPPFLAGS)
+
+endif
+
+exofs-y := osd.o
+obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig
new file mode 100644
index 0000000..86194b2
--- /dev/null
+++ b/fs/exofs/Kconfig
@@ -0,0 +1,13 @@
+config EXOFS_FS
+	tristate "exofs: OSD based file system support"
+	depends on SCSI_OSD_ULD
+	help
+	  EXOFS is a file system that uses an OSD storage device,
+	  as its backing storage.
+
+# Debugging-related stuff
+config EXOFS_DEBUG
+	bool "Enable debugging"
+	depends on EXOFS_FS
+	help
+	  This option enables EXOFS debug prints.
diff --git a/fs/exofs/common.h b/fs/exofs/common.h
new file mode 100644
index 0000000..bcc4882
--- /dev/null
+++ b/fs/exofs/common.h
@@ -0,0 +1,185 @@
+/*
+ * common.h - Common definitions for both Kernel and user-mode utilities
+ *
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * Copyrights for code taken from ext2:
+ *     Copyright (C) 1992, 1993, 1994, 1995
+ *     Remy Card (card@masi.ibp.fr)
+ *     Laboratoire MASI - Institut Blaise Pascal
+ *     Universite Pierre et Marie Curie (Paris VI)
+ *     from
+ *     linux/fs/minix/inode.c
+ *     Copyright (C) 1991, 1992  Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#ifndef __EXOFS_COM_H__
+#define __EXOFS_COM_H__
+
+#include <linux/types.h>
+
+#include <scsi/osd_attributes.h>
+#include <scsi/osd_initiator.h>
+#include <scsi/osd_sec.h>
+
+/****************************************************************************
+ * Object ID related defines
+ * NOTE: inode# = object ID - EXOFS_OBJ_OFF
+ ****************************************************************************/
+#define EXOFS_MIN_PID   0x10000	/* Smallest partition ID */
+#define EXOFS_OBJ_OFF	0x10000	/* offset for objects */
+#define EXOFS_SUPER_ID	0x10000	/* object ID for on-disk superblock */
+#define EXOFS_ROOT_ID	0x10002	/* object ID for root directory */
+
+/* exofs Application specific page/attribute */
+# define EXOFS_APAGE_FS_DATA	(OSD_APAGE_APP_DEFINED_FIRST + 3)
+# define EXOFS_ATTR_INODE_DATA	1
+
+/*
+ * The maximum number of files we can have is limited by the size of the
+ * inode number.  This is the largest object ID that the file system supports.
+ * Object IDs 0, 1, and 2 are always in use (see above defines).
+ */
+enum {
+	EXOFS_UINT64_MAX = (~0LL),
+	EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? EXOFS_UINT64_MAX :
+					(1LL << (sizeof(ino_t) * 8 - 1)),
+	EXOFS_MAX_ID	 = (EXOFS_MAX_INO_ID - 1 - EXOFS_OBJ_OFF),
+};
+
+/****************************************************************************
+ * Misc.
+ ****************************************************************************/
+#define EXOFS_BLKSHIFT	12
+#define EXOFS_BLKSIZE	(1UL << EXOFS_BLKSHIFT)
+
+/****************************************************************************
+ * superblock-related things
+ ****************************************************************************/
+#define EXOFS_SUPER_MAGIC	0x5DF5
+
+/*
+ * The file system control block - stored in an object's data (mainly, the one
+ * with ID EXOFS_SUPER_ID).  This is where the in-memory superblock is stored
+ * on disk.  Right now it just has a magic value, which is basically a sanity
+ * check on our ability to communicate with the object store.
+ */
+struct exofs_fscb {
+	__le64  s_nextid;	/* Highest object ID used */
+	__le32  s_numfiles;	/* Number of files on fs */
+	__le16  s_magic;	/* Magic signature */
+	__le16  s_newfs;	/* Non-zero if this is a new fs */
+};
+
+/****************************************************************************
+ * inode-related things
+ ****************************************************************************/
+#define EXOFS_IDATA		5
+
+/*
+ * The file control block - stored in an object's attributes.  This is where
+ * the in-memory inode is stored on disk.
+ */
+struct exofs_fcb {
+	__le64  i_size;			/* Size of the file */
+	__le16  i_mode;         	/* File mode */
+	__le16  i_links_count;  	/* Links count */
+	__le32  i_uid;          	/* Owner Uid */
+	__le32  i_gid;          	/* Group Id */
+	__le32  i_atime;        	/* Access time */
+	__le32  i_ctime;        	/* Creation time */
+	__le32  i_mtime;        	/* Modification time */
+	__le32  i_flags;        	/* File flags (unused for now)*/
+	__le32  i_generation;   	/* File version (for NFS) */
+	__le32  i_data[EXOFS_IDATA];	/* Short symlink names and device #s */
+};
+
+#define EXOFS_INO_ATTR_SIZE	sizeof(struct exofs_fcb)
+
+/* This is the Attribute the fcb is stored in */
+static const struct __weak osd_attr g_attr_inode_data = ATTR_DEF(
+	EXOFS_APAGE_FS_DATA,
+	EXOFS_ATTR_INODE_DATA,
+	EXOFS_INO_ATTR_SIZE);
+
+/****************************************************************************
+ * dentry-related things
+ ****************************************************************************/
+#define EXOFS_NAME_LEN	255
+
+/*
+ * The on-disk directory entry
+ */
+struct exofs_dir_entry {
+	__le64		inode_no;		/* inode number           */
+	__le16		rec_len;		/* directory entry length */
+	u8		name_len;		/* name length            */
+	u8		file_type;		/* umm...file type        */
+	char		name[EXOFS_NAME_LEN];	/* file name              */
+};
+
+enum {
+	EXOFS_FT_UNKNOWN,
+	EXOFS_FT_REG_FILE,
+	EXOFS_FT_DIR,
+	EXOFS_FT_CHRDEV,
+	EXOFS_FT_BLKDEV,
+	EXOFS_FT_FIFO,
+	EXOFS_FT_SOCK,
+	EXOFS_FT_SYMLINK,
+	EXOFS_FT_MAX
+};
+
+#define EXOFS_DIR_PAD			4
+#define EXOFS_DIR_ROUND			(EXOFS_DIR_PAD - 1)
+#define EXOFS_DIR_REC_LEN(name_len) \
+	(((name_len) + offsetof(struct exofs_dir_entry, name)  + \
+	  EXOFS_DIR_ROUND) & ~EXOFS_DIR_ROUND)
+
+/*************************
+ * function declarations *
+ *************************/
+/* osd.c                 */
+void exofs_make_credential(u8 cred_a[OSD_CAP_LEN],
+			   const struct osd_obj_id *obj);
+
+int exofs_check_ok_resid(struct osd_request *or, u64 *in_resid, u64 *out_resid);
+static inline int exofs_check_ok(struct osd_request *or)
+{
+	return exofs_check_ok_resid(or, NULL, NULL);
+}
+int exofs_sync_op(struct osd_request *or, int timeout, u8 *cred);
+int exofs_async_op(struct osd_request *or,
+	osd_req_done_fn *async_done, void *caller_context, u8 *cred);
+
+int extract_attr_from_req(struct osd_request *or, struct osd_attr *attr);
+
+int osd_req_read_kern(struct osd_request *or,
+	const struct osd_obj_id *obj, u64 offset, void *buff, u64 len);
+
+int osd_req_write_kern(struct osd_request *or,
+	const struct osd_obj_id *obj, u64 offset, void *buff, u64 len);
+
+#endif /*ifndef __EXOFS_COM_H__*/
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
new file mode 100644
index 0000000..304e052
--- /dev/null
+++ b/fs/exofs/exofs.h
@@ -0,0 +1,127 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * Copyrights for code taken from ext2:
+ *     Copyright (C) 1992, 1993, 1994, 1995
+ *     Remy Card (card@masi.ibp.fr)
+ *     Laboratoire MASI - Institut Blaise Pascal
+ *     Universite Pierre et Marie Curie (Paris VI)
+ *     from
+ *     linux/fs/minix/inode.c
+ *     Copyright (C) 1991, 1992  Linus Torvalds
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include <linux/fs.h>
+#include <linux/time.h>
+#include "common.h"
+
+#ifndef __EXOFS_H__
+#define __EXOFS_H__
+
+#define EXOFS_ERR(fmt, a...) printk(KERN_ERR "exofs: " fmt, ##a)
+
+#ifdef CONFIG_EXOFS_DEBUG
+#define EXOFS_DBGMSG(fmt, a...) \
+	printk(KERN_NOTICE "exofs @%s:%d: " fmt, __func__, __LINE__, ##a)
+#else
+#define EXOFS_DBGMSG(fmt, a...) \
+	do {} while (0)
+#endif
+
+/* u64 has problems with printk this will cast it to unsigned long long */
+#define _LLU(x) (unsigned long long)(x)
+
+/*
+ * our extension to the in-memory superblock
+ */
+struct exofs_sb_info {
+	struct osd_dev	*s_dev;			/* returned by get_osd_dev    */
+	osd_id		s_pid;			/* partition ID of file system*/
+	int		s_timeout;		/* timeout for OSD operations */
+	uint64_t	s_nextid;		/* highest object ID used     */
+	uint32_t	s_numfiles;		/* number of files on fs      */
+	spinlock_t	s_next_gen_lock;	/* spinlock for gen # update  */
+	u32		s_next_generation;	/* next gen # to use          */
+	atomic_t	s_curr_pending;		/* number of pending commands */
+	uint8_t		s_cred[OSD_CAP_LEN];	/* all-powerful credential    */
+};
+
+/*
+ * our extension to the in-memory inode
+ */
+struct exofs_i_info {
+	unsigned long  i_flags;            /* various atomic flags            */
+	uint32_t       i_data[EXOFS_IDATA];/*short symlink names and device #s*/
+	uint32_t       i_dir_start_lookup; /* which page to start lookup      */
+	wait_queue_head_t i_wq;            /* wait queue for inode            */
+	uint64_t       i_commit_size;      /* the object's written length     */
+	uint8_t        i_cred[OSD_CAP_LEN];/* all-powerful credential         */
+	struct inode   vfs_inode;          /* normal in-memory inode          */
+};
+
+/*
+ * our inode flags
+ */
+#define OBJ_2BCREATED	0	/* object will be created soon*/
+#define OBJ_CREATED	1	/* object has been created on the osd*/
+
+static inline int obj_2bcreated(struct exofs_i_info *oi)
+{
+	return test_bit(OBJ_2BCREATED, &(oi->i_flags));
+}
+
+static inline void set_obj_2bcreated(struct exofs_i_info *oi)
+{
+	set_bit(OBJ_2BCREATED, &(oi->i_flags));
+}
+
+static inline int obj_created(struct exofs_i_info *oi)
+{
+	return test_bit(OBJ_CREATED, &(oi->i_flags));
+}
+
+static inline void set_obj_created(struct exofs_i_info *oi)
+{
+	set_bit(OBJ_CREATED, &(oi->i_flags));
+}
+
+int __exofs_wait_obj_created(struct exofs_i_info *oi);
+static inline int wait_obj_created(struct exofs_i_info *oi)
+{
+	if (likely(obj_created(oi)))
+		return 0;
+
+	return __exofs_wait_obj_created(oi);
+}
+
+/*
+ * get to our inode from the vfs inode
+ */
+static inline struct exofs_i_info *exofs_i(struct inode *inode)
+{
+	return container_of(inode, struct exofs_i_info, vfs_inode);
+}
+
+#endif
diff --git a/fs/exofs/osd.c b/fs/exofs/osd.c
new file mode 100644
index 0000000..b249ae9
--- /dev/null
+++ b/fs/exofs/osd.c
@@ -0,0 +1,153 @@
+/*
+ * Copyright (C) 2005, 2006
+ * Avishay Traeger (avishay@gmail.com) (avishay@il.ibm.com)
+ * Copyright (C) 2005, 2006
+ * International Business Machines
+ * Copyright (C) 2008, 2009
+ * Boaz Harrosh <bharrosh@panasas.com>
+ *
+ * This file is part of exofs.
+ *
+ * exofs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.  Since it is based on ext2, and the only
+ * valid version of GPL for the Linux kernel is version 2, the only valid
+ * version of GPL for exofs is version 2.
+ *
+ * exofs is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with exofs; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include <scsi/scsi_device.h>
+#include <scsi/osd_sense.h>
+
+#include "exofs.h"
+
+int exofs_check_ok_resid(struct osd_request *or, u64 *in_resid, u64 *out_resid)
+{
+	struct osd_sense_info osi;
+	int ret = osd_req_decode_sense(or, &osi);
+
+	if (ret) { /* translate to Linux codes */
+		if (osi.additional_code == scsi_invalid_field_in_cdb) {
+			if (osi.cdb_field_offset == OSD_CFO_STARTING_BYTE)
+				ret = -EFAULT;
+			if (osi.cdb_field_offset == OSD_CFO_OBJECT_ID)
+				ret = -ENOENT;
+			else
+				ret = -EINVAL;
+		} else if (osi.additional_code == osd_quota_error)
+			ret = -ENOSPC;
+		else
+			ret = -EIO;
+	}
+
+	/* FIXME: should be include in osd_sense_info */
+	if (in_resid)
+		*in_resid = or->in.req ? or->in.req->data_len : 0;
+
+	if (out_resid)
+		*out_resid = or->out.req ? or->out.req->data_len : 0;
+
+	return ret;
+}
+
+void exofs_make_credential(u8 cred_a[OSD_CAP_LEN], const struct osd_obj_id *obj)
+{
+	osd_sec_init_nosec_doall_caps(cred_a, obj, false, true);
+}
+
+/*
+ * Perform a synchronous OSD operation.
+ */
+int exofs_sync_op(struct osd_request *or, int timeout, uint8_t *credential)
+{
+	int ret;
+
+	or->timeout = timeout;
+	ret = osd_finalize_request(or, 0, credential, NULL);
+	if (ret) {
+		EXOFS_DBGMSG("Faild to osd_finalize_request() => %d\n", ret);
+		return ret;
+	}
+
+	ret = osd_execute_request(or);
+
+	if (ret)
+		EXOFS_DBGMSG("osd_execute_request() => %d\n", ret);
+	/* osd_req_decode_sense(or, ret); */
+	return ret;
+}
+
+/*
+ * Perform an asynchronous OSD operation.
+ */
+int exofs_async_op(struct osd_request *or, osd_req_done_fn *async_done,
+		   void *caller_context, u8 *cred)
+{
+	int ret;
+
+	ret = osd_finalize_request(or, 0, cred, NULL);
+	if (ret) {
+		EXOFS_DBGMSG("Faild to osd_finalize_request() => %d\n", ret);
+		return ret;
+	}
+
+	ret = osd_execute_request_async(or, async_done, caller_context);
+
+	if (ret)
+		EXOFS_DBGMSG("osd_execute_request_async() => %d\n", ret);
+	return ret;
+}
+
+int extract_attr_from_req(struct osd_request *or, struct osd_attr *attr)
+{
+	struct osd_attr cur_attr = {.attr_page = 0}; /* start with zeros */
+	void *iter = NULL;
+	int nelem;
+
+	do {
+		nelem = 1;
+		osd_req_decode_get_attr_list(or, &cur_attr, &nelem, &iter);
+		if ((cur_attr.attr_page == attr->attr_page) &&
+		    (cur_attr.attr_id == attr->attr_id)) {
+			attr->len = cur_attr.len;
+			attr->val_ptr = cur_attr.val_ptr;
+			return 0;
+		}
+	} while (iter);
+
+	return -EIO;
+}
+
+int osd_req_read_kern(struct osd_request *or,
+	const struct osd_obj_id *obj, u64 offset, void* buff, u64 len)
+{
+	struct request_queue *req_q = or->osd_dev->scsi_device->request_queue;
+	struct bio *bio = bio_map_kern(req_q, buff, len, GFP_KERNEL);
+
+	if (!bio)
+		return -ENOMEM;
+
+	osd_req_read(or, obj, bio, offset);
+	return 0;
+}
+
+int osd_req_write_kern(struct osd_request *or,
+	const struct osd_obj_id *obj, u64 offset, void* buff, u64 len)
+{
+	struct request_queue *req_q = or->osd_dev->scsi_device->request_queue;
+	struct bio *bio = bio_map_kern(req_q, buff, len, GFP_KERNEL);
+
+	if (!bio)
+		return -ENOMEM;
+
+	osd_req_write(or, obj, bio, offset);
+	return 0;
+}
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2009-03-31  9:00 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-09 13:07 [PATCHSET 0/8 version 3] exofs Boaz Harrosh
2009-02-09 13:12 ` [PATCH 1/8] exofs: Kbuild, Headers and osd utils Boaz Harrosh
2009-02-16  4:18   ` FUJITA Tomonori
2009-02-16  8:49     ` Boaz Harrosh
2009-02-16  9:00       ` FUJITA Tomonori
2009-02-16  9:19         ` Boaz Harrosh
2009-02-16  9:27           ` Jeff Garzik
2009-02-16 10:19             ` Boaz Harrosh
2009-02-16 11:05               ` pNFS rant (was Re: [PATCH 1/8] exofs: Kbuild, Headers and osd utils) Jeff Garzik
2009-02-16 12:45                 ` Boaz Harrosh
2009-02-16 15:50                 ` James Bottomley
2009-02-16 16:27                   ` Benny Halevy
2009-02-16 16:23                 ` Benny Halevy
2009-02-16  9:38           ` [PATCH 1/8] exofs: Kbuild, Headers and osd utils FUJITA Tomonori
2009-02-16 10:29             ` Boaz Harrosh
2009-02-17  0:20               ` FUJITA Tomonori
2009-02-17  8:10                 ` [osd-dev] " Boaz Harrosh
2009-02-27  8:09                   ` FUJITA Tomonori
2009-03-01 10:43                     ` Boaz Harrosh
2009-02-09 13:18 ` [PATCH 2/8] exofs: file and file_inode operations Boaz Harrosh
2009-02-09 13:20 ` [PATCH 3/8] exofs: symlink_inode and fast_symlink_inode operations Boaz Harrosh
2009-02-09 13:22 ` [PATCH 4/8] exofs: address_space_operations Boaz Harrosh
2009-02-09 13:24 ` [PATCH 5/8] exofs: dir_inode and directory operations Boaz Harrosh
2009-02-15 17:08   ` Evgeniy Polyakov
2009-02-16  9:31     ` Boaz Harrosh
2009-03-15 18:10       ` Boaz Harrosh
2009-03-15 18:37         ` Evgeniy Polyakov
2009-02-09 13:25 ` [PATCH 6/8] exofs: super_operations and file_system_type Boaz Harrosh
2009-02-15 17:24   ` Evgeniy Polyakov
2009-02-16  9:59     ` Boaz Harrosh
2009-02-09 13:29 ` [PATCH 7/8] exofs: Documentation Boaz Harrosh
2009-02-09 13:31 ` [PATCH 8/8] fs: Add exofs to Kernel build Boaz Harrosh
2009-03-18 17:45 [PATCHSET 0/8 version 4] exofs for kernel 2.6.30 Boaz Harrosh
2009-03-18 17:57 ` [PATCH 1/8] exofs: Kbuild, Headers and osd utils Boaz Harrosh
2009-03-18 17:57 ` Boaz Harrosh
2009-03-31  8:04   ` Andrew Morton
2009-03-31  8:57     ` Boaz Harrosh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.