linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ->direct_IO API change in current 2.4 BK
@ 2003-07-09 12:31 Christoph Hellwig
  2003-07-09 17:03 ` Andreas Dilger
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2003-07-09 12:31 UTC (permalink / raw)
  To: marcelo, trond.myklebust; +Cc: linux-kernel

I just got a nice XFS oops due to the direct_IO API change in
2.4.  Guys, this is a STABLE series and APIs are supposed to be exactly
that, _STABLE_.  If you really think O_DIRECT on NFS is soo important
please add a ->direct_IO2 for NFS like the reiserfs read_inode2 hack.

But what's the use of it anyway?  AFAIK it's mostly for whoracle setups
that have their data on netapps but that needs a certified vendor kernel
not mainline..

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 12:31 ->direct_IO API change in current 2.4 BK Christoph Hellwig
@ 2003-07-09 17:03 ` Andreas Dilger
  2003-07-09 17:24   ` Marcelo Tosatti
  0 siblings, 1 reply; 27+ messages in thread
From: Andreas Dilger @ 2003-07-09 17:03 UTC (permalink / raw)
  To: Christoph Hellwig, marcelo, trond.myklebust, linux-kernel

On Jul 09, 2003  13:31 +0100, Christoph Hellwig wrote:
> I just got a nice XFS oops due to the direct_IO API change in
> 2.4.  Guys, this is a STABLE series and APIs are supposed to be exactly
> that, _STABLE_.  If you really think O_DIRECT on NFS is soo important
> please add a ->direct_IO2 for NFS like the reiserfs read_inode2 hack.

I would have to agree with that sentiment - we shouldn't change the
API in an "almost compatible" way, although I would have hoped that
compile warnings and/or module symbol versioning would have avoided
a crash.

> But what's the use of it anyway?  AFAIK it's mostly for whoracle setups
> that have their data on netapps but that needs a certified vendor kernel
> not mainline..

Actually, it is useful for Lustre to do this, because it allows us to have
a file handle (which, naturally, holds per-file data) at the time the IO is
sent over the wire, instead of the "anonymous" writes that happen now.
This helps us with readahead on the server and other minor improvements.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 17:03 ` Andreas Dilger
@ 2003-07-09 17:24   ` Marcelo Tosatti
  2003-07-09 17:43     ` Marc-Christian Petersen
  2003-07-09 18:29     ` Trond Myklebust
  0 siblings, 2 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2003-07-09 17:24 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Christoph Hellwig, marcelo, Trond Myklebust, lkml



On Wed, 9 Jul 2003, Andreas Dilger wrote:

> On Jul 09, 2003  13:31 +0100, Christoph Hellwig wrote:
> > I just got a nice XFS oops due to the direct_IO API change in
> > 2.4.  Guys, this is a STABLE series and APIs are supposed to be exactly
> > that, _STABLE_.  If you really think O_DIRECT on NFS is soo important
> > please add a ->direct_IO2 for NFS like the reiserfs read_inode2 hack.
>
> I would have to agree with that sentiment - we shouldn't change the
> API in an "almost compatible" way, although I would have hoped that
> compile warnings and/or module symbol versioning would have avoided
> a crash.
>
> > But what's the use of it anyway?  AFAIK it's mostly for whoracle setups
> > that have their data on netapps but that needs a certified vendor kernel
> > not mainline..
>
> Actually, it is useful for Lustre to do this, because it allows us to have
> a file handle (which, naturally, holds per-file data) at the time the IO is
> sent over the wire, instead of the "anonymous" writes that happen now.
> This helps us with readahead on the server and other minor improvements.

Fine, I agree. Trond, I'll have to revert your direct IO patches because
they break the _stable_ API, indeed.

Please come up with another solution for the problem (->direct_IO2 ? its
ugly, but...).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 17:24   ` Marcelo Tosatti
@ 2003-07-09 17:43     ` Marc-Christian Petersen
  2003-07-09 17:46       ` Marcelo Tosatti
  2003-07-09 18:29     ` Trond Myklebust
  1 sibling, 1 reply; 27+ messages in thread
From: Marc-Christian Petersen @ 2003-07-09 17:43 UTC (permalink / raw)
  To: Marcelo Tosatti, Andreas Dilger, Andrea Arcangeli
  Cc: Christoph Hellwig, marcelo, Trond Myklebust, lkml

On Wednesday 09 July 2003 19:24, Marcelo Tosatti wrote:

Hi,

> > > I just got a nice XFS oops due to the direct_IO API change in
> > > 2.4.  Guys, this is a STABLE series and APIs are supposed to be exactly
> > > that, _STABLE_.  If you really think O_DIRECT on NFS is soo important
> > > please add a ->direct_IO2 for NFS like the reiserfs read_inode2 hack.
I wonder why -aa and -wolk don't have these problems with O_DIRECT vs. XFS.

ciao, Marc


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 17:43     ` Marc-Christian Petersen
@ 2003-07-09 17:46       ` Marcelo Tosatti
  2003-07-09 17:55         ` Marc-Christian Petersen
  0 siblings, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2003-07-09 17:46 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Andreas Dilger, Andrea Arcangeli, Christoph Hellwig, marcelo,
	Trond Myklebust, lkml



On Wed, 9 Jul 2003, Marc-Christian Petersen wrote:

> On Wednesday 09 July 2003 19:24, Marcelo Tosatti wrote:
>
> Hi,
>
> > > > I just got a nice XFS oops due to the direct_IO API change in
> > > > 2.4.  Guys, this is a STABLE series and APIs are supposed to be exactly
> > > > that, _STABLE_.  If you really think O_DIRECT on NFS is soo important
> > > > please add a ->direct_IO2 for NFS like the reiserfs read_inode2 hack.
> I wonder why -aa and -wolk don't have these problems with O_DIRECT vs. XFS.

Do they have the NFS DIRECT IO patch?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 17:46       ` Marcelo Tosatti
@ 2003-07-09 17:55         ` Marc-Christian Petersen
  2003-07-09 18:08           ` Marcelo Tosatti
  0 siblings, 1 reply; 27+ messages in thread
From: Marc-Christian Petersen @ 2003-07-09 17:55 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andreas Dilger, Andrea Arcangeli, Christoph Hellwig, marcelo,
	Trond Myklebust, lkml

On Wednesday 09 July 2003 19:46, Marcelo Tosatti wrote:

Hi Marcelo,

> > > > > I just got a nice XFS oops due to the direct_IO API change in
> > > > > 2.4.  Guys, this is a STABLE series and APIs are supposed to be
> > > > > exactly that, _STABLE_.  If you really think O_DIRECT on NFS is soo
> > > > > important please add a ->direct_IO2 for NFS like the reiserfs
> > > > > read_inode2 hack.
> >
> > I wonder why -aa and -wolk don't have these problems with O_DIRECT vs.
> > XFS.
> Do they have the NFS DIRECT IO patch?
Yes. If not, my sentence would be superflous ;)

ciao, Marc


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 17:55         ` Marc-Christian Petersen
@ 2003-07-09 18:08           ` Marcelo Tosatti
  2003-07-09 18:22             ` Marc-Christian Petersen
  2003-07-09 18:33             ` Trond Myklebust
  0 siblings, 2 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2003-07-09 18:08 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Andreas Dilger, Andrea Arcangeli, Christoph Hellwig, marcelo,
	Trond Myklebust, lkml



On Wed, 9 Jul 2003, Marc-Christian Petersen wrote:

> On Wednesday 09 July 2003 19:46, Marcelo Tosatti wrote:
>
> Hi Marcelo,
>
> > > > > > I just got a nice XFS oops due to the direct_IO API change in
> > > > > > 2.4.  Guys, this is a STABLE series and APIs are supposed to be
> > > > > > exactly that, _STABLE_.  If you really think O_DIRECT on NFS is soo
> > > > > > important please add a ->direct_IO2 for NFS like the reiserfs
> > > > > > read_inode2 hack.
> > >
> > > I wonder why -aa and -wolk don't have these problems with O_DIRECT vs.
> > > XFS.
> > Do they have the NFS DIRECT IO patch?
> Yes. If not, my sentence would be superflous ;)

Ok, right. Well, I dont know why it doesnt happen there. Maybe not enough
testing?

Anyway, I'm going to revert the NFS DIRECT IO patch because, as Christoph
mentioned, breaks the API.

I except another solution from Trond (maybe ->direct_IO2).


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 18:08           ` Marcelo Tosatti
@ 2003-07-09 18:22             ` Marc-Christian Petersen
  2003-07-09 19:13               ` Marcelo Tosatti
  2003-07-09 18:33             ` Trond Myklebust
  1 sibling, 1 reply; 27+ messages in thread
From: Marc-Christian Petersen @ 2003-07-09 18:22 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andreas Dilger, Andrea Arcangeli, Christoph Hellwig, marcelo,
	Trond Myklebust, lkml

On Wednesday 09 July 2003 20:08, Marcelo Tosatti wrote:

Hi Marcelo,

> Ok, right. Well, I dont know why it doesnt happen there. Maybe not enough
> testing?
aehm, I'll bet -aa is tested _alot_ and also -wolk has tons of testers, but 
well, this is a different talk ;)

> Anyway, I'm going to revert the NFS DIRECT IO patch because, as Christoph
> mentioned, breaks the API.
> I except another solution from Trond (maybe ->direct_IO2).
I wonder why you merge such stuff then in the first place, if you agree with 
hch's opinion, that an API should not change in stable kernel series in the 
2nd place? ... Did you temporarly forget it? ;)

ciao, Marc


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 17:24   ` Marcelo Tosatti
  2003-07-09 17:43     ` Marc-Christian Petersen
@ 2003-07-09 18:29     ` Trond Myklebust
  2003-07-09 18:51       ` Andreas Dilger
  1 sibling, 1 reply; 27+ messages in thread
From: Trond Myklebust @ 2003-07-09 18:29 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andreas Dilger, Christoph Hellwig, marcelo, lkml

>>>>> " " == Marcelo Tosatti <marcelo@conectiva.com.br> writes:

     > Fine, I agree. Trond, I'll have to revert your direct IO
     > patches because they break the _stable_ API, indeed.

     > Please come up with another solution for the problem
     > (->direct_IO2 ? its ugly, but...).

Like this? Blech... 

How about instead following Alan's suggestion to replace
KERNEL_HAS_O_DIRECT with a KERNEL_HAS_O_DIRECT2 that can be used to
switch between the old and new O_DIRECT format in the XFS patches?

Cheers,
  Trond

diff -u --recursive --new-file linux-2.4.22-nodirect/Documentation/Configure.help linux-2.4.22-blech/Documentation/Configure.help
--- linux-2.4.22-nodirect/Documentation/Configure.help	2003-07-09 19:39:39.000000000 +0200
+++ linux-2.4.22-blech/Documentation/Configure.help	2003-07-09 20:07:36.000000000 +0200
@@ -15938,6 +15938,30 @@
 
   If unsure, say N.
 
+Allow direct I/O on files in NFS
+CONFIG_NFS_DIRECTIO
+  There are important applications whose performance or correctness
+  depends on uncached access to file data.  Database clusters (multiple
+  copies of the same instance running on separate hosts) implement their
+  own cache coherency protocol that subsumes the NFS cache protocols.
+  Applications that process datasets considerably larger than the client's
+  memory do not always benefit from a local cache.  A streaming video
+  server, for instance, has no need to cache the contents of a file.
+
+  This option enables applications to perform direct I/O on files in NFS
+  file systems using the O_DIRECT open() flag.  When O_DIRECT is set for
+  files, their data is not cached in the system's page cache.  Direct
+  read and write operations are aligned to block boundaries.  Data is
+  moved to and from user-level application buffers directly.
+
+  Unless your program is designed to use O_DIRECT properly, you are much
+  better off allowing the NFS client to manage caching for you.  Misusing
+  O_DIRECT can cause poor server performance or network storms.  This
+  kernel build option defaults OFF to avoid exposing system administrators
+  unwittingly to a potentially hazardous feature.
+
+  If unsure, say N.
+
 Root file system on NFS
 CONFIG_ROOT_NFS
   If you want your Linux box to mount its whole root file system (the
diff -u --recursive --new-file linux-2.4.22-nodirect/fs/Config.in linux-2.4.22-blech/fs/Config.in
--- linux-2.4.22-nodirect/fs/Config.in	2003-07-09 19:39:39.000000000 +0200
+++ linux-2.4.22-blech/fs/Config.in	2003-07-09 20:08:20.000000000 +0200
@@ -106,6 +106,7 @@
    dep_tristate 'InterMezzo file system support (replicating fs) (EXPERIMENTAL)' CONFIG_INTERMEZZO_FS $CONFIG_INET $CONFIG_EXPERIMENTAL
    dep_tristate 'NFS file system support' CONFIG_NFS_FS $CONFIG_INET
    dep_mbool '  Provide NFSv3 client support' CONFIG_NFS_V3 $CONFIG_NFS_FS
+   dep_mbool '  Allow direct I/O on NFS files (EXPERIMENTAL)' CONFIG_NFS_DIRECTIO $CONFIG_NFS_FS $CONFIG_EXPERIMENTAL
    dep_bool '  Root file system on NFS' CONFIG_ROOT_NFS $CONFIG_NFS_FS $CONFIG_IP_PNP
 
    dep_tristate 'NFS server support' CONFIG_NFSD $CONFIG_INET
diff -u --recursive --new-file linux-2.4.22-nodirect/fs/nfs/direct.c linux-2.4.22-blech/fs/nfs/direct.c
--- linux-2.4.22-nodirect/fs/nfs/direct.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.4.22-blech/fs/nfs/direct.c	2003-07-08 11:52:37.000000000 +0200
@@ -0,0 +1,382 @@
+/*
+ * linux/fs/nfs/direct.c
+ *
+ * High-performance direct I/O for the NFS client
+ *
+ * When an application requests uncached I/O, all read and write requests
+ * are made directly to the server; data stored or fetched via these
+ * requests is not cached in the Linux page cache.  The client does not
+ * correct unaligned requests from applications.  All requested bytes are
+ * held on permanent storage before a direct write system call returns to
+ * an application.  Applications that manage their own data caching, such
+ * as databases, make very good use of direct I/O on local file systems.
+ *
+ * Solaris implements an uncached I/O facility called directio() that
+ * is used for backups and sequential I/O to very large files.  Solaris
+ * also supports uncaching whole NFS partitions with "-o forcedirectio,"
+ * an undocumented mount option.
+ *
+ * Note that I/O to read in executables (e.g. kernel_read) cannot use
+ * direct (kiobuf) reads because there is no vma backing the passed-in
+ * data buffer.
+ *
+ * Designed by Jeff Kimmel, Chuck Lever, and Trond Myklebust.
+ *
+ * Initial implementation:	12/2001 by Chuck Lever <cel@netapp.com>
+ *
+ * TODO:
+ *
+ * 1.  Use concurrent asynchronous network requests rather than
+ *     serialized synchronous network requests for normal (non-sync)
+ *     direct I/O.
+ */
+
+#include <linux/config.h>
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/file.h>
+#include <linux/errno.h>
+#include <linux/nfs_fs.h>
+#include <linux/smp_lock.h>
+#include <linux/sunrpc/clnt.h>
+#include <linux/iobuf.h>
+
+#include <asm/system.h>
+#include <asm/uaccess.h>
+
+#define NFSDBG_FACILITY		(NFSDBG_PAGECACHE | NFSDBG_VFS)
+#define VERF_SIZE		(2 * sizeof(__u32))
+
+static inline int
+nfs_direct_read_rpc(struct file *file, struct nfs_readargs *arg)
+{
+	int result;
+	struct inode * inode = file->f_dentry->d_inode;
+	struct nfs_fattr fattr;
+        struct rpc_message msg;
+        struct nfs_readres res = { &fattr, arg->count, 0 };
+
+#ifdef CONFIG_NFS_V3
+	msg.rpc_proc = (NFS_PROTO(inode)->version == 3) ?
+						NFS3PROC_READ : NFSPROC_READ;
+#else
+	msg.rpc_proc = NFSPROC_READ;
+#endif
+	msg.rpc_argp = arg;
+        msg.rpc_resp = &res;
+
+	lock_kernel();
+        msg.rpc_cred = nfs_file_cred(file);
+        fattr.valid = 0;
+        result = rpc_call_sync(NFS_CLIENT(inode), &msg, 0);
+	nfs_refresh_inode(inode, &fattr);
+	unlock_kernel();
+
+	return result;
+}
+
+static inline int
+nfs_direct_write_rpc(struct file *file, struct nfs_writeargs *arg,
+	struct nfs_writeverf *verf)
+{
+	int result;
+	struct inode *inode = file->f_dentry->d_inode;
+	struct nfs_fattr fattr;
+        struct rpc_message msg;
+        struct nfs_writeres res = { &fattr, verf, 0 };
+
+#ifdef CONFIG_NFS_V3
+	msg.rpc_proc = (NFS_PROTO(inode)->version == 3) ?
+						NFS3PROC_WRITE : NFSPROC_WRITE;
+#else
+	msg.rpc_proc = NFSPROC_WRITE;
+#endif
+	msg.rpc_argp = arg;
+        msg.rpc_resp = &res;
+
+	lock_kernel();
+	msg.rpc_cred = get_rpccred(nfs_file_cred(file));
+	fattr.valid = 0;
+        result = rpc_call_sync(NFS_CLIENT(inode), &msg, 0);
+	nfs_write_attributes(inode, &fattr);
+	put_rpccred(msg.rpc_cred);
+	unlock_kernel();
+
+#ifdef CONFIG_NFS_V3
+	if (NFS_PROTO(inode)->version == 3) {
+		if (result > 0) {
+			if ((arg->stable == NFS_FILE_SYNC) &&
+			    (verf->committed != NFS_FILE_SYNC)) {
+				printk(KERN_ERR
+				"%s: server didn't sync stable write request\n",
+				__FUNCTION__);
+				return -EIO;
+			}
+
+			if (result != arg->count) {
+				printk(KERN_INFO
+					"%s: short write, count=%u, result=%d\n",
+					__FUNCTION__, arg->count, result);
+			}
+		}
+		return result;
+	} else {
+#endif
+        	verf->committed = NFS_FILE_SYNC; /* NFSv2 always syncs data */
+		if (result == 0)
+			return arg->count;
+		return result;
+#ifdef CONFIG_NFS_V3
+	}
+#endif
+}
+
+#ifdef CONFIG_NFS_V3
+static inline int
+nfs_direct_commit_rpc(struct inode *inode, loff_t offset, size_t count,
+	struct nfs_writeverf *verf)
+{
+	int result;
+	struct nfs_fattr fattr;
+	struct nfs_writeargs	arg = { NFS_FH(inode), offset, count, 0, 0,
+					NULL };
+	struct nfs_writeres	res = { &fattr, verf, 0 };
+	struct rpc_message	msg = { NFS3PROC_COMMIT, &arg, &res, NULL };
+
+	fattr.valid = 0;
+
+	lock_kernel();
+	result = rpc_call_sync(NFS_CLIENT(inode), &msg, 0);
+	nfs_write_attributes(inode, &fattr);
+	unlock_kernel();
+
+	return result;
+}
+#else
+static inline int
+nfs_direct_commit_rpc(struct inode *inode, loff_t offset, size_t count,
+	struct nfs_writeverf *verf)
+{
+	return 0;
+}
+#endif
+
+/*
+ * Walk through the iobuf and create an iovec for each "rsize" bytes.
+ */
+static int
+nfs_direct_read(struct file *file, struct kiobuf *iobuf, loff_t offset,
+	size_t count)
+{
+	int curpage, total;
+	int result = 0;
+	struct inode *inode = file->f_dentry->d_inode;
+	int rsize = NFS_SERVER(inode)->rsize;
+	struct page *pages[NFS_READ_MAXIOV];
+	struct nfs_readargs args = { NFS_FH(inode), offset, 0, iobuf->offset,
+				     pages };
+
+	total = 0;
+	curpage = 0;
+        while (count) {
+		int len, request;
+		struct page **dest = pages;
+
+                request = count;
+                if (count > rsize)
+                        request = rsize;
+		args.count = request;
+		args.offset = offset;
+		args.pgbase = (iobuf->offset + total) & ~PAGE_MASK;
+		len = PAGE_SIZE - args.pgbase;
+
+		do {
+			struct page *page = iobuf->maplist[curpage];
+
+			if (curpage >= iobuf->nr_pages || !page) {
+				result = -EFAULT;
+				goto out_err;
+			}
+
+			*dest++ = page;
+			/* zero after the first iov */
+			if (request < len)
+				break;
+			request -= len;
+			len = PAGE_SIZE;
+			curpage++;
+		} while (request != 0);
+
+                result = nfs_direct_read_rpc(file, &args);
+
+                if (result < 0)
+                        break;
+
+                total += result;
+                if (result < args.count)   /* NFSv2ism */
+                        break;
+                count -= result;
+                offset += result;
+        };
+out_err:
+	if (!total)
+		return result;
+	return total;
+}
+
+/*
+ * Walk through the iobuf and create an iovec for each "wsize" bytes.
+ * If only one network write is necessary, or if the O_SYNC flag or
+ * 'sync' mount option are present, or if this is a V2 inode, use
+ * FILE_SYNC.  Otherwise, use UNSTABLE and finish with a COMMIT.
+ *
+ * The mechanics of this function are much the same as nfs_direct_read,
+ * with the added complexity of committing unstable writes.
+ */
+static int
+nfs_direct_write(struct file *file, struct kiobuf *iobuf,
+	loff_t offset, size_t count)
+{
+	int curpage, total;
+	int need_commit = 0;
+	int result = 0;
+	loff_t save_offset = offset;
+	struct inode *inode = file->f_dentry->d_inode;
+	int wsize = NFS_SERVER(inode)->wsize;
+	struct nfs_writeverf first_verf, ret_verf;
+	struct page *pages[NFS_WRITE_MAXIOV];
+        struct nfs_writeargs args = { NFS_FH(inode), 0, 0, NFS_FILE_SYNC, 0,
+				pages };
+
+#ifdef CONFIG_NFS_V3
+	if ((NFS_PROTO(inode)->version == 3) && (count > wsize) &&
+							(!IS_SYNC(inode)))
+		args.stable = NFS_UNSTABLE;
+#endif
+
+retry:
+	total = 0;
+	curpage = 0;
+        while (count) {
+		int len, request;
+		struct page **dest = pages;
+
+                request = count;
+                if (count > wsize)
+                        request = wsize;
+		args.count = request;
+		args.offset = offset;
+		args.pgbase = (iobuf->offset + total) & ~PAGE_MASK;
+		len = PAGE_SIZE - args.pgbase;
+
+		do {
+			struct page *page = iobuf->maplist[curpage];
+
+			if (curpage >= iobuf->nr_pages || !page) {
+				result = -EFAULT;
+				goto out_err;
+			}
+
+			*dest++ = page;
+			/* zero after the first iov */
+			if (request < len)
+				break;
+			request -= len;
+			len = PAGE_SIZE;
+			curpage++;
+		} while (request != 0);
+
+                result = nfs_direct_write_rpc(file, &args, &ret_verf);
+
+                if (result < 0)
+                        break;
+
+		if (!total)
+			memcpy(&first_verf.verifier, &ret_verf.verifier,
+								VERF_SIZE);
+		if (ret_verf.committed != NFS_FILE_SYNC) {
+			need_commit = 1;
+			if (memcmp(&first_verf.verifier, &ret_verf.verifier,
+								VERF_SIZE))
+				goto print_retry;
+		}
+
+                total += result;
+                count -= result;
+                offset += result;
+        };
+
+out_err:
+	/*
+	 * Commit data written so far, even in the event of an error
+	 */
+	if (need_commit) {
+		if (nfs_direct_commit_rpc(inode, save_offset,
+					iobuf->length - count, &ret_verf))
+			goto print_retry;
+		if (memcmp(&first_verf.verifier, &ret_verf.verifier,
+								VERF_SIZE))
+			goto print_retry;
+	}
+
+	if (!total)
+		return result;
+	return total;
+
+print_retry:
+	printk(KERN_INFO "%s: detected server restart; retrying with FILE_SYNC\n",
+			__FUNCTION__);
+	args.stable = NFS_FILE_SYNC;
+	offset = save_offset;
+	count = iobuf->length;
+	goto retry;
+}
+
+/*
+ * Read or write data, moving the data directly to/from the
+ * application's buffer without caching in the page cache.
+ *
+ * Rules for direct I/O
+ *
+ * 1.  block size = 512 bytes or more
+ * 2.  file byte offset is block aligned
+ * 3.  byte count is a multiple of block size
+ * 4.  user buffer is not aligned
+ * 5.  user buffer is faulted in and pinned
+ *
+ * These are verified before we get here.
+ */
+int
+nfs_direct_IO(int rw, struct file *file, struct kiobuf *iobuf,
+	unsigned long blocknr, int blocksize)
+{
+	int result = -EINVAL;
+	size_t count = iobuf->length;
+	struct dentry *dentry = file->f_dentry;
+	struct inode *inode = dentry->d_inode;
+	loff_t offset = blocknr << inode->i_blkbits;
+
+	switch (rw) {
+	case READ:
+		dfprintk(VFS,
+			"NFS: direct_IO(READ) (%s/%s) off/cnt(%Lu/%d)\n",
+				dentry->d_parent->d_name.name,
+					dentry->d_name.name, offset, count);
+
+		result = nfs_direct_read(file, iobuf, offset, count);
+		break;
+	case WRITE:
+		dfprintk(VFS,
+			"NFS: direct_IO(WRITE) (%s/%s) off/cnt(%Lu/%d)\n",
+				dentry->d_parent->d_name.name,
+					dentry->d_name.name, offset, count);
+
+		result = nfs_direct_write(file, iobuf, offset, count);
+		break;
+	default:
+		break;
+	}
+
+	dfprintk(VFS, "NFS: direct_IO result = %d\n", result);
+	return result;
+}
diff -u --recursive --new-file linux-2.4.22-nodirect/fs/nfs/file.c linux-2.4.22-blech/fs/nfs/file.c
--- linux-2.4.22-nodirect/fs/nfs/file.c	2003-07-09 19:39:39.000000000 +0200
+++ linux-2.4.22-blech/fs/nfs/file.c	2003-07-09 20:10:21.000000000 +0200
@@ -16,6 +16,7 @@
  *  nfs regular file handling functions
  */
 
+#include <linux/config.h>
 #include <linux/sched.h>
 #include <linux/kernel.h>
 #include <linux/errno.h>
@@ -199,6 +200,9 @@
 	readpage: nfs_readpage,
 	sync_page: nfs_sync_page,
 	writepage: nfs_writepage,
+#ifdef CONFIG_NFS_DIRECTIO
+	direct_fileIO: nfs_direct_IO,
+#endif
 	prepare_write: nfs_prepare_write,
 	commit_write: nfs_commit_write
 };
diff -u --recursive --new-file linux-2.4.22-nodirect/fs/nfs/Makefile linux-2.4.22-blech/fs/nfs/Makefile
--- linux-2.4.22-nodirect/fs/nfs/Makefile	2003-07-09 19:39:39.000000000 +0200
+++ linux-2.4.22-blech/fs/nfs/Makefile	2003-07-09 20:08:47.000000000 +0200
@@ -14,6 +14,7 @@
 
 obj-$(CONFIG_ROOT_NFS) += nfsroot.o mount_clnt.o      
 obj-$(CONFIG_NFS_V3) += nfs3proc.o nfs3xdr.o
+obj-$(CONFIG_NFS_DIRECTIO) += direct.o
 
 obj-m   := $(O_TARGET)
 
diff -u --recursive --new-file linux-2.4.22-nodirect/fs/nfs/write.c linux-2.4.22-blech/fs/nfs/write.c
--- linux-2.4.22-nodirect/fs/nfs/write.c	2003-07-09 19:39:39.000000000 +0200
+++ linux-2.4.22-blech/fs/nfs/write.c	2003-07-09 20:11:04.000000000 +0200
@@ -123,23 +123,6 @@
 }
 
 /*
- * This function will be used to simulate weak cache consistency
- * under NFSv2 when the NFSv3 attribute patch is included.
- * For the moment, we just call nfs_refresh_inode().
- */
-static __inline__ int
-nfs_write_attributes(struct inode *inode, struct nfs_fattr *fattr)
-{
-	if ((fattr->valid & NFS_ATTR_FATTR) && !(fattr->valid & NFS_ATTR_WCC)) {
-		fattr->pre_size  = NFS_CACHE_ISIZE(inode);
-		fattr->pre_mtime = NFS_CACHE_MTIME(inode);
-		fattr->pre_ctime = NFS_CACHE_CTIME(inode);
-		fattr->valid |= NFS_ATTR_WCC;
-	}
-	return nfs_refresh_inode(inode, fattr);
-}
-
-/*
  * Write a page synchronously.
  * Offset is the data offset within the page.
  */
diff -u --recursive --new-file linux-2.4.22-nodirect/include/linux/fs.h linux-2.4.22-blech/include/linux/fs.h
--- linux-2.4.22-nodirect/include/linux/fs.h	2003-07-09 19:39:39.000000000 +0200
+++ linux-2.4.22-blech/include/linux/fs.h	2003-07-09 19:42:37.000000000 +0200
@@ -396,6 +396,8 @@
 	int (*releasepage) (struct page *, int);
 #define KERNEL_HAS_O_DIRECT /* this is for modules out of the kernel */
 	int (*direct_IO)(int, struct inode *, struct kiobuf *, unsigned long, int);
+#define KERNEL_HAS_O_DIRECT2 /* Unfortunate kludge due to lack of foresight */
+	int (*direct_fileIO)(int, struct file *, struct kiobuf *, unsigned long, int);
 	void (*removepage)(struct page *); /* called when page gets removed from the inode */
 };
 
diff -u --recursive --new-file linux-2.4.22-nodirect/include/linux/nfs_fs.h linux-2.4.22-blech/include/linux/nfs_fs.h
--- linux-2.4.22-nodirect/include/linux/nfs_fs.h	2003-07-09 19:39:39.000000000 +0200
+++ linux-2.4.22-blech/include/linux/nfs_fs.h	2003-07-09 20:12:27.000000000 +0200
@@ -274,6 +274,11 @@
 #define NFS_TestClearPageSync(page)	test_and_clear_bit(PG_fs_1, &(page)->flags)
 
 /*
+ * linux/fs/nfs/direct.c
+ */
+extern int  nfs_direct_IO(int, struct file *, struct kiobuf *, unsigned long, int);
+
+/*
  * linux/fs/mount_clnt.c
  * (Used only by nfsroot module)
  */
@@ -302,6 +307,23 @@
 	return __nfs_refresh_inode(inode,fattr);
 }
 
+/*
+ * This function will be used to simulate weak cache consistency
+ * under NFSv2 when the NFSv3 attribute patch is included.
+ * For the moment, we just call nfs_refresh_inode().
+ */
+static __inline__ int
+nfs_write_attributes(struct inode *inode, struct nfs_fattr *fattr)
+{
+	if ((fattr->valid & NFS_ATTR_FATTR) && !(fattr->valid & NFS_ATTR_WCC)) {
+		fattr->pre_size  = NFS_CACHE_ISIZE(inode);
+		fattr->pre_mtime = NFS_CACHE_MTIME(inode);
+		fattr->pre_ctime = NFS_CACHE_CTIME(inode);
+		fattr->valid |= NFS_ATTR_WCC;
+	}
+	return nfs_refresh_inode(inode, fattr);
+}
+
 static inline loff_t
 nfs_size_to_loff_t(__u64 size)
 {
diff -u --recursive --new-file linux-2.4.22-nodirect/include/linux/nfs_xdr.h linux-2.4.22-blech/include/linux/nfs_xdr.h
--- linux-2.4.22-nodirect/include/linux/nfs_xdr.h	2003-07-09 19:39:39.000000000 +0200
+++ linux-2.4.22-blech/include/linux/nfs_xdr.h	2003-07-09 20:13:22.000000000 +0200
@@ -59,7 +59,7 @@
 /* Arguments to the read call.
  * Note that NFS_READ_MAXIOV must be <= (MAX_IOVEC-2) from sunrpc/xprt.h
  */
-#define NFS_READ_MAXIOV 8
+#define NFS_READ_MAXIOV		(9)
 
 struct nfs_readargs {
 	struct nfs_fh *		fh;
@@ -78,7 +78,7 @@
 /* Arguments to the write call.
  * Note that NFS_WRITE_MAXIOV must be <= (MAX_IOVEC-2) from sunrpc/xprt.h
  */
-#define NFS_WRITE_MAXIOV        8
+#define NFS_WRITE_MAXIOV        (9)
 struct nfs_writeargs {
 	struct nfs_fh *		fh;
 	__u64			offset;
diff -u --recursive --new-file linux-2.4.22-nodirect/mm/filemap.c linux-2.4.22-blech/mm/filemap.c
--- linux-2.4.22-nodirect/mm/filemap.c	2003-07-09 19:39:39.000000000 +0200
+++ linux-2.4.22-blech/mm/filemap.c	2003-07-09 19:57:09.000000000 +0200
@@ -1651,6 +1651,95 @@
 	return retval;
 }
 
+/* Differs from the above in that it uses the a_op->direct_fileIO() call */
+static ssize_t direct_fileIO(int rw, struct file * filp, char * buf, size_t count, loff_t offset)
+{
+	ssize_t retval;
+	int new_iobuf, chunk_size, blocksize_mask, blocksize, blocksize_bits, iosize, progress;
+	struct kiobuf * iobuf;
+	struct address_space * mapping = filp->f_dentry->d_inode->i_mapping;
+	struct inode * inode = mapping->host;
+	loff_t size = inode->i_size;
+
+	new_iobuf = 0;
+	iobuf = filp->f_iobuf;
+	if (test_and_set_bit(0, &filp->f_iobuf_lock)) {
+		/*
+		 * A parallel read/write is using the preallocated iobuf
+		 * so just run slow and allocate a new one.
+		 */
+		retval = alloc_kiovec(1, &iobuf);
+		if (retval)
+			goto out;
+		new_iobuf = 1;
+	}
+
+	blocksize = 1 << inode->i_blkbits;
+	blocksize_bits = inode->i_blkbits;
+	blocksize_mask = blocksize - 1;
+	chunk_size = KIO_MAX_ATOMIC_IO << 10;
+
+	retval = -EINVAL;
+	if ((offset & blocksize_mask) || (count & blocksize_mask) || ((unsigned long) buf & blocksize_mask))
+		goto out_free;
+	if (!mapping->a_ops->direct_fileIO)
+		goto out_free;
+
+	if ((rw == READ) && (offset + count > size))
+		count = size - offset;
+
+	/*
+	 * Flush to disk exclusively the _data_, metadata must remain
+	 * completly asynchronous or performance will go to /dev/null.
+	 */
+	retval = filemap_fdatasync(mapping);
+	if (retval == 0)
+		retval = fsync_inode_data_buffers(inode);
+	if (retval == 0)
+		retval = filemap_fdatawait(mapping);
+	if (retval < 0)
+		goto out_free;
+
+	progress = retval = 0;
+	while (count > 0) {
+		iosize = count;
+		if (iosize > chunk_size)
+			iosize = chunk_size;
+
+		retval = map_user_kiobuf(rw, iobuf, (unsigned long) buf, iosize);
+		if (retval)
+			break;
+
+		retval = mapping->a_ops->direct_fileIO(rw, filp, iobuf, (offset+progress) >> blocksize_bits, blocksize);
+
+		if (rw == READ && retval > 0)
+			mark_dirty_kiobuf(iobuf, retval);
+		
+		if (retval >= 0) {
+			count -= retval;
+			buf += retval;
+			/* warning: weird semantics here, we're reporting a read behind the end of the file */
+			progress += retval;
+		}
+
+		unmap_kiobuf(iobuf);
+
+		if (retval != iosize)
+			break;
+	}
+
+	if (progress)
+		retval = progress;
+
+ out_free:
+	if (!new_iobuf)
+		clear_bit(0, &filp->f_iobuf_lock);
+	else
+		free_kiovec(1, &iobuf);
+ out:	
+	return retval;
+}
+
 int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size)
 {
 	char *kaddr;
@@ -1721,7 +1810,10 @@
 		down(&inode->i_sem);
 		size = inode->i_size;
 		if (pos < size) {
-			retval = generic_file_direct_IO(READ, filp, buf, count, pos);
+			if (mapping->a_ops->direct_fileIO)
+				retval = direct_fileIO(READ, filp, buf, count, pos);
+			else
+				retval = generic_file_direct_IO(READ, filp, buf, count, pos);
 			if (retval > 0)
 				*ppos = pos + retval;
 		}
@@ -3230,7 +3322,10 @@
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
 	mark_inode_dirty_sync(inode);
 
-	written = generic_file_direct_IO(WRITE, file, (char *) buf, count, pos);
+	if (mapping->a_ops->direct_fileIO)
+		written = direct_fileIO(WRITE, file, (char *) buf, count, pos);
+	else
+		written = generic_file_direct_IO(WRITE, file, (char *) buf, count, pos);
 	if (written > 0) {
 		loff_t end = pos + written;
 		if (end > inode->i_size && !S_ISBLK(inode->i_mode)) {

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 18:08           ` Marcelo Tosatti
  2003-07-09 18:22             ` Marc-Christian Petersen
@ 2003-07-09 18:33             ` Trond Myklebust
  2003-07-09 18:41               ` Marc-Christian Petersen
  1 sibling, 1 reply; 27+ messages in thread
From: Trond Myklebust @ 2003-07-09 18:33 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Marc-Christian Petersen, lkml

>>>>> " " == Marcelo Tosatti <marcelo@conectiva.com.br> writes:

     > Ok, right. Well, I dont know why it doesnt happen there. Maybe
     > not enough testing?

Probably nobody is applying the XFS patches on those kernel?

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 18:33             ` Trond Myklebust
@ 2003-07-09 18:41               ` Marc-Christian Petersen
  2003-07-09 18:50                 ` Trond Myklebust
  0 siblings, 1 reply; 27+ messages in thread
From: Marc-Christian Petersen @ 2003-07-09 18:41 UTC (permalink / raw)
  To: trond.myklebust, Marcelo Tosatti; +Cc: lkml

On Wednesday 09 July 2003 20:33, Trond Myklebust wrote:

Hi Trond,

> >>>>> " " == Marcelo Tosatti <marcelo@conectiva.com.br> writes:
>      > Ok, right. Well, I dont know why it doesnt happen there. Maybe
>      > not enough testing?
> Probably nobody is applying the XFS patches on those kernel?
err, -aa has XFS per default, -wolk has XFS per default. So ... ;)

ciao, Marc


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 18:41               ` Marc-Christian Petersen
@ 2003-07-09 18:50                 ` Trond Myklebust
  2003-07-09 18:55                   ` Marc-Christian Petersen
                                     ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Trond Myklebust @ 2003-07-09 18:50 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: Marcelo Tosatti, lkml

>>>>> " " == Marc-Christian Petersen <m.c.p@wolk-project.de> writes:

     > err, -aa has XFS per default, -wolk has XFS per default. So
     > ... ;)

So they have both XFS + NFS O_DIRECT?

The answer to your question is then that somebody made the trivial
conversion on XFS... It's just a question of replacing the second
argument of the direct_IO() method with a filp, then extracting the
inode from that. A 2-liner patch at most...

The point here is that Marcelo's tree does not include XFS, so my
patch can't fix it up...
As I said, I suggest replacing KERNEL_HAS_O_DIRECT with
KERNEL_HAS_O_DIRECT2 so that the XFS patches can switch on that, and
hence provide the 2-liner on newer kernels...

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 18:29     ` Trond Myklebust
@ 2003-07-09 18:51       ` Andreas Dilger
  2003-07-09 19:18         ` Jeff Garzik
  0 siblings, 1 reply; 27+ messages in thread
From: Andreas Dilger @ 2003-07-09 18:51 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Marcelo Tosatti, Christoph Hellwig, marcelo, lkml

On Jul 09, 2003  20:29 +0200, Trond Myklebust wrote:
> How about instead following Alan's suggestion to replace
> KERNEL_HAS_O_DIRECT with a KERNEL_HAS_O_DIRECT2 that can be used to
> switch between the old and new O_DIRECT format in the XFS patches?

Actually, I like that a lot more, as it allows out-of-tree code to
know which interface to use, and we don't have to key on kernel version
(which is bogus if compiling against a vendor kernel that back-ports
this fix).

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 18:50                 ` Trond Myklebust
@ 2003-07-09 18:55                   ` Marc-Christian Petersen
  2003-07-09 19:05                   ` Jeff Garzik
  2003-07-09 23:40                   ` Alan Cox
  2 siblings, 0 replies; 27+ messages in thread
From: Marc-Christian Petersen @ 2003-07-09 18:55 UTC (permalink / raw)
  To: trond.myklebust; +Cc: Marcelo Tosatti, lkml

On Wednesday 09 July 2003 20:50, Trond Myklebust wrote:

Hi Trond,

> So they have both XFS + NFS O_DIRECT?
yes.

> The answer to your question is then that somebody made the trivial
> conversion on XFS... It's just a question of replacing the second
> argument of the direct_IO() method with a filp, then extracting the
> inode from that. A 2-liner patch at most...
EXACT! That was my intention with my small first post ;)

> The point here is that Marcelo's tree does not include XFS, so my
> patch can't fix it up...
> As I said, I suggest replacing KERNEL_HAS_O_DIRECT with
> KERNEL_HAS_O_DIRECT2 so that the XFS patches can switch on that, and
> hence provide the 2-liner on newer kernels...
It's very okay with me.

ciao, Marc


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 18:50                 ` Trond Myklebust
  2003-07-09 18:55                   ` Marc-Christian Petersen
@ 2003-07-09 19:05                   ` Jeff Garzik
  2003-07-09 19:08                     ` Trond Myklebust
  2003-07-09 23:40                   ` Alan Cox
  2 siblings, 1 reply; 27+ messages in thread
From: Jeff Garzik @ 2003-07-09 19:05 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Marc-Christian Petersen, Marcelo Tosatti, lkml

On Wed, Jul 09, 2003 at 08:50:59PM +0200, Trond Myklebust wrote:
> >>>>> " " == Marc-Christian Petersen <m.c.p@wolk-project.de> writes:
> 
>      > err, -aa has XFS per default, -wolk has XFS per default. So
>      > ... ;)
> 
> So they have both XFS + NFS O_DIRECT?
> 
> The answer to your question is then that somebody made the trivial
> conversion on XFS... It's just a question of replacing the second
> argument of the direct_IO() method with a filp, then extracting the
> inode from that. A 2-liner patch at most...
> 
> The point here is that Marcelo's tree does not include XFS, so my
> patch can't fix it up...
> As I said, I suggest replacing KERNEL_HAS_O_DIRECT with
> KERNEL_HAS_O_DIRECT2 so that the XFS patches can switch on that, and
> hence provide the 2-liner on newer kernels...

s/replacing/adding/

A new ->direct_IO2 hook would be an addition, so you really want to
simply add another feature flag.

Since the 2.5 direct_IO API is already different from current 2.4, I
would also suggestion considering KERNEL24_HAS_O_DIRECT2 as the name, to
specify the feature is specific to 2.4.

	Jeff




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 19:05                   ` Jeff Garzik
@ 2003-07-09 19:08                     ` Trond Myklebust
  2003-07-09 19:17                       ` Jeff Garzik
  0 siblings, 1 reply; 27+ messages in thread
From: Trond Myklebust @ 2003-07-09 19:08 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Marc-Christian Petersen, Marcelo Tosatti, lkml

>>>>> " " == Jeff Garzik <jgarzik@pobox.com> writes:

     > s/replacing/adding/

Revert ;-)

     > A new ->direct_IO2 hook would be an addition, so you really
     > want to simply add another feature flag.

You missed the point. This is *instead* of ->direct_IO2. It's only
purpose would be to distinguish between the old

 int (*direct_IO)(int, struct inode *, struct kiobuf *, unsigned long, int);


and the new

 int (*direct_IO)(int, struct file *, struct kiobuf *, unsigned long, int);

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 18:22             ` Marc-Christian Petersen
@ 2003-07-09 19:13               ` Marcelo Tosatti
  2003-07-09 19:45                 ` Marc-Christian Petersen
  2003-07-09 23:43                 ` Alan Cox
  0 siblings, 2 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2003-07-09 19:13 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Andreas Dilger, Andrea Arcangeli, Christoph Hellwig, marcelo,
	Trond Myklebust, lkml



On Wed, 9 Jul 2003, Marc-Christian Petersen wrote:

> On Wednesday 09 July 2003 20:08, Marcelo Tosatti wrote:
>
> Hi Marcelo,
>
> > Ok, right. Well, I dont know why it doesnt happen there. Maybe not enough
> > testing?
> aehm, I'll bet -aa is tested _alot_ and also -wolk has tons of testers, but
> well, this is a different talk ;)
>
> > Anyway, I'm going to revert the NFS DIRECT IO patch because, as Christoph
> > mentioned, breaks the API.
> > I except another solution from Trond (maybe ->direct_IO2).

> I wonder why you merge such stuff then in the first place, if you agree with
> hch's opinion, that an API should not change in stable kernel series in the
> 2nd place? ... Did you temporarly forget it? ;)

No, I did not.

I applied it because, in my ignorance, I did not noticed it would break
the stable API.

I applied it because I wanted comments useful from people (Like hch and
others did).

Thank you very much for your flaming.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 19:08                     ` Trond Myklebust
@ 2003-07-09 19:17                       ` Jeff Garzik
  2003-07-09 19:51                         ` Trond Myklebust
  2003-07-09 23:42                         ` Alan Cox
  0 siblings, 2 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-07-09 19:17 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Marc-Christian Petersen, Marcelo Tosatti, lkml

On Wed, Jul 09, 2003 at 09:08:53PM +0200, Trond Myklebust wrote:
> >>>>> " " == Jeff Garzik <jgarzik@pobox.com> writes:
> 
>      > s/replacing/adding/
> 
> Revert ;-)
> 
>      > A new ->direct_IO2 hook would be an addition, so you really
>      > want to simply add another feature flag.
> 
> You missed the point. This is *instead* of ->direct_IO2. It's only
> purpose would be to distinguish between the old
> 
>  int (*direct_IO)(int, struct inode *, struct kiobuf *, unsigned long, int);
> 
> 
> and the new
> 
>  int (*direct_IO)(int, struct file *, struct kiobuf *, unsigned long, int);

I respectfully disagree, then.

The 2.5 direct_IO API is already different from 2.4, so, changing
the 2.4 stable API implies creating yet another different version of
the API.

When this situation presented itself earlier, with reiserfs, the
solution was ->read_inode2.  I think that same situation applies here.

Having the stable API change, conditional on a define, is really
nasty and IMO will create maintenance and support headaches down
the line.  I do not recall Linux VFS _ever_ having a hook's definition
conditional.  We should not start now...

We need a 2.4-specific solution for this.  ->direct_IO2 should suffice,
and it follows historical example.

XFS, ocfs, and others need only to test the HAVE_NEW_24_DIRECT_IO
define.

In the core kernel implementation, it is trivial to say "if direct_IO2
is non-NULL, use that, otherwise use direct_IO", without needing to make
any code conditional at all.

	Jeff




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 18:51       ` Andreas Dilger
@ 2003-07-09 19:18         ` Jeff Garzik
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-07-09 19:18 UTC (permalink / raw)
  To: Trond Myklebust, Marcelo Tosatti, Christoph Hellwig, marcelo, lkml

On Wed, Jul 09, 2003 at 11:51:48AM -0700, Andreas Dilger wrote:
> On Jul 09, 2003  20:29 +0200, Trond Myklebust wrote:
> > How about instead following Alan's suggestion to replace
> > KERNEL_HAS_O_DIRECT with a KERNEL_HAS_O_DIRECT2 that can be used to
> > switch between the old and new O_DIRECT format in the XFS patches?
> 
> Actually, I like that a lot more, as it allows out-of-tree code to
> know which interface to use, and we don't have to key on kernel version
> (which is bogus if compiling against a vendor kernel that back-ports
> this fix).

Feature test macros are definitely quite useful...

...but making the stable series VFS API definition _conditional_
is quite unprecedented, and IMO wrong.

	Jeff




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 19:13               ` Marcelo Tosatti
@ 2003-07-09 19:45                 ` Marc-Christian Petersen
  2003-07-09 23:43                 ` Alan Cox
  1 sibling, 0 replies; 27+ messages in thread
From: Marc-Christian Petersen @ 2003-07-09 19:45 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andreas Dilger, Andrea Arcangeli, Christoph Hellwig, marcelo,
	Trond Myklebust, lkml

On Wednesday 09 July 2003 21:13, Marcelo Tosatti wrote:

Hi Marcelo,

> No, I did not.
> I applied it because, in my ignorance, I did not noticed it would break
> the stable API.
> I applied it because I wanted comments useful from people (Like hch and
> others did).
> Thank you very much for your flaming.
Somewhere deep in my head I knew you'd say that :-( .. When I flame, it looks 
very different.

It was not any kind of flaming or any kind of bad meaning or such, just a 
simple question because I was just curious!

ciao, Marc


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 19:17                       ` Jeff Garzik
@ 2003-07-09 19:51                         ` Trond Myklebust
  2003-07-09 21:43                           ` Jeff Garzik
  2003-07-09 23:42                         ` Alan Cox
  1 sibling, 1 reply; 27+ messages in thread
From: Trond Myklebust @ 2003-07-09 19:51 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Trond Myklebust, Marc-Christian Petersen, Marcelo Tosatti, lkml

>>>>> " " == Jeff Garzik <jgarzik@pobox.com> writes:

     > Having the stable API change, conditional on a define, is
     > really nasty and IMO will create maintenance and support
     > headaches down the line.  I do not recall Linux VFS _ever_
     > having a hook's definition conditional.  We should not start
     > now...

direct_IO() was precisely such a conditional hook definition. It
appeared in 2.4.15, and anybody who does not check for
KERNEL_HAS_O_DIRECT is not backward compatible.

To comment further: There is at least one example I can think of which
was exactly equivalent to the proposed change, namely the redefinition
of the filldir_t type in 2.4.9. It was admittedly not documented using
a define...

Note: We could at the same time replace the name direct_IO() with
direct_IO2() (that has several precedents).  There are currently only
a small number of filesystems that provide O_DIRECT, and converting
them all is (as has been pointed out before) trivial...

The problem with read_inode2() was rather that it overloaded the the
existing iget4() interface...

Cheers,
  Trond



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 19:51                         ` Trond Myklebust
@ 2003-07-09 21:43                           ` Jeff Garzik
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-07-09 21:43 UTC (permalink / raw)
  To: trond.myklebust; +Cc: Marc-Christian Petersen, Marcelo Tosatti, lkml

Trond Myklebust wrote:
>>>>>>" " == Jeff Garzik <jgarzik@pobox.com> writes:
> 
> 
>      > Having the stable API change, conditional on a define, is
>      > really nasty and IMO will create maintenance and support
>      > headaches down the line.  I do not recall Linux VFS _ever_
>      > having a hook's definition conditional.  We should not start
>      > now...
> 
> direct_IO() was precisely such a conditional hook definition. It
> appeared in 2.4.15, and anybody who does not check for
> KERNEL_HAS_O_DIRECT is not backward compatible.

You misunderstand.  The 2.4.15 direct_IO hook was _not_ conditionally 
defined.  It appeared in the middle of a stable series, yes.  It has a 
feature macro, yes.  But the definition of the hook in 
include/linux/fs.h does not _change_ based on a define.  That is what I 
mean by a conditional hook definition.

It is far less trouble for everyone to add a new hook, instead of 
changing an existing hook, in the middle of a stable series.


> To comment further: There is at least one example I can think of which
> was exactly equivalent to the proposed change, namely the redefinition
> of the filldir_t type in 2.4.9. It was admittedly not documented using
> a define...

No doubt you can find more :)  That doesn't make the right thing to do, 
though :)


> Note: We could at the same time replace the name direct_IO() with
> direct_IO2() (that has several precedents).  There are currently only
> a small number of filesystems that provide O_DIRECT, and converting
> them all is (as has been pointed out before) trivial...

We cannot just-fix-up filesystems which are not in-tree, which is what 
the KERNEL_HAS_O_DIRECT2 define would be mainly used for.  In-tree 
filesystems would just unconditionally use the new, or old, interface as 
they chose.


> The problem with read_inode2() was rather that it overloaded the the
> existing iget4() interface...

The higher-level problem was that we didn't want to change the VFS 
API...  otherwise we could have simply used the new interface, and 
converted all in-tree filesystems.

	Jeff




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 18:50                 ` Trond Myklebust
  2003-07-09 18:55                   ` Marc-Christian Petersen
  2003-07-09 19:05                   ` Jeff Garzik
@ 2003-07-09 23:40                   ` Alan Cox
  2 siblings, 0 replies; 27+ messages in thread
From: Alan Cox @ 2003-07-09 23:40 UTC (permalink / raw)
  To: trond.myklebust; +Cc: Marc-Christian Petersen, Marcelo Tosatti, lkml

On Mer, 2003-07-09 at 19:50, Trond Myklebust wrote:
> >>>>> " " == Marc-Christian Petersen <m.c.p@wolk-project.de> writes:
> 
>      > err, -aa has XFS per default, -wolk has XFS per default. So
>      > ... ;)
> 
> So they have both XFS + NFS O_DIRECT?

-ac has both



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 19:17                       ` Jeff Garzik
  2003-07-09 19:51                         ` Trond Myklebust
@ 2003-07-09 23:42                         ` Alan Cox
  2003-07-10  0:23                           ` Jeff Garzik
  1 sibling, 1 reply; 27+ messages in thread
From: Alan Cox @ 2003-07-09 23:42 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Trond Myklebust, Marc-Christian Petersen, Marcelo Tosatti, lkml

On Mer, 2003-07-09 at 20:17, Jeff Garzik wrote:
> Having the stable API change, conditional on a define, is really
> nasty and IMO will create maintenance and support headaches down
> the line.  I do not recall Linux VFS _ever_ having a hook's definition
> conditional.  We should not start now...

The new quota code in 2.4.22pre already changed the rules slightly, as
do the shmemfs fixes if you are pedantic



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 19:13               ` Marcelo Tosatti
  2003-07-09 19:45                 ` Marc-Christian Petersen
@ 2003-07-09 23:43                 ` Alan Cox
  2003-07-10  0:21                   ` Jeff Garzik
  1 sibling, 1 reply; 27+ messages in thread
From: Alan Cox @ 2003-07-09 23:43 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Marc-Christian Petersen, Andreas Dilger, Andrea Arcangeli,
	Christoph Hellwig, marcelo, Trond Myklebust, lkml

On Mer, 2003-07-09 at 20:13, Marcelo Tosatti wrote:
> I applied it because, in my ignorance, I did not noticed it would break
> the stable API.
> 
> I applied it because I wanted comments useful from people (Like hch and
> others did).

I'm not sure I see what the fuss is about a slight API change that is
safe since it spews warnings/breaks existing code that isnt fixed. At
least one vendor kernel also has the changed API anyway


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 23:43                 ` Alan Cox
@ 2003-07-10  0:21                   ` Jeff Garzik
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-07-10  0:21 UTC (permalink / raw)
  To: Alan Cox
  Cc: Marcelo Tosatti, Marc-Christian Petersen, Andreas Dilger,
	Andrea Arcangeli, Christoph Hellwig, marcelo, Trond Myklebust,
	lkml

Alan Cox wrote:
> On Mer, 2003-07-09 at 20:13, Marcelo Tosatti wrote:
> 
>>I applied it because, in my ignorance, I did not noticed it would break
>>the stable API.
>>
>>I applied it because I wanted comments useful from people (Like hch and
>>others did).
> 
> 
> I'm not sure I see what the fuss is about a slight API change that is
> safe since it spews warnings/breaks existing code that isnt fixed. At
> least one vendor kernel also has the changed API anyway


"safe" ignores the pain of people trying to support multiple kernels. 
Each API change like the direct_IO one introduces ifdefs.  Changing a 
function prototype is particularly annoying because you can't create a 
backwards-compat wrapper

I disagree with the AC97 codec changes being merged into 2.4, too, for 
the same reason.  Yes I recognize it is required to support new 
hardware.  Yes I realize it vastly simplifies supporting some existing 
hardware.  But I don't think you realize (or don't care?) about the 
maintenance pain created by the change.  If a vendor wishes their driver 
to support 2.4.21 _and_ 2.4.22 (not a lot to ask), they must add a bunch 
of ifdef crud in their OSS driver.

Feature and API additions are _far_ less painful than API changes in the 
middle of a stable series.

Overall, I think we are looking at a question which needs to be answered 
by the community:  what constitutes a stable series?  when do we stop 
changing the API and let it stabilize?  ... and I am writing a mail 
right now to ask that question (as requested by Marcelo and a couple 
others, though I wanted to do it for a while now).

	Jeff




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ->direct_IO API change in current 2.4 BK
  2003-07-09 23:42                         ` Alan Cox
@ 2003-07-10  0:23                           ` Jeff Garzik
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-07-10  0:23 UTC (permalink / raw)
  To: Alan Cox; +Cc: Trond Myklebust, Marc-Christian Petersen, Marcelo Tosatti, lkml

Alan Cox wrote:
> On Mer, 2003-07-09 at 20:17, Jeff Garzik wrote:
> 
>>Having the stable API change, conditional on a define, is really
>>nasty and IMO will create maintenance and support headaches down
>>the line.  I do not recall Linux VFS _ever_ having a hook's definition
>>conditional.  We should not start now...
> 
> 
> The new quota code in 2.4.22pre already changed the rules slightly, as
> do the shmemfs fixes if you are pedantic

I am ;-)


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2003-07-10  0:12 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-07-09 12:31 ->direct_IO API change in current 2.4 BK Christoph Hellwig
2003-07-09 17:03 ` Andreas Dilger
2003-07-09 17:24   ` Marcelo Tosatti
2003-07-09 17:43     ` Marc-Christian Petersen
2003-07-09 17:46       ` Marcelo Tosatti
2003-07-09 17:55         ` Marc-Christian Petersen
2003-07-09 18:08           ` Marcelo Tosatti
2003-07-09 18:22             ` Marc-Christian Petersen
2003-07-09 19:13               ` Marcelo Tosatti
2003-07-09 19:45                 ` Marc-Christian Petersen
2003-07-09 23:43                 ` Alan Cox
2003-07-10  0:21                   ` Jeff Garzik
2003-07-09 18:33             ` Trond Myklebust
2003-07-09 18:41               ` Marc-Christian Petersen
2003-07-09 18:50                 ` Trond Myklebust
2003-07-09 18:55                   ` Marc-Christian Petersen
2003-07-09 19:05                   ` Jeff Garzik
2003-07-09 19:08                     ` Trond Myklebust
2003-07-09 19:17                       ` Jeff Garzik
2003-07-09 19:51                         ` Trond Myklebust
2003-07-09 21:43                           ` Jeff Garzik
2003-07-09 23:42                         ` Alan Cox
2003-07-10  0:23                           ` Jeff Garzik
2003-07-09 23:40                   ` Alan Cox
2003-07-09 18:29     ` Trond Myklebust
2003-07-09 18:51       ` Andreas Dilger
2003-07-09 19:18         ` Jeff Garzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).