All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
@ 2014-02-10 20:16 Dilger, Andreas
  2014-02-10 21:29 ` Al Viro
  0 siblings, 1 reply; 15+ messages in thread
From: Dilger, Andreas @ 2014-02-10 20:16 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig; +Cc: linux-fsdevel, Drokin, Oleg, Peng Tao, greg

On 2014/02/03, 5:09 PM, "Dilger, Andreas" <andreas.dilger@intel.com> wrote:
>On 2014/02/03, 4:07 PM, "Andreas Dilger" <andreas.dilger@intel.com> wrote:
>>In kernel 3.11 O_TMPFILE was introduced, but the open flag value
>>conflicts with the O_LOV_DELAY_CREATE flag 020000000 previously used
>>by Lustre-aware applications.  O_LOV_DELAY_CREATE allows applications
>>to defer file layout and object creation from open time (the default)
>>until it can instead be specified by the application using an ioctl.
>>
>>Instead of trying to find a non-conflicting O_LOV_DELAY_CREATE flag
>>or define a Lustre-specific flag that isn't of use to most/any other
>>filesystems, use (O_NOCTTY|FASYNC) as the new value.  These flag
>>are not meaningful for newly-created regular files and should be
>>OK since O_LOV_DELAY_CREATE is only meaningful for new files.
>>
>>I looked into using O_ACCMODE/FMODE_WRITE_IOCTL, which allows calling
>>ioctl() on the minimally-opened fd and is close to what is needed,
>>but that doesn't allow specifying the actual read or write mode for
>>the file, and fcntl(F_SETFL) doesn't allow O_RDONLY/O_WRONLY/O_RDWR
>>to be set after the file is opened.

Al, Christoph,
any comments on this approach?

>A few extra comments here that I wasn't sure should be in the commit
>comment.
>
>The main goal of the O_LOV_DELAY_CREATE flag is to allow the file to be
>opened in a "preliminary" manner to allow the application to specify the
>layout of the file across the Lustre storage targets (e.g. whether the
>app has millions of separate files, each one written to a single server,
>or there is a single huge file spread across all of the servers, or some
>combination of the two, if it is RAID-0 or RAID-1, or whatever).
>
>I'm open to a separate flag value for this, but I don't know if that is
>of much interest to other filesystems since they don't have similar needs.
>We want to avoid the need to have lots of syscalls to do this, since they
>translate into multiple RPCs that we want to avoid when creating
>potentially
>millions of files over the network.
>
>Cheers, Andreas
>
>>Lustre-change: http://review.whamcloud.com/8312
>>Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-4209
>>Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
>>Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
>>Signed-off-by: Peng Tao <bergwolf@gmail.com>
>>---
>> .../lustre/lustre/include/lustre/lustre_user.h     |   12 ++++------
>> drivers/staging/lustre/lustre/include/lustre_mdc.h |   11 ++++++++++
>> drivers/staging/lustre/lustre/llite/file.c         |   21
>>++++++++++---------
>> drivers/staging/lustre/lustre/mdc/mdc_lib.c        |    2 +-
>> 4 files changed, 28 insertions(+), 18 deletions(-)
>>
>>diff --git a/drivers/staging/lustre/lustre/include/lustre/lustre_user.h
>>b/drivers/staging/lustre/lustre/include/lustre/lustre_user.h
>>index 6b6c0240..91f22a8 100644
>>--- a/drivers/staging/lustre/lustre/include/lustre/lustre_user.h
>>+++ b/drivers/staging/lustre/lustre/include/lustre/lustre_user.h
>>@@ -265,13 +265,11 @@ struct ost_id {
>> 
>> #define MAX_OBD_NAME 128 /* If this changes, a NEW ioctl must be added
>>*/
>> 
>>-/* Hopefully O_LOV_DELAY_CREATE does not conflict with standard O_xxx
>>flags.
>>- * Previously it was defined as 0100000000 and conflicts with
>>FMODE_NONOTIFY
>>- * which was added since kernel 2.6.36, so we redefine it as 020000000.
>>- * To be compatible with old version's statically linked binary, finally
>>we
>>- * define it as (020000000 | 0100000000).
>>- * */
>>-#define O_LOV_DELAY_CREATE      0120000000
>>+/* Define O_LOV_DELAY_CREATE to be a mask that is not useful for regular
>>+ * files, but are unlikely to be used in practice and are not harmful if
>>+ * used incorrectly.  O_NOCTTY and FASYNC are only meaningful for
>>character
>>+ * devices and are safe for use on new files (See LU-812, LU-4209). */
>>+#define O_LOV_DELAY_CREATE	(O_NOCTTY | FASYNC)
>> 
>> #define LL_FILE_IGNORE_LOCK     0x00000001
>> #define LL_FILE_GROUP_LOCKED    0x00000002
>>diff --git a/drivers/staging/lustre/lustre/include/lustre_mdc.h
>>b/drivers/staging/lustre/lustre/include/lustre_mdc.h
>>index c1e0270..468f363 100644
>>--- a/drivers/staging/lustre/lustre/include/lustre_mdc.h
>>+++ b/drivers/staging/lustre/lustre/include/lustre_mdc.h
>>@@ -166,6 +166,17 @@ void it_clear_disposition(struct lookup_intent *it,
>>int flag);
>> void it_set_disposition(struct lookup_intent *it, int flag);
>> int it_open_error(int phase, struct lookup_intent *it);
>> 
>>+static inline bool cl_is_lov_delay_create(unsigned int flags)
>>+{
>>+	return (flags & O_LOV_DELAY_CREATE) == O_LOV_DELAY_CREATE;
>>+}
>>+
>>+static inline void cl_lov_delay_create_clear(unsigned int *flags)
>>+{
>>+	if ((*flags & O_LOV_DELAY_CREATE) == O_LOV_DELAY_CREATE)
>>+		*flags &= ~O_LOV_DELAY_CREATE;
>>+}
>>+
>> /** @} mdc */
>> 
>> #endif
>>diff --git a/drivers/staging/lustre/lustre/llite/file.c
>>b/drivers/staging/lustre/lustre/llite/file.c
>>index c12821a..dc9da77 100644
>>--- a/drivers/staging/lustre/lustre/llite/file.c
>>+++ b/drivers/staging/lustre/lustre/llite/file.c
>>@@ -671,14 +671,13 @@ restart:
>> 
>> 	ll_capa_open(inode);
>> 
>>-	if (!lli->lli_has_smd) {
>>-		if (file->f_flags & O_LOV_DELAY_CREATE ||
>>-		    !(file->f_mode & FMODE_WRITE)) {
>>-			CDEBUG(D_INODE, "object creation was delayed\n");
>>-			GOTO(out_och_free, rc);
>>-		}
>>+	if (!lli->lli_has_smd &&
>>+	    (cl_is_lov_delay_create(file->f_flags) ||
>>+	     (file->f_mode & FMODE_WRITE) == 0)) {
>>+		CDEBUG(D_INODE, "object creation was delayed\n");
>>+		GOTO(out_och_free, rc);
>> 	}
>>-	file->f_flags &= ~O_LOV_DELAY_CREATE;
>>+	cl_lov_delay_create_clear(&file->f_flags);
>> 	GOTO(out_och_free, rc);
>> 
>> out_och_free:
>>@@ -1381,23 +1380,25 @@ int ll_lov_setstripe_ea_info(struct inode *inode,
>>struct file *file,
>> 		ccc_inode_lsm_put(inode, lsm);
>> 		CDEBUG(D_IOCTL, "stripe already exists for ino %lu\n",
>> 		       inode->i_ino);
>>-		return -EEXIST;
>>+		GOTO(out, rc = -EEXIST);
>> 	}
>> 
>> 	ll_inode_size_lock(inode);
>> 	rc = ll_intent_file_open(file, lum, lum_size, &oit);
>> 	if (rc)
>>-		GOTO(out, rc);
>>+		GOTO(out_unlock, rc);
>> 	rc = oit.d.lustre.it_status;
>> 	if (rc < 0)
>> 		GOTO(out_req_free, rc);
>> 
>> 	ll_release_openhandle(file->f_dentry, &oit);
>> 
>>- out:
>>+out_unlock:
>> 	ll_inode_size_unlock(inode);
>> 	ll_intent_release(&oit);
>> 	ccc_inode_lsm_put(inode, lsm);
>>+out:
>>+	cl_lov_delay_create_clear(&file->f_flags);
>> 	return rc;
>> out_req_free:
>> 	ptlrpc_req_finished((struct ptlrpc_request *) oit.d.lustre.it_data);
>>diff --git a/drivers/staging/lustre/lustre/mdc/mdc_lib.c
>>b/drivers/staging/lustre/lustre/mdc/mdc_lib.c
>>index 91f6876..5b9f371 100644
>>--- a/drivers/staging/lustre/lustre/mdc/mdc_lib.c
>>+++ b/drivers/staging/lustre/lustre/mdc/mdc_lib.c
>>@@ -197,7 +197,7 @@ static __u64 mds_pack_open_flags(__u64 flags, __u32
>>mode)
>> 	if (flags & FMODE_EXEC)
>> 		cr_flags |= MDS_FMODE_EXEC;
>> #endif
>>-	if (flags & O_LOV_DELAY_CREATE)
>>+	if (cl_is_lov_delay_create(flags))
>> 		cr_flags |= MDS_OPEN_DELAY_CREATE;
>> 
>> 	if (flags & O_NONBLOCK)
>>-- 
>>1.7.3.4
>>
>>
>
>
>Cheers, Andreas
>-- 
>Andreas Dilger
>
>Lustre Software Architect
>Intel High Performance Data Division
>
>
>


Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
  2014-02-10 20:16 RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict Dilger, Andreas
@ 2014-02-10 21:29 ` Al Viro
  2014-02-10 22:10   ` Al Viro
                     ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Al Viro @ 2014-02-10 21:29 UTC (permalink / raw)
  To: Dilger, Andreas
  Cc: Christoph Hellwig, linux-fsdevel, Drokin, Oleg, Peng Tao, greg

On Mon, Feb 10, 2014 at 08:16:52PM +0000, Dilger, Andreas wrote:

> >>Instead of trying to find a non-conflicting O_LOV_DELAY_CREATE flag
> >>or define a Lustre-specific flag that isn't of use to most/any other
> >>filesystems, use (O_NOCTTY|FASYNC) as the new value.  These flag
> >>are not meaningful for newly-created regular files and should be
> >>OK since O_LOV_DELAY_CREATE is only meaningful for new files.

*shrug*

I can live with that; it's a kludge, but it's less broken than that
explicit constant - that one is a non-starter, since O_... flag
values are arch-dependent.

	I have another question about what you are doing there - the games
you are playing with crw_pos.  Is there any reason not to have ->ki_pos
updated immediately in lustre_generic_file_read()/lustre_generic_file_write()?

	These two are the only places in the entire tree where
generic_file_aio_{read,write}() does *not* have ppos argument
equal to &iocb->ki_pos and I would very much prefer to kill the
sucker off.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
  2014-02-10 21:29 ` Al Viro
@ 2014-02-10 22:10   ` Al Viro
  2014-02-10 22:51     ` Al Viro
  2014-02-11  0:18     ` Xiong, Jinshan
  2014-02-11  0:37   ` Dilger, Andreas
  2014-02-11  9:13   ` Christoph Hellwig
  2 siblings, 2 replies; 15+ messages in thread
From: Al Viro @ 2014-02-10 22:10 UTC (permalink / raw)
  To: Dilger, Andreas
  Cc: Christoph Hellwig, linux-fsdevel, Drokin, Oleg, Peng Tao, greg

On Mon, Feb 10, 2014 at 09:29:29PM +0000, Al Viro wrote:
> On Mon, Feb 10, 2014 at 08:16:52PM +0000, Dilger, Andreas wrote:
> 
> > >>Instead of trying to find a non-conflicting O_LOV_DELAY_CREATE flag
> > >>or define a Lustre-specific flag that isn't of use to most/any other
> > >>filesystems, use (O_NOCTTY|FASYNC) as the new value.  These flag
> > >>are not meaningful for newly-created regular files and should be
> > >>OK since O_LOV_DELAY_CREATE is only meaningful for new files.
> 
> *shrug*
> 
> I can live with that; it's a kludge, but it's less broken than that
> explicit constant - that one is a non-starter, since O_... flag
> values are arch-dependent.
> 
> 	I have another question about what you are doing there - the games
> you are playing with crw_pos.  Is there any reason not to have ->ki_pos
> updated immediately in lustre_generic_file_read()/lustre_generic_file_write()?
> 
> 	These two are the only places in the entire tree where
> generic_file_aio_{read,write}() does *not* have ppos argument
> equal to &iocb->ki_pos and I would very much prefer to kill the
> sucker off.

Ugh...  Sorry, I misread that code.  Why the devil do you have the
pos argument passed to lustre_generic_file_{read,write}() by address,
when both proceed to dereference it and pass the value on?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
  2014-02-10 22:10   ` Al Viro
@ 2014-02-10 22:51     ` Al Viro
  2014-02-11  0:31       ` Dilger, Andreas
  2014-02-11  0:18     ` Xiong, Jinshan
  1 sibling, 1 reply; 15+ messages in thread
From: Al Viro @ 2014-02-10 22:51 UTC (permalink / raw)
  To: Dilger, Andreas
  Cc: Christoph Hellwig, linux-fsdevel, Drokin, Oleg, Peng Tao, greg

On Mon, Feb 10, 2014 at 10:10:30PM +0000, Al Viro wrote:

> Ugh...  Sorry, I misread that code.  Why the devil do you have the
> pos argument passed to lustre_generic_file_{read,write}() by address,
> when both proceed to dereference it and pass the value on?

Egads.  Correct me if I'm wrong, please, but it looks like you have

1) ->aio_write() stash its arguments in an off-stack struct and pass it
and &iocb->ki_pos to

2) ll_file_io_generic() which calls

3) cl_io_rw_init() which copied ki_pos value passed to it into
io->u.ci_rw.crw_pos (in another off-stack struct) and calls

4) cl_io_init() which calls

5) cl_io_init0() which possibly calls a bunch of instances of ->coo_io_init(),
which may or may not return 0; I _hope_ that in this case that's what'll
happen and we get back to ll_file_io_generic(), where

3') we stash iocb and iov/iovlen into the third off-stack structure (cio)
and call

4') cl_io_loop() where we (in a loop) call cl_io_iter_init(), cl_io_lock() and

5') cl_io_start() which a calls a bunch (how large in that case?) of

6') ->cio_start() instances.  Hopefully that'll be vvp_io_write_start()
which will pass the value of io->u.ci_rw.crw_pos (picked via an overlapping
field of union) to generic_file_aio_write().  Which, BTW, updates ->ki_pos.
Then we return into cl_io_loop(), where

4'') we call cl_io_end(), cl_io_unlock() and

5'') cl_io_rw_advance() which increments ....crw_pos, hopefully in sync with
what we have in iocb->ki_pos.  And calls a bunch of methods.  And return
into cl_io_loop(), where

4''') we call cl_io_iter_fini() (_another_ pile of methods called) and possibly
repeat everything from (4') on (apparently only if nothing had been done so
far).  Eventually we return into ll_file_io_generic() and there

3''') we copy ....crw_pos into iocb->ki_pos.  WTF do we need that?  Hadn't
generic_file_aio_write() been good enough?

Is that correct?  I have *not* traced it into all methods that might've
been called in process - stuff called from cl_io_loop() is chock-full of
those.  Have I missed anything relevant wrt file position handling in
there?

You guys really should be forced to hand-draw a call graph for that thing.
Both as a punishment and a deterrent...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
  2014-02-10 22:10   ` Al Viro
  2014-02-10 22:51     ` Al Viro
@ 2014-02-11  0:18     ` Xiong, Jinshan
  1 sibling, 0 replies; 15+ messages in thread
From: Xiong, Jinshan @ 2014-02-11  0:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Christoph Hellwig, linux-fsdevel, Drokin, Oleg,
	Peng Tao, greg

[-- Attachment #1: Type: text/plain, Size: 1741 bytes --]


On Feb 10, 2014, at 2:10 PM, Al Viro <adilger@dilger.ca> wrote:

> On Mon, Feb 10, 2014 at 09:29:29PM +0000, Al Viro wrote:
>> On Mon, Feb 10, 2014 at 08:16:52PM +0000, Dilger, Andreas wrote:
>>
>>>>> Instead of trying to find a non-conflicting O_LOV_DELAY_CREATE flag
>>>>> or define a Lustre-specific flag that isn't of use to most/any other
>>>>> filesystems, use (O_NOCTTY|FASYNC) as the new value.  These flag
>>>>> are not meaningful for newly-created regular files and should be
>>>>> OK since O_LOV_DELAY_CREATE is only meaningful for new files.
>>
>> *shrug*
>>
>> I can live with that; it's a kludge, but it's less broken than that
>> explicit constant - that one is a non-starter, since O_... flag
>> values are arch-dependent.
>>
>>       I have another question about what you are doing there - the games
>> you are playing with crw_pos.  Is there any reason not to have ->ki_pos
>> updated immediately in lustre_generic_file_read()/lustre_generic_file_write()?
>>
>>       These two are the only places in the entire tree where
>> generic_file_aio_{read,write}() does *not* have ppos argument
>> equal to &iocb->ki_pos and I would very much prefer to kill the
>> sucker off.
>
> Ugh...  Sorry, I misread that code.  Why the devil do you have the
> pos argument passed to lustre_generic_file_{read,write}() by address,
> when both proceed to dereference it and pass the value on?

Indeed. This patch will fix the problem, please check the attachment for the patch.

Jinshan




> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: 0001-staging-lustre-llite-remove-lustre_generic_file_-rea.patch --]
[-- Type: application/octet-stream, Size: 3692 bytes --]

From 9a77c75976f60aa2ca76a91b5935cc7020558b34 Mon Sep 17 00:00:00 2001
From: Jinshan Xiong <jinshan.xiong@intel.com>
Date: Mon, 10 Feb 2014 14:38:21 -0800
Subject: [PATCH] staging/lustre/llite: remove
 lustre_generic_file_{read,write}

It looks like lustre_generic_file_{read,write} are a holdover from
2.6.19 where generic_file_aio_read() replaced generic_file_readv()
and cross-kernel interoperability was required for some period of
time. Lustre has since removed support for those older kernels, but
it looks like the wrappers were not deleted at that time. This patch
will delete them.

Pass &iocb->ki_pos as the last argument for these functions instead
of crw_pos, since this is the convention for other callers.  Verify
that this is the same as the current crw_pos argument.  This code can
likely be cleaned up further in a later patch.

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
---
 drivers/staging/lustre/lustre/llite/vvp_io.c |   29 +++++++++-----------------
 1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/vvp_io.c b/drivers/staging/lustre/lustre/llite/vvp_io.c
index 93cbfbb..aaee8a4 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_io.c
+++ b/drivers/staging/lustre/lustre/llite/vvp_io.c
@@ -474,20 +474,6 @@ static void vvp_io_setattr_fini(const struct lu_env *env,
 	vvp_io_fini(env, ios);
 }
 
-static ssize_t lustre_generic_file_read(struct file *file,
-					struct ccc_io *vio, loff_t *ppos)
-{
-	return generic_file_aio_read(vio->cui_iocb, vio->cui_iov,
-				     vio->cui_nrsegs, *ppos);
-}
-
-static ssize_t lustre_generic_file_write(struct file *file,
-					struct ccc_io *vio, loff_t *ppos)
-{
-	return generic_file_aio_write(vio->cui_iocb, vio->cui_iov,
-				      vio->cui_nrsegs, *ppos);
-}
-
 static int vvp_io_read_start(const struct lu_env *env,
 			     const struct cl_io_slice *ios)
 {
@@ -506,6 +492,7 @@ static int vvp_io_read_start(const struct lu_env *env,
 	int     exceed = 0;
 
 	CLOBINVRNT(env, obj, ccc_object_invariant(obj));
+	LASSERT(cio->cui_iocb->ki_pos == pos);
 
 	CDEBUG(D_VFSTRACE, "read: -> [%lli, %lli)\n", pos, pos + cnt);
 
@@ -540,8 +527,10 @@ static int vvp_io_read_start(const struct lu_env *env,
 	file_accessed(file);
 	switch (vio->cui_io_subtype) {
 	case IO_NORMAL:
-		 result = lustre_generic_file_read(file, cio, &pos);
-		 break;
+		result = generic_file_aio_read(cio->cui_iocb,
+					       cio->cui_iov, cio->cui_nrsegs,
+					       cio->cui_iocb->ki_pos);
+		break;
 	case IO_SPLICE:
 		result = generic_file_splice_read(file, &pos,
 				vio->u.splice.cui_pipe, cnt,
@@ -586,7 +575,6 @@ static int vvp_io_write_start(const struct lu_env *env,
 	struct cl_io       *io    = ios->cis_io;
 	struct cl_object   *obj   = io->ci_obj;
 	struct inode       *inode = ccc_object_inode(obj);
-	struct file	*file  = cio->cui_fd->fd_file;
 	ssize_t result = 0;
 	loff_t pos = io->u.ci_wr.wr.crw_pos;
 	size_t cnt = io->u.ci_wr.wr.crw_count;
@@ -601,6 +589,8 @@ static int vvp_io_write_start(const struct lu_env *env,
 		 */
 		pos = io->u.ci_wr.wr.crw_pos = i_size_read(inode);
 		cio->cui_iocb->ki_pos = pos;
+	} else {
+		LASSERT(cio->cui_iocb->ki_pos == pos);
 	}
 
 	CDEBUG(D_VFSTRACE, "write: [%lli, %lli)\n", pos, pos + (long long)cnt);
@@ -608,8 +598,9 @@ static int vvp_io_write_start(const struct lu_env *env,
 	if (cio->cui_iov == NULL) /* from a temp io in ll_cl_init(). */
 		result = 0;
 	else
-		result = lustre_generic_file_write(file, cio, &pos);
-
+		result = generic_file_aio_write(cio->cui_iocb,
+						cio->cui_iov, cio->cui_nrsegs,
+						cio->cui_iocb->ki_pos);
 	if (result > 0) {
 		if (result < cnt)
 			io->ci_continue = 0;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
  2014-02-10 22:51     ` Al Viro
@ 2014-02-11  0:31       ` Dilger, Andreas
  2014-02-11  2:40         ` Al Viro
  0 siblings, 1 reply; 15+ messages in thread
From: Dilger, Andreas @ 2014-02-11  0:31 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, linux-fsdevel, Drokin, Oleg, Peng Tao, greg,
	Xiong, Jinshan

On 2014/02/10, 3:51 PM, "Al Viro" <viro@ZenIV.linux.org.uk> wrote:
>
>Egads.  Correct me if I'm wrong, please, but it looks like you have
>
>1) ->aio_write() stash its arguments in an off-stack struct and pass it
>and &iocb->ki_pos to
>
>2) ll_file_io_generic() which calls
>
>3) cl_io_rw_init() which copied ki_pos value passed to it into
>io->u.ci_rw.crw_pos (in another off-stack struct) and calls

The off-stack storage of per-thread info is used to avoid stack overflow
because of the 4kB stack limitation, and to avoid kmalloc/kfree of the
same data structures for every call into the callpath.  Other code in
the kernel handles similar problems by having a separate thread do the
lower level IO, but that adds context switches to the IO path.  I'm not
a big fan of this, but that is what we live with due to limited stack
size.

>4) cl_io_init() which calls
>
>5) cl_io_init0() which possibly calls a bunch of instances of
>->coo_io_init(),
>which may or may not return 0; I _hope_ that in this case that's what'll
>happen and we get back to ll_file_io_generic(), where
>
>3') we stash iocb and iov/iovlen into the third off-stack structure (cio)
>and call

The reason for separate IO structures at different parts of the stack is
that the Lustre client code has different upper level VFS interfaces for
Linux, MacOS, WinNT, and a userspace library.  The kernel iocb is not
available in all of those cases.

Those non-Linux VFS interfaces are deprecated, and we are looking to
remove these abstractions again, but it will take time.

>4') cl_io_loop() where we (in a loop) call cl_io_iter_init(),
>cl_io_lock() and
>
>5') cl_io_start() which a calls a bunch (how large in that case?) of
>
>6') ->cio_start() instances.  Hopefully that'll be vvp_io_write_start()
>which will pass the value of io->u.ci_rw.crw_pos (picked via an
>overlapping field of union) to generic_file_aio_write().  Which, BTW,
> updates ->ki_pos. Then we return into cl_io_loop(), where
>
>4'') we call cl_io_end(), cl_io_unlock() and
>
>5'') cl_io_rw_advance() which increments ....crw_pos, hopefully in sync
>with what we have in iocb->ki_pos.  And calls a bunch of methods.  And
>return into cl_io_loop(), where
>
>4''') we call cl_io_iter_fini() (_another_ pile of methods called) and
> possibly repeat everything from (4') on (apparently only if nothing had
> been done so far).  Eventually we return into ll_file_io_generic() and
>there
>
>3''') we copy ....crw_pos into iocb->ki_pos.  WTF do we need that?  Hadn't
>generic_file_aio_write() been good enough?
>
>Is that correct?  I have *not* traced it into all methods that might've
>been called in process - stuff called from cl_io_loop() is chock-full of
>those.  Have I missed anything relevant wrt file position handling in
>there?

This internal separation of the IO submission path was done to allow
parallel IO generation for multiple target devices and parallel RPC
generation.  Pretty much everyone (except original author, who no
longer works on Lustre) agrees that the level of abstraction is too
much, and we need to simplify it so that more than a handful of people
can even understand it.

>You guys really should be forced to hand-draw a call graph for that thing.
>Both as a punishment and a deterrent...

Believe me, this is one of my least favourite, and most complex parts
of the Lustre code, and we've had many internal discussions and plans
to clean it up.  Some of that cleanup work has already started, and more
is planned, but it will take a while to test it and push it upstream.

Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
  2014-02-10 21:29 ` Al Viro
  2014-02-10 22:10   ` Al Viro
@ 2014-02-11  0:37   ` Dilger, Andreas
  2014-02-11  0:51     ` greg
  2014-02-11  9:13   ` Christoph Hellwig
  2 siblings, 1 reply; 15+ messages in thread
From: Dilger, Andreas @ 2014-02-11  0:37 UTC (permalink / raw)
  To: greg; +Cc: Christoph Hellwig, linux-fsdevel, Drokin, Oleg, Peng Tao, Al Viro

On 2014/02/10, 2:29 PM, "Al Viro" <viro@ZenIV.linux.org.uk> wrote:
>On Mon, Feb 10, 2014 at 08:16:52PM +0000, Dilger, Andreas wrote:
>> >>Instead of trying to find a non-conflicting O_LOV_DELAY_CREATE flag
>> >>or define a Lustre-specific flag that isn't of use to most/any other
>> >>filesystems, use (O_NOCTTY|FASYNC) as the new value.  These flag
>> >>are not meaningful for newly-created regular files and should be
>> >>OK since O_LOV_DELAY_CREATE is only meaningful for new files.
>
>*shrug*
>
>I can live with that; it's a kludge, but it's less broken than that
>explicit constant - that one is a non-starter, since O_... flag
>values are arch-dependent.

Greg,
could you please merge the original patch.  We'd like to get this into
our pending release of the Lustre user tools and into the releases for
older kernels (which will support both the old and new flags until the
support for older kernels is removed).

Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
  2014-02-11  0:37   ` Dilger, Andreas
@ 2014-02-11  0:51     ` greg
  0 siblings, 0 replies; 15+ messages in thread
From: greg @ 2014-02-11  0:51 UTC (permalink / raw)
  To: Dilger, Andreas
  Cc: Christoph Hellwig, linux-fsdevel, Drokin, Oleg, Peng Tao, Al Viro

On Tue, Feb 11, 2014 at 12:37:09AM +0000, Dilger, Andreas wrote:
> On 2014/02/10, 2:29 PM, "Al Viro" <viro@ZenIV.linux.org.uk> wrote:
> >On Mon, Feb 10, 2014 at 08:16:52PM +0000, Dilger, Andreas wrote:
> >> >>Instead of trying to find a non-conflicting O_LOV_DELAY_CREATE flag
> >> >>or define a Lustre-specific flag that isn't of use to most/any other
> >> >>filesystems, use (O_NOCTTY|FASYNC) as the new value.  These flag
> >> >>are not meaningful for newly-created regular files and should be
> >> >>OK since O_LOV_DELAY_CREATE is only meaningful for new files.
> >
> >*shrug*
> >
> >I can live with that; it's a kludge, but it's less broken than that
> >explicit constant - that one is a non-starter, since O_... flag
> >values are arch-dependent.
> 
> Greg,
> could you please merge the original patch.  We'd like to get this into
> our pending release of the Lustre user tools and into the releases for
> older kernels (which will support both the old and new flags until the
> support for older kernels is removed).

Can you please resend the original patch, without the "RFC" line in the
subject, so I know to apply it now?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
  2014-02-11  0:31       ` Dilger, Andreas
@ 2014-02-11  2:40         ` Al Viro
  2014-02-11  2:54           ` Drokin, Oleg
  2014-02-11  6:55           ` Xiong, Jinshan
  0 siblings, 2 replies; 15+ messages in thread
From: Al Viro @ 2014-02-11  2:40 UTC (permalink / raw)
  To: Dilger, Andreas
  Cc: Christoph Hellwig, linux-fsdevel, Drokin, Oleg, Peng Tao, greg,
	Xiong, Jinshan

On Tue, Feb 11, 2014 at 12:31:39AM +0000, Dilger, Andreas wrote:

> The off-stack storage of per-thread info is used to avoid stack overflow
> because of the 4kB stack limitation, and to avoid kmalloc/kfree of the
> same data structures for every call into the callpath.  Other code in
> the kernel handles similar problems by having a separate thread do the
> lower level IO, but that adds context switches to the IO path.  I'm not
> a big fan of this, but that is what we live with due to limited stack
> size.

<wry> Reducing the depth of call chains also might help. </wry>
And yes, I understand the motivation; makes keeping track of what's going on
really unpleasant, though...

> >4) cl_io_init() which calls
> >
> >5) cl_io_init0() which possibly calls a bunch of instances of
> >->coo_io_init(),
> >which may or may not return 0; I _hope_ that in this case that's what'll
> >happen and we get back to ll_file_io_generic(), where

Umm...  So am I right about the thing returning 0 in this case and in
case of ->aio_read()?

> >5') cl_io_start() which a calls a bunch (how large in that case?) of
> >
> >6') ->cio_start() instances.  Hopefully that'll be vvp_io_write_start()

Again, is that assumption correct?  I'm not interested in criticizing
overall design, etc.; what I am interested in is what does that code
end up doing wrt carrying iov/iovlen/iocb/position through the entire
thing.

> >which will pass the value of io->u.ci_rw.crw_pos (picked via an
> >overlapping field of union) to generic_file_aio_write().  Which, BTW,
> > updates ->ki_pos. Then we return into cl_io_loop(), where
> >
> >4'') we call cl_io_end(), cl_io_unlock() and
> >
> >5'') cl_io_rw_advance() which increments ....crw_pos, hopefully in sync
> >with what we have in iocb->ki_pos.  And calls a bunch of methods.  And
> >return into cl_io_loop(), where

> >3''') we copy ....crw_pos into iocb->ki_pos.  WTF do we need that?  Hadn't
> >generic_file_aio_write() been good enough?

Is it safe to assume that
	* these two calls of generic_file_aio_{read,write} will always
get pos equal to iocb->ki_pos
	* iocb/iov/iovlen will be unchanged since the moment they'd been
passed to ll_file_aio_{read,write}()
	* ->ki_pos assigment after the call of cl_io_loop() is no-op for
IO_NORMAL case
	* these calls of generic_file_aio_{read,write} will always have
ll_file_aio_{read,write} in the call chain and can't be reached otherwise?

What I want to do is turn ->aio_read()/->aio_write() signature into
kiocb x iov_iter -> ssize_t.  With all instances, including
generic_file_aio_{read,write}() converted to that.   With pos delivered via
iocb and iov/nrsegs via iov_iter.  Supplying that in places where they
are called via method is trivial and for such calls we definitely have
pos == iocb->ki_pos.  Most of the places calling those instances directly
(not via method) are in other instances of the same methods and pass the
arguments unchanged.  These two in lustre are, in fact, the only real
exceptions.  IF the hypotheses above are correct, we can convert those
two - just store iov_iter reference instead of iov/nrsegs in vvp_io_args and
in ccc_io and that'll do the trick.  Probably put iov_iter alongside the
vti_local_iovec as well, but I really wonder if that's worth the trouble -
might be better to use do_sync_read/do_sync_write and be done with that.
Sure, it will slightly increase the stack footprint, but if you are _that_
tight on stack...  Anyway, that's a separate story - we can deal with that
later.

PS: I'm serious about the call graph, ideally along with the stack footprints
of all functions.  At least it would give some idea how much do various
paths contribute...  Of course, it's a real bitch to automate with the
amount of indirect calls you've got there - the tricky part is finding the
set of functions that might be called at given location ;-/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
  2014-02-11  2:40         ` Al Viro
@ 2014-02-11  2:54           ` Drokin, Oleg
  2014-02-11  6:55           ` Xiong, Jinshan
  1 sibling, 0 replies; 15+ messages in thread
From: Drokin, Oleg @ 2014-02-11  2:54 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Christoph Hellwig, linux-fsdevel, Peng Tao,
	greg, Xiong, Jinshan

Hello!

On Feb 10, 2014, at 9:40 PM, Al Viro wrote:

> PS: I'm serious about the call graph, ideally along with the stack footprints
> of all functions.  At least it would give some idea how much do various
> paths contribute...  Of course, it's a real bitch to automate with the
> amount of indirect calls you've got there - the tricky part is finding the
> set of functions that might be called at given location ;-/

If you are really curious.
Here are the three documents that used to be true soon after the functionality
was first implemented.
Since then a bunch of changes (sometimes passed as simplification, sometimes not)
were done. But this is the best set of documentation available (other than the
code, of course, which is, sadly, not super readable as you have noticed already).
And it should give you some nightmares^Widea.

The documents have sample callgraphs and state machines as initially envisioned
(before meeting most of hard reality, should I add).

http://wiki.lustre.org/images/1/1d/CLIO.pdf (meat starts at slide 6, statemachines at slide 11)

http://wiki.lustre.org/images/6/66/CLIO-TOI.pdf (somewhat more in-depth version of the above).

http://wiki.lustre.org/images/3/37/CLIO-TOI-notes.pdf - text only version with a
                       lot of ascii flowcharts and lengthly explanations that
                       was delivered to us when primary designer/developer of
                       the code decided to leave our team.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
  2014-02-11  2:40         ` Al Viro
  2014-02-11  2:54           ` Drokin, Oleg
@ 2014-02-11  6:55           ` Xiong, Jinshan
  2014-02-11 14:25             ` Al Viro
  1 sibling, 1 reply; 15+ messages in thread
From: Xiong, Jinshan @ 2014-02-11  6:55 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Christoph Hellwig, linux-fsdevel, Drokin, Oleg,
	Peng Tao, greg


On Feb 10, 2014, at 6:40 PM, Al Viro <viro@ZenIV.linux.org.uk<mailto:viro@ZenIV.linux.org.uk>> wrote:

On Tue, Feb 11, 2014 at 12:31:39AM +0000, Dilger, Andreas wrote:

The off-stack storage of per-thread info is used to avoid stack overflow
because of the 4kB stack limitation, and to avoid kmalloc/kfree of the
same data structures for every call into the callpath.  Other code in
the kernel handles similar problems by having a separate thread do the
lower level IO, but that adds context switches to the IO path.  I'm not
a big fan of this, but that is what we live with due to limited stack
size.

<wry> Reducing the depth of call chains also might help. </wry>
And yes, I understand the motivation; makes keeping track of what's going on
really unpleasant, though...

4) cl_io_init() which calls

5) cl_io_init0() which possibly calls a bunch of instances of
->coo_io_init(),
which may or may not return 0; I _hope_ that in this case that's what'll
happen and we get back to ll_file_io_generic(), where

Umm...  So am I right about the thing returning 0 in this case and in
case of ->aio_read()?

Yes, cio_init() can return one of the following integers:
 = 0: IO can go forward
 > 0: IO is not necessary
< 0: error occurs

So it will call aio_read() if cl_io_init() returns zero.


5') cl_io_start() which a calls a bunch (how large in that case?) of

6') ->cio_start() instances.  Hopefully that'll be vvp_io_write_start()

Again, is that assumption correct?  I'm not interested in criticizing
overall design, etc.; what I am interested in is what does that code
end up doing wrt carrying iov/iovlen/iocb/position through the entire
thing.

Yes, it calls vvp_io_write_start().

Lustre supports data striping, so a file can be composed of multiple OST(Object Storage Target) objects. When reading or writing a striped file, IO is performed stripe by stripe, this is why you can see the loop of cl_io_lock(), cl_io_start(), and cl_io_unlock(), etc.

cl_io_init();
while (still has data) {
   cl_io_iter_init();
   cl_io_lock();
   cl_io_start();
   cl_io_end();
   cl_io_unlock();
}
cl_io_fini();

Essentially, cl_io_iter_init() separates the IO by stripes, and followed by taking dlm lock, doing actual IO, and release lock, etc.

Let’s take an example here, say a file has two stripes on OST0 and OST1, and stripe size is 1M, so when 2M data is to be written to this file starting from zero, then the above code will loop twice to finish this IO.
The first iteration will write [0, 1M) to OST0, and the 2nd one will write [1M, 2M) to OST1. As a result, generic_file_aio_write() will be called twice specifically.

The reason for us to split the IO by stripes is due to cascading problems, which is another story.


which will pass the value of io->u.ci_rw.crw_pos (picked via an
overlapping field of union) to generic_file_aio_write().  Which, BTW,
updates ->ki_pos. Then we return into cl_io_loop(), where

4'') we call cl_io_end(), cl_io_unlock() and

5'') cl_io_rw_advance() which increments ....crw_pos, hopefully in sync
with what we have in iocb->ki_pos.  And calls a bunch of methods.  And
return into cl_io_loop(), where

3''') we copy ....crw_pos into iocb->ki_pos.  WTF do we need that?  Hadn't
generic_file_aio_write() been good enough?

Is it safe to assume that
* these two calls of generic_file_aio_{read,write} will always
get pos equal to iocb->ki_pos

Yes.

* iocb/iov/iovlen will be unchanged since the moment they'd been
passed to ll_file_aio_{read,write}()

Yes. They will be updated after an IO for one stripe is finished.

* ->ki_pos assigment after the call of cl_io_loop() is no-op for
IO_NORMAL case

Yes. But as you have already noticed, the ppos in ll_file_io_generic() is NOT always &iocb->ki_pos. So it isn’t totally wrong to have

“*ppos = io->u.ci_wr.wr.crw_pos;”

in ll_file_io_generic(), take a look at ll_file_splice_read().

* these calls of generic_file_aio_{read,write} will always have
ll_file_aio_{read,write} in the call chain and can't be reached otherwise?

Yes.


What I want to do is turn ->aio_read()/->aio_write() signature into
kiocb x iov_iter -> ssize_t.  With all instances, including
generic_file_aio_{read,write}() converted to that.   With pos delivered via
iocb and iov/nrsegs via iov_iter.  Supplying that in places where they
are called via method is trivial and for such calls we definitely have
pos == iocb->ki_pos.  Most of the places calling those instances directly

It’s a good to use iov_iter{} to replace iov/nrsegs.

Even this, we still have to remember pos in cl_io{} because the file position will be used at LOV layer to determine the stripe boundary.

(not via method) are in other instances of the same methods and pass the
arguments unchanged.  These two in lustre are, in fact, the only real
exceptions.  IF the hypotheses above are correct, we can convert those
two - just store iov_iter reference instead of iov/nrsegs in vvp_io_args and
in ccc_io and that'll do the trick.  Probably put iov_iter alongside the
vti_local_iovec as well, but I really wonder if that's worth the trouble -
might be better to use do_sync_read/do_sync_write and be done with that.

How do we handle ->splice_read() in this case. We can simplify read and write a little bit, but it seems we have to duplicate some code for splice_read(). Please advise.

Thank you very much for giving the advice.

Jinshan

Sure, it will slightly increase the stack footprint, but if you are _that_
tight on stack...  Anyway, that's a separate story - we can deal with that
later.

PS: I'm serious about the call graph, ideally along with the stack footprints
of all functions.  At least it would give some idea how much do various
paths contribute...  Of course, it's a real bitch to automate with the
amount of indirect calls you've got there - the tricky part is finding the
set of functions that might be called at given location ;-/

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
  2014-02-10 21:29 ` Al Viro
  2014-02-10 22:10   ` Al Viro
  2014-02-11  0:37   ` Dilger, Andreas
@ 2014-02-11  9:13   ` Christoph Hellwig
  2014-02-11 11:01     ` Dilger, Andreas
  2 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2014-02-11  9:13 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Christoph Hellwig, linux-fsdevel, Drokin, Oleg,
	Peng Tao, greg

On Mon, Feb 10, 2014 at 09:29:29PM +0000, Al Viro wrote:
> I can live with that; it's a kludge, but it's less broken than that
> explicit constant - that one is a non-starter, since O_... flag
> values are arch-dependent.

Grabbing their own O_FLAG is of course not acceptable at all.
Personally I don't think this version is acceptable for real mainline
either.  What exactly are the semantics of the flag?  Why don't you do
object allocation on demand like all delalloc filesystems by default?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
  2014-02-11  9:13   ` Christoph Hellwig
@ 2014-02-11 11:01     ` Dilger, Andreas
  0 siblings, 0 replies; 15+ messages in thread
From: Dilger, Andreas @ 2014-02-11 11:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, Drokin, Oleg, Peng Tao, greg, Al Viro

On 2014/02/11, 2:13 AM, "Christoph Hellwig" <hch@infradead.org> wrote:
>On Mon, Feb 10, 2014 at 09:29:29PM +0000, Al Viro wrote:
>> I can live with that; it's a kludge, but it's less broken than that
>> explicit constant - that one is a non-starter, since O_... flag
>> values are arch-dependent.
>
>Grabbing their own O_FLAG is of course not acceptable at all.
>Personally I don't think this version is acceptable for real mainline
>either.  What exactly are the semantics of the flag?  Why don't you do
>object allocation on demand like all delalloc filesystems by default?

This was described in the original patch and follow-on email, but I'll
repeat it here, and expand the detail a bit further:

In kernel 3.11 O_TMPFILE was introduced, but the open flag value
conflicts with the O_LOV_DELAY_CREATE flag 020000000 previously used
by Lustre-aware applications.  O_LOV_DELAY_CREATE allows applications
to defer file layout and object creation from open time (the default)
until it can instead be specified by the application using an ioctl.

The main goal of the O_LOV_DELAY_CREATE flag is to allow the file to be
opened in a "preliminary" manner to allow the application to specify the
layout of the file across the Lustre storage targets (e.g. whether the
app has millions of separate files each one written to a single server,
or there is a single huge file spread across all of the servers, or some
combination of the two, if it is RAID-0 or RAID-1, or whatever).


FYI, an "object" in Lustre is not a fixed-size chunk of space like
Ceph or HDFS that needs to be continuously allocated as a file grows,
but rather a variable-sized inode-without-a-name that is written at
arbitrary byte offsets and can be sparse, so there is no need for
the client and metadata server to communicate after the initial
file layout has been decided.

The Lustre object(s) are normally allocated by the metadata server at
open time to avoid RPC round-trips and lock contention for files opened
by large numbers of nodes at once.  The layout is normally specified by
the filesystem default, or on the parent directory, but some applications
need fine-grained control over the layout to optimize for a particular
filesystem configuration.

Instead of trying to find a non-conflicting O_LOV_DELAY_CREATE flag
or define a Lustre-specific flag that isn't of use to most/any other
filesystems, use (O_NOCTTY|FASYNC) as the new value.  These flag
are not meaningful for newly-created regular files and should be
OK since O_LOV_DELAY_CREATE is only meaningful for new files.

I looked into using O_ACCMODE/FMODE_WRITE_IOCTL, which allows calling
ioctl() on the minimally-opened fd and is close to what is needed,
but that doesn't allow specifying the actual read or write mode for
the file, and fcntl(F_SETFL) doesn't allow O_RDONLY/O_WRONLY/O_RDWR
to be set after the file is opened.

We want to avoid the need to have lots of syscalls to do this, since
they translate into extra RPCs that we want to avoid when creating
potentially millions of files over the network.



Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
  2014-02-11  6:55           ` Xiong, Jinshan
@ 2014-02-11 14:25             ` Al Viro
  2014-02-11 18:26               ` Xiong, Jinshan
  0 siblings, 1 reply; 15+ messages in thread
From: Al Viro @ 2014-02-11 14:25 UTC (permalink / raw)
  To: Xiong, Jinshan
  Cc: Dilger, Andreas, Christoph Hellwig, linux-fsdevel, Drokin, Oleg,
	Peng Tao, greg

On Tue, Feb 11, 2014 at 06:55:59AM +0000, Xiong, Jinshan wrote:
>> 4) cl_io_init() which calls
>> 
>> 5) cl_io_init0() which possibly calls a bunch of instances of
>> ->coo_io_init(),
>> which may or may not return 0; I _hope_ that in this case that's what'll
>> happen and we get back to ll_file_io_generic(), where
> 
>> Umm...  So am I right about the thing returning 0 in this case and in
>> case of ->aio_read()?
> 
> Yes, cio_init() can return one of the following integers:
>  = 0: IO can go forward
>  > 0: IO is not necessary
> < 0: error occurs
> 
> So it will call aio_read() if cl_io_init() returns zero.

Er...  The question is the other way round, actually - will it return
zero when we arrive there from ll_file_aio_read()?  Or could it happen
that it returns something positive in that case?

> Let’s take an example here, say a file has two stripes on OST0 and OST1, and stripe size is 1M, so when 2M data is to be written to this file starting from zero, then the above code will loop twice to finish this IO.
> The first iteration will write [0, 1M) to OST0, and the 2nd one will write [1M, 2M) to OST1. As a result, generic_file_aio_write() will be called twice specifically.

_Twice_?  Oh, I see... ccc_io_advance(), right?  With iov_iter the whole
->cio_advance() thing would die, AFAICS.

> Yes. But as you have already noticed, the ppos in ll_file_io_generic() is NOT always &iocb->ki_pos. So it isn’t totally wrong to have
> 
> “*ppos = io->u.ci_wr.wr.crw_pos;”
> 
> in ll_file_io_generic(), take a look at ll_file_splice_read().

->splice_read/->splice_write is another part of that story; eventually,
I hope to get rid of those guys, when we get polymorphic iov_iter sorted
out.  Short version of the story: turn iov_iter into a tagged structure with
variants; primitives working with it would check the tag to decide what to
do.  Think of *BSD struct uio on steroids; AFAIK, Solaris has kept it as
well.  One of the cases would be iovec-based, with or without splitting the
kernel-space iovec into separate case.  That's pretty much what *BSD one
provides and what iov_iter does right now.  Another case: array of
<struct page *page, size_t offset, size_t size> triples.  ->splice_write()
would simply set such an array for all non-empty pipe_buffers and pass
it to ->write_iter().  That gives a usable generic implementation, suitable
for pretty much all filesystems.  _Maybe_ it can even be taught to deal
with page stealing, in a way that would allow to kill ->splice_write() as
a method - I didn't get to the point where it would be easy to investigate.
->splice_read() ought to use another iov_iter case, where copy_page_to_iter()
would grab a reference to page and plain copy_to_iter() would allocate
a page if needed and copy over there.

Hell knows; I'm still dealing with preliminary messes and with splice_write
side of things.  We'll see what set of primitives will it shake down to.
So far it's been spread between several fsdevel threads (and some bits of
off-list mail when potentially nasty bugs got caught).  I hope to get enough
of the queue stable enough and large enough to make it worth discussing.
Hopefully today or tomorrow there'll be enough meat on it...  In any case
I'm going to bring it up on LSF/MM.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
  2014-02-11 14:25             ` Al Viro
@ 2014-02-11 18:26               ` Xiong, Jinshan
  0 siblings, 0 replies; 15+ messages in thread
From: Xiong, Jinshan @ 2014-02-11 18:26 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Christoph Hellwig, linux-fsdevel, Drokin, Oleg,
	Peng Tao, greg


On Feb 11, 2014, at 6:25 AM, Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Tue, Feb 11, 2014 at 06:55:59AM +0000, Xiong, Jinshan wrote:
>>> 4) cl_io_init() which calls
>>> 
>>> 5) cl_io_init0() which possibly calls a bunch of instances of
>>> ->coo_io_init(),
>>> which may or may not return 0; I _hope_ that in this case that's what'll
>>> happen and we get back to ll_file_io_generic(), where
>> 
>>> Umm...  So am I right about the thing returning 0 in this case and in
>>> case of ->aio_read()?
>> 
>> Yes, cio_init() can return one of the following integers:
>> = 0: IO can go forward
>>> 0: IO is not necessary
>> < 0: error occurs
>> 
>> So it will call aio_read() if cl_io_init() returns zero.
> 
> Er...  The question is the other way round, actually - will it return
> zero when we arrive there from ll_file_aio_read()?  Or could it happen
> that it returns something positive in that case?

Ah sorry, I missed it.

For ->aio_read(), yes mostly it will return zero so that IO can go forward. It won’t return a positive value for read case.

It can also return -ENODATA for HSM(Hierarchical Storage Management) where the data has already been moved to secondary storage so Lustre has to copy it back and then restart the IO.

> 
>> Let’s take an example here, say a file has two stripes on OST0 and OST1, and stripe size is 1M, so when 2M data is to be written to this file starting from zero, then the above code will loop twice to finish this IO.
>> The first iteration will write [0, 1M) to OST0, and the 2nd one will write [1M, 2M) to OST1. As a result, generic_file_aio_write() will be called twice specifically.
> 
> _Twice_?  Oh, I see... ccc_io_advance(), right?  With iov_iter the whole
> ->cio_advance() thing would die, AFAICS.

Yes, this will simplify it. This is a typically once-it-works-never-put-your-hands-on-it thing. I will create a patch for this.

> 
>> Yes. But as you have already noticed, the ppos in ll_file_io_generic() is NOT always &iocb->ki_pos. So it isn’t totally wrong to have
>> 
>> “*ppos = io->u.ci_wr.wr.crw_pos;”
>> 
>> in ll_file_io_generic(), take a look at ll_file_splice_read().
> 
> ->splice_read/->splice_write is another part of that story; eventually,
> I hope to get rid of those guys, when we get polymorphic iov_iter sorted
> out.  Short version of the story: turn iov_iter into a tagged structure with
> variants; primitives working with it would check the tag to decide what to
> do.  Think of *BSD struct uio on steroids; AFAIK, Solaris has kept it as
> well.  One of the cases would be iovec-based, with or without splitting the
> kernel-space iovec into separate case.  That's pretty much what *BSD one
> provides and what iov_iter does right now.  Another case: array of
> <struct page *page, size_t offset, size_t size> triples.  ->splice_write()
> would simply set such an array for all non-empty pipe_buffers and pass
> it to ->write_iter().  That gives a usable generic implementation, suitable
> for pretty much all filesystems.  _Maybe_ it can even be taught to deal
> with page stealing, in a way that would allow to kill ->splice_write() as
> a method - I didn't get to the point where it would be easy to investigate.
> ->splice_read() ought to use another iov_iter case, where copy_page_to_iter()
> would grab a reference to page and plain copy_to_iter() would allocate
> a page if needed and copy over there.

So essentially, iov_iter{} will be extended to include enough information to perform all kinds of IO operations to operate file data. Therefore, generic_file_aio_{read,write} will become the only two interfaces to initiate file IO operations, with iov_iter{} as one of their parameters.

I will keep an eye on it and once it’s implemented, I will revise lustre implementation correspondingly. 

> 
> Hell knows; I'm still dealing with preliminary messes and with splice_write
> side of things.  We'll see what set of primitives will it shake down to.
> So far it's been spread between several fsdevel threads (and some bits of
> off-list mail when potentially nasty bugs got caught).  I hope to get enough
> of the queue stable enough and large enough to make it worth discussing.
> Hopefully today or tomorrow there'll be enough meat on it...  In any case
> I'm going to bring it up on LSF/MM.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-02-11 18:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-10 20:16 RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict Dilger, Andreas
2014-02-10 21:29 ` Al Viro
2014-02-10 22:10   ` Al Viro
2014-02-10 22:51     ` Al Viro
2014-02-11  0:31       ` Dilger, Andreas
2014-02-11  2:40         ` Al Viro
2014-02-11  2:54           ` Drokin, Oleg
2014-02-11  6:55           ` Xiong, Jinshan
2014-02-11 14:25             ` Al Viro
2014-02-11 18:26               ` Xiong, Jinshan
2014-02-11  0:18     ` Xiong, Jinshan
2014-02-11  0:37   ` Dilger, Andreas
2014-02-11  0:51     ` greg
2014-02-11  9:13   ` Christoph Hellwig
2014-02-11 11:01     ` Dilger, Andreas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.