All of lore.kernel.org
 help / color / mirror / Atom feed
* updated orangefs tree at kernel.org
@ 2015-10-07 20:07 Mike Marshall
  2015-10-08 21:07 ` Al Viro
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Marshall @ 2015-10-07 20:07 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

Greetings Al and fs-devel...

The orangefs for-next tree over at kernel.org has
been updated. It is rebased to 4.3-rc3 and has
numerous new commits related to reviews received
during the merge window...

git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
for-next

Martin Brandenburg, a Clemson student has worked on
several of the issues pointed out during the merge window
review, and Yi Liu, a Clemson sys-admin, is working on
a patch that changes most occurrences of pvfs to orangefs.
We hope to update our for-next branch one more time
during this rc series with Yi's changes, and probably
address an issue that the kbuild robots and other testers
have reported since the for-next branch was updated.
Arnd Bergmann even sent us a patch for the issue,
that will probably be in the next update.

Not only are more people helping and paying attention
now, but there are some lurkers out there who are
getting a lot of enjoyment out of the orangefs-related
fs-devel thread from the merge window where I got some
long emails from Linus with smiley-faces in them, and
even longer emails from Al with cuss-word acronyms
in them <g>...

-Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: updated orangefs tree at kernel.org
  2015-10-07 20:07 updated orangefs tree at kernel.org Mike Marshall
@ 2015-10-08 21:07 ` Al Viro
  2015-10-08 23:03   ` Al Viro
  0 siblings, 1 reply; 13+ messages in thread
From: Al Viro @ 2015-10-08 21:07 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

	Something's fishy around pvfs2_inode_read().  Look:

* ->readpage (== read_one_page) is called in normal context; no set_fs()
in sight.

* read_one_page() does kmap() of the page it had been given.  Result is
a kernel pointer.  That pointer is (mis)annotated as userland one and
passed to pvfs2_inode_read().

* pvfs2_inode_read() shoves it into struct iovec and passes that to
wait_for_direct_io() (type == PVFS_IO_READ).  From there it gets
passed to postcopy_buffers(), where we do
                iov_iter_init(&iter, READ, vec, nr_segs, total_size);
                ret = pvfs_bufmap_copy_to_iovec(bufmap, 
                                                &iter,
                                                buffer_index);
which does copy_page_to_iter().

* after that iov_iter_init() we have iter.type equal to ITER_IOVEC | READ,
so copy_page_to_iter() uses __copy_to_user_inatomic() or falls back to
__copy_to_user().  Giving a kernel pointer to either is going to break on
any architecture where separate MMU contexts are used for kernel and userland.
On x86 we get away with that, but e.g. on sparc64 we won't.

More to the point, why do you bother with kmap() at all?  Just set a BVEC_ITER
over that struct page and let copy_page_to_iter() do the right thing when
it comes to it.  IOW, have iov_iter passed to pvfs2_inode_read() /
wait_for_direct_io() / postcopy_buffers() and be done with that - have
read_one_page() do
	struct iov_iter iter;
	struct bvec bv = {.bv_page = page, .bv_len = PAGE_SIZE, .bv_offset = 0};
	iov_iter_init_bvec(&iter, ITER_BVEC | READ, &bv, 1, PAGE_SIZE);

        max_block = ((inode->i_size / blocksize) + 1);

        if (page->index < max_block) {
                loff_t blockptr_offset = (((loff_t) page->index) << blockbits);

                bytes_read = pvfs2_inode_read(inode,
                                              &iter,
                                              blocksize,
                                              &blockptr_offset,
                                              inode->i_size);
	}
	/* will only zero remaining unread portions of the page data */
	iov_iter_zero(~0U, &iter);

and then proceed as you currently do, except that no kunmap() is needed
at all.

precopy_buffers() would also need to take iov_iter, and the other caller
of wait_for_direct_io() (i.e. do_readv_writev()) will need to pass it
an iov_iter as well.

Said that, do_readv_writev() also needs some taking care of.  Both callers
are giving it iter->iov (which, BTW, is only valid for IOV_ITER ones),
and then play very odd games with splitting iovecs if the total size exceeds
your pvfs_bufmap_size_query().  What for?

Sure, I understand not wanting to submit IO in chunks exceeding your limit.
But why do you need a separate iovec each time?  One thing that iov_iter
does is handling the copying to/from partially used iovecs.  copy_page_to_iter
and copy_page_from_iter are just fine with that.  IOW, why bother with
split_iovecs() at all?

wait_for_direct_io() uses 'vec' and 'nr_segs' only for passing them to
{pre,post}copy_buffers(), where they are used to only to initialize iov_iter.
And since copy_page_{to,from}_iter() advances the iterator, there's no
need to reinitialize the damn thing at all.  Just pass the amount to copy
as an explicit argument, rather than using iov_iter_count() in
pvfs_bufmap_copy_to_iovec().

Am I missing anything subtle in there?  Looks like do_readv_writev() would
get much simpler from that *and* ->read_iter()/->write_iter() will work
on arbitrary iov_iter after that...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: updated orangefs tree at kernel.org
  2015-10-08 21:07 ` Al Viro
@ 2015-10-08 23:03   ` Al Viro
  2015-10-09  3:41     ` Al Viro
  2015-10-10  2:31     ` Al Viro
  0 siblings, 2 replies; 13+ messages in thread
From: Al Viro @ 2015-10-08 23:03 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Thu, Oct 08, 2015 at 10:07:45PM +0100, Al Viro wrote:

> Am I missing anything subtle in there?  Looks like do_readv_writev() would
> get much simpler from that *and* ->read_iter()/->write_iter() will work
> on arbitrary iov_iter after that...

What I have in mind is something like (*ABSOLUTELY* *UNTESTED*)
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git orangefs-untested

That's 8 commits on top of your #for-next:
      orangefs: explicitly pass the size to pvfs_bufmap_copy_to_iovec()
      pvfs_bufmap_copy_from_iovec(): don't rely upon size being equal to iov_iter_count(iter)
      orangefs: make postcopy_buffers() take iov_iter
      orangefs: make precopy_buffers() take iov_iter
      orangefs: make wait_for_direct_io() take iov_iter
      orangefs: don't bother with splitting iovecs
      orangefs: make do_readv_writev() take iov_iter
      orangefs: make pvfs2_inode_read() take iov_iter
 fs/orangefs/file.c         | 344 +++++++-------------------------------------------------------------------------------
 fs/orangefs/inode.c        |  17 ++---
 fs/orangefs/pvfs2-bufmap.c |  52 ++++++-------
 fs/orangefs/pvfs2-bufmap.h |   3 +-
 fs/orangefs/pvfs2-kernel.h |   3 +-
 5 files changed, 61 insertions(+), 358 deletions(-)

Might take care of iov_iter-related issues (and get rid of arseloads of manual
messing with iovecs, while we are at it).

Again, this is really, really untested - might chew your data, etc.  It
compiles, but that's it.  I easily might've missed something something
subtle (or trivial, for that matter).

And I haven't looked at the rest yet - some bugs I've mentioned are definitely
still there (e.g. bogus iput() on failure of d_make_root() - it's already
done by d_make_root() itself in that case, so what you are doing is double
iput(); just remove it and be done with that), but I hadn't checked the
whole list.  Will post more later tonight...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: updated orangefs tree at kernel.org
  2015-10-08 23:03   ` Al Viro
@ 2015-10-09  3:41     ` Al Viro
  2015-10-09  6:13       ` Al Viro
  2015-10-09 19:22       ` Al Viro
  2015-10-10  2:31     ` Al Viro
  1 sibling, 2 replies; 13+ messages in thread
From: Al Viro @ 2015-10-09  3:41 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Fri, Oct 09, 2015 at 12:03:33AM +0100, Al Viro wrote:

> And I haven't looked at the rest yet - some bugs I've mentioned are definitely
> still there (e.g. bogus iput() on failure of d_make_root() - it's already
> done by d_make_root() itself in that case, so what you are doing is double
> iput(); just remove it and be done with that), but I hadn't checked the
> whole list.  Will post more later tonight...

Directories:

1) I don't understand what happens to position in directory (ctx->pos).
AFAICS, you are reading directories in moderate chunks (up to 96
entries?) and use a token stored in *(file->private_data) to tell which
chunk it is.  So far, so good, but...  AFAICS ctx->pos is reset to 0
on the beginning of each such chunk.  Am I right assuming that looping
through a directory with readdir(3) and watching for return values of
telldir(3) will yield repeating offsets every time you go into the next
chunk?

2) More seriously, rewinddir(3) *must* rewind the directory.  With all
Linux libc variants it means lseek() to 0; AFAICS, orangefs directories
won't do it properly - all lseek attempts flat-out fail with ESPIPE.
Userland won't be happy...

3) ->pvfs_dirent_outcount is u32 and it comes from server with no validation
attempts ever made.  Now, look at
        readdir->dirent_array = kmalloc(readdir->pvfs_dirent_outcount *
                                        sizeof(*readdir->dirent_array),
in decode_dirents().  ->dirent_array is a structure consisting of pointer,
int and a 16-byte 64bit-aligned array.  IOW, it's size is greater than 1
and this product might wrap around.  You really should use kcalloc() there.

What's more, the loop parsing the response body is 
        for (i = 0; i < readdir->pvfs_dirent_outcount; i++) {
                dec_string(pptr, &readdir->dirent_array[i].d_name,
                           &readdir->dirent_array[i].d_length);
                readdir->dirent_array[i].khandle =
                        *(struct pvfs2_khandle *) *pptr;
                *pptr += 16;
        }
where dec_string(pptr, pbuf, plen) is defined as
        __u32 len = (*(__u32 *) *(pptr)); \
        *pbuf = *(pptr) + 4; \
        *(pptr) += roundup8(4 + len + 1); \
        if (plen) \
                *plen = len;\

Note that we do *not* validate len, so this code might end up dereferencing
any address within +-2Gb from the buffer.  You really can't assume that
server is neither insane nor compromised; this is a blatant security hole.

And please, drop these macros - you are not using enc_string() at all while
dec_string() is used only in decode_dirents() and would be easier to
understand if it had been spelled out right there.  Especially since you
need to add sanity checks (and buggering off should they fail) to it.

4)
	if (pos == 0) {
                ino = get_ino_from_khandle(dentry->d_inode);
                gossip_debug(GOSSIP_DIR_DEBUG,
                             "%s: calling dir_emit of \".\" with pos = %llu\n",
                             __func__,
                             llu(pos));
                ret = dir_emit(ctx, ".", 1, ino, DT_DIR);
                pos += 1;
        }
in pvfs2_readdir() is bloody odd.  What's wrong with emit_dot(), or,
for that matter, dir_emit_dots() to cover both that and .. right after
it.  You *are* setting ->i_ino to the same hash of khandle you are
using here anyway, so why not use the standard helpers?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: updated orangefs tree at kernel.org
  2015-10-09  3:41     ` Al Viro
@ 2015-10-09  6:13       ` Al Viro
  2015-10-09 19:22       ` Al Viro
  1 sibling, 0 replies; 13+ messages in thread
From: Al Viro @ 2015-10-09  6:13 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

	Superblock handling:

1) aside of the pointlessness of struct pvfs2_mount_sb_info_s, now that
pvfs2_fill_sb() is called directly and isn't constrained to passing just
one pointer argument, you are mishandling its failures.  Note that
mount_nodev() follows a failure of callback with deactivate_locked_super(s);
pvfs2_mount() does not.  It simply ends up leaking a struct super_block in
such case.

2) ->kill_sb() is called for everything that had been created by sget().
IOW, your pvfs2_kill_sb() can be called for something that never got past
the attempt to allocate ->s_fs_info.  You seem to assume that it's only
called for fully set up superblock.

3) the question about protection of pvfs2_superblocks in
dispatch_ioctl_command() remains - handling of PVFS_DEV_REMOUNT_ALL
loops through that list with no locks in common with the call chain
leading through ->kill_sb() to remove_pvfs2_sb().

4) ditto for pvfs2_remount() vs. pvfs2_unmount_sb() - is it OK to have
the former called while the latter is running?  I don't see anything that
would provide an exclusion here.

5) are you sure that pvfs2_unmount_sb() should be done *before*
kill_anon_super()?  At that point we still might have dirty inodes
in cache, etc.  I don't know the protocol - can't tell if that's really
OK.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: updated orangefs tree at kernel.org
  2015-10-09  3:41     ` Al Viro
  2015-10-09  6:13       ` Al Viro
@ 2015-10-09 19:22       ` Al Viro
  1 sibling, 0 replies; 13+ messages in thread
From: Al Viro @ 2015-10-09 19:22 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Fri, Oct 09, 2015 at 04:41:26AM +0100, Al Viro wrote:

> where dec_string(pptr, pbuf, plen) is defined as
>         __u32 len = (*(__u32 *) *(pptr)); \
>         *pbuf = *(pptr) + 4; \
>         *(pptr) += roundup8(4 + len + 1); \
>         if (plen) \
>                 *plen = len;\
> 
> Note that we do *not* validate len, so this code might end up dereferencing
> any address within +-2Gb from the buffer.  You really can't assume that
> server is neither insane nor compromised; this is a blatant security hole.
> 
> And please, drop these macros - you are not using enc_string() at all while
> dec_string() is used only in decode_dirents() and would be easier to
> understand if it had been spelled out right there.  Especially since you
> need to add sanity checks (and buggering off should they fail) to it.

Actually, validation is bloody weak in pvfs2_devreq_writev() as well.
And frankly, the layout it's expecting is downright insane.

What you have is some initial segment of
	* 32bit protocol version.  Never checked, which renders it useless.
	* 32bit magic.  That one _is_ checked, so basically they work
together as 64bit protocol version, with bloody odd validation.
	* 64bit tag, used to match with requests
	* 32bit type, apparently never checked - that of the matching request is
used
	* 32bit status
	* 64bit trailer_size, apparently used only with readdir, so
probably non-zero only if type is PVFS2_VFS_OP_READDIR or
PVFS2_VFS_OP_READDIRPLUS (the latter doesn't seem to be ever issued, though).
	* pointer-sized junk.  Ignored.
	* big fat union *NOT* including anything for readdir
possibly followed by readdir response.

To make things even funnier, you require 4- or 5-element iovec array,
the first 4 covering the aforementioned "some initial segment".  The 5th
is quietly ignored unless trailer_size is positive and status is zero.
If trailer_size > 0 && status == 0, you verify that the length of the 5th
segment is no more than trailer_size and copy it to vmalloc'ed buffer.
Without bothering to zero the rest of that buffer out.

The member of that big fat union for getattr has a bit of additional insanity -
several pointer-sized gaps.  Entirely unused.

Far be it from me to support The War On Some Drugs, but really, staying away
from the stuff that induces trips _that_ bad is just plain common sense...

First of all, this "exactly 4 or 5 iovecs in array" thing is beyond ugly - 
it's a way to separate large variable-length segment, all right, but why
the hell not just declare e.g. that if request is at least 32 bytes long
and has trailer_size > 0 && status == 0, trailer_size must be no more than
request size - 32 and that much bytes in the end will go into vmalloc'ed
buffer?  Everything else must fit into MAX_ALIGNED_DEV_REQ_DOWNSIZE.
AFAICS, that would be compatible with what your server is doing now.

Next, and that can't be fixed without a protocol change, I'd suggest losing
those pointer-sized gaps (and probably reordering struct PVFS_sys_attr_s
to make sure that all u64 are naturally aligned there).  Why make your
protocol wordsize-sensitive, when it's trivial to avoid?

In any case, the current code is asking for serious trouble if the 5th segment
is there and _shorter_ than trailer_size.  As the absolute minimum we need to
zero the rest of vmalloc'ed buffer, and I seriously suspect that we need
to outright reject such requests.  And I would really prefer to get rid of
this "exactly 4 or 5" thing as well (see above)...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: updated orangefs tree at kernel.org
  2015-10-08 23:03   ` Al Viro
  2015-10-09  3:41     ` Al Viro
@ 2015-10-10  2:31     ` Al Viro
  2015-10-10 20:23       ` Al Viro
  2015-10-10 23:10       ` orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org) Al Viro
  1 sibling, 2 replies; 13+ messages in thread
From: Al Viro @ 2015-10-10  2:31 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Fri, Oct 09, 2015 at 12:03:33AM +0100, Al Viro wrote:

> What I have in mind is something like (*ABSOLUTELY* *UNTESTED*)
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git orangefs-untested

OK, I've thrown several more fixes into that branch, but that's the
low-hanging fruit.  The really interesting questions I see right now
are:
	* it needs some form of lseek() for directories - currently absent,
not even rewind to the beginning.  Userland won't be happy
	* races around the umount - I don't understand the protocol well
enough to fix those, but this area looks fishy as hell
	* racing copy_attributes_to_inode() and the way it mangles the
metadata of live inodes.  ->i_mode updates in particular are seriously
broken, so are updates of symlink bodies, etc.
	* truncation of long symlinks and the lack of NUL-termination in
there.
	* atrocious userland API for downcalls (response to everything
other than readdir must come in 4-element iovec array passed to writev(),
no matter where the segment boundaries are; response to readdir must come
in 5-element iovec array passed to writev(), the boundaries among the
first 4 segments do not matter, the 5th segment covering exactly the payload).
IMO that needs to be fixed before we merge the damn thing - it's really too
ugly to live.  I would really like to hear from somebody familiar with the
userland side - what responses does it actually submit?  The validation
kernel-side is basically inexistent, and while I suspect that we could handle
the actually sent responses much cleaner, the full set of everything that
would be accepted by the current code is a bloody mess and would be much
harder to handle in a clean way.  What's more, the response layouts are
messy, and if it is still possible to change that API, we'd be much better off
if we cleaned them up.
	* for some reason the lookup request tells the server whether there
was LOOKUP_FOLLOW in flags.  This is really odd - no other filesystem cares
about that bit and until now its presence in ->lookup() flags had been
basically at convenience of fs/namei.c; it doesn't match anything useful
and I'm very surprised by seeing orangefs pass it to server.  LOOKUP_FOLLOW is
almost certainly a wrong approximation to whatever orangefs is trying to
achieve.  What is it about?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: updated orangefs tree at kernel.org
  2015-10-10  2:31     ` Al Viro
@ 2015-10-10 20:23       ` Al Viro
  2015-10-10 23:10       ` orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org) Al Viro
  1 sibling, 0 replies; 13+ messages in thread
From: Al Viro @ 2015-10-10 20:23 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

May I politely inquire about the intended meaning of the following
definition?
#define MAX_ALIGNED_DEV_REQ_DOWNSIZE                            \
                (MAX_DEV_REQ_DOWNSIZE +                         \
                        ((((MAX_DEV_REQ_DOWNSIZE /              \
                                (BITS_PER_LONG_DIV_8)) *        \
                                (BITS_PER_LONG_DIV_8)) +        \
                            (BITS_PER_LONG_DIV_8)) -            \
                        MAX_DEV_REQ_DOWNSIZE))

1) (x + (y - x)) is more commonly spelled as y
2) ((x / y) * y + y) might not be doing what you think it is doing
3) aligning the thing to multiple of sizeof(long) is an odd thing
to do, seeing that the value being aligned is 16 + sizeof
of a structure that contains pointers (which is an atrocity in its
own right, BTW).
4) on the related note, what is the meaning of
        static int max_downsize = MAX_ALIGNED_DEV_REQ_DOWNSIZE;
        int ret = 0, num_remaining = max_downsize;
(with no code other code touching that max_downsize thing)?
5) having collected the contents of the first 4 segments of your writev()
argument and having verified that their total size does not exceed
MAX_ALIGNED_DEV_REQ_DOWNSIZE, you proceed to do this:
                payload_size -= (2 * sizeof(__s32) + sizeof(__u64));
                if (payload_size <= sizeof(struct pvfs2_downcall_s))
                        /* copy the passed in downcall into the op */
                        memcpy(&op->downcall,
                               ptr,
                               sizeof(struct pvfs2_downcall_s));
                else
                        gossip_debug(GOSSIP_DEV_DEBUG,
                                     "writev: Ignoring %d bytes\n",
                                     payload_size);
upon the entry payload_size contains the amount of data collected.  What is the
intended meaning of the 'else' branch in there?  Note that it *is* possible
to hit, exactly because MAX_ALIGNED_DEV_REQ_DOWNSIZE is sizeof(long)
greater than 16 + sizeof(struct pvfs2_downcall_s).  What should happen if we
do hit it?  I do understand the "we shall whine" part, but then what?  Leave
the op->downcall uninitialized and proceed with whatever might've been there?

That code does marshalling of userland-supplied data.  Worse, the ABI is both
undocumented and utterly... I believe "idiosyncratic" would be a printable
term for that kind of thing.  Sigh...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org)
  2015-10-10  2:31     ` Al Viro
  2015-10-10 20:23       ` Al Viro
@ 2015-10-10 23:10       ` Al Viro
  2015-10-12 16:20         ` Mike Marshall
  1 sibling, 1 reply; 13+ messages in thread
From: Al Viro @ 2015-10-10 23:10 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Sat, Oct 10, 2015 at 03:31:57AM +0100, Al Viro wrote:
> 	* atrocious userland API for downcalls (response to everything
> other than readdir must come in 4-element iovec array passed to writev(),
> no matter where the segment boundaries are; response to readdir must come
> in 5-element iovec array passed to writev(), the boundaries among the
> first 4 segments do not matter, the 5th segment covering exactly the payload).
> IMO that needs to be fixed before we merge the damn thing - it's really too
> ugly to live.  I would really like to hear from somebody familiar with the
> userland side - what responses does it actually submit?  The validation
> kernel-side is basically inexistent, and while I suspect that we could handle
> the actually sent responses much cleaner, the full set of everything that
> would be accepted by the current code is a bloody mess and would be much
> harder to handle in a clean way.  What's more, the response layouts are
> messy, and if it is still possible to change that API, we'd be much better off
> if we cleaned them up.

Another API bogosity: when the 5th segment is present, successful writev()
returns the sum of sizes of the first 4.  Userland side just ignores that -
everything positive is treated as "everything's written".

Another piece of fun: header too large by less than sizeof(long) => print
a warning, leave op->downcall completely uninitialized and proceed.

Below is what the kernel side is actually doing:
	* less than 4 segments or more than 5 => whine and fail
	* if concatenation of the first 4 segments is longer than
16 + sizeof(struct pvfs2_downcall_s) + sizeof(long) => whine and fail
	* if concatenation of the first 4 segments is longer than
16 + sizeof(struct pvfs2_downcall_s) by no more than sizeof(long) => whine
and proceed with garbage.
	* if concatenation of the first 4 segments is no longer than
16 + sizeof(struct pvfs2_downcall_s), it is zero-padded to that size,
then the first 16 bytes are interpreted as 32bit protocol version (ignored)
+ 32bit magic (checked for one specific value) + 64bit tag (used to find the
matching request).  The rest is copied into op->downcall.
	* if the 32bit value 4 bytes into op->downcall is zero and 64bit
value following it is non-zero, the latter is interpreted as the size of
trailer data.
	* if there's no trailer, the 5th segment (if present) is completely
ignored.  Otherwise it must be present and is expected to contain the trailer
data.
	* if the 5th segment is longer than expected trailer data => whine
and fail (note that if the amount of expected trailer data is zero a non-empty
5th segment is quietly ignored).
	* if trailer expected, vmalloc a buffer for it and copy the 5th
segment contents in there; in case of the 5th segment being *shorter* than
expected trailer data, leave the rest of the buffer uninitialized (i.e.
expose the contents of random previously freed pages).
	* if vmalloc fails, act as if status (32bit at offset 5 into
op->downcall) had been -ENOMEM and don't look at the 5th segment at all.
	* in all cases ignore the amount of data taken from the 5th segment;
the return value is the sum of sizes of the first 4.

	What the current userland side appears to sends is 4-segment
version/magic/tag/struct pvfs2_downcall_s + possibly the 5th segment covering
the readdir results.  Trailer size in struct pvfs2_downcall_s actually is
equal to the size of the 5th segment if present and 0 otherwise.
	The 4th segment (struct pvfs2_downcall_s) contains a bunch of garbage
never used by the kernel (pointers and misalignment gaps).
	How much of that is guaranteed is, of course, known only to the
orangefs folks.  I only looked at the current version of the userland code;
hadn't checked the older ones and have no idea what kind of changes might
be planned.

	Folks, userland interfaces need to be documented (and preferably -
contain less ugliness).  At that point I think that we need the damn API
written up and reviewed, or it's a strong NAK.  Once it's in the tree, it's
cast in stone, so we'd better get it right and we need it explicitly
documented.  "Whatever orangefs userland code does" doesn't cut it, especially
since your protocol version check is inexistent.  Digging through the entire
history of your userland code and trying to deduce what would and what would
not be OK to change kernel-side is far past reasonable.

	I'm not asking for the description of semantics for individual requests
and responses (it would be nice to have as well, but that's a separate story),
but the rules for data marshalling are really needed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org)
  2015-10-10 23:10       ` orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org) Al Viro
@ 2015-10-12 16:20         ` Mike Marshall
  2015-10-12 19:16           ` Al Viro
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Marshall @ 2015-10-12 16:20 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

 > ... the ABI is ... utterly... "idiosyncratic"...

Al, do you mean idiosyncratic in a "distinctive and special" way,
or a "strange and bizarre" way? <g>

When I first started working with the kernel module, I grew
to think of the service_operation as the event horizon to
a black hole, but I figured that was just me...

First off, Al, I want to thank you for spending so much
time on the review. You have pointed out numerous ways
to make the kernel module better. Most importantly, though,
I believe you are saying (and I'm purposely restating
your words here to make sure we're on the same page) that there
will be no ack until we provide a usable detailed definition of
what is in every upcall and downcall and ioctl between
the kernel module and the userspace daemon. Not only do we need,
for example, to specify how LOOKUP_FOLLOW needs to be provided
in a lookup request, but it would be nice to also say why.

You're not promising an ack in return, but we can't go further
until we provide the definition.

 > Once [the userspace interface is] in the tree, it's cast in
 > stone, so we'd better get it right and we need it explicitly
 > documented.

Along these lines, we've done some well-intentioned "thrashing
around" (not Linus' favorite development model) towards ensuring
that an upstream kernel module will continue to work into the
future. The "khandle" code is an attempt to ensure that things
will continue to work when userspace transitions to 128 bit
uuids for handles, instead of the current 64 bit handles - we
use handles in the kernel module as inode numbers. And I added
an ioctl that allows the kernel module to become aware of
userspace's ever changing list of debug modes when the
userspace daemon is first started. Our intent to keep the
upstream kernel module working with userspace, current and future,
is also reflected in the upstream kernel modules' handling of
the the protocol version check - if userspace sees that the
kernel module is the upstream version then charge ahead because
we've planned not to change the protocol.

Anyhow, Martin and I plan to work on the protocol definition,
which may/might/probably-will lead to changes in the code to make
the protocol more sane. This will probably lead to us missing the
next merge window. And, as we proceed, we might ask for
feedback on what we are doing from Al and fs-devel...

Al, does this sound like how you think we should focus our
energy at this point?

-Mike

On Sat, Oct 10, 2015 at 7:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Oct 10, 2015 at 03:31:57AM +0100, Al Viro wrote:
>>       * atrocious userland API for downcalls (response to everything
>> other than readdir must come in 4-element iovec array passed to writev(),
>> no matter where the segment boundaries are; response to readdir must come
>> in 5-element iovec array passed to writev(), the boundaries among the
>> first 4 segments do not matter, the 5th segment covering exactly the payload).
>> IMO that needs to be fixed before we merge the damn thing - it's really too
>> ugly to live.  I would really like to hear from somebody familiar with the
>> userland side - what responses does it actually submit?  The validation
>> kernel-side is basically inexistent, and while I suspect that we could handle
>> the actually sent responses much cleaner, the full set of everything that
>> would be accepted by the current code is a bloody mess and would be much
>> harder to handle in a clean way.  What's more, the response layouts are
>> messy, and if it is still possible to change that API, we'd be much better off
>> if we cleaned them up.
>
> Another API bogosity: when the 5th segment is present, successful writev()
> returns the sum of sizes of the first 4.  Userland side just ignores that -
> everything positive is treated as "everything's written".
>
> Another piece of fun: header too large by less than sizeof(long) => print
> a warning, leave op->downcall completely uninitialized and proceed.
>
> Below is what the kernel side is actually doing:
>         * less than 4 segments or more than 5 => whine and fail
>         * if concatenation of the first 4 segments is longer than
> 16 + sizeof(struct pvfs2_downcall_s) + sizeof(long) => whine and fail
>         * if concatenation of the first 4 segments is longer than
> 16 + sizeof(struct pvfs2_downcall_s) by no more than sizeof(long) => whine
> and proceed with garbage.
>         * if concatenation of the first 4 segments is no longer than
> 16 + sizeof(struct pvfs2_downcall_s), it is zero-padded to that size,
> then the first 16 bytes are interpreted as 32bit protocol version (ignored)
> + 32bit magic (checked for one specific value) + 64bit tag (used to find the
> matching request).  The rest is copied into op->downcall.
>         * if the 32bit value 4 bytes into op->downcall is zero and 64bit
> value following it is non-zero, the latter is interpreted as the size of
> trailer data.
>         * if there's no trailer, the 5th segment (if present) is completely
> ignored.  Otherwise it must be present and is expected to contain the trailer
> data.
>         * if the 5th segment is longer than expected trailer data => whine
> and fail (note that if the amount of expected trailer data is zero a non-empty
> 5th segment is quietly ignored).
>         * if trailer expected, vmalloc a buffer for it and copy the 5th
> segment contents in there; in case of the 5th segment being *shorter* than
> expected trailer data, leave the rest of the buffer uninitialized (i.e.
> expose the contents of random previously freed pages).
>         * if vmalloc fails, act as if status (32bit at offset 5 into
> op->downcall) had been -ENOMEM and don't look at the 5th segment at all.
>         * in all cases ignore the amount of data taken from the 5th segment;
> the return value is the sum of sizes of the first 4.
>
>         What the current userland side appears to sends is 4-segment
> version/magic/tag/struct pvfs2_downcall_s + possibly the 5th segment covering
> the readdir results.  Trailer size in struct pvfs2_downcall_s actually is
> equal to the size of the 5th segment if present and 0 otherwise.
>         The 4th segment (struct pvfs2_downcall_s) contains a bunch of garbage
> never used by the kernel (pointers and misalignment gaps).
>         How much of that is guaranteed is, of course, known only to the
> orangefs folks.  I only looked at the current version of the userland code;
> hadn't checked the older ones and have no idea what kind of changes might
> be planned.
>
>         Folks, userland interfaces need to be documented (and preferably -
> contain less ugliness).  At that point I think that we need the damn API
> written up and reviewed, or it's a strong NAK.  Once it's in the tree, it's
> cast in stone, so we'd better get it right and we need it explicitly
> documented.  "Whatever orangefs userland code does" doesn't cut it, especially
> since your protocol version check is inexistent.  Digging through the entire
> history of your userland code and trying to deduce what would and what would
> not be OK to change kernel-side is far past reasonable.
>
>         I'm not asking for the description of semantics for individual requests
> and responses (it would be nice to have as well, but that's a separate story),
> but the rules for data marshalling are really needed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org)
  2015-10-12 16:20         ` Mike Marshall
@ 2015-10-12 19:16           ` Al Viro
  2015-10-13 17:46             ` Mike Marshall
  0 siblings, 1 reply; 13+ messages in thread
From: Al Viro @ 2015-10-12 19:16 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Mon, Oct 12, 2015 at 12:20:42PM -0400, Mike Marshall wrote:
>  > ... the ABI is ... utterly... "idiosyncratic"...
> 
> Al, do you mean idiosyncratic in a "distinctive and special" way,
> or a "strange and bizarre" way? <g>

The latter, but much stronger ;-)

> When I first started working with the kernel module, I grew
> to think of the service_operation as the event horizon to
> a black hole, but I figured that was just me...

*snort*

Anything you get back is, by definition, from before the event horizon...

> First off, Al, I want to thank you for spending so much
> time on the review. You have pointed out numerous ways
> to make the kernel module better. Most importantly, though,
> I believe you are saying (and I'm purposely restating
> your words here to make sure we're on the same page) that there
> will be no ack until we provide a usable detailed definition of
> what is in every upcall and downcall and ioctl between
> the kernel module and the userspace daemon. Not only do we need,
> for example, to specify how LOOKUP_FOLLOW needs to be provided
> in a lookup request, but it would be nice to also say why.

LOOKUP_FOLLOW is a separate story - the issue here is that you are
exposing internal details of the pathname resolution to the userland
and that's essentially a demand to keep them stable from now on; however,
I'm seriously at loss as to _what_ to keep stable.  "This call of ->lookup()
gets LOOKUP_FOLLOW in flags" doesn't map to any natural semantics I can
think of; it might accidentally happen to match something useful, but I
don't see what it is.

Your userland code does something rather odd with it, but it's buried inside
your state machine and I don't understand it well enough to tell what's going
on.  Is it trying to follow symlinks on its own upon lookup?  If so, it's
_very_ bogus.  If nothing else, userland code doesn't have enough information
to do that - it has no idea what the originating process is chrooted to,
for starters.  As the result, having your filesystem mounted inside a chroot
jail would open a nice way for non-priveleged process to break out - just
create an absolute symlink somewhere on that filesystem and go through it.
Again, I'm not sure what that userland code is doing, so it (hopefully)
might be something completely different.

So yes, I really do want details on this one, but for a different reason.

What I want from API documentation is something along the lines of "writev()
on /dev/pvfs2-req is used to pass responses to the requests made by the
kernel side; the data structure being passed is represented in the following
way <...>; representations do not statisfying constraints <...> are to be
rejected".

What you do with the payload _after_ it's been transferred is a separate
story; it's also nice to have, but right now we don't even have the calling
conventions, nevermind the description of behaviour.

How is one supposed to review the marshalling code?  Sure, I can tell that
if you have the iov[4].iov_len greater than .trailer_size, you end up with
uninitialized tail of vmalloc'ed buffer.  I can also see that your userland
code doesn't ever pass a positive .trailer_size with iov[4].iov_len different
from it.  What's the proper fix?  To declare that they must always be equal,
and reject mismatches, as you already do for inequality in other direction?
Or should we quietly zero-pad the buffer?  Either will work for your current
userland code, but only you can tell which is the right fix - I don't know
what the older versions used to do and I don't know what changes are planned.

And yes, I understand the desire to separate a potentially large payload
kernel-side; is there any deep reason why one doesn't simply use .trailer_size
to decide how much in the end of the area being transferred should go there?

Are there any plans for having the first 4 segments shorter than the full
16 + sizeof(struct pvfs2_downcall_s)?  If so, what's the minimal amount to
be expected in all cases?  I'm not nitpicking here - having e.g. short
packets ending at (non-zero) .status would work with the current kernel
code and isn't inconceivable as a planned change.

For that matter, what's .type for?  It looks like a selector for that
union you have in the end, but it is not used that way - originating
upcall is used for that instead.  Is it just a junk, or is it a currently
unimplemented sanity check (if .type in downcall must be the same as in
matching upcall)?

> You're not promising an ack in return, but we can't go further
> until we provide the definition.

*shrug*

Sure, I can try and pull it out of you guys bit by bit instead (see above
for some initial questions), but I suspect that it would be faster and less
painful for everyone involved if you just describe the calling conventions
and we'll see what might be missing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org)
  2015-10-12 19:16           ` Al Viro
@ 2015-10-13 17:46             ` Mike Marshall
  2015-10-13 23:34               ` Al Viro
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Marshall @ 2015-10-13 17:46 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel

 > *shrug*

Al, I guess what I wrote sounded pi**y, but I promise my attitude
is 180 degrees away from that... I'm just trying to focus on what is
most important... you've identified numerous issues, all of which
need to be addressed, but the state of the API used by userspace
is a show stopper: it needs documented, and then probably improved...

When we have something coherent started, we'll send it your way
so you can help make sure we're on the right track...

Thanks!

-Mike

On Mon, Oct 12, 2015 at 3:16 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Oct 12, 2015 at 12:20:42PM -0400, Mike Marshall wrote:
>>  > ... the ABI is ... utterly... "idiosyncratic"...
>>
>> Al, do you mean idiosyncratic in a "distinctive and special" way,
>> or a "strange and bizarre" way? <g>
>
> The latter, but much stronger ;-)
>
>> When I first started working with the kernel module, I grew
>> to think of the service_operation as the event horizon to
>> a black hole, but I figured that was just me...
>
> *snort*
>
> Anything you get back is, by definition, from before the event horizon...
>
>> First off, Al, I want to thank you for spending so much
>> time on the review. You have pointed out numerous ways
>> to make the kernel module better. Most importantly, though,
>> I believe you are saying (and I'm purposely restating
>> your words here to make sure we're on the same page) that there
>> will be no ack until we provide a usable detailed definition of
>> what is in every upcall and downcall and ioctl between
>> the kernel module and the userspace daemon. Not only do we need,
>> for example, to specify how LOOKUP_FOLLOW needs to be provided
>> in a lookup request, but it would be nice to also say why.
>
> LOOKUP_FOLLOW is a separate story - the issue here is that you are
> exposing internal details of the pathname resolution to the userland
> and that's essentially a demand to keep them stable from now on; however,
> I'm seriously at loss as to _what_ to keep stable.  "This call of ->lookup()
> gets LOOKUP_FOLLOW in flags" doesn't map to any natural semantics I can
> think of; it might accidentally happen to match something useful, but I
> don't see what it is.
>
> Your userland code does something rather odd with it, but it's buried inside
> your state machine and I don't understand it well enough to tell what's going
> on.  Is it trying to follow symlinks on its own upon lookup?  If so, it's
> _very_ bogus.  If nothing else, userland code doesn't have enough information
> to do that - it has no idea what the originating process is chrooted to,
> for starters.  As the result, having your filesystem mounted inside a chroot
> jail would open a nice way for non-priveleged process to break out - just
> create an absolute symlink somewhere on that filesystem and go through it.
> Again, I'm not sure what that userland code is doing, so it (hopefully)
> might be something completely different.
>
> So yes, I really do want details on this one, but for a different reason.
>
> What I want from API documentation is something along the lines of "writev()
> on /dev/pvfs2-req is used to pass responses to the requests made by the
> kernel side; the data structure being passed is represented in the following
> way <...>; representations do not statisfying constraints <...> are to be
> rejected".
>
> What you do with the payload _after_ it's been transferred is a separate
> story; it's also nice to have, but right now we don't even have the calling
> conventions, nevermind the description of behaviour.
>
> How is one supposed to review the marshalling code?  Sure, I can tell that
> if you have the iov[4].iov_len greater than .trailer_size, you end up with
> uninitialized tail of vmalloc'ed buffer.  I can also see that your userland
> code doesn't ever pass a positive .trailer_size with iov[4].iov_len different
> from it.  What's the proper fix?  To declare that they must always be equal,
> and reject mismatches, as you already do for inequality in other direction?
> Or should we quietly zero-pad the buffer?  Either will work for your current
> userland code, but only you can tell which is the right fix - I don't know
> what the older versions used to do and I don't know what changes are planned.
>
> And yes, I understand the desire to separate a potentially large payload
> kernel-side; is there any deep reason why one doesn't simply use .trailer_size
> to decide how much in the end of the area being transferred should go there?
>
> Are there any plans for having the first 4 segments shorter than the full
> 16 + sizeof(struct pvfs2_downcall_s)?  If so, what's the minimal amount to
> be expected in all cases?  I'm not nitpicking here - having e.g. short
> packets ending at (non-zero) .status would work with the current kernel
> code and isn't inconceivable as a planned change.
>
> For that matter, what's .type for?  It looks like a selector for that
> union you have in the end, but it is not used that way - originating
> upcall is used for that instead.  Is it just a junk, or is it a currently
> unimplemented sanity check (if .type in downcall must be the same as in
> matching upcall)?
>
>> You're not promising an ack in return, but we can't go further
>> until we provide the definition.
>
> *shrug*
>
> Sure, I can try and pull it out of you guys bit by bit instead (see above
> for some initial questions), but I suspect that it would be faster and less
> painful for everyone involved if you just describe the calling conventions
> and we'll see what might be missing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org)
  2015-10-13 17:46             ` Mike Marshall
@ 2015-10-13 23:34               ` Al Viro
  0 siblings, 0 replies; 13+ messages in thread
From: Al Viro @ 2015-10-13 23:34 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel

On Tue, Oct 13, 2015 at 01:46:54PM -0400, Mike Marshall wrote:
>  > *shrug*
> 
> Al, I guess what I wrote sounded pi**y, but I promise my attitude
> is 180 degrees away from that... I'm just trying to focus on what is
> most important... you've identified numerous issues, all of which
> need to be addressed, but the state of the API used by userspace
> is a show stopper: it needs documented, and then probably improved...
> 
> When we have something coherent started, we'll send it your way
> so you can help make sure we're on the right track...

No problem...  BTW, one note regarding the iov_iter - unlike an array of
iovec, things like "copy <that many> bytes" are easy; you don't need the
thing to start on the iovec boundary or anything like that.  Simple
copy_from_iter(dest, size, iter) will take care of everything.

Something like
	struct {
		__u32 version;
		__u32 magic;
		__u64 tag;
	} head;

	total = iov_iter_count(iter);
	if (total < 24)
		short packet
	n = copy_from_iter(&head, 16, iter);
	if (n != 16)
		failed copy
	<check head.magic, find the matching upcall by head.tag>
	if (not found)
		stray response

	/* get the beginning of response proper */
	memset(op->downcall, 0, 16);
	n = copy_from_iter(&op->downcall, 16, iter);
	if (n < 16) {
		/* it might be really short... */
		if (iov_iter_count())
			failed copy
		/* yes, and that's too short to have a trailer */
		/* just zero-pad and that's it */
		memset((char *)&op->downcall + n, 0,
			sizeof(op->downcall) - n);
		goto done;
	}

	/* will there be a trailer? */
	trailer_count = 0;
	if (op->downcall.status == 0 && op->downcall.trailer_count) {
		trailer_count = op->downcall.trailer_count;
		if (total - 32 < trailer_count)
			packet too short
		buf = vmalloc(trailer_count)
		if (!buf)
			op->downcall.status = -ENOMEM;
	}

	n = total - 32 - trailer_count;
	if (sizeof(op->downcall) - 16 < n)
		packet too long
	if (copy_from_iter((char *)op->downcall + 16, 0, n) != n)
		failed copy
	memset((char *)op->downcall + 16 + n, 0,
		sizeof(op->downcall) - n - 16);

	if (trailer_count) {
		/* read or skip the trailer */
		if (!buf) {
			iov_iter_advance(trailer_count, iter);
		} else {
			n = copy_from_iter(buf, trailer_count, iter);
			if (n < trailer_count)
				failed copy
		}
	}
done:
	...
	return total;

would do marshalling and accept all cases where your current code doesn't
produce an obvious breakage.  It's a lot more flexible than your current
*userland* code needs and probably more flexible than it would make sense
to have; e.g. if packets are not allowed to be just 24 bytes long, we
could turn that if (n < 16) {...} into if (n != 16) failed copy,
if we can always assume that packet is at least 16 + sizeof(op->downcall),
if could be simplified even more, etc.

There's no real benefit in using the boundary of the magic 5th segment to
tell where the trailer starts.  The above is just a demo, but hopefully it
does demonstrate how to use those primitives.  Basically, use copy_from_iter()
as you would use read() on stdin when writing a userland filter, with
iov_iter_advance() used to skip a given amount and iov_iter_count() telling
how much input is left.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-10-13 23:34 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-07 20:07 updated orangefs tree at kernel.org Mike Marshall
2015-10-08 21:07 ` Al Viro
2015-10-08 23:03   ` Al Viro
2015-10-09  3:41     ` Al Viro
2015-10-09  6:13       ` Al Viro
2015-10-09 19:22       ` Al Viro
2015-10-10  2:31     ` Al Viro
2015-10-10 20:23       ` Al Viro
2015-10-10 23:10       ` orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org) Al Viro
2015-10-12 16:20         ` Mike Marshall
2015-10-12 19:16           ` Al Viro
2015-10-13 17:46             ` Mike Marshall
2015-10-13 23:34               ` Al Viro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.