All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] Orangefs (text only resend)
@ 2015-09-01 15:42 Mike Marshall
  2015-09-02 23:34 ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Marshall @ 2015-09-01 15:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Christoph Hellwig, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman,
	Mike Marshall

Linus -

Please consider pulling Orangefs from

git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
for-next

OrangeFS is an LGPL userspace, scale-out, parallel storage system. It
is ideal for large storage challenges faced by HPC, Big Data, Genomics,
Bioinformatics, Video Streaming and Rendering.

Features:

  * Distributed Metadata, including distributed metadata inspired by
    Giga+ for directory entries

  * Support for multiple network infrastructures leveraging BMI to
    adapt to TCP, IB, Portals and others

  * Stateless Servers

  * User-level Implementation

  * Multiple interface integration levels for easy system and application
    integration

  * Server-to-server collective communication for improved metadata
    operation scalability

  * Scalable, Portable and Flexible

  * Proven Research Platform

  * Integrated capability-based security

  * Extensive Documentation

  * Optimized MPI-IO Support

In addition to the Linux kernel client, OrangeFS supports multiple client
platforms and applications including WEBDAV, S3, Windows and Hadoop
Integration.

History:

OrangeFS is the third development phase of the PVFS project. PFVS was
first developed in 1993 by Walt Ligon and Eric Blumer as a parallel
file system for Parallel Virtual Machine (PVM) as part of a NASA grant
to study the I/O patterns of parallel programs. In 2001-2004 a complete
rewrite PVFS2 was developed by Walt Ligon, Rob Ross, Phil Carns, Pete Wyckoff,
Neil Miller, Rob Latham, Sam Lang and others from many research institutions
all over the world, providing for many of the modern distributed and parallel
file system concepts. Omnibond is collaborating with the OrangeFS community
providing development, support and marketing resources.

Support:

Omnibond and other community members are committed to maintaining and
improving both the kernel and user space code, including compatibility
between versions.

Future:

OrangeFS has worldwide adoption in the cloud and on premise for a wide
variety of large scale scientific and computational problems with its
stability and easy integration with technologies such as MPI and even
replacing HDFS for Hadoop and Spark workloads. OrangeFS is available in
the AWS marketplace and is also an integral part of CloudyCluster which
aims to democratize Linux based computational science by making HPC easily
available for researchers in a diverse set of disciplines.


$ git diff --stat 64291f7db5bd8150a74ad2036f1037e6a0428df2
 Documentation/ABI/stable/sysfs-fs-orangefs |   87 ++
 Documentation/filesystems/orangefs.txt     |  137 +++
 fs/Kconfig                                 |    1 +
 fs/Makefile                                |    1 +
 fs/orangefs/Kconfig                        |    6 +
 fs/orangefs/Makefile                       |   10 +
 fs/orangefs/acl.c                          |  175 +++
 fs/orangefs/dcache.c                       |  142 +++
 fs/orangefs/devpvfs2-req.c                 | 1004 ++++++++++++++++
 fs/orangefs/dir.c                          |  378 ++++++
 fs/orangefs/downcall.h                     |  138 +++
 fs/orangefs/file.c                         | 1016 ++++++++++++++++
 fs/orangefs/inode.c                        |  469 ++++++++
 fs/orangefs/namei.c                        |  473 ++++++++
 fs/orangefs/protocol.h                     |  672 +++++++++++
 fs/orangefs/pvfs2-bufmap.c                 |  974 +++++++++++++++
 fs/orangefs/pvfs2-bufmap.h                 |   76 ++
 fs/orangefs/pvfs2-cache.c                  |  260 ++++
 fs/orangefs/pvfs2-debug.h                  |  290 +++++
 fs/orangefs/pvfs2-debugfs.c                |  458 +++++++
 fs/orangefs/pvfs2-debugfs.h                |    3 +
 fs/orangefs/pvfs2-dev-proto.h              |  102 ++
 fs/orangefs/pvfs2-kernel.h                 |  864 ++++++++++++++
 fs/orangefs/pvfs2-mod.c                    |  315 +++++
 fs/orangefs/pvfs2-sysfs.c                  | 1787 ++++++++++++++++++++++++++++
 fs/orangefs/pvfs2-sysfs.h                  |    2 +
 fs/orangefs/pvfs2-utils.c                  | 1129 ++++++++++++++++++
 fs/orangefs/super.c                        |  558 +++++++++
 fs/orangefs/symlink.c                      |   31 +
 fs/orangefs/upcall.h                       |  255 ++++
 fs/orangefs/waitqueue.c                    |  529 ++++++++
 fs/orangefs/xattr.c                        |  530 +++++++++
 32 files changed, 12872 insertions(+)

Thanks

Mike Marshall

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-01 15:42 [GIT PULL] Orangefs (text only resend) Mike Marshall
@ 2015-09-02 23:34 ` Linus Torvalds
  2015-09-03  1:13   ` Mike Marshall
  2015-09-06  6:35   ` Christoph Hellwig
  0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2015-09-02 23:34 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Al Viro, Christoph Hellwig, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

[ I haven't gotten to the filesystem pulls yet, and by now won't until
tomorrow, but I'm looking a bit ahead here.. ]

On Tue, Sep 1, 2015 at 8:42 AM, Mike Marshall <hubcap@omnibond.com> wrote:
>
> Please consider pulling Orangefs from
>
> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git for-next

I'd love to have comments and ack's from people. From the Cc I'd hope
we have people who have followed this, but there's no indication of
acks etc in the commits, so..

A few quick comments from just skimming the code and the commits:

 - the commit messages aren't really helpful. "Orangefs: kernel client
part 4". Yeah, that doesn't really say anything

 - the sign-offs are odd (omnibond _and_ clemson.edu?)

 - the code doesn't look obviously horrible, but I did notice that

   (a) the iovecs are walked manually (eg
pvfs_bufmap_copy_to_user_iovec()). I would really want to see the code
use the iov_iter infrastructure

   (b) iovec_iter may be new, but the wait_event() helpers are not,
and the code (eg pvfs2_devreq_writev) is using the really
old-fashioned way to wait for things (with add_wait_queue,
set_current_state, etc etc). I'd *really* want to see the much
simpler-to-read (and less bug-prone) wait_event() models used.

 - naming is an odd mix of "orangefs" and "pvfs2", both in the code
and in the filenames.

It would be lovely to get some comments from other people. Al? Anybody
who has been involved with this?

I'd also like to have more of an idea of who expects to maintain this?
I'm assuming that's Mark (and omnibond?), but it would be good to hear
who the users are and what the long-term support is supposed to be. We
have had a tradition of filesystems that don't then get used very
much, and they bit-rot.

               Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-02 23:34 ` Linus Torvalds
@ 2015-09-03  1:13   ` Mike Marshall
  2015-09-03  1:28     ` Linus Torvalds
  2015-09-06  6:35   ` Christoph Hellwig
  1 sibling, 1 reply; 26+ messages in thread
From: Mike Marshall @ 2015-09-03  1:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Christoph Hellwig, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

The "part 1", "part 2" commit messages are a result of me building
the linux-next tree directly from a patch series I posted to fs-devel.

I didn't know how much commit history you might want to see, but
we've been focusing on trying to go upstream for several years and
my github tree has several years of commit history representing
numerous branches across rebases:

https://github.com/hubcapsc/linux

The double sign-off's are also a side-effect of my inelegant use of
of git-am while I built the linux-next tree from the patch series I
had posted (and already signed off on)...

If the code doesn't look obviously horrible, it's because a lot of good people
have worked on the kernel module in the past, it has been in use since 2.4.
When I started working with the kernel module a couple of years ago it didn't
work past 2.6, so I've done a lot of "code janitor" type work and fixups,
a lot of work with the CodingStyle document. Christoph Hellwig helped
immensely with improvements that were beyond my depth of knowledge.

We can commit to working on "a" and "b" in your message, either before or
after going upstream, whichever you specify. As one result of
Greg KH's review, I replaced all the "proc" code with "sysfs" and "debugfs"
code... we are eager to be responsive to reviews. If some reviewers are
off-put at the idea of having to build the userspace part of Orangefs, there
is a cheat-sheet for that in orangefs.txt.

Walt Ligon has been with pvfs (now orangefs) since the beginning, and
is still the lead architect/programmer of the userspace filesystem. A major
new revision is in the works, and I've put a lot of work into making sure
the kernel module will work across revisions.

Omnibond is committed to maintaining the kernel module, we won't just
throw it over the wall and walk away.

-Mike "not Mark <g>"

On Wed, Sep 2, 2015 at 7:34 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> [ I haven't gotten to the filesystem pulls yet, and by now won't until
> tomorrow, but I'm looking a bit ahead here.. ]
>
> On Tue, Sep 1, 2015 at 8:42 AM, Mike Marshall <hubcap@omnibond.com> wrote:
>>
>> Please consider pulling Orangefs from
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git for-next
>
> I'd love to have comments and ack's from people. From the Cc I'd hope
> we have people who have followed this, but there's no indication of
> acks etc in the commits, so..
>
> A few quick comments from just skimming the code and the commits:
>
>  - the commit messages aren't really helpful. "Orangefs: kernel client
> part 4". Yeah, that doesn't really say anything
>
>  - the sign-offs are odd (omnibond _and_ clemson.edu?)
>
>  - the code doesn't look obviously horrible, but I did notice that
>
>    (a) the iovecs are walked manually (eg
> pvfs_bufmap_copy_to_user_iovec()). I would really want to see the code
> use the iov_iter infrastructure
>
>    (b) iovec_iter may be new, but the wait_event() helpers are not,
> and the code (eg pvfs2_devreq_writev) is using the really
> old-fashioned way to wait for things (with add_wait_queue,
> set_current_state, etc etc). I'd *really* want to see the much
> simpler-to-read (and less bug-prone) wait_event() models used.
>
>  - naming is an odd mix of "orangefs" and "pvfs2", both in the code
> and in the filenames.
>
> It would be lovely to get some comments from other people. Al? Anybody
> who has been involved with this?
>
> I'd also like to have more of an idea of who expects to maintain this?
> I'm assuming that's Mark (and omnibond?), but it would be good to hear
> who the users are and what the long-term support is supposed to be. We
> have had a tradition of filesystems that don't then get used very
> much, and they bit-rot.
>
>                Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-03  1:13   ` Mike Marshall
@ 2015-09-03  1:28     ` Linus Torvalds
  2015-09-03 20:22       ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2015-09-03  1:28 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Al Viro, Christoph Hellwig, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

On Wed, Sep 2, 2015 at 6:13 PM, Mike Marshall <hubcap@omnibond.com> wrote:
>
> If the code doesn't look obviously horrible, it's because a lot of good people
> have worked on the kernel module in the past, it has been in use since 2.4.
> When I started working with the kernel module a couple of years ago it didn't
> work past 2.6, so I've done a lot of "code janitor" type work and fixups,
> a lot of work with the CodingStyle document. Christoph Hellwig helped
> immensely with improvements that were beyond my depth of knowledge.

I'd *love* to get ack's from the people who were involved.

Greg? Christoph?

> We can commit to working on "a" and "b" in your message, either before or
> after going upstream, whichever you specify.

I don't think that's anything to hold things up for, as long as it
gets fixed. But I would like to see the people who reviewed it and
were involved in making it ready to ship actually stand up and say
so..

> -Mike "not Mark <g>"

Oops.

             Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-03  1:28     ` Linus Torvalds
@ 2015-09-03 20:22       ` Linus Torvalds
  2015-09-03 22:18         ` Mike Marshall
  2015-09-03 22:44         ` Greg Kroah-Hartman
  0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2015-09-03 20:22 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Al Viro, Christoph Hellwig, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

On Wed, Sep 2, 2015 at 6:28 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>             But I would like to see the people who reviewed it and
> were involved in making it ready to ship actually stand up and say
> so..

[ sound of crickets in the background ]

Ping? Anybody?

I'm doing my filesystem merges now, but I think I'll delay this in the
hope to get more feedback from people.

                      Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-03 20:22       ` Linus Torvalds
@ 2015-09-03 22:18         ` Mike Marshall
  2015-09-03 22:44         ` Greg Kroah-Hartman
  1 sibling, 0 replies; 26+ messages in thread
From: Mike Marshall @ 2015-09-03 22:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Christoph Hellwig, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

I pinged Christoph today, he must be out there somewhere... At Seattle
Greg told me he
was going on vacation, away from phones and email, for two weeks... he
might not come
out of the woodwork until next week...

I started reading Jonathan Corbet's "iov_iter interface" article, and
reading code, today...

-Mike

On Thu, Sep 3, 2015 at 4:22 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Sep 2, 2015 at 6:28 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>>             But I would like to see the people who reviewed it and
>> were involved in making it ready to ship actually stand up and say
>> so..
>
> [ sound of crickets in the background ]
>
> Ping? Anybody?
>
> I'm doing my filesystem merges now, but I think I'll delay this in the
> hope to get more feedback from people.
>
>                       Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-03 20:22       ` Linus Torvalds
  2015-09-03 22:18         ` Mike Marshall
@ 2015-09-03 22:44         ` Greg Kroah-Hartman
  1 sibling, 0 replies; 26+ messages in thread
From: Greg Kroah-Hartman @ 2015-09-03 22:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Marshall, Al Viro, Christoph Hellwig, linux-fsdevel,
	Andrew Morton, Stephen Rothwell, Boaz Harrosh

On Thu, Sep 03, 2015 at 01:22:33PM -0700, Linus Torvalds wrote:
> On Wed, Sep 2, 2015 at 6:28 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> >             But I would like to see the people who reviewed it and
> > were involved in making it ready to ship actually stand up and say
> > so..
> 
> [ sound of crickets in the background ]
> 
> Ping? Anybody?
> 
> I'm doing my filesystem merges now, but I think I'll delay this in the
> hope to get more feedback from people.

Sorry for the delay.

I've not reviewed the filesystem side, only the kobject handling and
debugfs portions of the filesystem, and they look sane to me.  Much
nicer than the old procfs interface that the code used to have.  So for
that portion, no objection to me for this to be merged.

As for long-term maintenance, I'm pretty sure that Mike is up for it, and
has the resources to do it, he's been working on this for many years now
already, and I doubt that will stop any time soon.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-02 23:34 ` Linus Torvalds
  2015-09-03  1:13   ` Mike Marshall
@ 2015-09-06  6:35   ` Christoph Hellwig
  2015-09-06  9:08     ` Al Viro
  2015-09-06 14:35     ` Mike Marshall
  1 sibling, 2 replies; 26+ messages in thread
From: Christoph Hellwig @ 2015-09-06  6:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Marshall, Al Viro, Christoph Hellwig, linux-fsdevel,
	Andrew Morton, Stephen Rothwell, Boaz Harrosh,
	Greg Kroah-Hartman

Hi Linus,

sorry, I've been away from mail for a few days.

I've been pretty happy with the code when I looked over it, which
was a while ago.

On Wed, Sep 02, 2015 at 04:34:41PM -0700, Linus Torvalds wrote:
>    (a) the iovecs are walked manually (eg
> pvfs_bufmap_copy_to_user_iovec()). I would really want to see the code
> use the iov_iter infrastructure

And that was before we had the full blown iov_iter code.  Note that
orangefs always does O_DIRECT-style I/O and doesn't go through the
page cache or does any other similar client side caching except for mmap,
so it will only use the low-level iov_iter helpers.

>  - naming is an odd mix of "orangefs" and "pvfs2", both in the code
> and in the filenames.

I found this a bit odd to - pvfs2 was the original version and now
Clemson has done orangefs based on it.  Mike might have more comments
on why he wants to keep both names.

> I'd also like to have more of an idea of who expects to maintain this?
> I'm assuming that's Mark (and omnibond?), but it would be good to hear
> who the users are and what the long-term support is supposed to be. We
> have had a tradition of filesystems that don't then get used very
> much, and they bit-rot.

PVFS2 has been around forever, and orangefs for quite a while.  I know
Mike has been working on getting this in shape for a couple years, and
been good at fixing review feedback. 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-06  6:35   ` Christoph Hellwig
@ 2015-09-06  9:08     ` Al Viro
  2015-09-06 14:52       ` Mike Marshall
  2015-09-06 14:35     ` Mike Marshall
  1 sibling, 1 reply; 26+ messages in thread
From: Al Viro @ 2015-09-06  9:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Mike Marshall, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

On Sun, Sep 06, 2015 at 08:35:52AM +0200, Christoph Hellwig wrote:

> On Wed, Sep 02, 2015 at 04:34:41PM -0700, Linus Torvalds wrote:
> >    (a) the iovecs are walked manually (eg
> > pvfs_bufmap_copy_to_user_iovec()). I would really want to see the code
> > use the iov_iter infrastructure
> 
> And that was before we had the full blown iov_iter code.  Note that
> orangefs always does O_DIRECT-style I/O and doesn't go through the
> page cache or does any other similar client side caching except for mmap,
> so it will only use the low-level iov_iter helpers.

FWIW, the thing that looks rather wrong there is playing fast and loose with
kvec -> iovec -> kvec casts.  This kind of thing
+               /* Are we copying to User Virtual Addresses? */
+               if (to_user)
+                       ret = pvfs_bufmap_copy_to_user_iovec(
+                               bufmap,
+                               buffer_index,
+                               vec,
+                               nr_segs,
+                               total_size);
+               /* Are we copying to Kern Virtual Addresses? */
+               else
+                       ret = pvfs_bufmap_copy_to_kernel_iovec(
+                               bufmap,
+                               buffer_index,
+                               vec,
+                               nr_segs,
+                               total_size);
is almost always a sign of really bad API.  "is that a kernel one?"
kind of flags is a bloody bad idea.  So's playing wiht copying iovecs,
modifying those copies, etc.

We might end up with new primitives added out of that, but this kind of
stuff definitely shouldn't be open-coded, especially that way.

Some random notes:

Could we please put
#define PVFS_util_min(x1, x2) (((x1) > (x2)) ? (x2) : (x1))
out of its misery?  It's pretty much the textbook example of the evils
of reinventing the wheels - side effects, arithmetic promotions, etc.

Another thing: why does server want to know about LOOKUP_FOLLOW, of all
things?  Unless I'm misreading what ->lookup() is doing there...

This
+ * pvfs2_link() is only implemented here to make sure that we return a
+ * reasonable error code (the kernel will return a misleading EPERM
+ * otherwise).  PVFS2 does not support hard links.
+ */
+static int pvfs2_link(struct dentry *old_dentry,
+                     struct inode *dir,
+                     struct dentry *dentry)
+{
+       return -EOPNOTSUPP;
+}
is stupid.  Expected error value is not EOPNOTSUPP; pardon the bluntness,
but your idea of what would be less misleading doesn't matter - what matters
is what the _callers_ of link(2), mknod(2), etc. are expecting.  Which is to
say, what does the userland code expect to get.  It's outright promised in
POSIX, actually.

symlink(2) - what happens to too long symlink bodies?  Silent truncation?

+static inline void PVFS_khandle_to(const struct pvfs2_khandle *kh,
+                                  void *p, int size)
+{
+       int i;
+       unsigned char *c = p;
+
+       memset(p, 0, size);
+
+       for (i = 0; i < 16 && i < size; i++)
+               c[i] = kh->u[i];
+}

Er...  If you are using memset(), why open-code memcpy()?  ->u is an array
of unsigned char, so this loop is just a straight memcpy(); nothing fancy
going on there...

May I politely inquire about the reasons for DECLARE_ERRNO_MAPPING_AND_FN 
being in a header?  Or existing at all, while we are at it...  Ditto for
the bit under the tail of that animal - DECLARE_ERRNO_MAPPING, that is.

mask_blocked_signals()/unmask_blocked_signals(): Oleg might want to take
a look at that one (and parties responsible might want to dive for cover,
especially when he figures out what hides behind pvfs2_current_sigaction
and pvfs2_current_signal_lock).

+       /* fill in temporary structure passed to fill_sb method */
+       mount_sb_info.data = data;
+       mount_sb_info.root_khandle =
+               new_op->downcall.resp.fs_mount.root_khandle;
+       mount_sb_info.fs_id = new_op->downcall.resp.fs_mount.fs_id;
+       mount_sb_info.id = new_op->downcall.resp.fs_mount.id;
+
+       /*
+        * the mount_sb_info structure looks odd, but it's used because
+        * the private sb info isn't allocated until we call
+        * pvfs2_fill_sb, yet we have the info we need to fill it with
+        * here.  so we store it temporarily and pass all of the info
+        * to fill_sb where it's properly copied out
+        */
+       mnt_sb_d = mount_nodev(fst,
+                              flags,
+                              (void *)&mount_sb_info,
+                              pvfs2_fill_sb);

If you are fighting an interface, that might be because you are using the
wrong one...  Consider the definition of mount_nodev():
struct dentry *mount_nodev(struct file_system_type *fs_type,
        int flags, void *data,
        int (*fill_super)(struct super_block *, void *, int))
{
        int error;
        struct super_block *s = sget(fs_type, NULL, set_anon_super, flags, NULL);

        if (IS_ERR(s))  
                return ERR_CAST(s);

        error = fill_super(s, data, flags & MS_SILENT ? 1 : 0);
        if (error) {
                deactivate_locked_super(s);
                return ERR_PTR(error);
        }
        s->s_flags |= MS_ACTIVE;
        return dget(s->s_root);
}
and observe that everything it calls is exported.  It's a trivial convenience
wrapper for sget(), and if it turns out to be inconvenient - just use sget()
itself and be done with that.  No need to bother with callbacks, having
that mount_sb_info thing, etc.

Is pvfs2_remount() safe to call during pvfs2_unmount_sb() on the same
filesystem?  Incidentally, what's protecting the list of superblocks
while you are walking it and calling pvfs2_remount()?

While we are at it, a lot of GFP_KERNEL allocations seem to be done
with a bunch of locks (including global ones, such as request_mutex) held.
Why won't that deadlock?

+#define pvfs2_lock_inode(inode) spin_lock(&inode->i_lock)
+#define pvfs2_unlock_inode(inode) spin_unlock(&inode->i_lock)

Inhume with extreme prejudice.  ->i_bytes and ->i_blocks are the least
of your worries if two threads hit copy_attributes_to_inode() on the
same inode in parallel.  Which can happen, AFAICS...

+       /*
+        * PVFS2 cannot set size with a setattr operation.  Probably not likely
+        * to be requested through the VFS, but just in case, don't worry about
+        * ATTR_SIZE
+        */
What about truncate(2)?

+       sb->s_fs_info =
+               kmalloc(sizeof(struct pvfs2_sb_info_s), PVFS2_GFP_FLAGS);
+       if (!PVFS2_SB(sb))
+               return -ENOMEM;
+       memset(sb->s_fs_info, 0, sizeof(struct pvfs2_sb_info_s));
+       PVFS2_SB(sb)->sb = sb;
kzalloc()?

+       root_dentry = d_make_root(root);
+       if (!root_dentry) {
+               iput(root);
+               return -ENOMEM;
Double iput() here...

Anyway, bedtime for me - it's 5am here.  I'll post more detailed one after
I get some sleep...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-06  6:35   ` Christoph Hellwig
  2015-09-06  9:08     ` Al Viro
@ 2015-09-06 14:35     ` Mike Marshall
  1 sibling, 0 replies; 26+ messages in thread
From: Mike Marshall @ 2015-09-06 14:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Al Viro, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

> I found this a bit odd to - pvfs2 was the original version and now
> Clemson has done orangefs based on it.  Mike might have more comments
> on why he wants to keep both names.

We talk in meetings about changing all the occurrences of pvfs to orangefs in
the code, but so far actual work on how the the userspace code functions
has taken precedence. The abilities of Orangefs continue to improve,
Walt gave a talk at the Boston Linux Foundation conference - as
"just a code janitor" you wouldn't want to grill me on it.

-Mike

On Sun, Sep 6, 2015 at 2:35 AM, Christoph Hellwig <hch@lst.de> wrote:
> Hi Linus,
>
> sorry, I've been away from mail for a few days.
>
> I've been pretty happy with the code when I looked over it, which
> was a while ago.
>
> On Wed, Sep 02, 2015 at 04:34:41PM -0700, Linus Torvalds wrote:
>>    (a) the iovecs are walked manually (eg
>> pvfs_bufmap_copy_to_user_iovec()). I would really want to see the code
>> use the iov_iter infrastructure
>
> And that was before we had the full blown iov_iter code.  Note that
> orangefs always does O_DIRECT-style I/O and doesn't go through the
> page cache or does any other similar client side caching except for mmap,
> so it will only use the low-level iov_iter helpers.
>
>>  - naming is an odd mix of "orangefs" and "pvfs2", both in the code
>> and in the filenames.
>
> I found this a bit odd to - pvfs2 was the original version and now
> Clemson has done orangefs based on it.  Mike might have more comments
> on why he wants to keep both names.
>
>> I'd also like to have more of an idea of who expects to maintain this?
>> I'm assuming that's Mark (and omnibond?), but it would be good to hear
>> who the users are and what the long-term support is supposed to be. We
>> have had a tradition of filesystems that don't then get used very
>> much, and they bit-rot.
>
> PVFS2 has been around forever, and orangefs for quite a while.  I know
> Mike has been working on getting this in shape for a couple years, and
> been good at fixing review feedback.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-06  9:08     ` Al Viro
@ 2015-09-06 14:52       ` Mike Marshall
  2015-09-06 15:00         ` Mike Marshall
  2015-09-06 20:20         ` Al Viro
  0 siblings, 2 replies; 26+ messages in thread
From: Mike Marshall @ 2015-09-06 14:52 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Linus Torvalds, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

Al's "depth of knowledge" review is the kind I have been
yearning for.

+               /* Are we copying to User Virtual Addresses? */
+               if (to_user)
+                       ret = pvfs_bufmap_copy_to_user_iovec(
+                               bufmap,
+                               buffer_index,
+                               vec,
+                               nr_segs,
+                               total_size);
+               /* Are we copying to Kern Virtual Addresses? */
+               else
+                       ret = pvfs_bufmap_copy_to_kernel_iovec(
+                               bufmap,
+                               buffer_index,
+                               vec,
+                               nr_segs,
+                               total_size);

In a new branch of my private tree I changed out the very complex
 pvfs_bufmap_copy_to_user with:

int pvfs_bufmap_copy_to_user_iovec2(struct pvfs2_bufmap *bufmap,
                                                        struct iov_iter *iter,
                                                        size_t total_size,
                                                        int buffer_index)
{
        struct pvfs_bufmap_desc *from;
        int ret = 0; void *from_kaddr = NULL;
        from = &bufmap->desc_array[buffer_index];

        from_kaddr = pvfs2_kmap(from->page_array[0]);
        ret = copy_to_iter(from_kaddr, total_size, iter);
        return ret;
}

"cat filename" is a simple test that drives execution
down the "Are we copying to User Virtual Addresses?"
code path, and big and small files still copy
out.

I haven't found a test case that drives execution down
the "Are we copying to Kern Virtual Addresses?" code
path yet.

Anyhow, whether Al's (and anyone else's) concerns
are worked on before or after (or if) we get accepted
upstream, we are committed to working on them.

I worked for several months on debugfs and sysfs
(and some "future proofing"), so you can guess
that we would wish to be out from under the gun
of going upstream, but code has to go on merit.

-Mike

On Sun, Sep 6, 2015 at 5:08 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Sep 06, 2015 at 08:35:52AM +0200, Christoph Hellwig wrote:
>
>> On Wed, Sep 02, 2015 at 04:34:41PM -0700, Linus Torvalds wrote:
>> >    (a) the iovecs are walked manually (eg
>> > pvfs_bufmap_copy_to_user_iovec()). I would really want to see the code
>> > use the iov_iter infrastructure
>>
>> And that was before we had the full blown iov_iter code.  Note that
>> orangefs always does O_DIRECT-style I/O and doesn't go through the
>> page cache or does any other similar client side caching except for mmap,
>> so it will only use the low-level iov_iter helpers.
>
> FWIW, the thing that looks rather wrong there is playing fast and loose with
> kvec -> iovec -> kvec casts.  This kind of thing
> +               /* Are we copying to User Virtual Addresses? */
> +               if (to_user)
> +                       ret = pvfs_bufmap_copy_to_user_iovec(
> +                               bufmap,
> +                               buffer_index,
> +                               vec,
> +                               nr_segs,
> +                               total_size);
> +               /* Are we copying to Kern Virtual Addresses? */
> +               else
> +                       ret = pvfs_bufmap_copy_to_kernel_iovec(
> +                               bufmap,
> +                               buffer_index,
> +                               vec,
> +                               nr_segs,
> +                               total_size);
> is almost always a sign of really bad API.  "is that a kernel one?"
> kind of flags is a bloody bad idea.  So's playing wiht copying iovecs,
> modifying those copies, etc.
>
> We might end up with new primitives added out of that, but this kind of
> stuff definitely shouldn't be open-coded, especially that way.
>
> Some random notes:
>
> Could we please put
> #define PVFS_util_min(x1, x2) (((x1) > (x2)) ? (x2) : (x1))
> out of its misery?  It's pretty much the textbook example of the evils
> of reinventing the wheels - side effects, arithmetic promotions, etc.
>
> Another thing: why does server want to know about LOOKUP_FOLLOW, of all
> things?  Unless I'm misreading what ->lookup() is doing there...
>
> This
> + * pvfs2_link() is only implemented here to make sure that we return a
> + * reasonable error code (the kernel will return a misleading EPERM
> + * otherwise).  PVFS2 does not support hard links.
> + */
> +static int pvfs2_link(struct dentry *old_dentry,
> +                     struct inode *dir,
> +                     struct dentry *dentry)
> +{
> +       return -EOPNOTSUPP;
> +}
> is stupid.  Expected error value is not EOPNOTSUPP; pardon the bluntness,
> but your idea of what would be less misleading doesn't matter - what matters
> is what the _callers_ of link(2), mknod(2), etc. are expecting.  Which is to
> say, what does the userland code expect to get.  It's outright promised in
> POSIX, actually.
>
> symlink(2) - what happens to too long symlink bodies?  Silent truncation?
>
> +static inline void PVFS_khandle_to(const struct pvfs2_khandle *kh,
> +                                  void *p, int size)
> +{
> +       int i;
> +       unsigned char *c = p;
> +
> +       memset(p, 0, size);
> +
> +       for (i = 0; i < 16 && i < size; i++)
> +               c[i] = kh->u[i];
> +}
>
> Er...  If you are using memset(), why open-code memcpy()?  ->u is an array
> of unsigned char, so this loop is just a straight memcpy(); nothing fancy
> going on there...
>
> May I politely inquire about the reasons for DECLARE_ERRNO_MAPPING_AND_FN
> being in a header?  Or existing at all, while we are at it...  Ditto for
> the bit under the tail of that animal - DECLARE_ERRNO_MAPPING, that is.
>
> mask_blocked_signals()/unmask_blocked_signals(): Oleg might want to take
> a look at that one (and parties responsible might want to dive for cover,
> especially when he figures out what hides behind pvfs2_current_sigaction
> and pvfs2_current_signal_lock).
>
> +       /* fill in temporary structure passed to fill_sb method */
> +       mount_sb_info.data = data;
> +       mount_sb_info.root_khandle =
> +               new_op->downcall.resp.fs_mount.root_khandle;
> +       mount_sb_info.fs_id = new_op->downcall.resp.fs_mount.fs_id;
> +       mount_sb_info.id = new_op->downcall.resp.fs_mount.id;
> +
> +       /*
> +        * the mount_sb_info structure looks odd, but it's used because
> +        * the private sb info isn't allocated until we call
> +        * pvfs2_fill_sb, yet we have the info we need to fill it with
> +        * here.  so we store it temporarily and pass all of the info
> +        * to fill_sb where it's properly copied out
> +        */
> +       mnt_sb_d = mount_nodev(fst,
> +                              flags,
> +                              (void *)&mount_sb_info,
> +                              pvfs2_fill_sb);
>
> If you are fighting an interface, that might be because you are using the
> wrong one...  Consider the definition of mount_nodev():
> struct dentry *mount_nodev(struct file_system_type *fs_type,
>         int flags, void *data,
>         int (*fill_super)(struct super_block *, void *, int))
> {
>         int error;
>         struct super_block *s = sget(fs_type, NULL, set_anon_super, flags, NULL);
>
>         if (IS_ERR(s))
>                 return ERR_CAST(s);
>
>         error = fill_super(s, data, flags & MS_SILENT ? 1 : 0);
>         if (error) {
>                 deactivate_locked_super(s);
>                 return ERR_PTR(error);
>         }
>         s->s_flags |= MS_ACTIVE;
>         return dget(s->s_root);
> }
> and observe that everything it calls is exported.  It's a trivial convenience
> wrapper for sget(), and if it turns out to be inconvenient - just use sget()
> itself and be done with that.  No need to bother with callbacks, having
> that mount_sb_info thing, etc.
>
> Is pvfs2_remount() safe to call during pvfs2_unmount_sb() on the same
> filesystem?  Incidentally, what's protecting the list of superblocks
> while you are walking it and calling pvfs2_remount()?
>
> While we are at it, a lot of GFP_KERNEL allocations seem to be done
> with a bunch of locks (including global ones, such as request_mutex) held.
> Why won't that deadlock?
>
> +#define pvfs2_lock_inode(inode) spin_lock(&inode->i_lock)
> +#define pvfs2_unlock_inode(inode) spin_unlock(&inode->i_lock)
>
> Inhume with extreme prejudice.  ->i_bytes and ->i_blocks are the least
> of your worries if two threads hit copy_attributes_to_inode() on the
> same inode in parallel.  Which can happen, AFAICS...
>
> +       /*
> +        * PVFS2 cannot set size with a setattr operation.  Probably not likely
> +        * to be requested through the VFS, but just in case, don't worry about
> +        * ATTR_SIZE
> +        */
> What about truncate(2)?
>
> +       sb->s_fs_info =
> +               kmalloc(sizeof(struct pvfs2_sb_info_s), PVFS2_GFP_FLAGS);
> +       if (!PVFS2_SB(sb))
> +               return -ENOMEM;
> +       memset(sb->s_fs_info, 0, sizeof(struct pvfs2_sb_info_s));
> +       PVFS2_SB(sb)->sb = sb;
> kzalloc()?
>
> +       root_dentry = d_make_root(root);
> +       if (!root_dentry) {
> +               iput(root);
> +               return -ENOMEM;
> Double iput() here...
>
> Anyway, bedtime for me - it's 5am here.  I'll post more detailed one after
> I get some sleep...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-06 14:52       ` Mike Marshall
@ 2015-09-06 15:00         ` Mike Marshall
  2015-09-06 20:20         ` Al Viro
  1 sibling, 0 replies; 26+ messages in thread
From: Mike Marshall @ 2015-09-06 15:00 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Linus Torvalds, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

Oh... and I also added a iov_iter_init before
calling the new version of pvfs_bufmap_copy_to_user...

if (to_user) {
   iov_iter_init(&iter, READ, vec, nr_segs, total_size);
   ret = pvfs_bufmap_copy_to_user_iovec2(bufmap,
                                                              &iter,
                                                              total_size,
                                                              buffer_index);
   ret = 0;
}

This is just me trying to figure out what to do,
not finished code...

-Mike

On Sun, Sep 6, 2015 at 10:52 AM, Mike Marshall <hubcap@omnibond.com> wrote:
> Al's "depth of knowledge" review is the kind I have been
> yearning for.
>
> +               /* Are we copying to User Virtual Addresses? */
> +               if (to_user)
> +                       ret = pvfs_bufmap_copy_to_user_iovec(
> +                               bufmap,
> +                               buffer_index,
> +                               vec,
> +                               nr_segs,
> +                               total_size);
> +               /* Are we copying to Kern Virtual Addresses? */
> +               else
> +                       ret = pvfs_bufmap_copy_to_kernel_iovec(
> +                               bufmap,
> +                               buffer_index,
> +                               vec,
> +                               nr_segs,
> +                               total_size);
>
> In a new branch of my private tree I changed out the very complex
>  pvfs_bufmap_copy_to_user with:
>
> int pvfs_bufmap_copy_to_user_iovec2(struct pvfs2_bufmap *bufmap,
>                                                         struct iov_iter *iter,
>                                                         size_t total_size,
>                                                         int buffer_index)
> {
>         struct pvfs_bufmap_desc *from;
>         int ret = 0; void *from_kaddr = NULL;
>         from = &bufmap->desc_array[buffer_index];
>
>         from_kaddr = pvfs2_kmap(from->page_array[0]);
>         ret = copy_to_iter(from_kaddr, total_size, iter);
>         return ret;
> }
>
> "cat filename" is a simple test that drives execution
> down the "Are we copying to User Virtual Addresses?"
> code path, and big and small files still copy
> out.
>
> I haven't found a test case that drives execution down
> the "Are we copying to Kern Virtual Addresses?" code
> path yet.
>
> Anyhow, whether Al's (and anyone else's) concerns
> are worked on before or after (or if) we get accepted
> upstream, we are committed to working on them.
>
> I worked for several months on debugfs and sysfs
> (and some "future proofing"), so you can guess
> that we would wish to be out from under the gun
> of going upstream, but code has to go on merit.
>
> -Mike
>
> On Sun, Sep 6, 2015 at 5:08 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Sun, Sep 06, 2015 at 08:35:52AM +0200, Christoph Hellwig wrote:
>>
>>> On Wed, Sep 02, 2015 at 04:34:41PM -0700, Linus Torvalds wrote:
>>> >    (a) the iovecs are walked manually (eg
>>> > pvfs_bufmap_copy_to_user_iovec()). I would really want to see the code
>>> > use the iov_iter infrastructure
>>>
>>> And that was before we had the full blown iov_iter code.  Note that
>>> orangefs always does O_DIRECT-style I/O and doesn't go through the
>>> page cache or does any other similar client side caching except for mmap,
>>> so it will only use the low-level iov_iter helpers.
>>
>> FWIW, the thing that looks rather wrong there is playing fast and loose with
>> kvec -> iovec -> kvec casts.  This kind of thing
>> +               /* Are we copying to User Virtual Addresses? */
>> +               if (to_user)
>> +                       ret = pvfs_bufmap_copy_to_user_iovec(
>> +                               bufmap,
>> +                               buffer_index,
>> +                               vec,
>> +                               nr_segs,
>> +                               total_size);
>> +               /* Are we copying to Kern Virtual Addresses? */
>> +               else
>> +                       ret = pvfs_bufmap_copy_to_kernel_iovec(
>> +                               bufmap,
>> +                               buffer_index,
>> +                               vec,
>> +                               nr_segs,
>> +                               total_size);
>> is almost always a sign of really bad API.  "is that a kernel one?"
>> kind of flags is a bloody bad idea.  So's playing wiht copying iovecs,
>> modifying those copies, etc.
>>
>> We might end up with new primitives added out of that, but this kind of
>> stuff definitely shouldn't be open-coded, especially that way.
>>
>> Some random notes:
>>
>> Could we please put
>> #define PVFS_util_min(x1, x2) (((x1) > (x2)) ? (x2) : (x1))
>> out of its misery?  It's pretty much the textbook example of the evils
>> of reinventing the wheels - side effects, arithmetic promotions, etc.
>>
>> Another thing: why does server want to know about LOOKUP_FOLLOW, of all
>> things?  Unless I'm misreading what ->lookup() is doing there...
>>
>> This
>> + * pvfs2_link() is only implemented here to make sure that we return a
>> + * reasonable error code (the kernel will return a misleading EPERM
>> + * otherwise).  PVFS2 does not support hard links.
>> + */
>> +static int pvfs2_link(struct dentry *old_dentry,
>> +                     struct inode *dir,
>> +                     struct dentry *dentry)
>> +{
>> +       return -EOPNOTSUPP;
>> +}
>> is stupid.  Expected error value is not EOPNOTSUPP; pardon the bluntness,
>> but your idea of what would be less misleading doesn't matter - what matters
>> is what the _callers_ of link(2), mknod(2), etc. are expecting.  Which is to
>> say, what does the userland code expect to get.  It's outright promised in
>> POSIX, actually.
>>
>> symlink(2) - what happens to too long symlink bodies?  Silent truncation?
>>
>> +static inline void PVFS_khandle_to(const struct pvfs2_khandle *kh,
>> +                                  void *p, int size)
>> +{
>> +       int i;
>> +       unsigned char *c = p;
>> +
>> +       memset(p, 0, size);
>> +
>> +       for (i = 0; i < 16 && i < size; i++)
>> +               c[i] = kh->u[i];
>> +}
>>
>> Er...  If you are using memset(), why open-code memcpy()?  ->u is an array
>> of unsigned char, so this loop is just a straight memcpy(); nothing fancy
>> going on there...
>>
>> May I politely inquire about the reasons for DECLARE_ERRNO_MAPPING_AND_FN
>> being in a header?  Or existing at all, while we are at it...  Ditto for
>> the bit under the tail of that animal - DECLARE_ERRNO_MAPPING, that is.
>>
>> mask_blocked_signals()/unmask_blocked_signals(): Oleg might want to take
>> a look at that one (and parties responsible might want to dive for cover,
>> especially when he figures out what hides behind pvfs2_current_sigaction
>> and pvfs2_current_signal_lock).
>>
>> +       /* fill in temporary structure passed to fill_sb method */
>> +       mount_sb_info.data = data;
>> +       mount_sb_info.root_khandle =
>> +               new_op->downcall.resp.fs_mount.root_khandle;
>> +       mount_sb_info.fs_id = new_op->downcall.resp.fs_mount.fs_id;
>> +       mount_sb_info.id = new_op->downcall.resp.fs_mount.id;
>> +
>> +       /*
>> +        * the mount_sb_info structure looks odd, but it's used because
>> +        * the private sb info isn't allocated until we call
>> +        * pvfs2_fill_sb, yet we have the info we need to fill it with
>> +        * here.  so we store it temporarily and pass all of the info
>> +        * to fill_sb where it's properly copied out
>> +        */
>> +       mnt_sb_d = mount_nodev(fst,
>> +                              flags,
>> +                              (void *)&mount_sb_info,
>> +                              pvfs2_fill_sb);
>>
>> If you are fighting an interface, that might be because you are using the
>> wrong one...  Consider the definition of mount_nodev():
>> struct dentry *mount_nodev(struct file_system_type *fs_type,
>>         int flags, void *data,
>>         int (*fill_super)(struct super_block *, void *, int))
>> {
>>         int error;
>>         struct super_block *s = sget(fs_type, NULL, set_anon_super, flags, NULL);
>>
>>         if (IS_ERR(s))
>>                 return ERR_CAST(s);
>>
>>         error = fill_super(s, data, flags & MS_SILENT ? 1 : 0);
>>         if (error) {
>>                 deactivate_locked_super(s);
>>                 return ERR_PTR(error);
>>         }
>>         s->s_flags |= MS_ACTIVE;
>>         return dget(s->s_root);
>> }
>> and observe that everything it calls is exported.  It's a trivial convenience
>> wrapper for sget(), and if it turns out to be inconvenient - just use sget()
>> itself and be done with that.  No need to bother with callbacks, having
>> that mount_sb_info thing, etc.
>>
>> Is pvfs2_remount() safe to call during pvfs2_unmount_sb() on the same
>> filesystem?  Incidentally, what's protecting the list of superblocks
>> while you are walking it and calling pvfs2_remount()?
>>
>> While we are at it, a lot of GFP_KERNEL allocations seem to be done
>> with a bunch of locks (including global ones, such as request_mutex) held.
>> Why won't that deadlock?
>>
>> +#define pvfs2_lock_inode(inode) spin_lock(&inode->i_lock)
>> +#define pvfs2_unlock_inode(inode) spin_unlock(&inode->i_lock)
>>
>> Inhume with extreme prejudice.  ->i_bytes and ->i_blocks are the least
>> of your worries if two threads hit copy_attributes_to_inode() on the
>> same inode in parallel.  Which can happen, AFAICS...
>>
>> +       /*
>> +        * PVFS2 cannot set size with a setattr operation.  Probably not likely
>> +        * to be requested through the VFS, but just in case, don't worry about
>> +        * ATTR_SIZE
>> +        */
>> What about truncate(2)?
>>
>> +       sb->s_fs_info =
>> +               kmalloc(sizeof(struct pvfs2_sb_info_s), PVFS2_GFP_FLAGS);
>> +       if (!PVFS2_SB(sb))
>> +               return -ENOMEM;
>> +       memset(sb->s_fs_info, 0, sizeof(struct pvfs2_sb_info_s));
>> +       PVFS2_SB(sb)->sb = sb;
>> kzalloc()?
>>
>> +       root_dentry = d_make_root(root);
>> +       if (!root_dentry) {
>> +               iput(root);
>> +               return -ENOMEM;
>> Double iput() here...
>>
>> Anyway, bedtime for me - it's 5am here.  I'll post more detailed one after
>> I get some sleep...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-06 14:52       ` Mike Marshall
  2015-09-06 15:00         ` Mike Marshall
@ 2015-09-06 20:20         ` Al Viro
  2015-09-07  6:37           ` Al Viro
  1 sibling, 1 reply; 26+ messages in thread
From: Al Viro @ 2015-09-06 20:20 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Christoph Hellwig, Linus Torvalds, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

On Sun, Sep 06, 2015 at 10:52:16AM -0400, Mike Marshall wrote:
> int pvfs_bufmap_copy_to_user_iovec2(struct pvfs2_bufmap *bufmap,
>                                                         struct iov_iter *iter,
>                                                         size_t total_size,
>                                                         int buffer_index)
> {
>         struct pvfs_bufmap_desc *from;
>         int ret = 0; void *from_kaddr = NULL;
>         from = &bufmap->desc_array[buffer_index];
> 
>         from_kaddr = pvfs2_kmap(from->page_array[0]);
>         ret = copy_to_iter(from_kaddr, total_size, iter);
>         return ret;
> }

Wrappers Are Evil: pvfs2_kmap().  It obfuscates the things for no reason
whatsoever, is actively wrong for a lot of reasons (kmap() slots are
system-wide resoure, and not too plentiful one, at that) _and_ obfuscates
the open-coding of existing primitive, badly: copy_page_to_iter()
does the same thing with less headache for caller and without a leak
(you forgot kunmap the damn thing afterwards).

> "cat filename" is a simple test that drives execution
> down the "Are we copying to User Virtual Addresses?"
> code path, and big and small files still copy
> out.
> 
> I haven't found a test case that drives execution down
> the "Are we copying to Kern Virtual Addresses?" code
> path yet.

Ummm...  GrepTFS?  pvfs_bufmap_copy_to_kernel_iovec() called from
postcopy_buffers() when the last argument is 0, which is called from
wait_for_direct_io() in the same conditions plus the first argument being
PVFS_IO_READ.  Said function has only two callers, one in do_readv_writev()
(the last argument is 1, out of consideration), another in pvfs2_inode_read().
Which does
+       ret = wait_for_direct_io(PVFS_IO_READ, inode, offset, &vec, 1,
+                       count, readahead_size, 0);
which should hit pvfs_bufmap_copy_to_kernel_iovec() just fine...

Incidentally, it's misannotated -
+ssize_t pvfs2_inode_read(struct inode *inode,
+                        char __user *buf,
+                        size_t count,
+                        loff_t *offset,
+                        loff_t readahead_size)
+{
claims buf to be a userland pointer, while its only caller is passing
kmap(some page) there.  This
                bytes_read = pvfs2_inode_read(inode,
-                                             page_data,
+                                             (char __user *) page_data,
                                              blocksize,
                                              &blockptr_offset,
in later commit forces sparce to STFU, but the point of __user annotations is
to be able to keep track which pointers are which, after all...

Oh, and to continue GTFS, the only caller of pvfs2_inode_read() appears
to be read_one_page(), called from pvfs2_readpage() and pvfs2_readpages(),
which are ->readpage() and ->readpages() methods in pvfs2_address_operations.
So I'd suggest trying to mmap and dereference.  Or execve() of a binary
there...

BTW, this
+       /* map the pages */
+       down_write(&current->mm->mmap_sem);
+       ret = get_user_pages(current,
+                            current->mm,
+                            (unsigned long)user_desc->ptr,
+                            bufmap->page_count,
+                            1,
+                            0,
+                            bufmap->page_array,
+                            NULL);
+       up_write(&current->mm->mmap_sem);
should almost certainly be replaced with
	ret = get_user_pages_fast((unsigned long)user_desc->ptr,
				  bufmap->page_count, 1,bufmap->page_array);
and let it deal with ->mmap_sem.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-06 20:20         ` Al Viro
@ 2015-09-07  6:37           ` Al Viro
  2015-09-07 21:10             ` Mike Marshall
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2015-09-07  6:37 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Christoph Hellwig, Linus Torvalds, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

	* symlink handling - why the hell do we set non-NULL cookie for
symlinks that do not have any ->put_link()?  What's worse, it looks like
a bogus server could trick us into overwriting ->link_target of a live
inode (e.g. via ->d_revalidate() -> pvfs2_inode_getattr() ->
copy_attributes_to_inode()).  Am I misreading it?
	* in copy_attributes_to_inode() we have
        case PVFS_TYPE_SYMLINK:
                if (symname != NULL) {
                        inode->i_size = (loff_t) strlen(symname);
                        break;
                }
                /*FALLTHRU*/
and the only caller passes is
        if (copy_attributes_to_inode(inode,
                        &new_op->downcall.resp.getattr.attributes,
                        new_op->downcall.resp.getattr.link_target)) {
How can symname possibly be NULL?  .resp.getattr.link_target is an array,
for crying out loud!
	* who sets (or uses) struct PVFS_sys_attr_s .target_link?  Or
.dist_name, or .dist_params, while we are at it...
	* can a symlink be longer than 256 bytes?  And why are
fs/orangefs/downcall.h:40:      char link_target[PVFS2_NAME_LEN];
and
fs/orangefs/pvfs2-kernel.h:317: char link_target[PVFS_NAME_MAX];
spelled out differently?  _Both_ constants are 256; what's the point of
obfuscating here?
	* what's to guarantee that this .resp.getattr.link_target is
NUL-terminated?  Passing it to strlen() in copy_attributes_to_inode()
is a bad idea unless we _have_ NUL in there...
	* in the same copy_attributes_to_inode() you have
                /* copy link target to inode private data */
                if (pvfs2_inode && symname) {
                        strncpy(pvfs2_inode->link_target,
                                symname,
                                PVFS_NAME_MAX);
                        gossip_debug(GOSSIP_UTILS_DEBUG,
                                     "Copied attr link target %s\n",
                                     pvfs2_inode->link_target);
                }
Again, what's to guarantee that ->link_target in pvfs2_inode will be
NUL-terminated?
	* handling of ->i_mode in the same function is completely broken -
you are building it in place *ON* *LIVE* *INODE*:
        inode->i_mode = perm_mode;
...
                inode->i_mode |= S_IFDIR;
(and similar for regulars and symlinks).  Again, that sucker is called from
your ->d_revalidate().  With no exclusion whatsoever, not that it would be
easy to provide one for all ->i_mode readers (it is possible, but you'd
have to replace a lot of generic helper functions with ones of your own;
not done and almost certainly not worth attempting).
	* pvfs2_symlink() can't get NULL symname; what it *can* get is
symname up to 4Kb long.  Doing
        strncpy(new_op->upcall.req.sym.target, symname, PVFS2_NAME_LEN);
will not only quietly truncate it, it might leave the copy without NUL...
	* what's up with compulsive casts? E.g.
                        attrs->mtime =
                            pvfs2_convert_time_field((void *)&iattr->ia_mtime);
with
__u64 pvfs2_convert_time_field(void *time_ptr)
{
        __u64 pvfs2_time;
        struct timespec *tspec = (struct timespec *)time_ptr;

        pvfs2_time = (__u64) ((time_t) tspec->tv_sec);
        return pvfs2_time;
}
Leaving aside the casts in (__u64) ((time_t) tspec->tv_sec), ->ia_mtime is
struct timespec; what's the point of taking its address, casting to void *,
passing to that sucker, only to cast to back to struct timespec * (and
only reading from it, so const struct timespec * would do just fine)?
	* wrappers must die.  It's not IOCCC, damn it.  While having
pvfs2_inode_lock and pvfs2_lock_inode (both bogus) in the same header has
a certain charm, please don't do it.
	* in dir.c:
        for (i = 0; i < rhandle.readdir_response.pvfs_dirent_outcount; i++) {
                len = rhandle.readdir_response.dirent_array[i].d_length;
                current_entry = rhandle.readdir_response.dirent_array[i].d_name;
                current_ino = pvfs2_khandle_to_ino(
                        &(rhandle.readdir_response.dirent_array[i].khandle));

                gossip_debug(GOSSIP_DIR_DEBUG,
                             "calling dir_emit for %s with len %d, pos %ld\n",
                             current_entry,
                             len,
                             (unsigned long)pos);
                ret =
                    dir_emit(ctx, current_entry, len, current_ino, DT_UNKNOWN);
                ctx->pos++;
                gossip_ldebug(GOSSIP_DIR_DEBUG,
                              "%s: ctx->pos:%lld\n",
                              __func__,
                              lld(ctx->pos));

                pos++;
        }

        /* this means that all of the dir_emit calls succeeded */
        if (i == rhandle.readdir_response.pvfs_dirent_outcount) {

No, it doesn't.  You ignore the return value of dir_emit()...

	* same file, decode_dirents():
        readdir->dirent_array = kmalloc(readdir->pvfs_dirent_outcount *
                                        sizeof(*readdir->dirent_array),
                                        GFP_KERNEL);
What's to prevent a wraparound in multiplication here?
	* same file, same function:
        for (i = 0; i < readdir->pvfs_dirent_outcount; i++) {
                dec_string(pptr, &readdir->dirent_array[i].d_name,
                           &readdir->dirent_array[i].d_length);
where
#define dec_string(pptr, pbuf, plen) do { \
        __u32 len = (*(__u32 *) *(pptr)); \
        *pbuf = *(pptr) + 4; \
        *(pptr) += roundup8(4 + len + 1); \
        if (plen) \
                *plen = len;\
} while (0)
	Just what will happen if this 32bit length will be well beyond
the size of response we are parsing here?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-07  6:37           ` Al Viro
@ 2015-09-07 21:10             ` Mike Marshall
  2015-09-07 23:22               ` Al Viro
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Marshall @ 2015-09-07 21:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Linus Torvalds, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

Al, you've given me a lot more to look at than I can comment on intelligently
in short order. Some of the things you've said are clear to me, and some
I'll have to work to figure out...

So, a couple of things:

I have failed so far today to figure out what you mean by GrepTFS or GTFS...
I know what you mean by STFU, so I hope its not something like that <g>...

I have looked at all the usages of GFP_KERNEL, and locks, focusing on
request_mutex... along with reading stuff like https://lwn.net/Articles/274695/
and https://lwn.net/lwn/kernel/LDD2/ch07.html... I don't yet see where we're
doing GFP_KERNEL allocations within held locks... could you describe in more
detail one of the ones you saw?

-Mike

On Mon, Sep 7, 2015 at 2:37 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>         * symlink handling - why the hell do we set non-NULL cookie for
> symlinks that do not have any ->put_link()?  What's worse, it looks like
> a bogus server could trick us into overwriting ->link_target of a live
> inode (e.g. via ->d_revalidate() -> pvfs2_inode_getattr() ->
> copy_attributes_to_inode()).  Am I misreading it?
>         * in copy_attributes_to_inode() we have
>         case PVFS_TYPE_SYMLINK:
>                 if (symname != NULL) {
>                         inode->i_size = (loff_t) strlen(symname);
>                         break;
>                 }
>                 /*FALLTHRU*/
> and the only caller passes is
>         if (copy_attributes_to_inode(inode,
>                         &new_op->downcall.resp.getattr.attributes,
>                         new_op->downcall.resp.getattr.link_target)) {
> How can symname possibly be NULL?  .resp.getattr.link_target is an array,
> for crying out loud!
>         * who sets (or uses) struct PVFS_sys_attr_s .target_link?  Or
> .dist_name, or .dist_params, while we are at it...
>         * can a symlink be longer than 256 bytes?  And why are
> fs/orangefs/downcall.h:40:      char link_target[PVFS2_NAME_LEN];
> and
> fs/orangefs/pvfs2-kernel.h:317: char link_target[PVFS_NAME_MAX];
> spelled out differently?  _Both_ constants are 256; what's the point of
> obfuscating here?
>         * what's to guarantee that this .resp.getattr.link_target is
> NUL-terminated?  Passing it to strlen() in copy_attributes_to_inode()
> is a bad idea unless we _have_ NUL in there...
>         * in the same copy_attributes_to_inode() you have
>                 /* copy link target to inode private data */
>                 if (pvfs2_inode && symname) {
>                         strncpy(pvfs2_inode->link_target,
>                                 symname,
>                                 PVFS_NAME_MAX);
>                         gossip_debug(GOSSIP_UTILS_DEBUG,
>                                      "Copied attr link target %s\n",
>                                      pvfs2_inode->link_target);
>                 }
> Again, what's to guarantee that ->link_target in pvfs2_inode will be
> NUL-terminated?
>         * handling of ->i_mode in the same function is completely broken -
> you are building it in place *ON* *LIVE* *INODE*:
>         inode->i_mode = perm_mode;
> ...
>                 inode->i_mode |= S_IFDIR;
> (and similar for regulars and symlinks).  Again, that sucker is called from
> your ->d_revalidate().  With no exclusion whatsoever, not that it would be
> easy to provide one for all ->i_mode readers (it is possible, but you'd
> have to replace a lot of generic helper functions with ones of your own;
> not done and almost certainly not worth attempting).
>         * pvfs2_symlink() can't get NULL symname; what it *can* get is
> symname up to 4Kb long.  Doing
>         strncpy(new_op->upcall.req.sym.target, symname, PVFS2_NAME_LEN);
> will not only quietly truncate it, it might leave the copy without NUL...
>         * what's up with compulsive casts? E.g.
>                         attrs->mtime =
>                             pvfs2_convert_time_field((void *)&iattr->ia_mtime);
> with
> __u64 pvfs2_convert_time_field(void *time_ptr)
> {
>         __u64 pvfs2_time;
>         struct timespec *tspec = (struct timespec *)time_ptr;
>
>         pvfs2_time = (__u64) ((time_t) tspec->tv_sec);
>         return pvfs2_time;
> }
> Leaving aside the casts in (__u64) ((time_t) tspec->tv_sec), ->ia_mtime is
> struct timespec; what's the point of taking its address, casting to void *,
> passing to that sucker, only to cast to back to struct timespec * (and
> only reading from it, so const struct timespec * would do just fine)?
>         * wrappers must die.  It's not IOCCC, damn it.  While having
> pvfs2_inode_lock and pvfs2_lock_inode (both bogus) in the same header has
> a certain charm, please don't do it.
>         * in dir.c:
>         for (i = 0; i < rhandle.readdir_response.pvfs_dirent_outcount; i++) {
>                 len = rhandle.readdir_response.dirent_array[i].d_length;
>                 current_entry = rhandle.readdir_response.dirent_array[i].d_name;
>                 current_ino = pvfs2_khandle_to_ino(
>                         &(rhandle.readdir_response.dirent_array[i].khandle));
>
>                 gossip_debug(GOSSIP_DIR_DEBUG,
>                              "calling dir_emit for %s with len %d, pos %ld\n",
>                              current_entry,
>                              len,
>                              (unsigned long)pos);
>                 ret =
>                     dir_emit(ctx, current_entry, len, current_ino, DT_UNKNOWN);
>                 ctx->pos++;
>                 gossip_ldebug(GOSSIP_DIR_DEBUG,
>                               "%s: ctx->pos:%lld\n",
>                               __func__,
>                               lld(ctx->pos));
>
>                 pos++;
>         }
>
>         /* this means that all of the dir_emit calls succeeded */
>         if (i == rhandle.readdir_response.pvfs_dirent_outcount) {
>
> No, it doesn't.  You ignore the return value of dir_emit()...
>
>         * same file, decode_dirents():
>         readdir->dirent_array = kmalloc(readdir->pvfs_dirent_outcount *
>                                         sizeof(*readdir->dirent_array),
>                                         GFP_KERNEL);
> What's to prevent a wraparound in multiplication here?
>         * same file, same function:
>         for (i = 0; i < readdir->pvfs_dirent_outcount; i++) {
>                 dec_string(pptr, &readdir->dirent_array[i].d_name,
>                            &readdir->dirent_array[i].d_length);
> where
> #define dec_string(pptr, pbuf, plen) do { \
>         __u32 len = (*(__u32 *) *(pptr)); \
>         *pbuf = *(pptr) + 4; \
>         *(pptr) += roundup8(4 + len + 1); \
>         if (plen) \
>                 *plen = len;\
> } while (0)
>         Just what will happen if this 32bit length will be well beyond
> the size of response we are parsing here?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-07 21:10             ` Mike Marshall
@ 2015-09-07 23:22               ` Al Viro
  2015-09-08  0:47                 ` Theodore Ts'o
  2015-09-08 14:48                 ` Christoph Hellwig
  0 siblings, 2 replies; 26+ messages in thread
From: Al Viro @ 2015-09-07 23:22 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Christoph Hellwig, Linus Torvalds, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

On Mon, Sep 07, 2015 at 05:10:11PM -0400, Mike Marshall wrote:

> I have failed so far today to figure out what you mean by GrepTFS or GTFS...

grep the fucking source, of course...

> and https://lwn.net/lwn/kernel/LDD2/ch07.html... I don't yet see where we're
> doing GFP_KERNEL allocations within held locks... could you describe in more
> detail one of the ones you saw?

You don't want e.g. to have allocation request triggering an attempt to write
a dirty page on a shared mapping of a file from your fs while you are holding
a mutex that would block that attempt *and* waiting for a allocation to
succeed.

Same story as when a block driver needs to allocate something to process a
write request - we do _not_ want that to trigger writes.

That's what GFP_NOFS and GFP_NOIO are about; LDD2 is old and the analogue
back then used to be called GFP_BUFFER.

Which (if any) is needed depends on the locks you are holding; something like
->i_mutex on a directory is not a problem, but things like your request mutex
almost certainly *are*.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-07 23:22               ` Al Viro
@ 2015-09-08  0:47                 ` Theodore Ts'o
  2015-09-08  2:49                   ` Dave Chinner
  2015-09-08 14:48                 ` Christoph Hellwig
  1 sibling, 1 reply; 26+ messages in thread
From: Theodore Ts'o @ 2015-09-08  0:47 UTC (permalink / raw)
  To: Al Viro
  Cc: Mike Marshall, Christoph Hellwig, Linus Torvalds, linux-fsdevel,
	Andrew Morton, Stephen Rothwell, Boaz Harrosh,
	Greg Kroah-Hartman

On Tue, Sep 08, 2015 at 12:22:06AM +0100, Al Viro wrote:
> You don't want e.g. to have allocation request triggering an attempt to write
> a dirty page on a shared mapping of a file from your fs while you are holding
> a mutex that would block that attempt *and* waiting for a allocation to
> succeed.

Mike,

To be more explicit --- any code in your writepage/writepages
function, or code called from your writepage/writepages function has
to do all of its allocations using GFP_NOFS.  Otherwise, you can end
up in a recursion where an attempt to writeback a page can trigger the
VM system to try to writeback either the same page or another page,
and then Hilarity Ensues, with either the code self-deadlocking, or in
the best case, the kernel stack getting overrun.

Note that in some cases, you could be calling kernel code that has no
idea that it is being called a context which requires GFP_NOFS.  In
the past, I've had to change code in fs/buffer.c so that it would take
a gfp_t argument, so that when it is called from a GFP_NOFS context,
we can pass in a GFP so memory allocations won't result in a recursion
back into the fs code.

Similarly, if you have code which is not in the writepage/writepages
code path, but which holds a lock which would block
writepage()/writepages() from making forward progress, then you could
be holding the lock, have the memory allocation force the page cleaner
into action, which then tries calling your writepage()/writepages()
function, which then runs into the lock that was being held at the
time of the memory allocation, and once again, Hilarity Ensues[1].

Hope this helps,

						- Ted

[1] http://tvtropes.org/pmwiki/pmwiki.php/Main/HilarityEnsues
(Well, it may not be that hilarious if you're the poor sucker trying
to debug the deadlock, but....)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-08  0:47                 ` Theodore Ts'o
@ 2015-09-08  2:49                   ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2015-09-08  2:49 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Al Viro, Mike Marshall, Christoph Hellwig, Linus Torvalds,
	linux-fsdevel, Andrew Morton, Stephen Rothwell, Boaz Harrosh,
	Greg Kroah-Hartman

On Mon, Sep 07, 2015 at 08:47:54PM -0400, Theodore Ts'o wrote:
> On Tue, Sep 08, 2015 at 12:22:06AM +0100, Al Viro wrote:
> > You don't want e.g. to have allocation request triggering an attempt to write
> > a dirty page on a shared mapping of a file from your fs while you are holding
> > a mutex that would block that attempt *and* waiting for a allocation to
> > succeed.
> 
> Mike,
> 
> To be more explicit --- any code in your writepage/writepages
> function, or code called from your writepage/writepages function has
> to do all of its allocations using GFP_NOFS.  Otherwise, you can end
> up in a recursion where an attempt to writeback a page can trigger the
> VM system to try to writeback either the same page or another page,
> and then Hilarity Ensues, with either the code self-deadlocking, or in
> the best case, the kernel stack getting overrun.

FWIW, lockdep will issue warnings on these potentially recursive
allocations, which can help you find them without having to first
deadlock the machine.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-07 23:22               ` Al Viro
  2015-09-08  0:47                 ` Theodore Ts'o
@ 2015-09-08 14:48                 ` Christoph Hellwig
  2015-09-08 18:21                   ` Mike Marshall
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2015-09-08 14:48 UTC (permalink / raw)
  To: Al Viro
  Cc: Mike Marshall, Christoph Hellwig, Linus Torvalds, linux-fsdevel,
	Andrew Morton, Stephen Rothwell, Boaz Harrosh,
	Greg Kroah-Hartman

On Tue, Sep 08, 2015 at 12:22:06AM +0100, Al Viro wrote:
> You don't want e.g. to have allocation request triggering an attempt to write
> a dirty page on a shared mapping of a file from your fs while you are holding
> a mutex that would block that attempt *and* waiting for a allocation to
> succeed.

Unless something changed very recently there is no writeback in orangefs.
It's all non-cached I/O straight on to the server, with readpage only
there for read-only mmaps (for executable I suspect).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-08 14:48                 ` Christoph Hellwig
@ 2015-09-08 18:21                   ` Mike Marshall
  2015-09-09 15:05                     ` Mike Marshall
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Marshall @ 2015-09-08 18:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Al Viro, Linus Torvalds, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

Thanks for all the lockdep feedback everyone...

I used to have some cool things turned on in my .config when
Christoph was working with us...

CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y

... and there used to be at thing like this:
CONFIG_LOCKDEP

... I guess it is this now?
CONFIG_LOCKDEP_SUPPORT

Anyhow... I (blush) didn't turn these things on in my VMs when I
got a new desktop recently, I've got them turned back on now...

-Mike

On Tue, Sep 8, 2015 at 10:48 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Tue, Sep 08, 2015 at 12:22:06AM +0100, Al Viro wrote:
>> You don't want e.g. to have allocation request triggering an attempt to write
>> a dirty page on a shared mapping of a file from your fs while you are holding
>> a mutex that would block that attempt *and* waiting for a allocation to
>> succeed.
>
> Unless something changed very recently there is no writeback in orangefs.
> It's all non-cached I/O straight on to the server, with readpage only
> there for read-only mmaps (for executable I suspect).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-08 18:21                   ` Mike Marshall
@ 2015-09-09 15:05                     ` Mike Marshall
  2015-09-11 21:12                       ` Mike Marshall
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Marshall @ 2015-09-09 15:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Al Viro, Linus Torvalds, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

Christoph's right about the writeback.

I'm pretty sure I added the lockdep stuff back to my .config
correctly:

[    0.000000] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo
 Molnar
[    0.000000] ... MAX_LOCKDEP_SUBCLASSES:  8
[    0.000000] ... MAX_LOCK_DEPTH:          48
[    0.000000] ... MAX_LOCKDEP_KEYS:        8191
[    0.000000] ... CLASSHASH_SIZE:          4096
[    0.000000] ... MAX_LOCKDEP_ENTRIES:     32768
[    0.000000] ... MAX_LOCKDEP_CHAINS:      65536
[    0.000000] ... CHAINHASH_SIZE:          32768
[    0.000000]  memory used by lock dependency info: 8159 kB
[    0.000000]  per task-struct memory footprint: 1920 bytes


After that I ran the dbench setup that our tester guy uses, and then
before I left for home I fired off xfstests (Christoph made a xfstest patch
for us that at least allows us to run some of it in a meaningful way)...

No lockdep backtraces in my syslog this morning...

I'll keep looking, and also return to fixing the iov_iter code
that Linus pointed out... I hope he looks back in here in the
next few day, I guess the merge window ends at the end of
this week? Whatever else happens, Al Viro gave me plenty
to do <g>...

-Mike

On Tue, Sep 8, 2015 at 2:21 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> Thanks for all the lockdep feedback everyone...
>
> I used to have some cool things turned on in my .config when
> Christoph was working with us...
>
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_MUTEXES=y
> CONFIG_DEBUG_LOCK_ALLOC=y
> CONFIG_PROVE_LOCKING=y
>
> ... and there used to be at thing like this:
> CONFIG_LOCKDEP
>
> ... I guess it is this now?
> CONFIG_LOCKDEP_SUPPORT
>
> Anyhow... I (blush) didn't turn these things on in my VMs when I
> got a new desktop recently, I've got them turned back on now...
>
> -Mike
>
> On Tue, Sep 8, 2015 at 10:48 AM, Christoph Hellwig <hch@lst.de> wrote:
>> On Tue, Sep 08, 2015 at 12:22:06AM +0100, Al Viro wrote:
>>> You don't want e.g. to have allocation request triggering an attempt to write
>>> a dirty page on a shared mapping of a file from your fs while you are holding
>>> a mutex that would block that attempt *and* waiting for a allocation to
>>> succeed.
>>
>> Unless something changed very recently there is no writeback in orangefs.
>> It's all non-cached I/O straight on to the server, with readpage only
>> there for read-only mmaps (for executable I suspect).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-09 15:05                     ` Mike Marshall
@ 2015-09-11 21:12                       ` Mike Marshall
  2015-09-11 22:24                         ` Al Viro
  2015-09-11 23:20                         ` Linus Torvalds
  0 siblings, 2 replies; 26+ messages in thread
From: Mike Marshall @ 2015-09-11 21:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Al Viro, Linus Torvalds, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

I'm about to leave for the day...

I haven't found any problems with the GFP_KERNEL allocations that
Al warned me about... that doesn't mean there aren't any...

I'm using copy_page_to_iter in my new branch as Al suggested. I've
changed the code quite a bit from any samples I've posted, there were
regressions with it. I finally stole some code from
cifs/file.c/cifs_readdata_to_iov
and everything works... but I'm not happy with it yet... Linus once posted
a message to the effect that "you don't fix bugs by thrashing around until
stuff seems to work, you fix them by doing the right thing on purpose..."
and I'm working towards that end...

Anyhow, what's in Linux next continues to pass all our tests, though the
code needs improved in the ways mentioned by Al and Linus... I guess
the merge window ends today?

-Mike

On Wed, Sep 9, 2015 at 11:05 AM, Mike Marshall <hubcap@omnibond.com> wrote:
> Christoph's right about the writeback.
>
> I'm pretty sure I added the lockdep stuff back to my .config
> correctly:
>
> [    0.000000] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo
>  Molnar
> [    0.000000] ... MAX_LOCKDEP_SUBCLASSES:  8
> [    0.000000] ... MAX_LOCK_DEPTH:          48
> [    0.000000] ... MAX_LOCKDEP_KEYS:        8191
> [    0.000000] ... CLASSHASH_SIZE:          4096
> [    0.000000] ... MAX_LOCKDEP_ENTRIES:     32768
> [    0.000000] ... MAX_LOCKDEP_CHAINS:      65536
> [    0.000000] ... CHAINHASH_SIZE:          32768
> [    0.000000]  memory used by lock dependency info: 8159 kB
> [    0.000000]  per task-struct memory footprint: 1920 bytes
>
>
> After that I ran the dbench setup that our tester guy uses, and then
> before I left for home I fired off xfstests (Christoph made a xfstest patch
> for us that at least allows us to run some of it in a meaningful way)...
>
> No lockdep backtraces in my syslog this morning...
>
> I'll keep looking, and also return to fixing the iov_iter code
> that Linus pointed out... I hope he looks back in here in the
> next few day, I guess the merge window ends at the end of
> this week? Whatever else happens, Al Viro gave me plenty
> to do <g>...
>
> -Mike
>
> On Tue, Sep 8, 2015 at 2:21 PM, Mike Marshall <hubcap@omnibond.com> wrote:
>> Thanks for all the lockdep feedback everyone...
>>
>> I used to have some cool things turned on in my .config when
>> Christoph was working with us...
>>
>> CONFIG_DEBUG_SPINLOCK=y
>> CONFIG_DEBUG_MUTEXES=y
>> CONFIG_DEBUG_LOCK_ALLOC=y
>> CONFIG_PROVE_LOCKING=y
>>
>> ... and there used to be at thing like this:
>> CONFIG_LOCKDEP
>>
>> ... I guess it is this now?
>> CONFIG_LOCKDEP_SUPPORT
>>
>> Anyhow... I (blush) didn't turn these things on in my VMs when I
>> got a new desktop recently, I've got them turned back on now...
>>
>> -Mike
>>
>> On Tue, Sep 8, 2015 at 10:48 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> On Tue, Sep 08, 2015 at 12:22:06AM +0100, Al Viro wrote:
>>>> You don't want e.g. to have allocation request triggering an attempt to write
>>>> a dirty page on a shared mapping of a file from your fs while you are holding
>>>> a mutex that would block that attempt *and* waiting for a allocation to
>>>> succeed.
>>>
>>> Unless something changed very recently there is no writeback in orangefs.
>>> It's all non-cached I/O straight on to the server, with readpage only
>>> there for read-only mmaps (for executable I suspect).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-11 21:12                       ` Mike Marshall
@ 2015-09-11 22:24                         ` Al Viro
  2015-09-13 11:56                           ` Mike Marshall
  2015-09-11 23:20                         ` Linus Torvalds
  1 sibling, 1 reply; 26+ messages in thread
From: Al Viro @ 2015-09-11 22:24 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Christoph Hellwig, Linus Torvalds, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

On Fri, Sep 11, 2015 at 05:12:08PM -0400, Mike Marshall wrote:
> I'm about to leave for the day...
> 
> I haven't found any problems with the GFP_KERNEL allocations that
> Al warned me about... that doesn't mean there aren't any...

*IF* you have nothing that would require locks in any of the pathways
related to memory pressure (and can guarantee that no such thing will
appear), GFP_KERNEL should be OK.  Still, doing that under the system-wide
mutex taken whenever you need to send a request looks like a Bad Idea(tm) -
too easy to introduce such deadlocks on subsequent changes.

> I'm using copy_page_to_iter in my new branch as Al suggested. I've
> changed the code quite a bit from any samples I've posted, there were
> regressions with it. I finally stole some code from
> cifs/file.c/cifs_readdata_to_iov
> and everything works... but I'm not happy with it yet... Linus once posted
> a message to the effect that "you don't fix bugs by thrashing around until
> stuff seems to work, you fix them by doing the right thing on purpose..."
> and I'm working towards that end...

Could you tell where does the current code live?  What's in -next appears
to be unchanged...

BTW, as for passing all your tests...  Do those include fuzzing it by
misbehaving server?  And getdents() from a directory that has a bunch of
long names *and* a short one in the very end looks like it would misbehave
even on correctly working server...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-11 21:12                       ` Mike Marshall
  2015-09-11 22:24                         ` Al Viro
@ 2015-09-11 23:20                         ` Linus Torvalds
  2015-09-13 11:59                           ` Mike Marshall
  1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2015-09-11 23:20 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Christoph Hellwig, Al Viro, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

On Fri, Sep 11, 2015 at 2:12 PM, Mike Marshall <hubcap@omnibond.com> wrote:
>
> Anyhow, what's in Linux next continues to pass all our tests, though the
> code needs improved in the ways mentioned by Al and Linus... I guess
> the merge window ends today?

Well, technically it ends Sunday, since I don't tend to distinguish
weekdays from weekends much, but I tend to frown on last-minute
submissions and I may just decide that I'm done on Saturday.

Yours isn't a last-minute submission, but I suspect that by now the
sane thing to do is to just say "4.4 material". It sounds like you
clearly aren't twiddling your thumbs, and can usefully continue
development for at least next week with just on the iovec stuff, and I
don't think this is a "must hit 4.3" thing, so there's no huge reason
to not let it slide until the next merge window.

And if you have a grudging ack by then from Al, I'll have no issues
merging it early next merge window ;)

                     Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-11 22:24                         ` Al Viro
@ 2015-09-13 11:56                           ` Mike Marshall
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Marshall @ 2015-09-13 11:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Linus Torvalds, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

I don't think any of us can *guarantee* that our code is bug free.
And, for me, coherency locking in giant programs like the Kernel
is hard... I think some of you guys can just shut your eyes and
see in your head how every lock in the kernel interacts. Now
that I've been refreshed on lockdep, I remember finding problems
(not with memory allocations) when Christoph was working with us,
and it was a good thing we had his help to fix them. Lockdep doesn't
point out any problems to me now, and I don't see problems with our
GFP_KERNEL allocations when I study the code surrounding
them (now that you have pointed out that I should). That you
think there might be problems there make me think there
might be too, so I at least won't be blithe about it...

Here is the iov_iter code that causes no regressions... it is
not checked into even my github repository yet.

file.c:

if (to_user) {
    iov_iter_init(&iter, READ, vec, nr_segs, total_size);
    ret = pvfs_bufmap_copy_to_user_iovec2(bufmap,
    &iter,
    total_size,
    buffer_index,
    nr_segs);
    ret = 0;
}

pvfs2-bufmap.c:

int pvfs_bufmap_copy_to_user_iovec2(struct pvfs2_bufmap *bufmap,
                    struct iov_iter *iter,
                    size_t total_size,
                    int buffer_index,
                    int nr_segs)
{
    struct pvfs_bufmap_desc *from;
    struct page *page;
    size_t remaining = total_size;
    size_t written;
    size_t copy;
    int i;

    from = &bufmap->desc_array[buffer_index];

    for (i = 0; i < bufmap->page_count; i++) {
        page = from->page_array[i];
        copy = min_t(size_t, remaining, PAGE_SIZE);
        written = copy_page_to_iter(page, 0, copy, iter);
        remaining -= written;
        if (written < copy && iov_iter_count(iter) > 0)
            break;
    }
    return remaining ? -EFAULT : 0;
}

The loop is ridiculous, page_count is always 10240, and that
loop churns 10240 times on a file containing "hello"...

I studied on the code all day, and changed/simplified it in ways
that I thought were sane, but my changes always caused
regressions...

cat hello.file        always worked

cp client.txt (a giant dbench input file) from orangefs to /tmp always worked

diff /pvfsmnt/client.txt /tmp/client.txt       broken, even though
both files md5summed the same...

the dbench test I run broke.

It all works with the above code (and with our open-coded non-iov-iter code),
but the above code is "not right"... I expect to make it right.

I learned about fuzzing at Vault. I will use this to help me
start trying to add meaningful fuzzing to my tests:

https://lwn.net/Articles/637151/

I am open to fuzzing suggestions...

-Mike

On Fri, Sep 11, 2015 at 6:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Sep 11, 2015 at 05:12:08PM -0400, Mike Marshall wrote:
>> I'm about to leave for the day...
>>
>> I haven't found any problems with the GFP_KERNEL allocations that
>> Al warned me about... that doesn't mean there aren't any...
>
> *IF* you have nothing that would require locks in any of the pathways
> related to memory pressure (and can guarantee that no such thing will
> appear), GFP_KERNEL should be OK.  Still, doing that under the system-wide
> mutex taken whenever you need to send a request looks like a Bad Idea(tm) -
> too easy to introduce such deadlocks on subsequent changes.
>
>> I'm using copy_page_to_iter in my new branch as Al suggested. I've
>> changed the code quite a bit from any samples I've posted, there were
>> regressions with it. I finally stole some code from
>> cifs/file.c/cifs_readdata_to_iov
>> and everything works... but I'm not happy with it yet... Linus once posted
>> a message to the effect that "you don't fix bugs by thrashing around until
>> stuff seems to work, you fix them by doing the right thing on purpose..."
>> and I'm working towards that end...
>
> Could you tell where does the current code live?  What's in -next appears
> to be unchanged...
>
> BTW, as for passing all your tests...  Do those include fuzzing it by
> misbehaving server?  And getdents() from a directory that has a bunch of
> long names *and* a short one in the very end looks like it would misbehave
> even on correctly working server...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [GIT PULL] Orangefs (text only resend)
  2015-09-11 23:20                         ` Linus Torvalds
@ 2015-09-13 11:59                           ` Mike Marshall
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Marshall @ 2015-09-13 11:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Hellwig, Al Viro, linux-fsdevel, Andrew Morton,
	Stephen Rothwell, Boaz Harrosh, Greg Kroah-Hartman

> if you have a grudging ack by then from Al, I'll have no issues
> merging it early next merge window ;)

That's the kind of ack I'll be shooting for <g> ...

-Mike

On Fri, Sep 11, 2015 at 7:20 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Sep 11, 2015 at 2:12 PM, Mike Marshall <hubcap@omnibond.com> wrote:
>>
>> Anyhow, what's in Linux next continues to pass all our tests, though the
>> code needs improved in the ways mentioned by Al and Linus... I guess
>> the merge window ends today?
>
> Well, technically it ends Sunday, since I don't tend to distinguish
> weekdays from weekends much, but I tend to frown on last-minute
> submissions and I may just decide that I'm done on Saturday.
>
> Yours isn't a last-minute submission, but I suspect that by now the
> sane thing to do is to just say "4.4 material". It sounds like you
> clearly aren't twiddling your thumbs, and can usefully continue
> development for at least next week with just on the iovec stuff, and I
> don't think this is a "must hit 4.3" thing, so there's no huge reason
> to not let it slide until the next merge window.
>
> And if you have a grudging ack by then from Al, I'll have no issues
> merging it early next merge window ;)
>
>                      Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-09-13 11:59 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-01 15:42 [GIT PULL] Orangefs (text only resend) Mike Marshall
2015-09-02 23:34 ` Linus Torvalds
2015-09-03  1:13   ` Mike Marshall
2015-09-03  1:28     ` Linus Torvalds
2015-09-03 20:22       ` Linus Torvalds
2015-09-03 22:18         ` Mike Marshall
2015-09-03 22:44         ` Greg Kroah-Hartman
2015-09-06  6:35   ` Christoph Hellwig
2015-09-06  9:08     ` Al Viro
2015-09-06 14:52       ` Mike Marshall
2015-09-06 15:00         ` Mike Marshall
2015-09-06 20:20         ` Al Viro
2015-09-07  6:37           ` Al Viro
2015-09-07 21:10             ` Mike Marshall
2015-09-07 23:22               ` Al Viro
2015-09-08  0:47                 ` Theodore Ts'o
2015-09-08  2:49                   ` Dave Chinner
2015-09-08 14:48                 ` Christoph Hellwig
2015-09-08 18:21                   ` Mike Marshall
2015-09-09 15:05                     ` Mike Marshall
2015-09-11 21:12                       ` Mike Marshall
2015-09-11 22:24                         ` Al Viro
2015-09-13 11:56                           ` Mike Marshall
2015-09-11 23:20                         ` Linus Torvalds
2015-09-13 11:59                           ` Mike Marshall
2015-09-06 14:35     ` Mike Marshall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.