All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons
@ 2013-04-08 10:19 Boaz Harrosh
  2013-04-08 10:22 ` [1/8] readdir-plus system call Boaz Harrosh
                   ` (9 more replies)
  0 siblings, 10 replies; 43+ messages in thread
From: Boaz Harrosh @ 2013-04-08 10:19 UTC (permalink / raw)
  To: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, Venkateswararao Jujjuri, DENIEL Philippe

Hi

Steven suggested to discuss a readdir-plus system call. For a better LS
but certainly better FS daemons. I have hijacked his topic to expand it
to a slew of requests we, the FS daemons people, would like to see extended,
so to make our lives better (and faster). It should be up to Steven and the
comity if these are grouped together or separated into different talks.

I would like to send a set of emails. Each below topic in it's own mail.
Any interested party, please post to the topic of your heart so we can collect
all the information,, header files, status of each topic. The LSF talk should be
not for introduction of the topics but for actual decision making.

Some of the topics already carry, experimental or more, implementations
and suggested APIs. Some of these topics are just a cry for help, we know what
we don't want, and maybe how we would like them, and what we need is from the
community to give us a green light as to if it is wanted at all.

This is certainly an exhaustive list, we might want to drop some of them, if
they don't generate much interest. Also if I forgot something please post (JV)

Here is the list of topics in this set:
- [1/8] readdir-plus system call (By Steven Whitehouse <swhiteho@redhat.com>)
- [2/8] Sane locks (UNPOSIX locks) (frank)
- [3/8] File delegations, Usermode API of Bruce's pending patches.
- [4/8] PNFS ioctls/syscall 
- [5/8] syscall_cred() a system call that receives alternate creds (fsuid fsgid thread-groups)
- [6/8] Rich ACLs (continued, drive through this time)
- [7/8] Single call interface to getattr/setattr (Frank S Filz <ffilz@us.ibm.com>)
- [8/8] Fix fsnotify short comings (single fd with recursive notifications).

Thanks
Boaz

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [1/8] readdir-plus system call
  2013-04-08 10:19 [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Boaz Harrosh
@ 2013-04-08 10:22 ` Boaz Harrosh
  2013-04-08 10:26   ` Steven Whitehouse
                     ` (2 more replies)
  2013-04-08 10:25 ` [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Steven Whitehouse
                   ` (8 subsequent siblings)
  9 siblings, 3 replies; 43+ messages in thread
From: Boaz Harrosh @ 2013-04-08 10:22 UTC (permalink / raw)
  To: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, Venkateswararao Jujjuri, DENIEL Philippe

By: Steven Whitehouse <swhiteho@redhat.com>)

I repeat below Steve's original mail. Steve you said you have
some experimental code, could you post an header and a git URL
so we can have a look?

I have seen in the Corner of my eye a readdir-plus syscall in FreeBSD
I will try to find it and post the Interface header as reply to here.
Might as well make sure they should match somewhat.

Steve wrote ...
> As part of the work we've been doing in relation to integration between
> NFS/Samba and GFS2, Abhi has been working on a readdirplus system call
> in order to investigate the issues involved with creating such a call.
> It is still early days yet, but by April there should be some
> interesting results to present.
> 
> Please add Abhi to the attendee list as well as myself. 
> 
> Also, it has been some time since we had a NFS/Samba meeting to discuss
> the other issues which are pending, such as locking, ACLs, etc. So we
> could perhaps also allow some time to do that face to face rather than
> over the phone as we've been doing up til now,
> 
> Steve.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons
  2013-04-08 10:19 [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Boaz Harrosh
  2013-04-08 10:22 ` [1/8] readdir-plus system call Boaz Harrosh
@ 2013-04-08 10:25 ` Steven Whitehouse
  2013-04-08 10:25 ` [2/8] Sane locks (UNPOSIX locks) Boaz Harrosh
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Steven Whitehouse @ 2013-04-08 10:25 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Steve Dickson, Jeff Layton, lsf-pc, linux-fsdevel,
	Ganesha NFS List, Frank S Filz, J. Bruce Fields, Lieb, Jim,
	Venkateswararao Jujjuri, DENIEL Philippe, Abhi Das

Hi,

On Mon, 2013-04-08 at 13:19 +0300, Boaz Harrosh wrote:
> Hi
> 
> Steven suggested to discuss a readdir-plus system call. For a better LS
> but certainly better FS daemons. I have hijacked his topic to expand it
> to a slew of requests we, the FS daemons people, would like to see extended,
> so to make our lives better (and faster). It should be up to Steven and the
> comity if these are grouped together or separated into different talks.
> 
> I would like to send a set of emails. Each below topic in it's own mail.
> Any interested party, please post to the topic of your heart so we can collect
> all the information,, header files, status of each topic. The LSF talk should be
> not for introduction of the topics but for actual decision making.
> 
> Some of the topics already carry, experimental or more, implementations
> and suggested APIs. Some of these topics are just a cry for help, we know what
> we don't want, and maybe how we would like them, and what we need is from the
> community to give us a green light as to if it is wanted at all.
> 
> This is certainly an exhaustive list, we might want to drop some of them, if
> they don't generate much interest. Also if I forgot something please post (JV)
> 
> Here is the list of topics in this set:
> - [1/8] readdir-plus system call (By Steven Whitehouse <swhiteho@redhat.com>)

This is really Abhi's topic - I've cc'd him, but certainly something I'm
interesting in seeing a solution to as well.

> - [2/8] Sane locks (UNPOSIX locks) (frank)
> - [3/8] File delegations, Usermode API of Bruce's pending patches.
> - [4/8] PNFS ioctls/syscall 
> - [5/8] syscall_cred() a system call that receives alternate creds (fsuid fsgid thread-groups)
> - [6/8] Rich ACLs (continued, drive through this time)
> - [7/8] Single call interface to getattr/setattr (Frank S Filz <ffilz@us.ibm.com>)
> - [8/8] Fix fsnotify short comings (single fd with recursive notifications).
> 
> Thanks
> Boaz

Looks like a useful set of discussion topics,

Steve.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [2/8] Sane locks (UNPOSIX locks)
  2013-04-08 10:19 [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Boaz Harrosh
  2013-04-08 10:22 ` [1/8] readdir-plus system call Boaz Harrosh
  2013-04-08 10:25 ` [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Steven Whitehouse
@ 2013-04-08 10:25 ` Boaz Harrosh
  2013-04-08 12:02   ` [Lsf-pc] " Jeff Layton
  2013-04-08 10:28 ` [3/8] File delegations, Usermode API of Bruce's pending patches Boaz Harrosh
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 43+ messages in thread
From: Boaz Harrosh @ 2013-04-08 10:25 UTC (permalink / raw)
  To: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, Venkateswararao Jujjuri, DENIEL Philippe


In this topic we do not actually know what it will look like
but we know what we do not like.

The most troublesome is the POSIX crap. POSIX says that if
any fd of a process on an inode is closed, all locks are lost,
even if we used another fd with other modes to acquire those
locks. (They had good stuff to smoke at when that stuff was
defined)

This is real crap, because it completely kills our ability
to acquire some other resources on the file, and/or keep
correct access modes. Because as soon as we need a lock
we need to open the fd in read/write mode, because if in future
a clients need a write access, we cannot do a re-open of the
file, we will loose the locks. Now if we open in RW, then we will
immediately loose our delegations and also in PNFS a write-open
means a different thing then read-open.

So what we urgently need is new locks API that is strictly
per fd. When we open an fd for read and then acquire a read
lock, we can continue to serve delegations. Only the close
of that specific fd will loose the locks. Any other parallel
activity in the background will not affect anything.

We can craft an API that is very similar to today's API only
with the semantic changes. But we should also consider a
completely new API that can cover all the kind of locks, including
a notification API. Perhaps also unite that API with the delegations
API we want in the next topic.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [1/8] readdir-plus system call
  2013-04-08 10:22 ` [1/8] readdir-plus system call Boaz Harrosh
@ 2013-04-08 10:26   ` Steven Whitehouse
  2013-04-08 15:18     ` [Nfs-ganesha-devel] " Matt W. Benjamin
  2013-04-08 13:51   ` DENIEL Philippe
  2013-04-08 19:02   ` Abhijith Das
  2 siblings, 1 reply; 43+ messages in thread
From: Steven Whitehouse @ 2013-04-08 10:26 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Steve Dickson, Jeff Layton, lsf-pc, linux-fsdevel,
	Ganesha NFS List, Frank S Filz, J. Bruce Fields, Lieb, Jim,
	Venkateswararao Jujjuri, DENIEL Philippe, Abhi Das

Again, copying in Abhi,

Steve.

On Mon, 2013-04-08 at 13:22 +0300, Boaz Harrosh wrote:
> By: Steven Whitehouse <swhiteho@redhat.com>)
> 
> I repeat below Steve's original mail. Steve you said you have
> some experimental code, could you post an header and a git URL
> so we can have a look?
> 
> I have seen in the Corner of my eye a readdir-plus syscall in FreeBSD
> I will try to find it and post the Interface header as reply to here.
> Might as well make sure they should match somewhat.
> 
> Steve wrote ...
> > As part of the work we've been doing in relation to integration between
> > NFS/Samba and GFS2, Abhi has been working on a readdirplus system call
> > in order to investigate the issues involved with creating such a call.
> > It is still early days yet, but by April there should be some
> > interesting results to present.
> > 
> > Please add Abhi to the attendee list as well as myself. 
> > 
> > Also, it has been some time since we had a NFS/Samba meeting to discuss
> > the other issues which are pending, such as locking, ACLs, etc. So we
> > could perhaps also allow some time to do that face to face rather than
> > over the phone as we've been doing up til now,
> > 
> > Steve.
> 
> Thanks
> Boaz
> 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [3/8] File delegations, Usermode API of Bruce's pending patches.
  2013-04-08 10:19 [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Boaz Harrosh
                   ` (2 preceding siblings ...)
  2013-04-08 10:25 ` [2/8] Sane locks (UNPOSIX locks) Boaz Harrosh
@ 2013-04-08 10:28 ` Boaz Harrosh
  2013-04-08 10:32 ` [4/8] PNFS ioctls/syscall Boaz Harrosh
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2013-04-08 10:28 UTC (permalink / raw)
  To: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, Venkateswararao Jujjuri, DENIEL Philippe

Here too we have not considered how this will look like.

Bruce has a pending patchset that will enable delegations
in the Kernel NFSD server. He has introduced a new delegation
object at the lock-manager, and so on.

We at user mode would like to acquire such delegations and
be notified of their revocation, synchronously so we can
recall them form clients before new access is granted.

Again perhaps this API can be united with the new locks
API we want.

fctl or syscall ?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [4/8] PNFS ioctls/syscall
  2013-04-08 10:19 [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Boaz Harrosh
                   ` (3 preceding siblings ...)
  2013-04-08 10:28 ` [3/8] File delegations, Usermode API of Bruce's pending patches Boaz Harrosh
@ 2013-04-08 10:32 ` Boaz Harrosh
  2013-04-08 10:36 ` [5/8] syscall_cred() a system call that receives alternate CREDs Boaz Harrosh
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2013-04-08 10:32 UTC (permalink / raw)
  To: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, Venkateswararao Jujjuri, DENIEL Philippe

In this topic I would like to present the API I use for exporting
PNFS layouts, and recall layouts from a Kernel FileSystem.

I use this API from the Ganesha server, to enable pnfs export
both from the in Kernel exofs FS, as well as from the proprietary
PanFS filesystem driver of the Panasas's cluster. Same exact API
works well.

The API is currently an IOCTL based, I understand it might want to
be delegated to a syscall. With common code.

The API is based on all the same types and structures as the In Kernel
(out-of-tree) PNFS project and API's. 80% of the vectors implemented
for the in-kernel server are identical to the in-kernel implementation.
The remaining 20% can have a common core code at the FS, and two wrappers
one for each server. Mainly the big difference is in how the RECALLs are
made.

The two filesystems I coded also use a common library I call libpnfs_logic
which can or not be used by FSs to implement their internal logic. This
is because all this logic is currently inside the KPNFSD server, but it
cannot be when the Server is in user mode. The In Kernel Server can also
be converted to use libpnfs_logic but I have not yet attempted such a conversion
and the code is currently duplicated.

I will post later, the header I use for the IOCTL and a git URL for the
hard cores.

But we need not talk details, we should only discus in principle if it is
wanted, and how it should fit with the In-Kernel PNFS project.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [5/8] syscall_cred() a system call that receives alternate CREDs
  2013-04-08 10:19 [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Boaz Harrosh
                   ` (4 preceding siblings ...)
  2013-04-08 10:32 ` [4/8] PNFS ioctls/syscall Boaz Harrosh
@ 2013-04-08 10:36 ` Boaz Harrosh
  2013-04-08 13:54   ` DENIEL Philippe
  2013-04-08 14:42   ` J. Bruce Fields
  2013-04-08 10:42 ` [6/8] Rich ACLs (continued, drive through this time) Boaz Harrosh
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 43+ messages in thread
From: Boaz Harrosh @ 2013-04-08 10:36 UTC (permalink / raw)
  To: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, Venkateswararao Jujjuri, DENIEL Philippe

From: Jim Lieb <jlieb@panasas.com>

In current NFS Server (Ganesha) lots of operation becomes 6 syscalls
(Or is it 7?)

- setfsuid(), setfsgid(), thread_setgroups()
- The OP
- Revert setfsuid(), setfsgid() to root

This is because if we do all these file operations as root then
FS will not account for the quota a user have on create files,
data space, and so on.
(Note that permission checking is done by Ganesha core, because
 We may cache open fd(s) and such not, another topic)

We could maybe with hard work save the last two calls for reverting
to root, but this will force us to audit lots of code that we are
not prepared to do right now. And will not save us much.

[thread_setgroups()]
thread_setgroups() is what we use at Ganesha and what Samaba guys use
for a per-thread setgroups() call. In the Linux Kernel the setgroups is
actually always per thread. It is only the POSIX (crap) pthread layer
at glibc that intercepts the setgroups() call (and others), Iterates on
all threads that belong to a process, and calls the native Kernel setgroups
on them. So thread_setgroups() is just the raw syscall bypassing glibc's
processing. We will eventually push this API to glibc.
BTW: this is done exactly the same on FreeBSD, with same exact glibc intervention.

[Proposed]
What Jim proposed is a syscall that receives a struct that has
the regular syscalls parameters plus the creds structure with fsuid/fsgid and
groups array. Kernel will set these in, call the original syscall, and revert.
This will be done on only an interested subset of the syscalls that are one -
are related to filesystems (setfsXid) and two - are of interest to us Servers.

Jim care to scribble a structure definition?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [6/8] Rich ACLs (continued, drive through this time)
  2013-04-08 10:19 [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Boaz Harrosh
                   ` (5 preceding siblings ...)
  2013-04-08 10:36 ` [5/8] syscall_cred() a system call that receives alternate CREDs Boaz Harrosh
@ 2013-04-08 10:42 ` Boaz Harrosh
  2013-04-08 11:12   ` Vyacheslav Dubeyko
  2013-04-08 14:27   ` Venkateswararao Jujjuri
  2013-04-08 10:43 ` [7/8] Single call interface to getattr/setattr Boaz Harrosh
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 43+ messages in thread
From: Boaz Harrosh @ 2013-04-08 10:42 UTC (permalink / raw)
  To: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, Venkateswararao Jujjuri, DENIEL Philippe

>From the top of my head please correct me where I'm wrong.

Anish Kumar from IBM has posted a patchset to support rich ACLs at Filesystems of interest
Specially also by the NFS client FS, and through the KNFSD by the exported FSs of interest.

His approach was to have both the POSIX-ACLs and the RICH-ACLs, co exist at the VFS and FS
layer, and to interact nicely together. But a supporting FS needs to support two APIs and
do its own code refactoring at the bottom.

What the VFS community wanted to see is that all FS APIs change to the bigger RICH-ACLs
API, and the translation from rich to POSIX is done at the generic VFS layer, thoughts
doing the refactoring once at the VFS.

Anish pressed between his IBM obligations and the huge amount of coding it will take
to convert all FSs, has put the project on a back burner, and we have not seen any
farther changes from him.

JV Please find and post a git tree with the latest code. And if there are some
documentation post them here. Please also try to find the original ML thread that
is captured above.

[Proposition]

There are more interested parties then Anish in this matter. I know for a fact that IBM,
Panasas RedHat, and any NFS and CIFS community member would like to see this work done.
I propose that we get a status update, on what's there now. Talk about a strategy for
an incremental but complete Kernel transformation, and finally submission. And I think
All parties above should invest in time and resources together to drive this through. I would
like to see all this done under Anish's guide if possible.

[JV what is Anish's email ?]

[user-mode API]
And we will need a good local API for get/set of rich ACLs, currently they are interfaced
through either a Windows CIFS client or through some NFS4 test applications. There should
be an API that can be used both at the local FS as well as via the NFS-client.
(And the new readdir-plus should give us a flag if there are ACLs present at an entry)

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [7/8] Single call interface to getattr/setattr
  2013-04-08 10:19 [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Boaz Harrosh
                   ` (6 preceding siblings ...)
  2013-04-08 10:42 ` [6/8] Rich ACLs (continued, drive through this time) Boaz Harrosh
@ 2013-04-08 10:43 ` Boaz Harrosh
       [not found]   ` <OF4A1A78E0.CB4DED3E-ON87257B47.00549E35-88257B47.005520A8@us.ibm.com>
  2013-04-08 10:45 ` [8/8] Fix fsnotify short comings (single fd with recursive notifications) Boaz Harrosh
  2013-04-08 14:31 ` [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Venkateswararao Jujjuri
  9 siblings, 1 reply; 43+ messages in thread
From: Boaz Harrosh @ 2013-04-08 10:43 UTC (permalink / raw)
  To: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, Venkateswararao Jujjuri, DENIEL Philippe

From: Frank S Filz <ffilz@us.ibm.com>

Frank this is your call, I'm not at all familiar with this subject.
Shoot ...


Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [8/8] Fix fsnotify short comings (single fd with recursive notifications).
  2013-04-08 10:19 [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Boaz Harrosh
                   ` (7 preceding siblings ...)
  2013-04-08 10:43 ` [7/8] Single call interface to getattr/setattr Boaz Harrosh
@ 2013-04-08 10:45 ` Boaz Harrosh
  2013-04-08 13:59   ` DENIEL Philippe
  2013-04-08 14:31 ` [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Venkateswararao Jujjuri
  9 siblings, 1 reply; 43+ messages in thread
From: Boaz Harrosh @ 2013-04-08 10:45 UTC (permalink / raw)
  To: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, Venkateswararao Jujjuri, DENIEL Philippe

From: DENIEL Philippe <philippe.deniel@cea.fr>

DENIEL Has reported that watching directories through the new fsnotify API,
might miss some events like deletes and that a directory watch is not recursive, which
means that we need to open two fd(s) for each directory in the cache. (Which halves our
fd cache size).

Again from the top of my head, and I know nothing of this subject.

DENIEL please add any information here, so we can talk about it at LSF.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [6/8] Rich ACLs (continued, drive through this time)
  2013-04-08 10:42 ` [6/8] Rich ACLs (continued, drive through this time) Boaz Harrosh
@ 2013-04-08 11:12   ` Vyacheslav Dubeyko
  2013-04-08 14:27   ` Venkateswararao Jujjuri
  1 sibling, 0 replies; 43+ messages in thread
From: Vyacheslav Dubeyko @ 2013-04-08 11:12 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, Venkateswararao Jujjuri, DENIEL Philippe

Hi Boaz,

On Mon, 2013-04-08 at 13:42 +0300, Boaz Harrosh wrote:
> From the top of my head please correct me where I'm wrong.
> 
> Anish Kumar from IBM has posted a patchset to support rich ACLs at Filesystems of interest
> Specially also by the NFS client FS, and through the KNFSD by the exported FSs of interest.
> 
> His approach was to have both the POSIX-ACLs and the RICH-ACLs, co exist at the VFS and FS
> layer, and to interact nicely together. But a supporting FS needs to support two APIs and
> do its own code refactoring at the bottom.
> 
> What the VFS community wanted to see is that all FS APIs change to the bigger RICH-ACLs
> API, and the translation from rich to POSIX is done at the generic VFS layer, thoughts
> doing the refactoring once at the VFS.
> 

Yes, RICH-ACLs is important topic. HFS+ uses NFSv4 ACLs model but with
own peculiarities. I published not so recently path set with
implementation of ACLs support in HFS+ driver. I used as a basis mapping
code (NFSv4 <-> POSIX ACLs) that it was implemented in nfsd. And J.
Bruce Fields makes remark about necessity to generalize mapping code
with purpose of sharing it between drivers. So, I am working on
generalization mapping code (NFSv4 <-> POSIX ACLs) for using it in HFS+
driver. I hope that I'll finish this work soon. But, anyway, the
RICH-ACLs scheme is more natural for HFS+.

With the best regards,
Vyacheslav Dubeyko.

> Anish pressed between his IBM obligations and the huge amount of coding it will take
> to convert all FSs, has put the project on a back burner, and we have not seen any
> farther changes from him.
> 
> JV Please find and post a git tree with the latest code. And if there are some
> documentation post them here. Please also try to find the original ML thread that
> is captured above.
> 
> [Proposition]
> 
> There are more interested parties then Anish in this matter. I know for a fact that IBM,
> Panasas RedHat, and any NFS and CIFS community member would like to see this work done.
> I propose that we get a status update, on what's there now. Talk about a strategy for
> an incremental but complete Kernel transformation, and finally submission. And I think
> All parties above should invest in time and resources together to drive this through. I would
> like to see all this done under Anish's guide if possible.
> 
> [JV what is Anish's email ?]
> 
> [user-mode API]
> And we will need a good local API for get/set of rich ACLs, currently they are interfaced
> through either a Windows CIFS client or through some NFS4 test applications. There should
> be an API that can be used both at the local FS as well as via the NFS-client.
> (And the new readdir-plus should give us a flag if there are ACLs present at an entry)
> 
> Thanks
> Boaz
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf-pc] [2/8] Sane locks (UNPOSIX locks)
  2013-04-08 10:25 ` [2/8] Sane locks (UNPOSIX locks) Boaz Harrosh
@ 2013-04-08 12:02   ` Jeff Layton
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff Layton @ 2013-04-08 12:02 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Steven Whitehouse, Steve Dickson, lsf-pc, linux-fsdevel,
	Ganesha NFS List, Frank S Filz, J. Bruce Fields, Lieb, Jim,
	Venkateswararao Jujjuri, DENIEL Philippe

On Mon, 8 Apr 2013 13:25:43 +0300
Boaz Harrosh <bharrosh@panasas.com> wrote:

> 
> In this topic we do not actually know what it will look like
> but we know what we do not like.
> 
> The most troublesome is the POSIX crap. POSIX says that if
> any fd of a process on an inode is closed, all locks are lost,
> even if we used another fd with other modes to acquire those
> locks. (They had good stuff to smoke at when that stuff was
> defined)
> 

Jeremy Allison did some detective work on why this is:

    http://www.samba.org/samba/news/articles/low_point/tale_two_stds_os2.html

See the section on "First Implementation Past the Post".

> This is real crap, because it completely kills our ability
> to acquire some other resources on the file, and/or keep
> correct access modes. Because as soon as we need a lock
> we need to open the fd in read/write mode, because if in future
> a clients need a write access, we cannot do a re-open of the
> file, we will loose the locks. Now if we open in RW, then we will
> immediately loose our delegations and also in PNFS a write-open
> means a different thing then read-open.
> 
> So what we urgently need is new locks API that is strictly
> per fd. When we open an fd for read and then acquire a read
> lock, we can continue to serve delegations. Only the close
> of that specific fd will loose the locks. Any other parallel
> activity in the background will not affect anything.
> 
> We can craft an API that is very similar to today's API only
> with the semantic changes. But we should also consider a
> completely new API that can cover all the kind of locks, including
> a notification API. Perhaps also unite that API with the delegations
> API we want in the next topic.
> 
> Thanks
> Boaz
> 

Perhaps we can simply add a new clone() flag (CLONE_SANELOCKS?) that
indicates that the tasks in question want locks that are not affected
by close() from other tasks?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [1/8] readdir-plus system call
  2013-04-08 10:22 ` [1/8] readdir-plus system call Boaz Harrosh
  2013-04-08 10:26   ` Steven Whitehouse
@ 2013-04-08 13:51   ` DENIEL Philippe
  2013-04-08 19:02   ` Abhijith Das
  2 siblings, 0 replies; 43+ messages in thread
From: DENIEL Philippe @ 2013-04-08 13:51 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, Venkateswararao Jujjuri

Hi,

from a Lustre user point of view, having a way to readdirplus (getting 
dirents + related attrs) would really be helpful.
Beyond the mere scope of Ganesha, it can have bunches of cool cases of use.

     Philippe

On 04/08/13 12:22, Boaz Harrosh wrote:
> By: Steven Whitehouse <swhiteho@redhat.com>)
>
> I repeat below Steve's original mail. Steve you said you have
> some experimental code, could you post an header and a git URL
> so we can have a look?
>
> I have seen in the Corner of my eye a readdir-plus syscall in FreeBSD
> I will try to find it and post the Interface header as reply to here.
> Might as well make sure they should match somewhat.
>
> Steve wrote ...
>> As part of the work we've been doing in relation to integration between
>> NFS/Samba and GFS2, Abhi has been working on a readdirplus system call
>> in order to investigate the issues involved with creating such a call.
>> It is still early days yet, but by April there should be some
>> interesting results to present.
>>
>> Please add Abhi to the attendee list as well as myself.
>>
>> Also, it has been some time since we had a NFS/Samba meeting to discuss
>> the other issues which are pending, such as locking, ACLs, etc. So we
>> could perhaps also allow some time to do that face to face rather than
>> over the phone as we've been doing up til now,
>>
>> Steve.
> Thanks
> Boaz
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [5/8] syscall_cred() a system call that receives alternate CREDs
  2013-04-08 10:36 ` [5/8] syscall_cred() a system call that receives alternate CREDs Boaz Harrosh
@ 2013-04-08 13:54   ` DENIEL Philippe
  2013-04-08 14:42   ` J. Bruce Fields
  1 sibling, 0 replies; 43+ messages in thread
From: DENIEL Philippe @ 2013-04-08 13:54 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, Venkateswararao Jujjuri

I do agree with Boaz and Jim. Being capable of "mascarading" a syscall 
with someone's credential would be really useful. In particular, in the 
Ganesha's scope, it is required to properly managed quota (you need to 
create inodes and write to files as the user if you want those inodes 
and blocks to be added to the right user's bill).

     Philippe

On 04/08/13 12:36, Boaz Harrosh wrote:
> From: Jim Lieb <jlieb@panasas.com>
>
> In current NFS Server (Ganesha) lots of operation becomes 6 syscalls
> (Or is it 7?)
>
> - setfsuid(), setfsgid(), thread_setgroups()
> - The OP
> - Revert setfsuid(), setfsgid() to root
>
> This is because if we do all these file operations as root then
> FS will not account for the quota a user have on create files,
> data space, and so on.
> (Note that permission checking is done by Ganesha core, because
>   We may cache open fd(s) and such not, another topic)
>
> We could maybe with hard work save the last two calls for reverting
> to root, but this will force us to audit lots of code that we are
> not prepared to do right now. And will not save us much.
>
> [thread_setgroups()]
> thread_setgroups() is what we use at Ganesha and what Samaba guys use
> for a per-thread setgroups() call. In the Linux Kernel the setgroups is
> actually always per thread. It is only the POSIX (crap) pthread layer
> at glibc that intercepts the setgroups() call (and others), Iterates on
> all threads that belong to a process, and calls the native Kernel setgroups
> on them. So thread_setgroups() is just the raw syscall bypassing glibc's
> processing. We will eventually push this API to glibc.
> BTW: this is done exactly the same on FreeBSD, with same exact glibc intervention.
>
> [Proposed]
> What Jim proposed is a syscall that receives a struct that has
> the regular syscalls parameters plus the creds structure with fsuid/fsgid and
> groups array. Kernel will set these in, call the original syscall, and revert.
> This will be done on only an interested subset of the syscalls that are one -
> are related to filesystems (setfsXid) and two - are of interest to us Servers.
>
> Jim care to scribble a structure definition?
>
> Thanks
> Boaz
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [8/8] Fix fsnotify short comings (single fd with recursive notifications).
  2013-04-08 10:45 ` [8/8] Fix fsnotify short comings (single fd with recursive notifications) Boaz Harrosh
@ 2013-04-08 13:59   ` DENIEL Philippe
  2013-04-08 15:22     ` Al Viro
  2013-04-08 15:36     ` J. Bruce Fields
  0 siblings, 2 replies; 43+ messages in thread
From: DENIEL Philippe @ 2013-04-08 13:59 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, Venkateswararao Jujjuri

On 04/08/13 12:45, Boaz Harrosh wrote:
> From: DENIEL Philippe <philippe.deniel@cea.fr>
>
> DENIEL Has reported that watching directories through the new fsnotify API,
> might miss some events like deletes and that a directory watch is not recursive, which
> means that we need to open two fd(s) for each directory in the cache. (Which halves our
> fd cache size).
>
> Again from the top of my head, and I know nothing of this subject.
>
> DENIEL please add any information here, so we can talk about it at LSF.
>
What I have seen was this:
     - if dnotify() is used, it gets every events. If used on a 
directory, it will get every creation and deletion. But it is not 
recursive, so we need to use dnotify on every inode we manage.
     - if fanotify is used, it is recursive and by using it on the root 
of a filesystem, then you have events from the whole underlying tree. 
The trouble is that deletion seem not to be caught.

I do have tests written in C for that.

     regards

         Philippe

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [6/8] Rich ACLs (continued, drive through this time)
  2013-04-08 10:42 ` [6/8] Rich ACLs (continued, drive through this time) Boaz Harrosh
  2013-04-08 11:12   ` Vyacheslav Dubeyko
@ 2013-04-08 14:27   ` Venkateswararao Jujjuri
  1 sibling, 0 replies; 43+ messages in thread
From: Venkateswararao Jujjuri @ 2013-04-08 14:27 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, DENIEL Philippe, Aneesh Kumar K.V

On 04/08/2013 03:42 AM, Boaz Harrosh wrote:
>  From the top of my head please correct me where I'm wrong.
>
> Anish Kumar from IBM has posted a patchset to support rich ACLs at Filesystems of interest
> Specially also by the NFS client FS, and through the KNFSD by the exported FSs of interest.
>
> His approach was to have both the POSIX-ACLs and the RICH-ACLs, co exist at the VFS and FS
> layer, and to interact nicely together. But a supporting FS needs to support two APIs and
> do its own code refactoring at the bottom.
>
> What the VFS community wanted to see is that all FS APIs change to the bigger RICH-ACLs
> API, and the translation from rich to POSIX is done at the generic VFS layer, thoughts
> doing the refactoring once at the VFS.
>
> Anish pressed between his IBM obligations and the huge amount of coding it will take
> to convert all FSs, has put the project on a back burner, and we have not seen any
> farther changes from him.
>
> JV Please find and post a git tree with the latest code. And if there are some
> documentation post them here. Please also try to find the original ML thread that
> is captured above.
>
> [Proposition]
>
> There are more interested parties then Anish in this matter. I know for a fact that IBM,
> Panasas RedHat, and any NFS and CIFS community member would like to see this work done.
> I propose that we get a status update, on what's there now. Talk about a strategy for
> an incremental but complete Kernel transformation, and finally submission. And I think
> All parties above should invest in time and resources together to drive this through. I would
> like to see all this done under Anish's guide if possible.
>
> [JV what is Anish's email ?]
Included Aneesh in the discussion.
>
> [user-mode API]
> And we will need a good local API for get/set of rich ACLs, currently they are interfaced
> through either a Windows CIFS client or through some NFS4 test applications. There should
> be an API that can be used both at the local FS as well as via the NFS-client.
> (And the new readdir-plus should give us a flag if there are ACLs present at an entry)
>
> Thanks
> Boaz
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons
  2013-04-08 10:19 [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Boaz Harrosh
                   ` (8 preceding siblings ...)
  2013-04-08 10:45 ` [8/8] Fix fsnotify short comings (single fd with recursive notifications) Boaz Harrosh
@ 2013-04-08 14:31 ` Venkateswararao Jujjuri
  9 siblings, 0 replies; 43+ messages in thread
From: Venkateswararao Jujjuri @ 2013-04-08 14:31 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Lieb, Jim, DENIEL Philippe

On 04/08/2013 03:19 AM, Boaz Harrosh wrote:
> Hi
>
> Steven suggested to discuss a readdir-plus system call. For a better LS
> but certainly better FS daemons. I have hijacked his topic to expand it
> to a slew of requests we, the FS daemons people, would like to see extended,
> so to make our lives better (and faster). It should be up to Steven and the
> comity if these are grouped together or separated into different talks.
>
> I would like to send a set of emails. Each below topic in it's own mail.
> Any interested party, please post to the topic of your heart so we can collect
> all the information,, header files, status of each topic. The LSF talk should be
> not for introduction of the topics but for actual decision making.
>
> Some of the topics already carry, experimental or more, implementations
> and suggested APIs. Some of these topics are just a cry for help, we know what
> we don't want, and maybe how we would like them, and what we need is from the
> community to give us a green light as to if it is wanted at all.
>
> This is certainly an exhaustive list, we might want to drop some of them, if
> they don't generate much interest. Also if I forgot something please post (JV)
>
> Here is the list of topics in this set:
> - [1/8] readdir-plus system call (By Steven Whitehouse <swhiteho@redhat.com>)
> - [2/8] Sane locks (UNPOSIX locks) (frank)
> - [3/8] File delegations, Usermode API of Bruce's pending patches.
> - [4/8] PNFS ioctls/syscall
> - [5/8] syscall_cred() a system call that receives alternate creds (fsuid fsgid thread-groups)
> - [6/8] Rich ACLs (continued, drive through this time)
> - [7/8] Single call interface to getattr/setattr (Frank S Filz <ffilz@us.ibm.com>)
> - [8/8] Fix fsnotify short comings (single fd with recursive notifications).

WCC Provision. User level API need to :

getattr()
perform operation write/read/something
getattr()

Breaking this into 3 calls gives all kinds of race possibilities and 
makes WCC useless.
Having one single interface to get all this info atomically will be a 
great help.


>
> Thanks
> Boaz
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [5/8] syscall_cred() a system call that receives alternate CREDs
  2013-04-08 10:36 ` [5/8] syscall_cred() a system call that receives alternate CREDs Boaz Harrosh
  2013-04-08 13:54   ` DENIEL Philippe
@ 2013-04-08 14:42   ` J. Bruce Fields
  2013-04-08 14:58     ` Boaz Harrosh
  2013-04-08 18:23     ` Jim Lieb
  1 sibling, 2 replies; 43+ messages in thread
From: J. Bruce Fields @ 2013-04-08 14:42 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, Lieb, Jim,
	Venkateswararao Jujjuri, DENIEL Philippe

On Mon, Apr 08, 2013 at 01:36:46PM +0300, Boaz Harrosh wrote:
> From: Jim Lieb <jlieb@panasas.com>
> 
> In current NFS Server (Ganesha) lots of operation becomes 6 syscalls
> (Or is it 7?)
> 
> - setfsuid(), setfsgid(), thread_setgroups()
> - The OP
> - Revert setfsuid(), setfsgid() to root
> 
> This is because if we do all these file operations as root then
> FS will not account for the quota a user have on create files,
> data space, and so on.

To make sure I understand, you're saying that:

	- the behavior you get out of those 6 syscalls is correct, 
	- you just want to be able to do exactly the same thing, but
	  with 1 syscall.  (For performance?)

Or is there some other issue?

> (Note that permission checking is done by Ganesha core, because
>  We may cache open fd(s) and such not, another topic)

Is there anything we could do to make it possible for you to depend on
the kernel's permissions checking instead?

--b.

> 
> We could maybe with hard work save the last two calls for reverting
> to root, but this will force us to audit lots of code that we are
> not prepared to do right now. And will not save us much.
> 
> [thread_setgroups()]
> thread_setgroups() is what we use at Ganesha and what Samaba guys use
> for a per-thread setgroups() call. In the Linux Kernel the setgroups is
> actually always per thread. It is only the POSIX (crap) pthread layer
> at glibc that intercepts the setgroups() call (and others), Iterates on
> all threads that belong to a process, and calls the native Kernel setgroups
> on them. So thread_setgroups() is just the raw syscall bypassing glibc's
> processing. We will eventually push this API to glibc.
> BTW: this is done exactly the same on FreeBSD, with same exact glibc intervention.
> 
> [Proposed]
> What Jim proposed is a syscall that receives a struct that has
> the regular syscalls parameters plus the creds structure with fsuid/fsgid and
> groups array. Kernel will set these in, call the original syscall, and revert.
> This will be done on only an interested subset of the syscalls that are one -
> are related to filesystems (setfsXid) and two - are of interest to us Servers.
> 
> Jim care to scribble a structure definition?
> 
> Thanks
> Boaz
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [5/8] syscall_cred() a system call that receives alternate CREDs
  2013-04-08 14:42   ` J. Bruce Fields
@ 2013-04-08 14:58     ` Boaz Harrosh
  2013-04-08 18:23     ` Jim Lieb
  1 sibling, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2013-04-08 14:58 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, Lieb, Jim,
	Venkateswararao Jujjuri, DENIEL Philippe

On 08/04/13 17:42, J. Bruce Fields wrote:
> On Mon, Apr 08, 2013 at 01:36:46PM +0300, Boaz Harrosh wrote:
>> From: Jim Lieb <jlieb@panasas.com>
>>
>> In current NFS Server (Ganesha) lots of operation becomes 6 syscalls
>> (Or is it 7?)
>>
>> - setfsuid(), setfsgid(), thread_setgroups()
>> - The OP
>> - Revert setfsuid(), setfsgid() to root
>>
>> This is because if we do all these file operations as root then
>> FS will not account for the quota a user have on create files,
>> data space, and so on.
> 
> To make sure I understand, you're saying that:
> 
> 	- the behavior you get out of those 6 syscalls is correct, 
> 	- you just want to be able to do exactly the same thing, but
> 	  with 1 syscall.  (For performance?)
> 

Yes, performance.

> Or is there some other issue?
> 
>> (Note that permission checking is done by Ganesha core, because
>>  We may cache open fd(s) and such not, another topic)
> 
> Is there anything we could do to make it possible for you to depend on
> the kernel's permissions checking instead?
> 

That one is a different topic. I thought like you that we should let the
FS have the final disposition. But the guys convinced me that it is not possible.
Both because of caching as well as because there are places that NFS3/4/4.1
will allow or deny differently then POSIX.

Some of the other guys on the list have more details then me. Frank?

> --b.
> 

Thanks
Boaz


>>
>> We could maybe with hard work save the last two calls for reverting
>> to root, but this will force us to audit lots of code that we are
>> not prepared to do right now. And will not save us much.
>>
>> [thread_setgroups()]
>> thread_setgroups() is what we use at Ganesha and what Samaba guys use
>> for a per-thread setgroups() call. In the Linux Kernel the setgroups is
>> actually always per thread. It is only the POSIX (crap) pthread layer
>> at glibc that intercepts the setgroups() call (and others), Iterates on
>> all threads that belong to a process, and calls the native Kernel setgroups
>> on them. So thread_setgroups() is just the raw syscall bypassing glibc's
>> processing. We will eventually push this API to glibc.
>> BTW: this is done exactly the same on FreeBSD, with same exact glibc intervention.
>>
>> [Proposed]
>> What Jim proposed is a syscall that receives a struct that has
>> the regular syscalls parameters plus the creds structure with fsuid/fsgid and
>> groups array. Kernel will set these in, call the original syscall, and revert.
>> This will be done on only an interested subset of the syscalls that are one -
>> are related to filesystems (setfsXid) and two - are of interest to us Servers.
>>
>> Jim care to scribble a structure definition?
>>
>> Thanks
>> Boaz
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Nfs-ganesha-devel] [1/8] readdir-plus system call
  2013-04-08 10:26   ` Steven Whitehouse
@ 2013-04-08 15:18     ` Matt W. Benjamin
  0 siblings, 0 replies; 43+ messages in thread
From: Matt W. Benjamin @ 2013-04-08 15:18 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Ganesha NFS List, J. Bruce Fields, Jeff Layton, Abhi Das,
	Steve Dickson, linux-fsdevel, lsf-pc, Boaz Harrosh,
	Trond Myklebust

All (CC Trond),

This seems timely.  Has this effort looed at issues distributed file
systems have with cookies/offsets, and whether a new interface can offer future
relief there?

Thanks,

Matt

----- "Steven Whitehouse" <swhiteho@redhat.com> wrote:

> Again, copying in Abhi,
> 
> Steve.
> 
> On Mon, 2013-04-08 at 13:22 +0300, Boaz Harrosh wrote:
> > By: Steven Whitehouse <swhiteho@redhat.com>)
> > 
> > I repeat below Steve's original mail. Steve you said you have
> > some experimental code, could you post an header and a git URL
> > so we can have a look?
> > 
> > I have seen in the Corner of my eye a readdir-plus syscall in
> FreeBSD
> > I will try to find it and post the Interface header as reply to
> here.
> > Might as well make sure they should match somewhat.
> > 
> > Steve wrote ...
> > > As part of the work we've been doing in relation to integration
> between
> > > NFS/Samba and GFS2, Abhi has been working on a readdirplus system
> call
> > > in order to investigate the issues involved with creating such a
> call.
> > > It is still early days yet, but by April there should be some
> > > interesting results to present.
> > > 
> > > Please add Abhi to the attendee list as well as myself. 
> > > 
> > > Also, it has been some time since we had a NFS/Samba meeting to
> discuss
> > > the other issues which are pending, such as locking, ACLs, etc. So
> we
> > > could perhaps also allow some time to do that face to face rather
> than
> > > over the phone as we've been doing up til now,
> > > 
> > > Steve.
> > 
> > Thanks
> > Boaz
> > 
> 

-- 
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [8/8] Fix fsnotify short comings (single fd with recursive notifications).
  2013-04-08 13:59   ` DENIEL Philippe
@ 2013-04-08 15:22     ` Al Viro
  2013-04-08 15:36     ` J. Bruce Fields
  1 sibling, 0 replies; 43+ messages in thread
From: Al Viro @ 2013-04-08 15:22 UTC (permalink / raw)
  To: DENIEL Philippe
  Cc: Boaz Harrosh, Steven Whitehouse, Steve Dickson, Jeff Layton,
	lsf-pc, linux-fsdevel, Ganesha NFS List, Frank S Filz,
	J. Bruce Fields, Lieb, Jim, Venkateswararao Jujjuri

On Mon, Apr 08, 2013 at 03:59:49PM +0200, DENIEL Philippe wrote:

> What I have seen was this:
>     - if dnotify() is used, it gets every events. If used on a
> directory, it will get every creation and deletion. But it is not
> recursive, so we need to use dnotify on every inode we manage.
>     - if fanotify is used, it is recursive and by using it on the
> root of a filesystem, then you have events from the whole underlying
> tree. The trouble is that deletion seem not to be caught.

All *notify APIs are broken by design, film at 11...  See also: "Doctor,
it hurts when I do it"

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [8/8] Fix fsnotify short comings (single fd with recursive notifications).
  2013-04-08 13:59   ` DENIEL Philippe
  2013-04-08 15:22     ` Al Viro
@ 2013-04-08 15:36     ` J. Bruce Fields
  1 sibling, 0 replies; 43+ messages in thread
From: J. Bruce Fields @ 2013-04-08 15:36 UTC (permalink / raw)
  To: DENIEL Philippe
  Cc: Boaz Harrosh, Steven Whitehouse, Steve Dickson, Jeff Layton,
	lsf-pc, linux-fsdevel, Ganesha NFS List, Frank S Filz, Lieb, Jim,
	Venkateswararao Jujjuri

On Mon, Apr 08, 2013 at 03:59:49PM +0200, DENIEL Philippe wrote:
> On 04/08/13 12:45, Boaz Harrosh wrote:
> >From: DENIEL Philippe <philippe.deniel@cea.fr>
> >
> >DENIEL Has reported that watching directories through the new fsnotify API,
> >might miss some events like deletes and that a directory watch is not recursive, which
> >means that we need to open two fd(s) for each directory in the cache. (Which halves our
> >fd cache size).
> >
> >Again from the top of my head, and I know nothing of this subject.
> >
> >DENIEL please add any information here, so we can talk about it at LSF.
> >
> What I have seen was this:
>     - if dnotify() is used, it gets every events. If used on a
> directory, it will get every creation and deletion. But it is not
> recursive, so we need to use dnotify on every inode we manage.
>     - if fanotify is used, it is recursive and by using it on the
> root of a filesystem, then you have events from the whole underlying
> tree. The trouble is that deletion seem not to be caught.
> 
> I do have tests written in C for that.

What are you actually using notifications for?  If you had adequate
leases/delegations, for example, would you still need them?

--b.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [7/8] Single call interface to getattr/setattr
       [not found]   ` <OF4A1A78E0.CB4DED3E-ON87257B47.00549E35-88257B47.005520A8@us.ibm.com>
@ 2013-04-08 16:41     ` Boaz Harrosh
  0 siblings, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2013-04-08 16:41 UTC (permalink / raw)
  To: Frank S Filz
  Cc: J. Bruce Fields, Jeff Layton, Lieb, Jim, Venkateswararao Jujjuri,
	linux-fsdevel, lsf-pc, Ganesha NFS List, DENIEL Philippe,
	Steve Dickson, Steven Whitehouse

On 08/04/13 18:29, Frank S Filz wrote:
> Boaz Harrosh <bharrosh@panasas.com> wrote on 04/08/2013 03:43:23 AM:
>> [7/8] Single call interface to getattr/setattr
>>
>> From: Frank S Filz <ffilz@us.ibm.com>
>>
>> Frank this is your call, I'm not at all familiar with this subject.
>> Shoot ...
> 
> This is mostly on the setattr side, being able to do the following in a single call (with each item optional):
> 
> set mode
> set acl
> set owner
> set owner_group
> set atime
> set mtime
> set size
> 
> on the getattr side, the same, which would be a stat+getacl merged into one call.
> 
> Would be nice to be able to do an create and open(O_CREAT) with all
> those attr also (again, each attr being optional).
> 

Just adding a comment. If we are going into new territory here then these should be
RICH-ACLs and not POSIX ACLs, kill two birds with one stone.

> Frank
> 

Thanks Frank
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Re: [5/8] syscall_cred() a system call that receives alternate CREDs
  2013-04-08 14:42   ` J. Bruce Fields
  2013-04-08 14:58     ` Boaz Harrosh
@ 2013-04-08 18:23     ` Jim Lieb
  2013-04-08 18:31       ` J. Bruce Fields
  1 sibling, 1 reply; 43+ messages in thread
From: Jim Lieb @ 2013-04-08 18:23 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Boaz Harrosh, Steven Whitehouse, Steve Dickson, Jeff Layton,
	lsf-pc, linux-fsdevel, Ganesha NFS List, Frank S Filz,
	Venkateswararao Jujjuri, DENIEL Philippe

[-- Attachment #1: Type: text/plain, Size: 3444 bytes --]

On Monday, April 08, 2013 10:42:02 J. Bruce Fields wrote:
> On Mon, Apr 08, 2013 at 01:36:46PM +0300, Boaz Harrosh wrote:
> > From: Jim Lieb <jlieb@panasas.com>
> > 
> > In current NFS Server (Ganesha) lots of operation becomes 6 syscalls
> > (Or is it 7?)
> > 
> > - setfsuid(), setfsgid(), thread_setgroups()
> > - The OP
> > - Revert setfsuid(), setfsgid() to root
> > 
> > This is because if we do all these file operations as root then
> > FS will not account for the quota a user have on create files,
> > data space, and so on.
> 
> To make sure I understand, you're saying that:
> 
> 	- the behavior you get out of those 6 syscalls is correct,
> 	- you just want to be able to do exactly the same thing, but
> 	  with 1 syscall.  (For performance?)
> 
> Or is there some other issue?

I have attached the email I sent around on the nfs-ganesha list with a model 
api so we know the details.

Boaz replied "performance" but there are also race conditions to consider.  If 
we get signals or ??? somewhere in the sequence, what is our state?  Yes, the 
setfsuid call back to root can still be done but masquerading has any signals 
etc. be in the context of that user/group and there is one syscall to deal 
with, not a stream.

There may be selinux/apparmor issues to deal with too.  If we first masquerade 
the thread and then apply all these access checks, as far as the kernel is 
concerned, it is the masqueraded user.

> 
> > (Note that permission checking is done by Ganesha core, because
> > 
> >  We may cache open fd(s) and such not, another topic)
> 
> Is there anything we could do to make it possible for you to depend on
> the kernel's permissions checking instead?
> 
I concur with Frank's assessment here.  There are more instances where nfs-
ganesha is doing a syscall as the server than as the masqueraded user.  In the 
pNFS case, this hardly happens at all.  We looked at having the kernel do it 
but found that we also had to do it and mixing gets seriously messy.  For 
starters, we really do want to share fd's.

> --b.
> 
> > We could maybe with hard work save the last two calls for reverting
> > to root, but this will force us to audit lots of code that we are
> > not prepared to do right now. And will not save us much.
> > 
> > [thread_setgroups()]
> > thread_setgroups() is what we use at Ganesha and what Samaba guys use
> > for a per-thread setgroups() call. In the Linux Kernel the setgroups is
> > actually always per thread. It is only the POSIX (crap) pthread layer
> > at glibc that intercepts the setgroups() call (and others), Iterates on
> > all threads that belong to a process, and calls the native Kernel
> > setgroups
> > on them. So thread_setgroups() is just the raw syscall bypassing glibc's
> > processing. We will eventually push this API to glibc.
> > BTW: this is done exactly the same on FreeBSD, with same exact glibc
> > intervention.
> > 
> > [Proposed]
> > What Jim proposed is a syscall that receives a struct that has
> > the regular syscalls parameters plus the creds structure with fsuid/fsgid
> > and groups array. Kernel will set these in, call the original syscall,
> > and revert. This will be done on only an interested subset of the
> > syscalls that are one - are related to filesystems (setfsXid) and two -
> > are of interest to us Servers.
> > 
> > Jim care to scribble a structure definition?
> > 
> > Thanks
> > Boaz
-- 
Jim Lieb
Linux Systems Engineer
Panasas Inc.

[-- Attachment #2: Jim Lieb <jlieb@panasas.com>: filesystem summit idea --]
[-- Type: message/rfc822, Size: 2657 bytes --]

From: Jim Lieb <jlieb@panasas.com>
To: <bharrosh@panasas.com>
Cc: <nfs-ganesha-devel@lists.sourceforge.net>
Subject: filesystem summit idea
Date: Thu, 31 Jan 2013 13:44:30 -0800
Message-ID: <9007132.QKd3F9o8Qa@jlieb-e6410>

In replying to the creds RFC branch, an idea came to me.  What we need is a 
syscall for server syscalls.  At first, I thought of doing something like what 
was done for the *at calls.  That got pretty silly with some calls only 
needing an extra flag and others needing extra args.  All of the glibc and abi 
pain was a mess I'd rather not repeat.

How about this idea:

/**
* @brief Syscall entry point for servers that need to masquerade as others
*
* This is a privileged syscall.
*
* @param syscall_number [IN] syscall number from syscall.h
* @param syscall_args     IN] the arguments for that syscall in a vector 
mimicing the syscall prototype.
* @param creds [IN] credentials to use.  See definition in fsal_types.h
*/

int server_syscall(int syscall_number, void *syscall_args, struct creds 
*creds);

This syscall would have its own matching vector of the kernels calls it does.  
Maybe this is a bit in the syscall vector.  Point being not all calls would be 
supported, only a small set.

The syscall args would be packaged and managed like ioctl does it now.  This 
is an extra dereference in the syscall processing to validate the struct and 
copy the args in/out.  The same applies to creds only instead of applying them 
to the specific syscall's stack frame, they would go into the "effective" 
uid/gid for the thread.

We save the back and forth across the syscall barrier with slightly more 
overhead per affected call which is less than the multiple roundtrips for 
setfsuid/gid.  As a priv'd syscall, it becomes outside the set of "posix" 
compliance so we can also bypass things like posix lock behavior.  It is also 
expandable without breaking the bank on syscalls or moving ABIs.

Further rationale for this is that the *at calls and handle calls do have more 
general use and therefore fit in the set of general syscalls.  This is an 
enabler for servers that can take over in user space tasks that once were 
mandated into the kernel because of these user masquerading issues.

Last point, No, I haven't researched what the Samba team has lobbied for but I 
suspect that if they are asking for variant syscalls like the *at case, this 
has lower impact.

Jim
-- 
Jim Lieb
Linux Systems Engineer
Panasas Inc.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Re: [5/8] syscall_cred() a system call that receives alternate CREDs
  2013-04-08 18:23     ` Jim Lieb
@ 2013-04-08 18:31       ` J. Bruce Fields
  2013-04-08 19:45         ` Jim Lieb
  0 siblings, 1 reply; 43+ messages in thread
From: J. Bruce Fields @ 2013-04-08 18:31 UTC (permalink / raw)
  To: Jim Lieb
  Cc: Boaz Harrosh, Steven Whitehouse, Steve Dickson, Jeff Layton,
	lsf-pc, linux-fsdevel, Ganesha NFS List, Frank S Filz,
	Venkateswararao Jujjuri, DENIEL Philippe

On Mon, Apr 08, 2013 at 11:23:14AM -0700, Jim Lieb wrote:
> On Monday, April 08, 2013 10:42:02 J. Bruce Fields wrote:
> > On Mon, Apr 08, 2013 at 01:36:46PM +0300, Boaz Harrosh wrote:
> > > From: Jim Lieb <jlieb@panasas.com>
> > > 
> > > In current NFS Server (Ganesha) lots of operation becomes 6 syscalls
> > > (Or is it 7?)
> > > 
> > > - setfsuid(), setfsgid(), thread_setgroups()
> > > - The OP
> > > - Revert setfsuid(), setfsgid() to root
> > > 
> > > This is because if we do all these file operations as root then
> > > FS will not account for the quota a user have on create files,
> > > data space, and so on.
> > 
> > To make sure I understand, you're saying that:
> > 
> > 	- the behavior you get out of those 6 syscalls is correct,
> > 	- you just want to be able to do exactly the same thing, but
> > 	  with 1 syscall.  (For performance?)
> > 
> > Or is there some other issue?
> 
> I have attached the email I sent around on the nfs-ganesha list with a model 
> api so we know the details.
> 
> Boaz replied "performance" but there are also race conditions to consider.  If 
> we get signals or ??? somewhere in the sequence, what is our state?  Yes, the 
> setfsuid call back to root can still be done but masquerading has any signals 
> etc. be in the context of that user/group and there is one syscall to deal 
> with, not a stream.

Sorry, I don't understand what you're saying here.  Could you give an
example showing a sequence of events with the wrong result?

> There may be selinux/apparmor issues to deal with too.  If we first
> masquerade the thread and then apply all these access checks, as far
> as the kernel is concerned, it is the masqueraded user.

I don't understand here either.

--b.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [1/8] readdir-plus system call
  2013-04-08 10:22 ` [1/8] readdir-plus system call Boaz Harrosh
  2013-04-08 10:26   ` Steven Whitehouse
  2013-04-08 13:51   ` DENIEL Philippe
@ 2013-04-08 19:02   ` Abhijith Das
  2013-04-10 20:31     ` Andreas Dilger
  2013-05-24 16:14     ` [1/8] readdir-plus system call - LSF/MM follow up Abhijith Das
  2 siblings, 2 replies; 43+ messages in thread
From: Abhijith Das @ 2013-04-08 19:02 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Jim Lieb, Venkateswararao Jujjuri, DENIEL Philippe

Hi Boaz/All,

----- Original Message -----
> From: "Boaz Harrosh" <bharrosh@panasas.com>
> To: "Steven Whitehouse" <swhiteho@redhat.com>, "Steve Dickson" <steved@redhat.com>, "Jeff Layton"
> <jlayton@redhat.com>, lsf-pc@lists.linux-foundation.org, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, "Ganesha
> NFS List" <nfs-ganesha-devel@lists.sourceforge.net>, "Frank S Filz" <ffilz@us.ibm.com>, "J. Bruce Fields"
> <bfields@redhat.com>, "Jim Lieb" <jlieb@panasas.com>, "Venkateswararao Jujjuri" <jvrao@linux.vnet.ibm.com>, "DENIEL
> Philippe" <philippe.deniel@cea.fr>
> Sent: Monday, April 8, 2013 5:22:46 AM
> Subject: [1/8] readdir-plus system call
> 
> By: Steven Whitehouse <swhiteho@redhat.com>)
> 
> I repeat below Steve's original mail. Steve you said you have
> some experimental code, could you post an header and a git URL
> so we can have a look?

The patchset I'm working on is in a local tree, but the latest bits are available in this Red Hat Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=850426#c14

>From a GFS2 perspective, the need for such a system call arose from our talks with Samba folks to better support clustered samba over GFS2. The system call simply collects dirents along with stat and extended attributes and copies the info out to the user buffer. This patchset is a first-attempt at tackling this problem from a GFS2 perspective and is mainly a way to get us talking about possible implementations.

As the patches stand right now, the VFS bits are just hooks and all the real work is done in the GFS2 filesystem. However, there are some bits that could be moved into the VFS so other filesystems can utilize them.

For obtaining stat info, I'm making use of VFS bits of the xstat and fxstat system calls that David Howells proposed here : https://lists.samba.org/archive/samba-technical/2012-April/082906.html

There are 4 parts to my readdirplus (xgetdents()) patches:

Patch 1of4 adds the xgetdents() syscall interface, xreaddir() f_op and the linux_xdirent structure that specifies how the collected data is packaged to the user. From the caller's perspective, it behaves very much like the getdents() syscall except for the -EAGAIN return code. This would require the caller to re-issue the syscall with the same parameters.

Patch 2of4 is a gfs2 patch that adds a data structure that is a resizeable buffer backed by a vector of pages. This is used to collect all the intermediate data before writing it out to the user buffer.

Patch 3of4 is a simple port of the sort() function from lib/sort.c called ctx_sort(). Only difference is that it takes an additional (void *) opaque context pointer and passes it to the compare() and swap() functions. I needed this to be able to sort pointers stored in the vector of pages buffer.

Patch 4of4 has GFS2's implementation of the xreaddir() f_op and all its supporting functions. gfs2_xreaddir() tries to collect the requested data efficiently by ordering disk block accesses based on the filesystem's on-disk layout and also by adjusting the resizeable buffer as needed.

In my quick testing with a 50,000 file directory, xgetdents() is at least twice as fast as getdents()+stat()+getxattr() with a cold cache and nearly thrice as fast when the disk blocks have been cached.

Cheers!
--Abhi

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Re: Re: [5/8] syscall_cred() a system call that receives alternate CREDs
  2013-04-08 18:31       ` J. Bruce Fields
@ 2013-04-08 19:45         ` Jim Lieb
  2013-04-08 21:33           ` Boaz Harrosh
  0 siblings, 1 reply; 43+ messages in thread
From: Jim Lieb @ 2013-04-08 19:45 UTC (permalink / raw)
  To: J. Bruce Fields, Frank S Filz
  Cc: Boaz Harrosh, Steven Whitehouse, Steve Dickson, Jeff Layton,
	lsf-pc, linux-fsdevel, Venkateswararao Jujjuri, DENIEL Philippe

On Monday, April 08, 2013 14:31:20 J. Bruce Fields wrote:
> On Mon, Apr 08, 2013 at 11:23:14AM -0700, Jim Lieb wrote:
> > On Monday, April 08, 2013 10:42:02 J. Bruce Fields wrote:
> > > On Mon, Apr 08, 2013 at 01:36:46PM +0300, Boaz Harrosh wrote:
> > > > From: Jim Lieb <jlieb@panasas.com>
> > > > 
> > > > In current NFS Server (Ganesha) lots of operation becomes 6 syscalls
> > > > (Or is it 7?)
> > > > 
> > > > - setfsuid(), setfsgid(), thread_setgroups()
> > > > - The OP
> > > > - Revert setfsuid(), setfsgid() to root
> > > > 
> > > > This is because if we do all these file operations as root then
> > > > FS will not account for the quota a user have on create files,
> > > > data space, and so on.
> > > 
> > > To make sure I understand, you're saying that:
> > > 	- the behavior you get out of those 6 syscalls is correct,
> > > 	- you just want to be able to do exactly the same thing, but
> > > 	
> > > 	  with 1 syscall.  (For performance?)
> > > 
> > > Or is there some other issue?
> > 
> > I have attached the email I sent around on the nfs-ganesha list with a
> > model api so we know the details.
> > 
> > Boaz replied "performance" but there are also race conditions to consider.
> >  If we get signals or ??? somewhere in the sequence, what is our state? 
> > Yes, the setfsuid call back to root can still be done but masquerading
> > has any signals etc. be in the context of that user/group and there is
> > one syscall to deal with, not a stream.
> 
> Sorry, I don't understand what you're saying here.  Could you give an
> example showing a sequence of events with the wrong result?

We are setting user, primary group, and alt groups in sequence before we do 
the actual work (read/write/...).  This is a potential TOCTOU race.  Granted, 
there is little/no real atomic guarantee but implied in the syscall model is 
that creds don't change for the duration of a syscall.  We go back to 
userspace multiple times with creds in intermediate state(s).  Signals can 
happen anytime but are only checked on the way back out of the syscall or we 
can hold them off at critical times within a single syscall.  Which syscall is 
is the one where the signal occurred?  In our case, we minimally use signals 
(do no i/o etc.) but they are still there.  If it is one syscall, we know.

We currently have an RFC implementation of a "creds wrapper" but it is still 
in flux and the codiing of all these calls to "get it right" is ugly.  One 
call, done right would be much better.

We also have a problem with the setgroups.  We escape in Linux because the 
kernel doesn't do it process wide and glibc fakes it.  I don't want to depend 
on that.  In FreeBSD, we can't do it at all since the creds are shared at the 
proc level.  Note that I am constrained to think about portability and it's 
easier to sell a new syscall than to hack fundamental kernel structures which 
is why the "do to all" bit is in glibc...

> 
> > There may be selinux/apparmor issues to deal with too.  If we first
> > masquerade the thread and then apply all these access checks, as far
> > as the kernel is concerned, it is the masqueraded user.
> 
> I don't understand here either.

There is the security context nfs-ganesha would live in but actions on behalf 
of clients are (or will be in 4.2+) be in the context of the client.  This is 
outside my expertise but I'd like to have a "masquerading" framework in place 
where it could be added in a known way, or at least we are thinking about it.

Capabilities have also been thrown into the mix.  I will be the first to defer 
to the selinux/apparmor heavies but I'd like to have all that capability 
constricted down to one syscall that can be controlled, i.e. selinux says only 
real samba and real nfs-ganesha can do this call.
> 
> --b.
-- 
Jim Lieb
Linux Systems Engineer
Panasas Inc.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [5/8] syscall_cred() a system call that receives alternate CREDs
  2013-04-08 19:45         ` Jim Lieb
@ 2013-04-08 21:33           ` Boaz Harrosh
  2013-04-09 16:40             ` Jim Lieb
  0 siblings, 1 reply; 43+ messages in thread
From: Boaz Harrosh @ 2013-04-08 21:33 UTC (permalink / raw)
  To: Jim Lieb
  Cc: J. Bruce Fields, Frank S Filz, Steven Whitehouse, Steve Dickson,
	Jeff Layton, lsf-pc, linux-fsdevel, Venkateswararao Jujjuri,
	DENIEL Philippe

On 08/04/13 22:45, Jim Lieb wrote:
> We are setting user, primary group, and alt groups in sequence before we do 
> the actual work (read/write/...).  This is a potential TOCTOU race.  Granted, 
> there is little/no real atomic guarantee but implied in the syscall model is 
> that creds don't change for the duration of a syscall.  We go back to 
> userspace multiple times with creds in intermediate state(s).  Signals can 
> happen anytime but are only checked on the way back out of the syscall or we 
> can hold them off at critical times within a single syscall.  Which syscall is 
> is the one where the signal occurred?  In our case, we minimally use signals 
> (do no i/o etc.) but they are still there.  If it is one syscall, we know.
> 

signals are crap long before our case. I would not really care about signals.
The performance argument is good enough for me.

> We currently have an RFC implementation of a "creds wrapper" but it is still 
> in flux and the codiing of all these calls to "get it right" is ugly.  One 
> call, done right would be much better.
> 
> We also have a problem with the setgroups.  We escape in Linux because the 
> kernel doesn't do it process wide and glibc fakes it.  I don't want to depend 
> on that.  In FreeBSD, we can't do it at all since the creds are shared at the 

Sachin found that FBSD might be exactly the same as Linux here. Please talk to
him to make sure?

> proc level.  Note that I am constrained to think about portability and it's 
> easier to sell a new syscall than to hack fundamental kernel structures which 
> is why the "do to all" bit is in glibc...
> 

But yes a new syscall with new defined semantics is definitely a better way
to go. Just for the sake of thread_setgroups such a call is well worth it.
Completely bypass POSIX and be done with it. So strongly yes here.

Bruce, got the point? Current code with the undocumented thread_setgroups is a
POSIX hack. And will break at any given moment in time. Only a new syscall with
defined semantics will ever be correct.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Re: [5/8] syscall_cred() a system call that receives alternate CREDs
  2013-04-08 21:33           ` Boaz Harrosh
@ 2013-04-09 16:40             ` Jim Lieb
  0 siblings, 0 replies; 43+ messages in thread
From: Jim Lieb @ 2013-04-09 16:40 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: J. Bruce Fields, Frank S Filz, Steven Whitehouse, Steve Dickson,
	Jeff Layton, lsf-pc, linux-fsdevel, Venkateswararao Jujjuri,
	DENIEL Philippe

On Tuesday, April 09, 2013 00:33:32 Boaz Harrosh wrote:
> On 08/04/13 22:45, Jim Lieb wrote:
> > We are setting user, primary group, and alt groups in sequence before we
> > do
> > the actual work (read/write/...).  This is a potential TOCTOU race. 
> > Granted, there is little/no real atomic guarantee but implied in the
> > syscall model is that creds don't change for the duration of a syscall. 
> > We go back to userspace multiple times with creds in intermediate
> > state(s).  Signals can happen anytime but are only checked on the way
> > back out of the syscall or we can hold them off at critical times within
> > a single syscall.  Which syscall is is the one where the signal occurred?
> >  In our case, we minimally use signals (do no i/o etc.) but they are
> > still there.  If it is one syscall, we know.
> signals are crap long before our case. I would not really care about
> signals. The performance argument is good enough for me.

In our case, true.  But this is a syscall and we will be stuck with it 
forever.  The reason for this syscall request, along with the locking request 
it that they were too narrow in scope.  This is, or can be, and issue for non-
threaded apps that do use pollling, signaling, and all the rest.

> 
> > We currently have an RFC implementation of a "creds wrapper" but it is
> > still in flux and the codiing of all these calls to "get it right" is
> > ugly.  One call, done right would be much better.
> > 
> > We also have a problem with the setgroups.  We escape in Linux because the
> > kernel doesn't do it process wide and glibc fakes it.  I don't want to
> > depend on that.  In FreeBSD, we can't do it at all since the creds are
> > shared at the
> Sachin found that FBSD might be exactly the same as Linux here. Please talk
> to him to make sure?

I checked with Herb et al and they have a hack for CIFS (which scares me a 
bit...) but they confirmed that stock FreeBSD has the creds as a property of 
the proc, not a thread.  Their structure sounds similar to Mach, aka OSF/1, 
aka Digital UNIX which had thread structs only contain the thread specific 
state and left things like creds and open fds in the proc struct.  To handle 
their CIFS need, they added a creds_like (emphasis on the "like") struct that 
gets allocated on a CIFS related event (not sure on details) but this is 
specific to their kernel.  Stock FreeBSD doesn't have it and it doesn't sound 
like we can use it.

> 
> > proc level.  Note that I am constrained to think about portability and
> > it's
> > easier to sell a new syscall than to hack fundamental kernel structures
> > which is why the "do to all" bit is in glibc...
> 
> But yes a new syscall with new defined semantics is definitely a better way
> to go. Just for the sake of thread_setgroups such a call is well worth it.
> Completely bypass POSIX and be done with it. So strongly yes here.
> 
> Bruce, got the point? Current code with the undocumented thread_setgroups is
> a POSIX hack. And will break at any given moment in time. Only a new
> syscall with defined semantics will ever be correct.

And my comments about selinux are to explore/define/get_alarmed about that 
dimension as well given that selinux and friends are another layer/alternate 
universe of access control.  Hmm.  Might need a flags arg in the api...

> 
> Thanks
> Boaz
-- 
Jim Lieb
Linux Systems Engineer
Panasas Inc.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [1/8] readdir-plus system call
  2013-04-08 19:02   ` Abhijith Das
@ 2013-04-10 20:31     ` Andreas Dilger
  2013-05-24 16:14     ` [1/8] readdir-plus system call - LSF/MM follow up Abhijith Das
  1 sibling, 0 replies; 43+ messages in thread
From: Andreas Dilger @ 2013-04-10 20:31 UTC (permalink / raw)
  To: Abhijith Das
  Cc: Boaz Harrosh, Steven Whitehouse, Steve Dickson, Jeff Layton,
	lsf-pc, linux-fsdevel, Ganesha NFS List, Frank S Filz,
	J. Bruce Fields, Jim Lieb, Venkateswararao Jujjuri,
	DENIEL Philippe

On 2013-04-08, at 1:02 PM, Abhijith Das wrote:
> For obtaining stat info, I'm making use of VFS bits of the xstat and fxstat system calls that David Howells proposed here : https://lists.samba.org/archive/samba-technical/2012-April/082906.html

The fxstat_at() system calls probably deserve discussion by themselves.
I think the last time they were discussed we were pretty close to
agreement on what they should (and shouldn't) do.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [1/8] readdir-plus system call - LSF/MM follow up
  2013-04-08 19:02   ` Abhijith Das
  2013-04-10 20:31     ` Andreas Dilger
@ 2013-05-24 16:14     ` Abhijith Das
  2013-05-24 19:41       ` Zach Brown
  1 sibling, 1 reply; 43+ messages in thread
From: Abhijith Das @ 2013-05-24 16:14 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Steven Whitehouse, Steve Dickson, Jeff Layton, lsf-pc,
	linux-fsdevel, Ganesha NFS List, Frank S Filz, J. Bruce Fields,
	Jim Lieb, Venkateswararao Jujjuri, DENIEL Philippe, Dave Chinner

Hi all,

Hoping to revive the discussion for $SUBJECT since we ran out of time when Boaz brought it up at LSF.
Summary of what was discussed:

- readdirplus syscall can be modeled after NFS' internal readdirplus implementation.
- Need for a directory version counter (change count)
- Need for each entry to have an opaque resume key - The linux_dirent.d_off in getdents(2) does this somewhat.
- Header at top of the returned data with bits to signify what's inside.
- What data to return? entries + stat + xattrs/acls?

The fs/kernel guys were opposed to tossing xattrs/acls into the mix - I tend to agree, after having worked on a draft readdirplus syscall on GFS2 that does xattrs in addition to stat.

The potentially large amount of variable length data to handle and the alloc/realloc/dealloc of said data makes the code quite complicated and hence, difficult to maintain. I had to write a new page-backed resizeable buffer to make this worthwhile (performance was actually worse with kmalloc & friends and kmap/kunmap compared to simply doing getdents()+stat()+getxattr()).

For those who are interested, here are the patches (description in previous email below): https://bugzilla.redhat.com/show_bug.cgi?id=850426#c14
There's an interesting seekwatcher graph on there too that compares the two cases. With a cold cache, almost all the speedup obtained by readdirplus is by being able to order all the disk reads. I've seen a 2x speedup (cold cache) with my test directories, but not much more. When the relevant disk blocks are in cache, readdirplus is about 3x faster - I attribute it to the minimal allocing and user/kernel mode switching that goes on.

We might also get decent performance by simply having a system call that takes the directory as argument and goes off and pre-fetches all the relevant blocks required to do subsequent getdents()+stat()+getxattr() efficiently.

Thoughts?

Cheers!
--Abhi

----- Original Message -----
> From: "Abhijith Das" <adas@redhat.com>
> To: "Boaz Harrosh" <bharrosh@panasas.com>
> Cc: "Steven Whitehouse" <swhiteho@redhat.com>, "Steve Dickson" <steved@redhat.com>, "Jeff Layton"
> <jlayton@redhat.com>, lsf-pc@lists.linux-foundation.org, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, "Ganesha
> NFS List" <nfs-ganesha-devel@lists.sourceforge.net>, "Frank S Filz" <ffilz@us.ibm.com>, "J. Bruce Fields"
> <bfields@redhat.com>, "Jim Lieb" <jlieb@panasas.com>, "Venkateswararao Jujjuri" <jvrao@linux.vnet.ibm.com>, "DENIEL
> Philippe" <philippe.deniel@cea.fr>
> Sent: Monday, April 8, 2013 2:02:40 PM
> Subject: Re: [1/8] readdir-plus system call
> 
> Hi Boaz/All,
> 
> ----- Original Message -----
> > From: "Boaz Harrosh" <bharrosh@panasas.com>
> > To: "Steven Whitehouse" <swhiteho@redhat.com>, "Steve Dickson"
> > <steved@redhat.com>, "Jeff Layton"
> > <jlayton@redhat.com>, lsf-pc@lists.linux-foundation.org, "linux-fsdevel"
> > <linux-fsdevel@vger.kernel.org>, "Ganesha
> > NFS List" <nfs-ganesha-devel@lists.sourceforge.net>, "Frank S Filz"
> > <ffilz@us.ibm.com>, "J. Bruce Fields"
> > <bfields@redhat.com>, "Jim Lieb" <jlieb@panasas.com>, "Venkateswararao
> > Jujjuri" <jvrao@linux.vnet.ibm.com>, "DENIEL
> > Philippe" <philippe.deniel@cea.fr>
> > Sent: Monday, April 8, 2013 5:22:46 AM
> > Subject: [1/8] readdir-plus system call
> > 
> > By: Steven Whitehouse <swhiteho@redhat.com>)
> > 
> > I repeat below Steve's original mail. Steve you said you have
> > some experimental code, could you post an header and a git URL
> > so we can have a look?
> 
> The patchset I'm working on is in a local tree, but the latest bits are
> available in this Red Hat Bugzilla:
> https://bugzilla.redhat.com/show_bug.cgi?id=850426#c14
> 
> From a GFS2 perspective, the need for such a system call arose from our talks
> with Samba folks to better support clustered samba over GFS2. The system
> call simply collects dirents along with stat and extended attributes and
> copies the info out to the user buffer. This patchset is a first-attempt at
> tackling this problem from a GFS2 perspective and is mainly a way to get us
> talking about possible implementations.
> 
> As the patches stand right now, the VFS bits are just hooks and all the real
> work is done in the GFS2 filesystem. However, there are some bits that could
> be moved into the VFS so other filesystems can utilize them.
> 
> For obtaining stat info, I'm making use of VFS bits of the xstat and fxstat
> system calls that David Howells proposed here :
> https://lists.samba.org/archive/samba-technical/2012-April/082906.html
> 
> There are 4 parts to my readdirplus (xgetdents()) patches:
> 
> Patch 1of4 adds the xgetdents() syscall interface, xreaddir() f_op and the
> linux_xdirent structure that specifies how the collected data is packaged to
> the user. From the caller's perspective, it behaves very much like the
> getdents() syscall except for the -EAGAIN return code. This would require
> the caller to re-issue the syscall with the same parameters.
> 
> Patch 2of4 is a gfs2 patch that adds a data structure that is a resizeable
> buffer backed by a vector of pages. This is used to collect all the
> intermediate data before writing it out to the user buffer.
> 
> Patch 3of4 is a simple port of the sort() function from lib/sort.c called
> ctx_sort(). Only difference is that it takes an additional (void *) opaque
> context pointer and passes it to the compare() and swap() functions. I
> needed this to be able to sort pointers stored in the vector of pages
> buffer.
> 
> Patch 4of4 has GFS2's implementation of the xreaddir() f_op and all its
> supporting functions. gfs2_xreaddir() tries to collect the requested data
> efficiently by ordering disk block accesses based on the filesystem's
> on-disk layout and also by adjusting the resizeable buffer as needed.
> 
> In my quick testing with a 50,000 file directory, xgetdents() is at least
> twice as fast as getdents()+stat()+getxattr() with a cold cache and nearly
> thrice as fast when the disk blocks have been cached.
> 
> Cheers!
> --Abhi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [1/8] readdir-plus system call - LSF/MM follow up
  2013-05-24 16:14     ` [1/8] readdir-plus system call - LSF/MM follow up Abhijith Das
@ 2013-05-24 19:41       ` Zach Brown
  2013-05-28 14:49         ` Abhijith Das
  0 siblings, 1 reply; 43+ messages in thread
From: Zach Brown @ 2013-05-24 19:41 UTC (permalink / raw)
  To: Abhijith Das
  Cc: Boaz Harrosh, Steven Whitehouse, Steve Dickson, Jeff Layton,
	lsf-pc, linux-fsdevel, Ganesha NFS List, Frank S Filz,
	J. Bruce Fields, Jim Lieb, Venkateswararao Jujjuri,
	DENIEL Philippe, Dave Chinner

On Fri, May 24, 2013 at 12:14:59PM -0400, Abhijith Das wrote:
> 
> Hoping to revive the discussion for $SUBJECT since we ran out of time
> when Boaz brought it up at LSF.
> [ ... ]
> For those who are interested, here are the patches (description in
> previous email below):
> https://bugzilla.redhat.com/show_bug.cgi?id=850426#c14

Some quick things that struck me as I glanced through the patches:

- Please post the patch series, don't make us go digging through
  bugzilla.

- Don't use variable size types in the ABI or you'll have to add compat_
  wrappers to fix it all up on the stack when going between 32bit
  userspace and 64bit kernelspace.  This is going to be especially nasty
  if this is a giant sequence of variable length blobs. 

  +struct linux_xdirent {
  +	unsigned long        xd_ino;
  +	char                 xd_type;
  +	unsigned long        xd_off;
  +	struct xstat         xd_stat;
  +	unsigned long        xd_reclen;
  +	struct xdirent_blob  xd_blob;
  +};

  Notice how, in contrast, David was careful to use naturally aligned
  fixed-width types in his xstat patch.

- z

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [1/8] readdir-plus system call - LSF/MM follow up
  2013-05-24 19:41       ` Zach Brown
@ 2013-05-28 14:49         ` Abhijith Das
  2013-05-28 15:13           ` Jim Lieb
  2013-05-28 20:00           ` Andreas Dilger
  0 siblings, 2 replies; 43+ messages in thread
From: Abhijith Das @ 2013-05-28 14:49 UTC (permalink / raw)
  To: Zach Brown
  Cc: Boaz Harrosh, Steven Whitehouse, Steve Dickson, Jeff Layton,
	lsf-pc, linux-fsdevel, Ganesha NFS List, Frank S Filz,
	J. Bruce Fields, Jim Lieb, Venkateswararao Jujjuri,
	DENIEL Philippe, Dave Chinner

Zack, thanks for taking a peek at the patches.

> 
> Some quick things that struck me as I glanced through the patches:
> 
> - Please post the patch series, don't make us go digging through
>   bugzilla.

Duly noted. I wasn't hoping for my patches to be a serious submission, just something I wrote up as a POC. I was mainly looking to (re)start a conversation about readdirplus to see what's the best way to go about doing this. Your point applies nonetheless; I'll post my patchset again, properly.

> 
> - Don't use variable size types in the ABI or you'll have to add compat_
>   wrappers to fix it all up on the stack when going between 32bit
>   userspace and 64bit kernelspace.  This is going to be especially nasty
>   if this is a giant sequence of variable length blobs.
> 
>   +struct linux_xdirent {
>   +	unsigned long        xd_ino;
>   +	char                 xd_type;
>   +	unsigned long        xd_off;
>   +	struct xstat         xd_stat;
>   +	unsigned long        xd_reclen;
>   +	struct xdirent_blob  xd_blob;
>   +};
> 
>   Notice how, in contrast, David was careful to use naturally aligned
>   fixed-width types in his xstat patch.
> 

Yes, you're right. I'll fix this.

> - z

Cheers!
--Abhi

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Re: [1/8] readdir-plus system call - LSF/MM follow up
  2013-05-28 14:49         ` Abhijith Das
@ 2013-05-28 15:13           ` Jim Lieb
       [not found]             ` <OF27E1911F.3FBABA22-ON87257B79.005C087F-88257B79.005C320B@us.ibm.com>
  2013-05-28 20:00           ` Andreas Dilger
  1 sibling, 1 reply; 43+ messages in thread
From: Jim Lieb @ 2013-05-28 15:13 UTC (permalink / raw)
  To: Abhijith Das
  Cc: Zach Brown, Boaz Harrosh, Steven Whitehouse, Steve Dickson,
	Jeff Layton, lsf-pc, linux-fsdevel, Ganesha NFS List,
	Frank S Filz, J. Bruce Fields, Venkateswararao Jujjuri,
	DENIEL Philippe, Dave Chinner

On Tuesday, May 28, 2013 10:49:31 Abhijith Das wrote:
> Zack, thanks for taking a peek at the patches.
> 
> > Some quick things that struck me as I glanced through the patches:
> > 
> > - Please post the patch series, don't make us go digging through
> > 
> >   bugzilla.
> 
> Duly noted. I wasn't hoping for my patches to be a serious submission, just
> something I wrote up as a POC. I was mainly looking to (re)start a
> conversation about readdirplus to see what's the best way to go about doing
> this. Your point applies nonetheless; I'll post my patchset again,
> properly.

Speaking for the nfs-ganesha project, one of the issues that we couldn't sort 
out at lsf was whether to to include xattrs.  As far as ganesha is concerned, 
the xstat struct is sufficient.  At readdir time, we pretty much just want to 
build our cache entries and get basic stat info.  The only case where we'd 
really need xattrs would be for acls but that is usually later in the protocol 
op sequence.  I'd just like to get the path from readdir+ thru the callbacks 
to the xdr into the reply as simple as possible.  That and get enough in the 
cache entry to be ready for the next step.

> > - Don't use variable size types in the ABI or you'll have to add compat_
> > 
> >   wrappers to fix it all up on the stack when going between 32bit
> >   userspace and 64bit kernelspace.  This is going to be especially nasty
> >   if this is a giant sequence of variable length blobs.
> >   
> >   +struct linux_xdirent {
> >   +	unsigned long        xd_ino;
> >   +	char                 xd_type;
> >   +	unsigned long        xd_off;
> >   +	struct xstat         xd_stat;
> >   +	unsigned long        xd_reclen;
> >   +	struct xdirent_blob  xd_blob;
> >   +};
> >   
> >   Notice how, in contrast, David was careful to use naturally aligned
> >   fixed-width types in his xstat patch.
> 
> Yes, you're right. I'll fix this.
> 
> > - z
> 
> Cheers!
> --Abhi
-- 
Jim Lieb
Linux Systems Engineer
Panasas Inc.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [1/8] readdir-plus system call - LSF/MM follow up
  2013-05-28 14:49         ` Abhijith Das
  2013-05-28 15:13           ` Jim Lieb
@ 2013-05-28 20:00           ` Andreas Dilger
  2013-05-28 20:11             ` Abhijith Das
  1 sibling, 1 reply; 43+ messages in thread
From: Andreas Dilger @ 2013-05-28 20:00 UTC (permalink / raw)
  To: Abhijith Das
  Cc: Zach Brown, Boaz Harrosh, Steven Whitehouse, Steve Dickson,
	David Howells, Jeff Layton, lsf-pc, linux-fsdevel,
	Ganesha NFS List, Frank S Filz, J. Bruce Fields, Jim Lieb,
	Venkateswararao Jujjuri, DENIEL Philippe, Dave Chinner

On 2013-05-28, at 8:49 AM, Abhijith Das wrote:
> Zack, thanks for taking a peek at the patches.

It wasn't clear to me which version of the xstat patches you were
basing your work on?  Are these the latest ones from David, or ones
from a mailing list thread?  I've CC'd David, since he might have
newer (and hopefully minimally contentious?) versions of the patches.

>> Some quick things that struck me as I glanced through the patches:
>> 
>> - Please post the patch series, don't make us go digging through
>>  bugzilla.
> 
> Duly noted. I wasn't hoping for my patches to be a serious submission, just something I wrote up as a POC. I was mainly looking to (re)start a conversation about readdirplus to see what's the best way to go about doing this. Your point applies nonetheless; I'll post my patchset again, properly.

It definitely makes sense to start with just the xstat patch, since
if we can't get that agreed upon and landed, there is no hope for
readdirplus to get consensus.

Could you or David submit the latest xstat patch to the list?

Cheers, Andreas

>> - Don't use variable size types in the ABI or you'll have to add compat_
>>  wrappers to fix it all up on the stack when going between 32bit
>>  userspace and 64bit kernelspace.  This is going to be especially nasty
>>  if this is a giant sequence of variable length blobs.
>> 
>>  +struct linux_xdirent {
>>  +	unsigned long        xd_ino;
>>  +	char                 xd_type;
>>  +	unsigned long        xd_off;
>>  +	struct xstat         xd_stat;
>>  +	unsigned long        xd_reclen;
>>  +	struct xdirent_blob  xd_blob;
>>  +};
>> 
>>  Notice how, in contrast, David was careful to use naturally aligned
>>  fixed-width types in his xstat patch.
>> 
> 
> Yes, you're right. I'll fix this.
> 
>> - z
> 
> Cheers!
> --Abhi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [1/8] readdir-plus system call - LSF/MM follow up
  2013-05-28 20:00           ` Andreas Dilger
@ 2013-05-28 20:11             ` Abhijith Das
  0 siblings, 0 replies; 43+ messages in thread
From: Abhijith Das @ 2013-05-28 20:11 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Zach Brown, Boaz Harrosh, Steven Whitehouse, Steve Dickson,
	David Howells, Jeff Layton, lsf-pc, linux-fsdevel,
	Ganesha NFS List, Frank S Filz, J. Bruce Fields, Jim Lieb,
	Venkateswararao Jujjuri, DENIEL Philippe, Dave Chinner


----- Original Message -----
> From: "Andreas Dilger" <adilger@dilger.ca>
> To: "Abhijith Das" <adas@redhat.com>
> Cc: "Zach Brown" <zab@redhat.com>, "Boaz Harrosh" <bharrosh@panasas.com>, "Steven Whitehouse" <swhiteho@redhat.com>,
> "Steve Dickson" <steved@redhat.com>, "David Howells" <dhowells@redhat.com>, "Jeff Layton" <jlayton@redhat.com>,
> lsf-pc@lists.linux-foundation.org, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, "Ganesha NFS List"
> <nfs-ganesha-devel@lists.sourceforge.net>, "Frank S Filz" <ffilz@us.ibm.com>, "J. Bruce Fields"
> <bfields@redhat.com>, "Jim Lieb" <jlieb@panasas.com>, "Venkateswararao Jujjuri" <jvrao@linux.vnet.ibm.com>, "DENIEL
> Philippe" <philippe.deniel@cea.fr>, "Dave Chinner" <dchinner@redhat.com>
> Sent: Tuesday, May 28, 2013 3:00:39 PM
> Subject: Re: [1/8] readdir-plus system call - LSF/MM follow up
> 
> On 2013-05-28, at 8:49 AM, Abhijith Das wrote:
> > Zack, thanks for taking a peek at the patches.
> 
> It wasn't clear to me which version of the xstat patches you were
> basing your work on?  Are these the latest ones from David, or ones
> from a mailing list thread?  I've CC'd David, since he might have
> newer (and hopefully minimally contentious?) versions of the patches.

I've based my patches on an older mailing-list version of David's xstat patches. I'll defer to David to post/point to a newer version.

Cheers!
--Abhi

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Re: Re: [1/8] readdir-plus system call - LSF/MM follow up
       [not found]             ` <OF27E1911F.3FBABA22-ON87257B79.005C087F-88257B79.005C320B@us.ibm.com>
@ 2013-05-29  0:57               ` Jim Lieb
       [not found]                 ` <OF067A3B49.F63109B6-ON87257B7A.00137A60-88257B7A.00140BC7@us.ibm.com>
  0 siblings, 1 reply; 43+ messages in thread
From: Jim Lieb @ 2013-05-29  0:57 UTC (permalink / raw)
  To: Frank S Filz
  Cc: Abhijith Das, J. Bruce Fields, Boaz Harrosh, Dave Chinner,
	Jeff Layton, Venkateswararao Jujjuri, linux-fsdevel, lsf-pc,
	Ganesha NFS List, DENIEL Philippe, Steve Dickson,
	Steven Whitehouse, Zach Brown

On Tuesday, May 28, 2013 09:47:01 Frank S Filz wrote:
> Jim Lieb <jlieb@panasas.com> wrote on 05/28/2013 08:13:02 AM:
> > On Tuesday, May 28, 2013 10:49:31 Abhijith Das wrote:
> > > Zack, thanks for taking a peek at the patches.
> > > 
> > > > Some quick things that struck me as I glanced through the patches:
> > > > 
> > > > - Please post the patch series, don't make us go digging through
> > > > 
> > > >   bugzilla.
> > > 
> > > Duly noted. I wasn't hoping for my patches to be a serious submission,
> 
> just
> 
> > > something I wrote up as a POC. I was mainly looking to (re)start a
> > > conversation about readdirplus to see what's the best way to go about
> 
> doing
> 
> > > this. Your point applies nonetheless; I'll post my patchset again,
> > > properly.
> > 
> > Speaking for the nfs-ganesha project, one of the issues that we couldn't
> 
> sort
> 
> > out at lsf was whether to to include xattrs.  As far as ganesha is
> 
> concerned,
> 
> > the xstat struct is sufficient.  At readdir time, we pretty much just
> 
> want to
> 
> > build our cache entries and get basic stat info.  The only case where
> 
> we'd
> 
> > really need xattrs would be for acls but that is usually later in
> > the protocol
> > op sequence.  I'd just like to get the path from readdir+ thru the
> 
> callbacks
> 
> > to the xdr into the reply as simple as possible.  That and get enough in
> 
> the
> 
> > cache entry to be ready for the next step.
> 
> Actually, ACLs are critical for Ganesha. Unless we decide to have separate
> attr validity bits for "stat" attributes and ACLs, Ganesha will have a
> difficult time knowing if the ACL attribute is up to date (or even
> available).

True enough.  But one of the pushbacks was the amount of work needed to get to 
xattrs where acls live.  One thing I heard that made not having acls on the 
readdir+ pass was a status of some kind that indicated "I have acls..."  The 
readdir is a dir op and so 10k+ entries need to be minimal overhead.  we 
already have the acls of the dir from the lookup.  we don't need an entry's 
acls until we do the lookup on it.  at that time we can grab the acls.  That 
was the argument as I remember and I'm willing to accept it.  IIRC, the client 
is going to send us a getattrs later.  we can do it then.  Is this reasonable?

> 
> > > > - Don't use variable size types in the ABI or you'll have to add
> 
> compat_
> 
> > > >   wrappers to fix it all up on the stack when going between 32bit
> > > >   userspace and 64bit kernelspace.  This is going to be especially
> 
> nasty
> 
> > > >   if this is a giant sequence of variable length blobs.
> > > >   
> > > >   +struct linux_xdirent {
> > > >   +   unsigned long        xd_ino;
> > > >   +   char                 xd_type;
> > > >   +   unsigned long        xd_off;
> > > >   +   struct xstat         xd_stat;
> > > >   +   unsigned long        xd_reclen;
> > > >   +   struct xdirent_blob  xd_blob;
> > > >   +};
> > > >   
> > > >   Notice how, in contrast, David was careful to use naturally aligned
> > > >   fixed-width types in his xstat patch.
> > > 
> > > Yes, you're right. I'll fix this.
> > > 
> > > > - z
> > > 
> > > Cheers!
> > > --Abhi
> > 
> > --
> > Jim Lieb
> > Linux Systems Engineer
> > Panasas Inc.
-- 
Jim Lieb
Linux Systems Engineer
Panasas Inc.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [1/8] readdir-plus system call - LSF/MM follow up
       [not found]                 ` <OF067A3B49.F63109B6-ON87257B7A.00137A60-88257B7A.00140BC7@us.ibm.com>
@ 2013-05-29 10:06                   ` Jeff Layton
  2013-05-29 14:04                     ` J. Bruce Fields
  2013-05-29 16:52                   ` Re: Re: " Jim Lieb
  1 sibling, 1 reply; 43+ messages in thread
From: Jeff Layton @ 2013-05-29 10:06 UTC (permalink / raw)
  To: Frank S Filz
  Cc: Jim Lieb, Abhijith Das, J. Bruce Fields, Boaz Harrosh,
	Dave Chinner, Venkateswararao Jujjuri, linux-fsdevel, lsf-pc,
	Ganesha NFS List, DENIEL Philippe, Steve Dickson,
	Steven Whitehouse, Zach Brown

On Tue, 28 May 2013 20:38:57 -0700
Frank S Filz <ffilz@us.ibm.com> wrote:

> 
> Jim Lieb <jlieb@panasas.com> wrote on 05/28/2013 05:57:31 PM:
> > > Actually, ACLs are critical for Ganesha. Unless we decide to have
> separate
> > > attr validity bits for "stat" attributes and ACLs, Ganesha will have a
> > > difficult time knowing if the ACL attribute is up to date (or even
> > > available).
> >
> > True enough.  But one of the pushbacks was the amount of work neededto
> get to
> > xattrs where acls live.  One thing I heard that made not having acls on
> the
> > readdir+ pass was a status of some kind that indicated "I have acls..."
> The
> > readdir is a dir op and so 10k+ entries need to be minimal overhead.  we
> > already have the acls of the dir from the lookup.  we don't need an
> entry's
> > acls until we do the lookup on it.  at that time we can grab the acls.
> That
> > was the argument as I remember and I'm willing to accept it.  IIRC,
> > the client
> > is going to send us a getattrs later.  we can do it then.  Is this
> reasonable?
> 
> The ACL COULD be required on READDIR, though I would not expect any clients
> to ask for ACL on READDIR (though it sure would be handy if Ganesha's PROXY
> client could do so...).
> 
> Fortunately we don't enforce ACE4_READ_ATTR, otherwise we WOULD need ACL on
> any READDIR...
> 
> If there are times when we get attrs without getting ACL, then we will need
> a separate validity bit for ACL, otherwise we won't be able to tell if we
> have current ACL for an entry or not.
> 
> What would actually be helpful though, and make Ganesha a lot more
> efficient is if we could actually get all the ACLs for a directory in one
> fell swoop with some sort of "compression". Given that a large percentage
> of files actually have the same ACL, we could get a the 1-4 ACLs that
> apply, and then a bunch of entries, each indicating which of the 4 ACLs
> they have.
> 

Most NFS clients aren't going to need ACLs during a READDIR operation.
I'll go as so far to say that most NFS clients don't care *at all*
about ACLs. Those are things that are enforced by the server and the
client doesn't really care to know about them.

The exception is when a client gets an explicit request to either view
or change the ACL. For Linux clients (and most other POSIX-y ones),
that's never done in any sort of batch form. It's always an operation
done against a single dentry.

So, I'm not sure I understand the argument for adding ACLs here. It's
not likely to be something you're going to end up stuffing into a
READDIR reply.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [1/8] readdir-plus system call - LSF/MM follow up
  2013-05-29 10:06                   ` Jeff Layton
@ 2013-05-29 14:04                     ` J. Bruce Fields
  2013-06-04 15:38                       ` [Lsf-pc] " Christoph Hellwig
  0 siblings, 1 reply; 43+ messages in thread
From: J. Bruce Fields @ 2013-05-29 14:04 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Frank S Filz, Jim Lieb, Abhijith Das, Boaz Harrosh, Dave Chinner,
	Venkateswararao Jujjuri, linux-fsdevel, lsf-pc, Ganesha NFS List,
	DENIEL Philippe, Steve Dickson, Steven Whitehouse, Zach Brown

On Wed, May 29, 2013 at 06:06:09AM -0400, Jeff Layton wrote:
> On Tue, 28 May 2013 20:38:57 -0700
> Frank S Filz <ffilz@us.ibm.com> wrote:
> 
> > 
> > Jim Lieb <jlieb@panasas.com> wrote on 05/28/2013 05:57:31 PM:
> > > > Actually, ACLs are critical for Ganesha. Unless we decide to have
> > separate
> > > > attr validity bits for "stat" attributes and ACLs, Ganesha will have a
> > > > difficult time knowing if the ACL attribute is up to date (or even
> > > > available).
> > >
> > > True enough.  But one of the pushbacks was the amount of work neededto
> > get to
> > > xattrs where acls live.  One thing I heard that made not having acls on
> > the
> > > readdir+ pass was a status of some kind that indicated "I have acls..."
> > The
> > > readdir is a dir op and so 10k+ entries need to be minimal overhead.  we
> > > already have the acls of the dir from the lookup.  we don't need an
> > entry's
> > > acls until we do the lookup on it.  at that time we can grab the acls.
> > That
> > > was the argument as I remember and I'm willing to accept it.  IIRC,
> > > the client
> > > is going to send us a getattrs later.  we can do it then.  Is this
> > reasonable?
> > 
> > The ACL COULD be required on READDIR, though I would not expect any clients
> > to ask for ACL on READDIR (though it sure would be handy if Ganesha's PROXY
> > client could do so...).
> > 
> > Fortunately we don't enforce ACE4_READ_ATTR, otherwise we WOULD need ACL on
> > any READDIR...
> > 
> > If there are times when we get attrs without getting ACL, then we will need
> > a separate validity bit for ACL, otherwise we won't be able to tell if we
> > have current ACL for an entry or not.
> > 
> > What would actually be helpful though, and make Ganesha a lot more
> > efficient is if we could actually get all the ACLs for a directory in one
> > fell swoop with some sort of "compression". Given that a large percentage
> > of files actually have the same ACL, we could get a the 1-4 ACLs that
> > apply, and then a bunch of entries, each indicating which of the 4 ACLs
> > they have.
> > 
> 
> Most NFS clients aren't going to need ACLs during a READDIR operation.
> I'll go as so far to say that most NFS clients don't care *at all*
> about ACLs. Those are things that are enforced by the server and the
> client doesn't really care to know about them.
> 
> The exception is when a client gets an explicit request to either view
> or change the ACL. For Linux clients (and most other POSIX-y ones),
> that's never done in any sort of batch form. It's always an operation
> done against a single dentry.

An odd exception: in the presence of "posix" acls, "ls -l" requests an
acl for every entry, so it can decide whether or not to add a "+" after
the mode (which indicates the presence of a non-trivial acl.) Judging
from http://www.bestbits.at/richacl/example.html, the same is intended
(but not yet implemented) for richacls.

Maybe if that case were common, there'd be some advantage to ls being
able to do a readdir plus to the nfs client that the nfs client could
translate into a single readdir to the server?

But I hope it doesn't come to that.

--b.

> So, I'm not sure I understand the argument for adding ACLs here. It's
> not likely to be something you're going to end up stuffing into a
> READDIR reply.
> 
> -- 
> Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Re: Re: Re: [1/8] readdir-plus system call - LSF/MM follow up
       [not found]                 ` <OF067A3B49.F63109B6-ON87257B7A.00137A60-88257B7A.00140BC7@us.ibm.com>
  2013-05-29 10:06                   ` Jeff Layton
@ 2013-05-29 16:52                   ` Jim Lieb
  1 sibling, 0 replies; 43+ messages in thread
From: Jim Lieb @ 2013-05-29 16:52 UTC (permalink / raw)
  To: Frank S Filz
  Cc: Abhijith Das, J. Bruce Fields, Boaz Harrosh, Dave Chinner,
	Jeff Layton, Venkateswararao Jujjuri, linux-fsdevel, lsf-pc,
	Ganesha NFS List, DENIEL Philippe, Steve Dickson,
	Steven Whitehouse, Zach Brown

On Tuesday, May 28, 2013 20:38:57 Frank S Filz wrote:
> Jim Lieb <jlieb@panasas.com> wrote on 05/28/2013 05:57:31 PM:
> > > Actually, ACLs are critical for Ganesha. Unless we decide to have
> 
> separate
> 
> > > attr validity bits for "stat" attributes and ACLs, Ganesha will have a
> > > difficult time knowing if the ACL attribute is up to date (or even
> > > available).
> > 
> > True enough.  But one of the pushbacks was the amount of work neededto
> 
> get to
> 
> > xattrs where acls live.  One thing I heard that made not having acls on
> 
> the
> 
> > readdir+ pass was a status of some kind that indicated "I have acls..."
> 
> The
> 
> > readdir is a dir op and so 10k+ entries need to be minimal overhead.  we
> > already have the acls of the dir from the lookup.  we don't need an
> 
> entry's
> 
> > acls until we do the lookup on it.  at that time we can grab the acls.
> 
> That
> 
> > was the argument as I remember and I'm willing to accept it.  IIRC,
> > the client
> > is going to send us a getattrs later.  we can do it then.  Is this
> 
> reasonable?
> 
> The ACL COULD be required on READDIR, though I would not expect any clients
> to ask for ACL on READDIR (though it sure would be handy if Ganesha's PROXY
> client could do so...).
> 
> Fortunately we don't enforce ACE4_READ_ATTR, otherwise we WOULD need ACL on
> any READDIR...

That's an acl on the dir itself.  If we carry this further, a recursive ls 
would readdir+ the top level dir, then lookup each dir in its list in order to 
get a handle for the next level of readdir+.  At the time of the lookup is 
when we need the acl for that dir.  If anywhere in the tree we have 5 dirs and 
10k "other" files, regulars, symlinks etc., we only really need the acl above 
for the 5 dirs.  We take the getattr (+get acls) hit 5 times at lookup time 
for that dir.  The 10k other stuff is happy with name, ls -l stuff.

> 
> If there are times when we get attrs without getting ACL, then we will need
> a separate validity bit for ACL, otherwise we won't be able to tell if we
> have current ACL for an entry or not.
> 
> What would actually be helpful though, and make Ganesha a lot more
> efficient is if we could actually get all the ACLs for a directory in one
> fell swoop with some sort of "compression". Given that a large percentage
> of files actually have the same ACL, we could get a the 1-4 ACLs that
> apply, and then a bunch of entries, each indicating which of the 4 ACLs
> they have.
> 
> Frank
-- 
Jim Lieb
Linux Systems Engineer
Panasas Inc.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf-pc] [1/8] readdir-plus system call - LSF/MM follow up
  2013-05-29 14:04                     ` J. Bruce Fields
@ 2013-06-04 15:38                       ` Christoph Hellwig
  2013-06-04 15:52                         ` J. Bruce Fields
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2013-06-04 15:38 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jeff Layton, lsf-pc, Zach Brown, Abhijith Das, Dave Chinner,
	Ganesha NFS List, Steve Dickson, linux-fsdevel, DENIEL Philippe,
	Boaz Harrosh, Frank S Filz, Jim Lieb, Venkateswararao Jujjuri,
	Steven Whitehouse

On Wed, May 29, 2013 at 10:04:56AM -0400, J. Bruce Fields wrote:
> An odd exception: in the presence of "posix" acls, "ls -l" requests an
> acl for every entry, so it can decide whether or not to add a "+" after
> the mode (which indicates the presence of a non-trivial acl.) Judging
> from http://www.bestbits.at/richacl/example.html, the same is intended
> (but not yet implemented) for richacls.
> 
> Maybe if that case were common, there'd be some advantage to ls being
> able to do a readdir plus to the nfs client that the nfs client could
> translate into a single readdir to the server?
> 
> But I hope it doesn't come to that.

Having a xstat flag that a filesystem can set meaning there is no
additional security information would be way more efficient for the
common case.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf-pc] [1/8] readdir-plus system call - LSF/MM follow up
  2013-06-04 15:38                       ` [Lsf-pc] " Christoph Hellwig
@ 2013-06-04 15:52                         ` J. Bruce Fields
  0 siblings, 0 replies; 43+ messages in thread
From: J. Bruce Fields @ 2013-06-04 15:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Layton, lsf-pc, Zach Brown, Abhijith Das, Dave Chinner,
	Ganesha NFS List, Steve Dickson, linux-fsdevel, DENIEL Philippe,
	Boaz Harrosh, Frank S Filz, Jim Lieb, Venkateswararao Jujjuri,
	Steven Whitehouse

On Tue, Jun 04, 2013 at 08:38:48AM -0700, Christoph Hellwig wrote:
> On Wed, May 29, 2013 at 10:04:56AM -0400, J. Bruce Fields wrote:
> > An odd exception: in the presence of "posix" acls, "ls -l" requests an
> > acl for every entry, so it can decide whether or not to add a "+" after
> > the mode (which indicates the presence of a non-trivial acl.) Judging
> > from http://www.bestbits.at/richacl/example.html, the same is intended
> > (but not yet implemented) for richacls.
> > 
> > Maybe if that case were common, there'd be some advantage to ls being
> > able to do a readdir plus to the nfs client that the nfs client could
> > translate into a single readdir to the server?
> > 
> > But I hope it doesn't come to that.
> 
> Having a xstat flag that a filesystem can set meaning there is no
> additional security information would be way more efficient for the
> common case.

I think that might make sense.

(Though I can't claim any evidence of an actual problem here.  Just that
if people are counting the stat's on "ls -l" then may find out they run
into those extra getxattrs too.)

--b.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2013-06-04 15:53 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-08 10:19 [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Boaz Harrosh
2013-04-08 10:22 ` [1/8] readdir-plus system call Boaz Harrosh
2013-04-08 10:26   ` Steven Whitehouse
2013-04-08 15:18     ` [Nfs-ganesha-devel] " Matt W. Benjamin
2013-04-08 13:51   ` DENIEL Philippe
2013-04-08 19:02   ` Abhijith Das
2013-04-10 20:31     ` Andreas Dilger
2013-05-24 16:14     ` [1/8] readdir-plus system call - LSF/MM follow up Abhijith Das
2013-05-24 19:41       ` Zach Brown
2013-05-28 14:49         ` Abhijith Das
2013-05-28 15:13           ` Jim Lieb
     [not found]             ` <OF27E1911F.3FBABA22-ON87257B79.005C087F-88257B79.005C320B@us.ibm.com>
2013-05-29  0:57               ` Jim Lieb
     [not found]                 ` <OF067A3B49.F63109B6-ON87257B7A.00137A60-88257B7A.00140BC7@us.ibm.com>
2013-05-29 10:06                   ` Jeff Layton
2013-05-29 14:04                     ` J. Bruce Fields
2013-06-04 15:38                       ` [Lsf-pc] " Christoph Hellwig
2013-06-04 15:52                         ` J. Bruce Fields
2013-05-29 16:52                   ` Re: Re: " Jim Lieb
2013-05-28 20:00           ` Andreas Dilger
2013-05-28 20:11             ` Abhijith Das
2013-04-08 10:25 ` [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Steven Whitehouse
2013-04-08 10:25 ` [2/8] Sane locks (UNPOSIX locks) Boaz Harrosh
2013-04-08 12:02   ` [Lsf-pc] " Jeff Layton
2013-04-08 10:28 ` [3/8] File delegations, Usermode API of Bruce's pending patches Boaz Harrosh
2013-04-08 10:32 ` [4/8] PNFS ioctls/syscall Boaz Harrosh
2013-04-08 10:36 ` [5/8] syscall_cred() a system call that receives alternate CREDs Boaz Harrosh
2013-04-08 13:54   ` DENIEL Philippe
2013-04-08 14:42   ` J. Bruce Fields
2013-04-08 14:58     ` Boaz Harrosh
2013-04-08 18:23     ` Jim Lieb
2013-04-08 18:31       ` J. Bruce Fields
2013-04-08 19:45         ` Jim Lieb
2013-04-08 21:33           ` Boaz Harrosh
2013-04-09 16:40             ` Jim Lieb
2013-04-08 10:42 ` [6/8] Rich ACLs (continued, drive through this time) Boaz Harrosh
2013-04-08 11:12   ` Vyacheslav Dubeyko
2013-04-08 14:27   ` Venkateswararao Jujjuri
2013-04-08 10:43 ` [7/8] Single call interface to getattr/setattr Boaz Harrosh
     [not found]   ` <OF4A1A78E0.CB4DED3E-ON87257B47.00549E35-88257B47.005520A8@us.ibm.com>
2013-04-08 16:41     ` Boaz Harrosh
2013-04-08 10:45 ` [8/8] Fix fsnotify short comings (single fd with recursive notifications) Boaz Harrosh
2013-04-08 13:59   ` DENIEL Philippe
2013-04-08 15:22     ` Al Viro
2013-04-08 15:36     ` J. Bruce Fields
2013-04-08 14:31 ` [LSF/MM TOPIC (expanded) 0/8] New API's for better exporting of VFS from user-mode daemons Venkateswararao Jujjuri

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.