All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [patch] epoll use a single inode ...
       [not found] <45ED1A3C.6030707@argo.co.il>
@ 2007-03-07  0:37 ` Davide Libenzi
  2007-03-07  0:51   ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Davide Libenzi @ 2007-03-07  0:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Linux Kernel Mailing List, Linus Torvalds, Andrew Morton

On Tue, 6 Mar 2007, Avi Kivity wrote:

> Right.  For kvmfs this isn't a problem as there's a 1:1 relationship 
> between synthetic inodes and dentries.  Perhaps d_alloc_anon() could be 
> extended to avoid the scan if it's a problem.

I currently have the dentry to carry the name of the file* class it is 
linked to. I'd prefer to keep it that way, unless there are huge factors 
against.



- Davide



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07  0:37 ` [patch] epoll use a single inode Davide Libenzi
@ 2007-03-07  0:51   ` Linus Torvalds
  2007-03-07  1:01     ` Davide Libenzi
  2007-03-07  6:52     ` Eric Dumazet
  0 siblings, 2 replies; 45+ messages in thread
From: Linus Torvalds @ 2007-03-07  0:51 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Avi Kivity, Linux Kernel Mailing List, Andrew Morton, Al Viro


[ Al Viro added to Cc: as the arbiter of good taste in the VFS layer. He 
  has veto powers even over my proposals ;^]

On Tue, 6 Mar 2007, Davide Libenzi wrote:
> 
> I currently have the dentry to carry the name of the file* class it is 
> linked to. I'd prefer to keep it that way, unless there are huge factors 
> against.

I assume that the *only* reason for having multiple dentries is really 
just the output in /proc/<pid>/fd/, right? Or is there any other reason to 
have separate dentries for these pseudo-files?

It's a bit sad to waste that much memory (and time) on something like 
that. I bet that the dentry setup is a noticeable part of the whole 
sigfd()/timerfd() setup. It's likely also a big part of any memory 
footprint if you have lots of them.

So how about just doing:
 - do a single dentry
 - make a "struct file_operations" member function that prints out the 
   name of the thing in /proc/<pid>/fd/, and which *defaults* to just 
   doing the d_path() on the dentry, but special filesystems like this 
   could do something else (like print out a fake inode number from the 
   "file->f_private_data" information)

There seems to really be no downsides to that approach. No existing 
filesystem will even notice (they'll all have NULL in the new f_op 
member), and it would allow pipes etc to be sped up and use less memory.

			Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07  0:51   ` Linus Torvalds
@ 2007-03-07  1:01     ` Davide Libenzi
  2007-03-07  1:27       ` Linus Torvalds
  2007-03-07  6:52     ` Eric Dumazet
  1 sibling, 1 reply; 45+ messages in thread
From: Davide Libenzi @ 2007-03-07  1:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, Linux Kernel Mailing List, Andrew Morton, Al Viro

On Tue, 6 Mar 2007, Linus Torvalds wrote:

> 
> [ Al Viro added to Cc: as the arbiter of good taste in the VFS layer. He 
>   has veto powers even over my proposals ;^]
> 
> On Tue, 6 Mar 2007, Davide Libenzi wrote:
> > 
> > I currently have the dentry to carry the name of the file* class it is 
> > linked to. I'd prefer to keep it that way, unless there are huge factors 
> > against.
> 
> I assume that the *only* reason for having multiple dentries is really 
> just the output in /proc/<pid>/fd/, right? Or is there any other reason to 
> have separate dentries for these pseudo-files?
> 
> It's a bit sad to waste that much memory (and time) on something like 
> that. I bet that the dentry setup is a noticeable part of the whole 
> sigfd()/timerfd() setup. It's likely also a big part of any memory 
> footprint if you have lots of them.
> 
> So how about just doing:
>  - do a single dentry
>  - make a "struct file_operations" member function that prints out the 
>    name of the thing in /proc/<pid>/fd/, and which *defaults* to just 
>    doing the d_path() on the dentry, but special filesystems like this 
>    could do something else (like print out a fake inode number from the 
>    "file->f_private_data" information)
> 
> There seems to really be no downsides to that approach. No existing 
> filesystem will even notice (they'll all have NULL in the new f_op 
> member), and it would allow pipes etc to be sped up and use less memory.

I'm OK with everything that avoid code duplication due to those fake 
inodes. The change can be localized inside the existing API, so it doesn't 
really affect me externally.



- Davide



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07  1:01     ` Davide Libenzi
@ 2007-03-07  1:27       ` Linus Torvalds
  2007-03-07  1:47         ` Davide Libenzi
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2007-03-07  1:27 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Avi Kivity, Linux Kernel Mailing List, Andrew Morton, Al Viro



On Tue, 6 Mar 2007, Davide Libenzi wrote:
> 
> I'm OK with everything that avoid code duplication due to those fake 
> inodes. The change can be localized inside the existing API, so it doesn't 
> really affect me externally.

Can you try with the first patch version that doesn't do anything special 
at all, and just uses a single dentry.

Yeah, the dentry name will be identical, and so you would see something 
like "7 -> signalfd:signalfd" when you do "ls -l /proc/<pid>/fd/" on a 
task that has such a special file descriptor (with no way to tell 
different timerfd's and signalfd's apart), but I think it's better to 
start off simple than to overdesign things.

And trying to tell them apart sounds a bit like overdesign, if only 
because I really don't see why anybody would really *care*. So it's a 
timer for poll/select/epoll - why care about anything else?

If it really really turns out that people care, we know how we can do it. 
We'd hook into "proc_fd_link()" and we'd allow a per-file mntget/dget that 
we could use to let special filesystems create fake entries on demand. So 
it's not impossible, it's just likely simply not needed.

			Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07  1:27       ` Linus Torvalds
@ 2007-03-07  1:47         ` Davide Libenzi
  0 siblings, 0 replies; 45+ messages in thread
From: Davide Libenzi @ 2007-03-07  1:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, Linux Kernel Mailing List, Andrew Morton, Al Viro

On Tue, 6 Mar 2007, Linus Torvalds wrote:

> 
> 
> On Tue, 6 Mar 2007, Davide Libenzi wrote:
> > 
> > I'm OK with everything that avoid code duplication due to those fake 
> > inodes. The change can be localized inside the existing API, so it doesn't 
> > really affect me externally.
> 
> Can you try with the first patch version that doesn't do anything special 
> at all, and just uses a single dentry.
> 
> Yeah, the dentry name will be identical, and so you would see something 
> like "7 -> signalfd:signalfd" when you do "ls -l /proc/<pid>/fd/" on a 
> task that has such a special file descriptor (with no way to tell 
> different timerfd's and signalfd's apart), but I think it's better to 
> start off simple than to overdesign things.
> 
> And trying to tell them apart sounds a bit like overdesign, if only 
> because I really don't see why anybody would really *care*. So it's a 
> timer for poll/select/epoll - why care about anything else?

The code would not change/shrink much with the single dentry. We'd save 
memory, and we'd lose the capability of seeing aino:[CLASS].
Both ways are fine with me.



- Davide



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07  0:51   ` Linus Torvalds
  2007-03-07  1:01     ` Davide Libenzi
@ 2007-03-07  6:52     ` Eric Dumazet
  2007-03-07  7:15       ` Davide Libenzi
                         ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Eric Dumazet @ 2007-03-07  6:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Avi Kivity, Linux Kernel Mailing List,
	Andrew Morton, Al Viro

Linus Torvalds a écrit :
> 
> I assume that the *only* reason for having multiple dentries is really 
> just the output in /proc/<pid>/fd/, right? Or is there any other reason to 
> have separate dentries for these pseudo-files?
> 
> It's a bit sad to waste that much memory (and time) on something like 
> that. I bet that the dentry setup is a noticeable part of the whole 
> sigfd()/timerfd() setup. It's likely also a big part of any memory 
> footprint if you have lots of them.
> 
> So how about just doing:
>  - do a single dentry
>  - make a "struct file_operations" member function that prints out the 
>    name of the thing in /proc/<pid>/fd/, and which *defaults* to just 
>    doing the d_path() on the dentry, but special filesystems like this 
>    could do something else (like print out a fake inode number from the 
>    "file->f_private_data" information)
> 
> There seems to really be no downsides to that approach. No existing 
> filesystem will even notice (they'll all have NULL in the new f_op 
> member), and it would allow pipes etc to be sped up and use less memory.
> 

I would definitly *love* saving dentries for pipes (and sockets too), but how 
are you going to get the inode ?

pipes()/sockets() can use read()/write()/rw_verify_area() and thus need 
file->f_path.dentry->d_inode (so each pipe needs a separate dentry)

Are you suggesting adding a new "struct file_operations" member to get the inode ?
Or re-intoducing an inode pointer in struct file ?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07  6:52     ` Eric Dumazet
@ 2007-03-07  7:15       ` Davide Libenzi
  2007-03-07  7:16       ` Eric Dumazet
  2007-03-07 17:02       ` Linus Torvalds
  2 siblings, 0 replies; 45+ messages in thread
From: Davide Libenzi @ 2007-03-07  7:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Avi Kivity, Linux Kernel Mailing List,
	Andrew Morton, Al Viro

On Wed, 7 Mar 2007, Eric Dumazet wrote:

> I would definitly *love* saving dentries for pipes (and sockets too), but how
> are you going to get the inode ?

I was not planning to touch anything but epoll, signalfd and timerfd 
files.


> pipes()/sockets() can use read()/write()/rw_verify_area() and thus need
> file->f_path.dentry->d_inode (so each pipe needs a separate dentry)

Currently, they use a single inode, and multiple dentries (to give the 
name of the class). But this could be changed to a single dentry like 
Linus was suggesting. I'll wait for Al's reply before doing anything.
Memory saving can be something, on top of the already big one of avoiding 
code duplication.



- Davide



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07  6:52     ` Eric Dumazet
  2007-03-07  7:15       ` Davide Libenzi
@ 2007-03-07  7:16       ` Eric Dumazet
  2007-03-07  8:56         ` Christoph Hellwig
  2007-03-07 17:21         ` Linus Torvalds
  2007-03-07 17:02       ` Linus Torvalds
  2 siblings, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2007-03-07  7:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro

Eric Dumazet a écrit :
> 
> I would definitly *love* saving dentries for pipes (and sockets too), 
> but how are you going to get the inode ?
> 
> pipes()/sockets() can use read()/write()/rw_verify_area() and thus need 
> file->f_path.dentry->d_inode (so each pipe needs a separate dentry)
> 
> Are you suggesting adding a new "struct file_operations" member to get 
> the inode ?
> Or re-intoducing an inode pointer in struct file ?

Crazy ideas : (some readers are going to kill me)

1) Use the low order bit of f_path.dentry to say : this pointer is not a 
pointer to a dentry but the inode pointer (with the low order bit set to 1)

OR

2) file->f_path.dentry set to NULL for this special files (so that we dont 
need to dput() and cache line ping pong the common dentry each time we 
__fput() a pipe/socket.

Same trick could be used for file->f_path.mnt, because there is a big SMP 
cache line ping/pong to maintain a mnt_count on pipe/sockets mountpoint while 
these file systems cannot be un-mounted)

If dentry is NULL, we get the inode pointer from an overlay of struct 
file_ra_state    f_ra; (because for this special files readahead is unused)

This adds some conditional branches of course, but being able to save ram and 
better use cpu caches might be worth them.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07  7:16       ` Eric Dumazet
@ 2007-03-07  8:56         ` Christoph Hellwig
  2007-03-07 17:21         ` Linus Torvalds
  1 sibling, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2007-03-07  8:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro

On Wed, Mar 07, 2007 at 08:16:14AM +0100, Eric Dumazet wrote:
> Crazy ideas : (some readers are going to kill me)
> 
> 1) Use the low order bit of f_path.dentry to say : this pointer is not a 
> pointer to a dentry but the inode pointer (with the low order bit set to 1)
> 
> OR
> 
> 2) file->f_path.dentry set to NULL for this special files (so that we dont 
> need to dput() and cache line ping pong the common dentry each time we 
> __fput() a pipe/socket.

No way on either one.  f_path.dentry always beeing there is an assumption
we make all over the place, and changing that would be a big regression
for code qualityand reliability all over the place.

Face it folks, memory is generally cheap, and we're not going to uglify
huge amounts of code to shave of a little bit.

[and that is only in reply to this one, the single dentry optimizations
 for epoll and friends are perfecltly fine from the high level view]

> Same trick could be used for file->f_path.mnt, because there is a big SMP 
> cache line ping/pong to maintain a mnt_count on pipe/sockets mountpoint 
> while these file systems cannot be un-mounted)

Same thing as above.  We might do a hack to not refcount these vfsmounts,
but we definitively want to keep the invariant of f_path.mnt never
beeing NULL.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07  6:52     ` Eric Dumazet
  2007-03-07  7:15       ` Davide Libenzi
  2007-03-07  7:16       ` Eric Dumazet
@ 2007-03-07 17:02       ` Linus Torvalds
  2007-03-07 17:31         ` Eric Dumazet
  2 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2007-03-07 17:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Davide Libenzi, Avi Kivity, Linux Kernel Mailing List,
	Andrew Morton, Al Viro



On Wed, 7 Mar 2007, Eric Dumazet wrote:
> 
> I would definitly *love* saving dentries for pipes (and sockets too), but how
> are you going to get the inode ?

Don't use an inode at all.

> pipes()/sockets() can use read()/write()/rw_verify_area() and thus need
> file->f_path.dentry->d_inode (so each pipe needs a separate dentry)

No, at least pipes could easily just use "file->f_private_data" instead.

Now, sockets really do want the inode (or it would be really really big 
changes), but pipes really just want a "struct pipe_inode_info" pointer, 
which we could hide away directly in the file descriptor itself.

That's what Davide already did (on my suggestion) for signalfd - there's a 
*single* inode, and the real data is in the per-fd f_private_data.

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07  7:16       ` Eric Dumazet
  2007-03-07  8:56         ` Christoph Hellwig
@ 2007-03-07 17:21         ` Linus Torvalds
  1 sibling, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2007-03-07 17:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Davide Libenzi, Avi Kivity, Linux Kernel Mailing List,
	Andrew Morton, Al Viro



On Wed, 7 Mar 2007, Eric Dumazet wrote:
> 
> Crazy ideas : (some readers are going to kill me)

First off, as noted earlier, you don't need crazy ideas. 

But:

> 1) Use the low order bit of f_path.dentry to say : this pointer is not a
> pointer to a dentry but the inode pointer (with the low order bit set to 1)

No, we don't want to do that. There's a lot of code that wants to just 
follow the dentry/inode chain unconditionally, and we want to have people 
able to just do

	i_size_read(file->f_path.dentry->d_inode);

kind of thing, without having to check anything in between. It's costly 
enough that we have a chain of pointers, it's much *worse* if we then have 
to make conditionals etc (and forgetting is an oops).

> 2) file->f_path.dentry set to NULL for this special files (so that we dont
> need to dput() and cache line ping pong the common dentry each time we
> __fput() a pipe/socket.

I agree abot the ping pong, but I suspect it's a lot less than the current 
pingpong on dcache_lock, which is a lot hotter than a pipe-dentry counter 
would be.

If it becomes a big issue, we could just have a fixed number of dentries, 
and hash them out some way (to dilute the ping-ponging). So making things 
NULL would just be horrible for everybody, since we have lots of generic 
code that just looks up the dentry and inode, and we don't want to make 
that worse.

(I'm sure using "filp->f_private_data" has some downsides too, but it 
seems fairly simple at least for pipes. The biggest problem is likely that 
we currently use the mutex in the inode in "filp->f_dentry->d_inode" as a 
way to protect the inode->i_pipe pointer itself, and there is no 
equivalent locking at the "struct file *" level at all. We might have to 
make the "f_ep_lock" spinlock be unconditional)

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07 17:02       ` Linus Torvalds
@ 2007-03-07 17:31         ` Eric Dumazet
  2007-03-07 17:36           ` Eric Dumazet
                             ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Eric Dumazet @ 2007-03-07 17:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Avi Kivity, Linux Kernel Mailing List,
	Andrew Morton, Al Viro

[-- Attachment #1: Type: text/plain, Size: 2357 bytes --]

On Wednesday 07 March 2007 18:02, Linus Torvalds wrote:
> On Wed, 7 Mar 2007, Eric Dumazet wrote:
> > I would definitly *love* saving dentries for pipes (and sockets too), but
> > how are you going to get the inode ?
>
> Don't use an inode at all.

Lovely :)

>
> > pipes()/sockets() can use read()/write()/rw_verify_area() and thus need
> > file->f_path.dentry->d_inode (so each pipe needs a separate dentry)
>
> No, at least pipes could easily just use "file->f_private_data" instead.
>
> Now, sockets really do want the inode (or it would be really really big
> changes), but pipes really just want a "struct pipe_inode_info" pointer,
> which we could hide away directly in the file descriptor itself.

sockets already uses file->private_data.

But calls to read()/write() (not send()/recv()) still need to go through the 
dentry, before entering socket land.

>
> That's what Davide already did (on my suggestion) for signalfd - there's a
> *single* inode, and the real data is in the per-fd f_private_data.


Please find enclosed the following patch, to prepare this path.

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to be able to provide a dentry name 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on sock_mnt

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on pipe_mnt

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

 fs/dcache.c            |    3 +++
 fs/pipe.c              |   16 +++++++++++-----
 include/linux/dcache.h |    1 +
 net/socket.c           |   15 +++++++++++----
 4 files changed, 26 insertions(+), 9 deletions(-)


[-- Attachment #2: introduce_d_dname.patch --]
[-- Type: text/x-diff, Size: 3784 bytes --]

--- linux-2.6.21-rc3/include/linux/dcache.h	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h	2007-03-07 17:27:39.000000000 +0100
@@ -133,6 +133,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char * (*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/fs/dcache.c	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c	2007-03-07 17:28:46.000000000 +0100
@@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str
 	struct vfsmount *rootmnt;
 	struct dentry *root;
 
+	if (dentry->d_op && dentry->d_op->d_dname)
+		return (dentry->d_op->d_dname)(dentry, buf, buflen);
+
 	read_lock(&current->fs->lock);
 	rootmnt = mntget(current->fs->rootmnt);
 	root = dget(current->fs->root);
--- linux-2.6.21-rc3/fs/pipe.c	2007-03-07 17:42:36.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c	2007-03-07 18:01:40.000000000 +0100
@@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d
 	return 0;
 }
 
+static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "pipe:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations pipefs_dentry_operations = {
 	.d_delete	= pipefs_delete_dentry,
+	.d_dname	= pipefs_dname,
 };
 
 static struct inode * get_pipe_inode(void)
@@ -888,7 +895,6 @@ struct file *create_write_pipe(void)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	char name[32];
 	struct qstr this;
 
 	f = get_empty_filp();
@@ -899,8 +905,8 @@ struct file *create_write_pipe(void)
 	if (!inode)
 		goto err_file;
 
-	this.len = sprintf(name, "[%lu]", inode->i_ino);
-	this.name = name;
+	this.len = 0;
+	this.name = NULL;
 	this.hash = 0;
 	err = -ENOMEM;
 	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &this);
@@ -915,7 +921,7 @@ struct file *create_write_pipe(void)
 	 */
 	dentry->d_flags &= ~DCACHE_UNHASHED;
 	d_instantiate(dentry, inode);
-	f->f_path.mnt = mntget(pipe_mnt);
+	f->f_path.mnt = NULL;
 	f->f_path.dentry = dentry;
 	f->f_mapping = inode->i_mapping;
 
@@ -949,7 +955,7 @@ struct file *create_read_pipe(struct fil
 		return ERR_PTR(-ENFILE);
 
 	/* Grab pipe from the writer */
-	f->f_path.mnt = mntget(wrf->f_path.mnt);
+	f->f_path.mnt = NULL;
 	f->f_path.dentry = dget(wrf->f_path.dentry);
 	f->f_mapping = wrf->f_path.dentry->d_inode->i_mapping;
 
--- linux-2.6.21-rc3/net/socket.c	2007-03-07 17:37:56.000000000 +0100
+++ linux-2.6.21-rc3-ed/net/socket.c	2007-03-07 17:54:53.000000000 +0100
@@ -314,8 +314,16 @@ static int sockfs_delete_dentry(struct d
 	dentry->d_flags |= DCACHE_UNHASHED;
 	return 0;
 }
+
+static char * sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "socket:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations sockfs_dentry_operations = {
 	.d_delete = sockfs_delete_dentry,
+	.d_dname  = sockfs_dname,
 };
 
 /*
@@ -356,10 +364,9 @@ static int sock_alloc_fd(struct file **f
 static int sock_attach_fd(struct socket *sock, struct file *file)
 {
 	struct qstr this;
-	char name[32];
 
-	this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
-	this.name = name;
+	this.len = 0;
+	this.name = NULL;
 	this.hash = 0;
 
 	file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this);
@@ -374,7 +381,7 @@ static int sock_attach_fd(struct socket 
 	 */
 	file->f_path.dentry->d_flags &= ~DCACHE_UNHASHED;
 	d_instantiate(file->f_path.dentry, SOCK_INODE(sock));
-	file->f_path.mnt = mntget(sock_mnt);
+	file->f_path.mnt = NULL;
 	file->f_mapping = file->f_path.dentry->d_inode->i_mapping;
 
 	sock->file = file;

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07 17:31         ` Eric Dumazet
@ 2007-03-07 17:36           ` Eric Dumazet
  2007-03-07 17:45           ` Linus Torvalds
  2007-03-08  8:56           ` Christoph Hellwig
  2 siblings, 0 replies; 45+ messages in thread
From: Eric Dumazet @ 2007-03-07 17:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, Linux Kernel Mailing List, Andrew Morton, Al Viro

[-- Attachment #1: Type: text/plain, Size: 1453 bytes --]

(resending with a more convenient attachment)

Please find enclosed the following patch, to prepare this path.

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to be able to provide a dentry name 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on sock_mnt

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on pipe_mnt

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

 fs/dcache.c            |    3 +++
 fs/pipe.c              |   16 +++++++++++-----
 include/linux/dcache.h |    1 +
 net/socket.c           |   15 +++++++++++----
 4 files changed, 26 insertions(+), 9 deletions(-)

[-- Attachment #2: introduce_d_dname.patch --]
[-- Type: text/plain, Size: 3784 bytes --]

--- linux-2.6.21-rc3/include/linux/dcache.h	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h	2007-03-07 17:27:39.000000000 +0100
@@ -133,6 +133,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char * (*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/fs/dcache.c	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c	2007-03-07 17:28:46.000000000 +0100
@@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str
 	struct vfsmount *rootmnt;
 	struct dentry *root;
 
+	if (dentry->d_op && dentry->d_op->d_dname)
+		return (dentry->d_op->d_dname)(dentry, buf, buflen);
+
 	read_lock(&current->fs->lock);
 	rootmnt = mntget(current->fs->rootmnt);
 	root = dget(current->fs->root);
--- linux-2.6.21-rc3/fs/pipe.c	2007-03-07 17:42:36.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c	2007-03-07 18:01:40.000000000 +0100
@@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d
 	return 0;
 }
 
+static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "pipe:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations pipefs_dentry_operations = {
 	.d_delete	= pipefs_delete_dentry,
+	.d_dname	= pipefs_dname,
 };
 
 static struct inode * get_pipe_inode(void)
@@ -888,7 +895,6 @@ struct file *create_write_pipe(void)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	char name[32];
 	struct qstr this;
 
 	f = get_empty_filp();
@@ -899,8 +905,8 @@ struct file *create_write_pipe(void)
 	if (!inode)
 		goto err_file;
 
-	this.len = sprintf(name, "[%lu]", inode->i_ino);
-	this.name = name;
+	this.len = 0;
+	this.name = NULL;
 	this.hash = 0;
 	err = -ENOMEM;
 	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &this);
@@ -915,7 +921,7 @@ struct file *create_write_pipe(void)
 	 */
 	dentry->d_flags &= ~DCACHE_UNHASHED;
 	d_instantiate(dentry, inode);
-	f->f_path.mnt = mntget(pipe_mnt);
+	f->f_path.mnt = NULL;
 	f->f_path.dentry = dentry;
 	f->f_mapping = inode->i_mapping;
 
@@ -949,7 +955,7 @@ struct file *create_read_pipe(struct fil
 		return ERR_PTR(-ENFILE);
 
 	/* Grab pipe from the writer */
-	f->f_path.mnt = mntget(wrf->f_path.mnt);
+	f->f_path.mnt = NULL;
 	f->f_path.dentry = dget(wrf->f_path.dentry);
 	f->f_mapping = wrf->f_path.dentry->d_inode->i_mapping;
 
--- linux-2.6.21-rc3/net/socket.c	2007-03-07 17:37:56.000000000 +0100
+++ linux-2.6.21-rc3-ed/net/socket.c	2007-03-07 17:54:53.000000000 +0100
@@ -314,8 +314,16 @@ static int sockfs_delete_dentry(struct d
 	dentry->d_flags |= DCACHE_UNHASHED;
 	return 0;
 }
+
+static char * sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "socket:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations sockfs_dentry_operations = {
 	.d_delete = sockfs_delete_dentry,
+	.d_dname  = sockfs_dname,
 };
 
 /*
@@ -356,10 +364,9 @@ static int sock_alloc_fd(struct file **f
 static int sock_attach_fd(struct socket *sock, struct file *file)
 {
 	struct qstr this;
-	char name[32];
 
-	this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
-	this.name = name;
+	this.len = 0;
+	this.name = NULL;
 	this.hash = 0;
 
 	file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this);
@@ -374,7 +381,7 @@ static int sock_attach_fd(struct socket 
 	 */
 	file->f_path.dentry->d_flags &= ~DCACHE_UNHASHED;
 	d_instantiate(file->f_path.dentry, SOCK_INODE(sock));
-	file->f_path.mnt = mntget(sock_mnt);
+	file->f_path.mnt = NULL;
 	file->f_mapping = file->f_path.dentry->d_inode->i_mapping;
 
 	sock->file = file;

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07 17:31         ` Eric Dumazet
  2007-03-07 17:36           ` Eric Dumazet
@ 2007-03-07 17:45           ` Linus Torvalds
  2007-03-07 18:06             ` Eric Dumazet
  2007-03-08  8:56           ` Christoph Hellwig
  2 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2007-03-07 17:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Davide Libenzi, Avi Kivity, Linux Kernel Mailing List,
	Andrew Morton, Al Viro



On Wed, 7 Mar 2007, Eric Dumazet wrote:
> 
> sockets already uses file->private_data.
> 
> But calls to read()/write() (not send()/recv()) still need to go through the 
> dentry, before entering socket land.

Sure. The dentry and the inode need to *exist*, but they can be one single 
static dentry/inode per "file descriptor type".

We always pass in the "struct file *" to read/write too, since we need it 
anyway for things like file control information (eg "is it a nonblocking 
read or write" kinds of things).

So I'm not suggesting a NULL dentry/inode, I'm suggesting a single static 
one per type.

And yeah, it may be harder than it looks. Some things "know" that all the 
relevant info is in the inode, so they just pass in the inode. In the pipe 
layer, for example, you'd need to change free_pipe_info() and 
alloc_pipe_info() to pass in the file descriptor instead, same goes for 
pipe_release(). But the "struct file *" is always available, it's just 
that since the code was originally written to have all the info in the 
inode, some of the code isn't set up to use it or pass it on..

But your patch is independent of that, and looks fine. Except I don't like 
this part:

-       file->f_path.mnt = mntget(sock_mnt);
+       file->f_path.mnt = NULL;

since I'd be much happer with always having f_path.mnt available, the same 
way we should always have f_path.dentry there.

(Btw, your patch is *not* going to work with the file->f_private_data 
approach, because d_path() is not passed down the "file *" thing. So we'd 
need to do that, and that's more intrusive (it can be NULL, since for 
things like cwd/pwd we don't have a "struct file").

But I like your patch as a totally independent thing. "It just makes 
sense". 

(Apart from the f_path.mnt thing, which I think was something else ;)

			Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07 17:45           ` Linus Torvalds
@ 2007-03-07 18:06             ` Eric Dumazet
  2007-03-07 18:30               ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2007-03-07 18:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Avi Kivity, Linux Kernel Mailing List,
	Andrew Morton, Al Viro

[-- Attachment #1: Type: text/plain, Size: 3599 bytes --]

On Wednesday 07 March 2007 18:45, Linus Torvalds wrote:
> On Wed, 7 Mar 2007, Eric Dumazet wrote:
> > sockets already uses file->private_data.
> >
> > But calls to read()/write() (not send()/recv()) still need to go through
> > the dentry, before entering socket land.
>
> Sure. The dentry and the inode need to *exist*, but they can be one single
> static dentry/inode per "file descriptor type".
>
> We always pass in the "struct file *" to read/write too, since we need it
> anyway for things like file control information (eg "is it a nonblocking
> read or write" kinds of things).
>
> So I'm not suggesting a NULL dentry/inode, I'm suggesting a single static
> one per type.
>
> And yeah, it may be harder than it looks. Some things "know" that all the
> relevant info is in the inode, so they just pass in the inode. In the pipe
> layer, for example, you'd need to change free_pipe_info() and
> alloc_pipe_info() to pass in the file descriptor instead, same goes for
> pipe_release(). But the "struct file *" is always available, it's just
> that since the code was originally written to have all the info in the
> inode, some of the code isn't set up to use it or pass it on..
>
> But your patch is independent of that, and looks fine. Except I don't like
> this part:
>
> -       file->f_path.mnt = mntget(sock_mnt);
> +       file->f_path.mnt = NULL;
>
> since I'd be much happer with always having f_path.mnt available, the same
> way we should always have f_path.dentry there.

Yes, but mntget()/mntput() are protected against NULL.
I was quite happy to remove two locked operations :)
I didnt found a way to crash (yet) my patched machine :)

>
> (Btw, your patch is *not* going to work with the file->f_private_data
> approach, because d_path() is not passed down the "file *" thing. So we'd
> need to do that, and that's more intrusive (it can be NULL, since for
> things like cwd/pwd we don't have a "struct file").

I tried this path today and failed...
Too many changes to do (nameidata) to propagate a 'struct file *' 
appropriately...

>
> But I like your patch as a totally independent thing. "It just makes
> sense".
>
> (Apart from the f_path.mnt thing, which I think was something else ;)

OK no problem here is the patch without messing f_path.mnt 

(benchmark results not really different on my little machine, SMP kernel but 
one CPU only... maybe because lock suffix is changed by a nop)


[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to be able to provide a dentry name 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
 fs/dcache.c            |    3 +++
 fs/pipe.c              |   12 +++++++++---
 include/linux/dcache.h |    1 +
 net/socket.c           |   13 ++++++++++---
 4 files changed, 23 insertions(+), 6 deletions(-)

[-- Attachment #2: introduce_d_dname.patch --]
[-- Type: text/plain, Size: 2901 bytes --]

--- linux-2.6.21-rc3/include/linux/dcache.h	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h	2007-03-07 17:27:39.000000000 +0100
@@ -133,6 +133,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char * (*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/fs/dcache.c	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c	2007-03-07 17:28:46.000000000 +0100
@@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str
 	struct vfsmount *rootmnt;
 	struct dentry *root;
 
+	if (dentry->d_op && dentry->d_op->d_dname)
+		return (dentry->d_op->d_dname)(dentry, buf, buflen);
+
 	read_lock(&current->fs->lock);
 	rootmnt = mntget(current->fs->rootmnt);
 	root = dget(current->fs->root);
--- linux-2.6.21-rc3/fs/pipe.c	2007-03-07 17:42:36.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c	2007-03-07 18:54:33.000000000 +0100
@@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d
 	return 0;
 }
 
+static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "pipe:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations pipefs_dentry_operations = {
 	.d_delete	= pipefs_delete_dentry,
+	.d_dname	= pipefs_dname,
 };
 
 static struct inode * get_pipe_inode(void)
@@ -888,7 +895,6 @@ struct file *create_write_pipe(void)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	char name[32];
 	struct qstr this;
 
 	f = get_empty_filp();
@@ -899,8 +905,8 @@ struct file *create_write_pipe(void)
 	if (!inode)
 		goto err_file;
 
-	this.len = sprintf(name, "[%lu]", inode->i_ino);
-	this.name = name;
+	this.len = 0;
+	this.name = NULL;
 	this.hash = 0;
 	err = -ENOMEM;
 	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &this);
--- linux-2.6.21-rc3/net/socket.c	2007-03-07 17:37:56.000000000 +0100
+++ linux-2.6.21-rc3-ed/net/socket.c	2007-03-07 18:54:33.000000000 +0100
@@ -314,8 +314,16 @@ static int sockfs_delete_dentry(struct d
 	dentry->d_flags |= DCACHE_UNHASHED;
 	return 0;
 }
+
+static char * sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "socket:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations sockfs_dentry_operations = {
 	.d_delete = sockfs_delete_dentry,
+	.d_dname  = sockfs_dname,
 };
 
 /*
@@ -356,10 +364,9 @@ static int sock_alloc_fd(struct file **f
 static int sock_attach_fd(struct socket *sock, struct file *file)
 {
 	struct qstr this;
-	char name[32];
 
-	this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
-	this.name = name;
+	this.len = 0;
+	this.name = NULL;
 	this.hash = 0;
 
 	file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this);

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07 18:06             ` Eric Dumazet
@ 2007-03-07 18:30               ` Linus Torvalds
  2007-03-07 18:52                 ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2007-03-07 18:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Davide Libenzi, Avi Kivity, Linux Kernel Mailing List,
	Andrew Morton, Al Viro



On Wed, 7 Mar 2007, Eric Dumazet wrote:
> 
> OK no problem here is the patch without messing f_path.mnt 

All right. I like it. Definitely worth putting into -mm, or just 
re-sending to me after 2.6.21 is out (I'll forget all about it otherwise).

I have one more worry, namely this::

	-       char name[32];
	-       this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
	-       this.name = name;
	+       this.len = 0;
	+       this.name = NULL;

I think that's fine, and it *happens* to work, but it happens to work just 
because then d_alloc() will do:

	memcpy(dname, name->name, name->len);
	dname[name->len] = 0;

and passing in NULL to memcpy() is generally ok when len is 0.

HOWEVER. 

Not only might memcpy() do a "prefetch for read" on the source for some 
architectures (which in turn may end up being slow for an address that 
isn't in the TLB, like NULL), but you depend on a very much internal 
detail, since it *could* have been using something like

	memcpy(dname, name->name, name->len+1);

instead, and expected to get the final '\0' character from the source 
string.

So I would actually much prefer it to be written as

	this.len = 0;
	this.name = "";

just because it's safer.

But other than that small detail, I think this is not only an 
optimization, it's an actual cleanup, and we migth some day want to use 
something like this for some other things too (eg maybe this kind of 
approach is usable for /proc/<pid> entries too, to avoid instantiating 
them).

As to avoiding the mntget(), I'm not violently opposed to it, but I do 
think that it's a totally unrelated matter, so even if it's decided it's 
worth it, it should be done as a separate patch regardless.

			Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07 18:30               ` Linus Torvalds
@ 2007-03-07 18:52                 ` Eric Dumazet
  2007-03-07 19:07                   ` Linus Torvalds
  2007-03-07 22:14                   ` Anton Blanchard
  0 siblings, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2007-03-07 18:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Avi Kivity, Linux Kernel Mailing List,
	Andrew Morton, Al Viro


> Not only might memcpy() do a "prefetch for read" on the source for some
> architectures (which in turn may end up being slow for an address that
> isn't in the TLB, like NULL), but you depend on a very much internal

Well, I hope a prefetch(NULL) is OK because we are doing millions of them (see 
include/linux/list.h :) )

> detail, since it *could* have been using something like
>
> 	memcpy(dname, name->name, name->len+1);
>

Yes very true, I will change that and push to Andrew for mm testing

I was thinking about being able to cache the name into the dentry, do you 
think it's worth the pain ? (its not SMP safe for example...)

static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
{
        if (!dentry->d_iname[0])
                snprintf(dentry->d_iname,
			DNAME_INLINE_LEN_MIN,
			"pipe:[%lu]",
			dentry->d_inode->i_ino);
        strlcpy(buffer, dentry->d_iname, buflen);
        return buffer;
}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07 18:52                 ` Eric Dumazet
@ 2007-03-07 19:07                   ` Linus Torvalds
  2007-03-07 22:14                   ` Anton Blanchard
  1 sibling, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2007-03-07 19:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Davide Libenzi, Avi Kivity, Linux Kernel Mailing List,
	Andrew Morton, Al Viro



On Wed, 7 Mar 2007, Eric Dumazet wrote:
> 
> I was thinking about being able to cache the name into the dentry, do you 
> think it's worth the pain ? (its not SMP safe for example...)

Actually, it *can* be SMP-safe, if you do it right. Something like

	len = dentry->d_name.len;

	if (!len) {
		len = snprintf(dentry->d_iname,
			DNAME_INLINE_LEN_MIN,
			"pipe:[%lu]",
			dentry->d_inode->i_ino);
		smp_wmb();
		dentry->d_name.len = len;
	}
	if (len >= buflen)
		len = buflen-1;
	memcpy(buffer, dentry->d_iname, len);
	buffer[len] = 0;
	return buffer;

should work, although it depends on the fact that our snprintf() 
implementation should be "stable" (ie if snprintf() modifies the buffer 
temporarily as it goes along, that would break, but I think our 
vsnprintf is good in that respect).

So you could have two different CPU's doing the snprintf() on the same 
buffer at the same time (and assigning the length at the same time), but 
since they'll write the same thing, you don't really care.

It's a bit subtle, though. And probably not really worth it.

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07 18:52                 ` Eric Dumazet
  2007-03-07 19:07                   ` Linus Torvalds
@ 2007-03-07 22:14                   ` Anton Blanchard
  2007-03-07 22:57                     ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Anton Blanchard @ 2007-03-07 22:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro


Hi,

> Well, I hope a prefetch(NULL) is OK because we are doing millions of
> them (see include/linux/list.h :) )

Funny you mention this. We found some noticeable ppc64 regressions when
moving the dcache to standard list macros and had to do this to fix it
up:

static inline void prefetch(const void *x)
{
        if (unlikely(!x))
                return;

        __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x));
}

Urgh :)

Anton

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07 22:14                   ` Anton Blanchard
@ 2007-03-07 22:57                     ` Linus Torvalds
  2007-03-08  1:25                       ` Michael K. Edwards
  2007-03-09  2:46                       ` Anton Blanchard
  0 siblings, 2 replies; 45+ messages in thread
From: Linus Torvalds @ 2007-03-07 22:57 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Eric Dumazet, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro



On Wed, 7 Mar 2007, Anton Blanchard wrote:
> 
> Funny you mention this. We found some noticeable ppc64 regressions when
> moving the dcache to standard list macros and had to do this to fix it
> up:
> 
> static inline void prefetch(const void *x)
> {
>         if (unlikely(!x))
>                 return;
> 
>         __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x));
> }
> 
> Urgh :)

Yeah, I'm not at all surprised. Any implementation of "prefetch" that 
doesn't just turn into a no-op if the TLB entry doesn't exist (which makes 
them weaker for *actual* prefetching) will generally have a hard time with 
a NULL pointer. Exactly because it will try to do a totally unnecessary 
TLB fill - and since most CPU's will not cache negative TLB entries, that 
unnecessary TLB fill will be done over and over and over again..

In general, using software prefetching is just a stupid idea, unless

 - the prefetch really is very strict (ie for a linked list you do exactly 
   the above kinds of things to make sure that you don't try to prefetch 
   the non-existent end entry)
AND
 - the CPU is stupid (in-order in particular).

I think Intel even suggests in their optimization manuals to *not* do 
software prefetching, because hw can usually simply do better without it.

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07 22:57                     ` Linus Torvalds
@ 2007-03-08  1:25                       ` Michael K. Edwards
  2007-03-08  2:48                         ` Kyle Moffett
  2007-03-08  3:20                         ` Linus Torvalds
  2007-03-09  2:46                       ` Anton Blanchard
  1 sibling, 2 replies; 45+ messages in thread
From: Michael K. Edwards @ 2007-03-08  1:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Anton Blanchard, Eric Dumazet, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro

On 3/7/07, Linus Torvalds <torvalds@linux-foundation.org> wrote
> Yeah, I'm not at all surprised. Any implementation of "prefetch" that
> doesn't just turn into a no-op if the TLB entry doesn't exist (which makes
> them weaker for *actual* prefetching) will generally have a hard time with
> a NULL pointer. Exactly because it will try to do a totally unnecessary
> TLB fill - and since most CPU's will not cache negative TLB entries, that
> unnecessary TLB fill will be done over and over and over again..

Data prefetch instructions should indeed avoid page table walks.
(Instruction prefetch mechanisms often do induce table walks on ITLB
miss.)  Not just because of the null pointer case, but because it's
quite normal to run off the end of an array in a loop with an embedded
prefetch instruction.  If you have an extra instruction issue unit
that shares the same DTLB, and you know you will really want that
data, you can sometimes use it to force DTLB preloads by doing an
actual data fetch from the foreseeable page.  This is potentially one
of the best uses of chip multi-threading on an architecture like Sun's
Niagara.

(I don't think Intel's hyper-threading works for this purpose; the
DTLB is shared but the entries are marked as owned by one thread or
the other.  HT can be used for L2 cache prefetching, although the
results so far seem to be mixed:
http://www.cgo.org/cgo2004/papers/02_80_Kim_D_REVISED.pdf)

> In general, using software prefetching is just a stupid idea, unless
>
>  - the prefetch really is very strict (ie for a linked list you do exactly
>    the above kinds of things to make sure that you don't try to prefetch
>    the non-existent end entry)
> AND
>  - the CPU is stupid (in-order in particular).
>
> I think Intel even suggests in their optimization manuals to *not* do
> software prefetching, because hw can usually simply do better without it.

Not the XScale -- it performs quite poorly without prefetch, as people
who have run ARMv5-optimized binaries on it can testify.  From the
XScale Core Developer's Manual:

<quote>
The Intel XScale(r) core has a true prefetch load instruction (PLD).
The purpose of this instruction is to preload data into the data and
mini-data caches. Data prefetching allows hiding of memory transfer
latency while the processor continues to execute instructions. The
prefetch is important to compiler and assembly code because judicious
use of the prefetch instruction can enormously improve throughput
performance of the core. Data prefetch can be applied not only to
loops but also to any data references within a block of code. Prefetch
also applies to data writing when the memory type is enabled as write
allocate

The Intel XScale(r) core prefetch load instruction is a true prefetch
instruction because the load destination is the data or mini-data
cache and not a register. Compilers for processors which have data
caches, but do not support prefetch, sometimes use a load instruction
to preload the data cache. This technique has the disadvantages of
using a register to load data and requiring additional registers for
subsequent preloads and thus increasing register pressure. By
contrast, the prefetch can be used to reduce register pressure instead
of increasing it.

The prefetch load is a hint instruction and does not guarantee that
the data will be loaded. Whenever the load would cause a fault or a
table walk, then the processor will ignore the prefetch instruction,
the fault or table walk, and continue processing the next instruction.
This is particularly advantageous in the case where a linked list or
recursive data structure is terminated by a NULL pointer. Prefetching
the NULL pointer will not fault program flow.
</quote>

People's prejudices against prefetch instructions are sometimes
traceable to the 3DNow! prefetch(w) botch, which some processors
"support" as no-ops and others are too aggressive about (Opteron
prefetches are reputed to be "strong", i. e., not dropped on DTLB
miss).  XScale gets it right.  So do most Pentium 4's using the SSE
prefetches, according to the IA-32 optimization manual.  (Oddly,
Prescott seems to have initiated a page table walk on DTLB miss during
software prefetch -- just one of many weird Prescott flaws.)  I'm
guessing Pentium M and its descendants (Core Solo and Duo) get it
right but I'm having a hell of a time finding out for sure.  Can any
of the x86 experts answer this?

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-08  1:25                       ` Michael K. Edwards
@ 2007-03-08  2:48                         ` Kyle Moffett
  2007-03-08  7:24                           ` Eric Dumazet
  2007-03-08  3:20                         ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Kyle Moffett @ 2007-03-08  2:48 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Linus Torvalds, Anton Blanchard, Eric Dumazet, Davide Libenzi,
	Avi Kivity, Linux Kernel Mailing List, Andrew Morton, Al Viro

On Mar 07, 2007, at 20:25:14, Michael K. Edwards wrote:
> On 3/7/07, Linus Torvalds <torvalds@linux-foundation.org> wrote
>> In general, using software prefetching is just a stupid idea, unless
>>
>>  - the prefetch really is very strict (ie for a linked list you do  
>> exactly the above kinds of things to make sure that you don't try  
>> to prefetch the non-existent end entry)
>> AND
>>  - the CPU is stupid (in-order in particular).
>>
>> I think Intel even suggests in their optimization manuals to *not*  
>> do software prefetching, because hw can usually simply do better  
>> without it.
>
> Not the XScale -- it performs quite poorly without prefetch, as  
> people who have run ARMv5-optimized binaries on it can testify.
>
> The Intel XScale(r) core prefetch load instruction is a true  
> prefetch instruction because the load destination is the data or  
> mini-data cache and not a register. Compilers for processors which  
> have data caches, but do not support prefetch, sometimes use a load  
> instruction to preload the data cache. This technique has the  
> disadvantages of using a register to load data and requiring  
> additional registers for
> subsequent preloads and thus increasing register pressure. By  
> contrast, the prefetch can be used to reduce register pressure  
> instead of increasing it.
>
> The prefetch load is a hint instruction and does not guarantee that  
> the data will be loaded. Whenever the load would cause a fault or a  
> table walk, then the processor will ignore the prefetch  
> instruction, the fault or table walk, and continue processing the  
> next instruction. This is particularly advantageous in the case  
> where a linked list or recursive data structure is terminated by a  
> NULL pointer. Prefetching the NULL pointer will not fault program  
> flow.

Prefetching is also fairly critical on a Power4 or G5 PowerPC system  
as they have a long memory latency; an L2-cache miss can cost 200+  
cycles.  On such systems the "dcbt" prefetch instruction brings in a  
single 128-byte cacheline and has no serializing effects whatsoever,  
making it ideal for use in a linked-list-traversal inner loop.

Cheers,
Kyle Moffett


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-08  1:25                       ` Michael K. Edwards
  2007-03-08  2:48                         ` Kyle Moffett
@ 2007-03-08  3:20                         ` Linus Torvalds
  2007-03-08  8:37                           ` Michael K. Edwards
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2007-03-08  3:20 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Anton Blanchard, Eric Dumazet, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro



On Wed, 7 Mar 2007, Michael K. Edwards wrote:
> 
> People's prejudices against prefetch instructions are sometimes
> traceable to the 3DNow! prefetch(w) botch, which some processors
> "support" as no-ops and others are too aggressive about (Opteron
> prefetches are reputed to be "strong", i. e., not dropped on DTLB
> miss).

No, I just checked, and Intel's own optimization manual makes it clear 
that you should be careful. They talk about performance penalties due to 
resource constraints - which makes tons of sense with a core that is good 
at handling its own resources and could quite possibly use those resources 
better to actually execute the loads and stores deeper down the 
instruction pipeline.

So it's not just 3DNow! making AMD look bad, or Intel would obviously 
suggest people use it out of the wazoo ;)

> XScale gets it right.

Blah. XScale isn't even an OoO CPU, *of*course* it needs prefetching. 
Calling that "getting it right" is ludicrous. If anything, it gets things 
so wrong that prefetching is *required* for good performance.

I'm talking about real CPU's with real memory pipelines that already do 
prefetching in hardware. The better the core is, the less the prefetch 
helps (and often the more it hurts in comparison to how much it helps).

But if you mean "doesn't try to fill the TLB on data prefetches", then 
yes, that's generally the right thing to do.

> (Oddly, Prescott seems to have initiated a page table walk on DTLB miss 
> during software prefetch -- just one of many weird Prescott flaws.)  

Netburst in general is *very* happy to do speculative TLB fills, I think.

> I'm guessing Pentium M and its descendants (Core Solo and Duo) get it 
> right but I'm having a hell of a time finding out for sure.  Can any of 
> the x86 experts answer this?

I just suspect that the upside for Core 2 Due is likely fairly low. The L2 
cache is good, the memory re-ordering is working.. I doubt "prefetch" 
helps in generic code that much for things like linked list following, you 
should probably limit it to code that has *known* access patterns and you 
know it's not going to be in the cache.

(In other words, I bet prefetching can help a lot with MMX/media kind of 
code, I doubt it's a huge win for "for_each_entry()")

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-08  2:48                         ` Kyle Moffett
@ 2007-03-08  7:24                           ` Eric Dumazet
  2007-03-08 16:57                             ` Valdis.Kletnieks
                                               ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Eric Dumazet @ 2007-03-08  7:24 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Michael K. Edwards, Linus Torvalds, Anton Blanchard,
	Davide Libenzi, Avi Kivity, Linux Kernel Mailing List,
	Andrew Morton, Al Viro

Kyle Moffett a écrit :
> 
> Prefetching is also fairly critical on a Power4 or G5 PowerPC system as 
> they have a long memory latency; an L2-cache miss can cost 200+ cycles.  
> On such systems the "dcbt" prefetch instruction brings in a single 
> 128-byte cacheline and has no serializing effects whatsoever, making it 
> ideal for use in a linked-list-traversal inner loop.

OK, 200 cycles...

But what is the cost of the conditional branch you added in prefetch(x) ?

if (!x) return;

(correctly predicted or not, but do powerPC have a BTB ?)

About the NULL 'potential problem', maybe we could use a dummy nil (but 
mapped) object, and use its address in lists, ie compare for &nil instead of 
NULL. This would avoid :

- The conditional test in some prefetch() implementations
- The potential TLB problem with the NULL value.





^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-08  3:20                         ` Linus Torvalds
@ 2007-03-08  8:37                           ` Michael K. Edwards
  0 siblings, 0 replies; 45+ messages in thread
From: Michael K. Edwards @ 2007-03-08  8:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Anton Blanchard, Eric Dumazet, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro

On 3/7/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> No, I just checked, and Intel's own optimization manual makes it clear
> that you should be careful. They talk about performance penalties due to
> resource constraints - which makes tons of sense with a core that is good
> at handling its own resources and could quite possibly use those resources
> better to actually execute the loads and stores deeper down the
> instruction pipeline.

Certainly you should be careful -- and that usually means leaving it
up to the compiler.  But hinting to the compiler can help; there may
be an analogue of the (un)likely macros waiting to be implemented for
loop prefetch.  And while out-of-order execution and fancy hardware
prefetch streams greatly reduce the need for explicit prefetch in
general code, there's no substitute for the cache _bypassing_
instructions when trying to avoid excessive cache eviction in DDoS
situations.

For instance, if I wind up working on a splay-tree variant of Robert
Olsson's trie/hash work, I'll try to measure the effect of using SSE2
non-temporal stores to write half-open connections to the leaves of
the tree.  That may give some additional improvement in the ability to
keep servicing real load during a SYN flood.

> So it's not just 3DNow! making AMD look bad, or Intel would obviously
> suggest people use it out of the wazoo ;)

Intel puts a lot of effort into educating compiler writers about the
optimal prefetch insertion strategies for particular cache
architectures.  At the same time, they put out the orange cones to
warn people off of hand-tuning prefetch placement using
micro-benchmarks.  People did that when 3DNow! first came out, with
predictable consequences.

> > XScale gets it right.
>
> Blah. XScale isn't even an OoO CPU, *of*course* it needs prefetching.
> Calling that "getting it right" is ludicrous. If anything, it gets things
> so wrong that prefetching is *required* for good performance.

That's not an accident.  Hardware prefetch units cost a lot in power
consumption.  Omitting the hardware prefetch unit and drastically
simplifying the pipeline is how they got a design whose clock they
could crank into the stratosphere and still run on battery power.  And
in the network processor space, they can bolt a myriad of on-chip
microengines and still have some prayer of accurately simulating the
patterns of internal bus cycles.  Errors in simulation can still be
fixed up with prefetch instruction placement to put memory accesses
from the XScale core into phases where the data path processors aren't
working so hard.

Moreover, because they're embedded targets and rarely have to run
third-party binaries originally compiled for older cores, it didn't
really cost them anything to say, "Sorry, this chip's performance is
going to suck if your compiler's prefetch insertion isn't properly
tuned."  The _only_ cost is a slightly less dense instruction stream.
That's not trivial but it's manageable; you budget for it, and the
increase in I-cache power consumption is more than made up for by the
reduction in erroneous data prefetches (hardware prefetch gets it
wrong a substantial fraction of the time).

> I'm talking about real CPU's with real memory pipelines that already do
> prefetching in hardware. The better the core is, the less the prefetch
> helps (and often the more it hurts in comparison to how much it helps).

The more sophisticated the core is, the less software prefetch
instructions help.  But more sophisticated isn't always better; it
depends on your target applications.

> But if you mean "doesn't try to fill the TLB on data prefetches", then
> yes, that's generally the right thing to do.

AOL.

> > (Oddly, Prescott seems to have initiated a page table walk on DTLB miss
> > during software prefetch -- just one of many weird Prescott flaws.)
>
> Netburst in general is *very* happy to do speculative TLB fills, I think.

Design by micro-benchmark.  :-)  They set out to push the headline MHz
and the real memory bandwidth to the limit in Prescott, and they
succeeded (data at
http://www.digit-life.com/articles2/rmma/rmma-p4.html).  At a horrible
cost in power per clock, and no gain in real application performance.
So NetBurst died a horrible death, and now we have "Intel Core" -- P6
warmed over, with caches sized such that for most applications the
second core ought to be used solely to soak up control path overheads.

Windows runs better on dual-core machines because the NT kernel will
happily eat an entire core doing memory bookkeeping.  Linux could take
a hint here and use the second core largely for interrupt handling and
force-priming the L2 cache on task switch.  (Prefetch instructions
aren't much use here, precisely because they give up on DTLB miss.)
Any kernel code paths that are known to stall a lot because of
usually-cold-cache access patterns (TCP connection establishment, for
instance) can also be punted over to the second core.  If you're
feeling industrious, use non-temporal memory accesses judiciously in
these code paths to reduce cache pollution; that core's CPU cycles are
going to be undersubscribed and you can afford to let it stall.

> > I'm guessing Pentium M and its descendants (Core Solo and Duo) get it
> > right but I'm having a hell of a time finding out for sure.  Can any of
> > the x86 experts answer this?
>
> I just suspect that the upside for Core 2 Due is likely fairly low. The L2
> cache is good, the memory re-ordering is working.. I doubt "prefetch"
> helps in generic code that much for things like linked list following, you
> should probably limit it to code that has *known* access patterns and you
> know it's not going to be in the cache.
>
> (In other words, I bet prefetching can help a lot with MMX/media kind of
> code, I doubt it's a huge win for "for_each_entry()")

If I understand the Intel Core microarchitecture correctly, it's more
accurate to say that for pointer-chasing code, the instruction decoder
is so good at injecting prefetch instructions into the micro-op stream
during I-Cache prefetch that additional hinting from the compiler
isn't needed.  For array-traversing code, the hardware stride prefetch
kicks in, which saves you from having to inject prefetch instructions
into hand-coded assembly (and tight into inner loops in general).

This leaves one important role for in-line software prefetch
instructions:  improving worst-case latency bounds when handling data
structures that may bloat under DDoS or other unusual loads.  It's the
next best thing to having multiple memory windows with different
hardware cache eviction strategies.  But that's another discussion,
over on netdev.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07 17:31         ` Eric Dumazet
  2007-03-07 17:36           ` Eric Dumazet
  2007-03-07 17:45           ` Linus Torvalds
@ 2007-03-08  8:56           ` Christoph Hellwig
  2007-03-08  9:42             ` Eric Dumazet
  2 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2007-03-08  8:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro

This patch needs a lot more documentation.  It needs some really big
comments on why this should never ever be used for a real filesystem
(real as in user mountable), and probably add an assert for that
invariant somewhere.  Please also update Documentation/filesystems/Locking
and Documentation/filesystems/vfs.txt for it.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-08  8:56           ` Christoph Hellwig
@ 2007-03-08  9:42             ` Eric Dumazet
  2007-03-08 10:21               ` Christoph Hellwig
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2007-03-08  9:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro

[-- Attachment #1: Type: text/plain, Size: 2058 bytes --]

On Thursday 08 March 2007 09:56, Christoph Hellwig wrote:
> This patch needs a lot more documentation.  It needs some really big
> comments on why this should never ever be used for a real filesystem
> (real as in user mountable), and probably add an assert for that
> invariant somewhere.  Please also update Documentation/filesystems/Locking
> and Documentation/filesystems/vfs.txt for it.

Thank you for your feedback Christoph

Here is the version I was about to push to -mm, so if nobody complains, I ask 
Andrew to push it to mm so that it can reach 2.6.22 target

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Davide Libenzi <davidel@xmailserver.org>
CC: Al Viro <viro@zeniv.linux.org.uk>

 Documentation/filesystems/Locking |    2 ++
 Documentation/filesystems/vfs.txt |   12 +++++++++++-
 fs/dcache.c                       |    3 +++
 fs/pipe.c                         |   12 +++++++++---
 include/linux/dcache.h            |    1 +
 net/socket.c                      |   13 ++++++++++---
 6 files changed, 36 insertions(+), 7 deletions(-)

[-- Attachment #2: introduce_d_dname.patch --]
[-- Type: text/plain, Size: 5325 bytes --]

--- linux-2.6.21-rc3/include/linux/dcache.h	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h	2007-03-08 10:14:38.000000000 +0100
@@ -133,6 +133,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char * (*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt	2007-03-08 10:14:38.000000000 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt	2007-03-08 10:34:34.000000000 +0100
@@ -827,7 +827,7 @@ This describes how a filesystem can over
 operations. Dentries and the dcache are the domain of the VFS and the
 individual filesystem implementations. Device drivers have no business
 here. These methods may be set to NULL, as they are either optional or
-the VFS uses a default. As of kernel 2.6.13, the following members are
+the VFS uses a default. As of kernel 2.6.22, the following members are
 defined:
 
 struct dentry_operations {
@@ -837,6 +837,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char * (*d_dname)(struct dentry *, char *, int);
 };
 
   d_revalidate: called when the VFS needs to revalidate a dentry. This
@@ -859,6 +860,15 @@ struct dentry_operations {
 	VFS calls iput(). If you define this method, you must call
 	iput() yourself
 
+  d_dname: called when the pathname of a dentry should be generated.
+	Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
+	pathname generation. (Instead of doing it when dentry is created,
+	its done only when the path is needed.). Real filesystems probably
+	dont want to use it, because their dentries are present in global
+	dcache hash, so their hash should be an invariant. As no lock is
+	held, d_dname() should not try to modify the dentry itself, unless
+	appropriate SMP safety is used.
+
 Each dentry has a pointer to its parent dentry, as well as a hash list
 of child dentries. Child dentries are basically like files in a
 directory.
--- linux-2.6.21-rc3/Documentation/filesystems/Locking	2007-03-08 10:29:04.000000000 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking	2007-03-08 10:29:04.000000000 +0100
@@ -15,6 +15,7 @@ prototypes:
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char * (*d_dname)((struct dentry *dentry, char *buffer, int buflen);
 
 locking rules:
 	none have BKL
@@ -25,6 +26,7 @@ d_compare:	no		yes		no		no 
 d_delete:	yes		no		yes		no
 d_release:	no		no		no		yes
 d_iput:		no		no		no		yes
+d_dname:	no		no		no		no
 
 --------------------------- inode_operations --------------------------- 
 prototypes:
--- linux-2.6.21-rc3/fs/dcache.c	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c	2007-03-07 17:28:46.000000000 +0100
@@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str
 	struct vfsmount *rootmnt;
 	struct dentry *root;
 
+	if (dentry->d_op && dentry->d_op->d_dname)
+		return (dentry->d_op->d_dname)(dentry, buf, buflen);
+
 	read_lock(&current->fs->lock);
 	rootmnt = mntget(current->fs->rootmnt);
 	root = dget(current->fs->root);
--- linux-2.6.21-rc3/fs/pipe.c	2007-03-07 17:42:36.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c	2007-03-08 10:16:25.000000000 +0100
@@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d
 	return 0;
 }
 
+static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "pipe:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations pipefs_dentry_operations = {
 	.d_delete	= pipefs_delete_dentry,
+	.d_dname	= pipefs_dname,
 };
 
 static struct inode * get_pipe_inode(void)
@@ -888,7 +895,6 @@ struct file *create_write_pipe(void)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	char name[32];
 	struct qstr this;
 
 	f = get_empty_filp();
@@ -899,8 +905,8 @@ struct file *create_write_pipe(void)
 	if (!inode)
 		goto err_file;
 
-	this.len = sprintf(name, "[%lu]", inode->i_ino);
-	this.name = name;
+	this.len = 0;
+	this.name = "";
 	this.hash = 0;
 	err = -ENOMEM;
 	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &this);
--- linux-2.6.21-rc3/net/socket.c	2007-03-07 17:37:56.000000000 +0100
+++ linux-2.6.21-rc3-ed/net/socket.c	2007-03-07 20:00:40.000000000 +0100
@@ -314,8 +314,16 @@ static int sockfs_delete_dentry(struct d
 	dentry->d_flags |= DCACHE_UNHASHED;
 	return 0;
 }
+
+static char * sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "socket:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations sockfs_dentry_operations = {
 	.d_delete = sockfs_delete_dentry,
+	.d_dname  = sockfs_dname,
 };
 
 /*
@@ -356,10 +364,9 @@ static int sock_alloc_fd(struct file **f
 static int sock_attach_fd(struct socket *sock, struct file *file)
 {
 	struct qstr this;
-	char name[32];
 
-	this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
-	this.name = name;
+	this.len = 0;
+	this.name = "";
 	this.hash = 0;
 
 	file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this);

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-08  9:42             ` Eric Dumazet
@ 2007-03-08 10:21               ` Christoph Hellwig
  2007-03-08 11:11                 ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2007-03-08 10:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, Linus Torvalds, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro

On Thu, Mar 08, 2007 at 10:42:21AM +0100, Eric Dumazet wrote:
> @@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str
>  	struct vfsmount *rootmnt;
>  	struct dentry *root;
>  
> +	if (dentry->d_op && dentry->d_op->d_dname)
> +		return (dentry->d_op->d_dname)(dentry, buf, buflen);

Please don't put braces around the function pointer call.  Also I
think we really want a comment here why we need this call.

> +static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)

normally this would be:

	
static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)

> +static char * sockfs_dname(struct dentry *dentry, char *buffer, int buflen)

same here.

> -	char name[32];
>  
> -	this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
> -	this.name = name;
> +	this.len = 0;
> +	this.name = "";
>  	this.hash = 0;

Can you add a helper to init this one?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-08 10:21               ` Christoph Hellwig
@ 2007-03-08 11:11                 ` Eric Dumazet
  2007-03-08 11:18                   ` Christoph Hellwig
                                     ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Eric Dumazet @ 2007-03-08 11:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro

[-- Attachment #1: Type: text/plain, Size: 1595 bytes --]

Thanks again for your feedback Christoph :)

I think I addressed all your remarks.

Thank you

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Davide Libenzi <davidel@xmailserver.org>
CC: Al Viro <viro@zeniv.linux.org.uk>

 Documentation/filesystems/Locking |    2 ++
 Documentation/filesystems/vfs.txt |   12 +++++++++++-
 fs/dcache.c                       |    7 +++++++
 fs/pipe.c                         |   15 +++++++++------
 include/linux/dcache.h            |    1 +
 net/socket.c                      |   17 ++++++++++-------
 6 files changed, 40 insertions(+), 14 deletions(-)


[-- Attachment #2: introduce_d_dname.patch --]
[-- Type: text/plain, Size: 5686 bytes --]

--- linux-2.6.21-rc3/include/linux/dcache.h	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h	2007-03-08 11:57:41.000000000 +0100
@@ -133,6 +133,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt	2007-03-08 10:14:38.000000000 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt	2007-03-08 10:34:34.000000000 +0100
@@ -827,7 +827,7 @@ This describes how a filesystem can over
 operations. Dentries and the dcache are the domain of the VFS and the
 individual filesystem implementations. Device drivers have no business
 here. These methods may be set to NULL, as they are either optional or
-the VFS uses a default. As of kernel 2.6.13, the following members are
+the VFS uses a default. As of kernel 2.6.22, the following members are
 defined:
 
 struct dentry_operations {
@@ -837,6 +837,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)(struct dentry *, char *, int);
 };
 
   d_revalidate: called when the VFS needs to revalidate a dentry. This
@@ -859,6 +860,15 @@ struct dentry_operations {
 	VFS calls iput(). If you define this method, you must call
 	iput() yourself
 
+  d_dname: called when the pathname of a dentry should be generated.
+	Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
+	pathname generation. (Instead of doing it when dentry is created,
+	its done only when the path is needed.). Real filesystems probably
+	dont want to use it, because their dentries are present in global
+	dcache hash, so their hash should be an invariant. As no lock is
+	held, d_dname() should not try to modify the dentry itself, unless
+	appropriate SMP safety is used.
+
 Each dentry has a pointer to its parent dentry, as well as a hash list
 of child dentries. Child dentries are basically like files in a
 directory.
--- linux-2.6.21-rc3/Documentation/filesystems/Locking	2007-03-08 10:29:04.000000000 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking	2007-03-08 10:29:04.000000000 +0100
@@ -15,6 +15,7 @@ prototypes:
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
 
 locking rules:
 	none have BKL
@@ -25,6 +26,7 @@ d_compare:	no		yes		no		no 
 d_delete:	yes		no		yes		no
 d_release:	no		no		no		yes
 d_iput:		no		no		no		yes
+d_dname:	no		no		no		no
 
 --------------------------- inode_operations --------------------------- 
 prototypes:
--- linux-2.6.21-rc3/fs/dcache.c	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c	2007-03-08 11:57:41.000000000 +0100
@@ -1823,6 +1823,13 @@ char * d_path(struct dentry *dentry, str
 	struct vfsmount *rootmnt;
 	struct dentry *root;
 
+	/*
+	 * Some filesystems want to provide dentry's pathname themselves,
+	 * instead of pre-building names at dentry creation.
+	 */
+	if (dentry->d_op && dentry->d_op->d_dname)
+		return dentry->d_op->d_dname(dentry, buf, buflen);
+
 	read_lock(&current->fs->lock);
 	rootmnt = mntget(current->fs->rootmnt);
 	root = dget(current->fs->root);
--- linux-2.6.21-rc3/fs/pipe.c	2007-03-07 17:42:36.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c	2007-03-08 11:57:41.000000000 +0100
@@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d
 	return 0;
 }
 
+static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "pipe:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations pipefs_dentry_operations = {
 	.d_delete	= pipefs_delete_dentry,
+	.d_dname	= pipefs_dname,
 };
 
 static struct inode * get_pipe_inode(void)
@@ -888,8 +895,7 @@ struct file *create_write_pipe(void)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	char name[32];
-	struct qstr this;
+	struct qstr name = { .name = "" };
 
 	f = get_empty_filp();
 	if (!f)
@@ -899,11 +905,8 @@ struct file *create_write_pipe(void)
 	if (!inode)
 		goto err_file;
 
-	this.len = sprintf(name, "[%lu]", inode->i_ino);
-	this.name = name;
-	this.hash = 0;
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
 	if (!dentry)
 		goto err_inode;
 
--- linux-2.6.21-rc3/net/socket.c	2007-03-07 17:37:56.000000000 +0100
+++ linux-2.6.21-rc3-ed/net/socket.c	2007-03-08 11:57:41.000000000 +0100
@@ -314,8 +314,16 @@ static int sockfs_delete_dentry(struct d
 	dentry->d_flags |= DCACHE_UNHASHED;
 	return 0;
 }
+
+static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "socket:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations sockfs_dentry_operations = {
 	.d_delete = sockfs_delete_dentry,
+	.d_dname  = sockfs_dname,
 };
 
 /*
@@ -355,14 +363,9 @@ static int sock_alloc_fd(struct file **f
 
 static int sock_attach_fd(struct socket *sock, struct file *file)
 {
-	struct qstr this;
-	char name[32];
-
-	this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
-	this.name = name;
-	this.hash = 0;
+	struct qstr name = { .name = "" };
 
-	file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this);
+	file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
 	if (unlikely(!file->f_path.dentry))
 		return -ENOMEM;
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-08 11:11                 ` Eric Dumazet
@ 2007-03-08 11:18                   ` Christoph Hellwig
  2007-03-08 15:52                   ` Linus Torvalds
  2007-03-08 20:00                   ` [patch] epoll use a single inode Bob Copeland
  2 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2007-03-08 11:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, Linus Torvalds, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro


I'm sorry about complaining again, but..

>  
> +	/*
> +	 * Some filesystems want to provide dentry's pathname themselves,
> +	 * instead of pre-building names at dentry creation.
> +	 */

It's not _some_ filesystems.  If real filesystem did this we'd be in
horrible trouble.  It's "synthetic, non-mountable" filesystem.

Except for this the patch looks fine to me, thanks Eric!

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-08 11:11                 ` Eric Dumazet
  2007-03-08 11:18                   ` Christoph Hellwig
@ 2007-03-08 15:52                   ` Linus Torvalds
  2007-03-08 16:58                     ` [PATCH] VFS : Delay the dentry name generation on sockets and pipes Eric Dumazet
  2007-03-08 20:00                   ` [patch] epoll use a single inode Bob Copeland
  2 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2007-03-08 15:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro



On Thu, 8 Mar 2007, Eric Dumazet wrote:
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> CC: Christoph Hellwig <hch@infradead.org>
> CC: Linus Torvalds <torvalds@linux-foundation.org>

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

Except you should fix the subject line when you send it out to Andrew ;)

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-08  7:24                           ` Eric Dumazet
@ 2007-03-08 16:57                             ` Valdis.Kletnieks
  2007-03-09  1:34                             ` Kyle Moffett
  2007-03-09  2:51                             ` Anton Blanchard
  2 siblings, 0 replies; 45+ messages in thread
From: Valdis.Kletnieks @ 2007-03-08 16:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Kyle Moffett, Michael K. Edwards, Linus Torvalds,
	Anton Blanchard, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro

[-- Attachment #1: Type: text/plain, Size: 1057 bytes --]

On Thu, 08 Mar 2007 08:24:04 +0100, Eric Dumazet said:
>
> But what is the cost of the conditional branch you added in prefetch(x) ?
> 
> if (!x) return;
> 
> (correctly predicted or not, but do powerPC have a BTB ?)
> 
> About the NULL 'potential problem', maybe we could use a dummy nil (but 
> mapped) object, and use its address in lists, ie compare for &nil instead of 
> NULL. This would avoid :
> 
> - The conditional test in some prefetch() implementations
> - The potential TLB problem with the NULL value.

You avoid those two, but add the very high likelyhood that a statement of the
form 'if (!x) {....}' will creep back in and bug out.  I doubt that changing
the form of a basic C idiom to save a few cycles is worth it, especially when
the (!x) form can be tested without a memory fetch, but (x != nil) may require
fetching 'nil' (remember - the x86 is a very popular chipset, but register
starved - if the loop is any size at all, 'nil' may require a reload each
time around, costing you the win you thought you had....

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH] VFS : Delay the dentry name generation on sockets and pipes.
  2007-03-08 15:52                   ` Linus Torvalds
@ 2007-03-08 16:58                     ` Eric Dumazet
  2007-03-08 18:29                       ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2007-03-08 16:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Christoph Hellwig, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1549 bytes --]

Hi Andrew

Could you please put this final version in mm for testing ?

Thank's to all contributors.

[PATCH] VFS : Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all 
pipes/sockets) may need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen);


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

 Documentation/filesystems/Locking |    2 ++
 Documentation/filesystems/vfs.txt |   12 +++++++++++-
 fs/dcache.c                       |   10 ++++++++++
 fs/pipe.c                         |   15 +++++++++------
 include/linux/dcache.h            |    1 +
 net/socket.c                      |   17 ++++++++++-------
 6 files changed, 43 insertions(+), 14 deletions(-)


[-- Attachment #2: introduce_d_dname.patch --]
[-- Type: text/plain, Size: 5919 bytes --]

--- linux-2.6.21-rc3/include/linux/dcache.h	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h	2007-03-08 11:57:41.000000000 +0100
@@ -133,6 +133,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt	2007-03-08 10:14:38.000000000 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt	2007-03-08 12:08:56.000000000 +0100
@@ -827,7 +827,7 @@ This describes how a filesystem can over
 operations. Dentries and the dcache are the domain of the VFS and the
 individual filesystem implementations. Device drivers have no business
 here. These methods may be set to NULL, as they are either optional or
-the VFS uses a default. As of kernel 2.6.13, the following members are
+the VFS uses a default. As of kernel 2.6.22, the following members are
 defined:
 
 struct dentry_operations {
@@ -837,6 +837,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)(struct dentry *, char *, int);
 };
 
   d_revalidate: called when the VFS needs to revalidate a dentry. This
@@ -859,6 +860,15 @@ struct dentry_operations {
 	VFS calls iput(). If you define this method, you must call
 	iput() yourself
 
+  d_dname: called when the pathname of a dentry should be generated.
+	Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
+	pathname generation. (Instead of doing it when dentry is created,
+	its done only when the path is needed.). Real filesystems probably
+	dont want to use it, because their dentries are present in global
+	dcache hash, so their hash should be an invariant. As no lock is
+	held, d_dname() should not try to modify the dentry itself, unless
+	appropriate SMP safety is used.
+
 Each dentry has a pointer to its parent dentry, as well as a hash list
 of child dentries. Child dentries are basically like files in a
 directory.
--- linux-2.6.21-rc3/Documentation/filesystems/Locking	2007-03-08 10:29:04.000000000 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking	2007-03-08 12:08:56.000000000 +0100
@@ -15,6 +15,7 @@ prototypes:
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
 
 locking rules:
 	none have BKL
@@ -25,6 +26,7 @@ d_compare:	no		yes		no		no 
 d_delete:	yes		no		yes		no
 d_release:	no		no		no		yes
 d_iput:		no		no		no		yes
+d_dname:	no		no		no		no
 
 --------------------------- inode_operations --------------------------- 
 prototypes:
--- linux-2.6.21-rc3/fs/dcache.c	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c	2007-03-08 17:47:04.000000000 +0100
@@ -1823,6 +1823,16 @@ char * d_path(struct dentry *dentry, str
 	struct vfsmount *rootmnt;
 	struct dentry *root;
 
+	/*
+	 * We have various synthetic filesystems that never get mounted.  On
+	 * these filesystems dentries are never used for lookup purposes, and
+	 * thus don't need to be hashed.  They also don't need a name until a
+	 * user wants to identify the object in /proc/*/fd/.  The little hack
+	 * below allows us to generate a name for these objects on demand:
+	 */
+	if (dentry->d_op && dentry->d_op->d_dname)
+		return dentry->d_op->d_dname(dentry, buf, buflen);
+
 	read_lock(&current->fs->lock);
 	rootmnt = mntget(current->fs->rootmnt);
 	root = dget(current->fs->root);
--- linux-2.6.21-rc3/fs/pipe.c	2007-03-07 17:42:36.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c	2007-03-08 17:44:06.000000000 +0100
@@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d
 	return 0;
 }
 
+static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "pipe:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations pipefs_dentry_operations = {
 	.d_delete	= pipefs_delete_dentry,
+	.d_dname	= pipefs_dname,
 };
 
 static struct inode * get_pipe_inode(void)
@@ -888,8 +895,7 @@ struct file *create_write_pipe(void)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	char name[32];
-	struct qstr this;
+	struct qstr name = { .name = "" };
 
 	f = get_empty_filp();
 	if (!f)
@@ -899,11 +905,8 @@ struct file *create_write_pipe(void)
 	if (!inode)
 		goto err_file;
 
-	this.len = sprintf(name, "[%lu]", inode->i_ino);
-	this.name = name;
-	this.hash = 0;
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
 	if (!dentry)
 		goto err_inode;
 
--- linux-2.6.21-rc3/net/socket.c	2007-03-07 17:37:56.000000000 +0100
+++ linux-2.6.21-rc3-ed/net/socket.c	2007-03-08 11:57:41.000000000 +0100
@@ -314,8 +314,16 @@ static int sockfs_delete_dentry(struct d
 	dentry->d_flags |= DCACHE_UNHASHED;
 	return 0;
 }
+
+static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "socket:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations sockfs_dentry_operations = {
 	.d_delete = sockfs_delete_dentry,
+	.d_dname  = sockfs_dname,
 };
 
 /*
@@ -355,14 +363,9 @@ static int sock_alloc_fd(struct file **f
 
 static int sock_attach_fd(struct socket *sock, struct file *file)
 {
-	struct qstr this;
-	char name[32];
-
-	this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
-	this.name = name;
-	this.hash = 0;
+	struct qstr name = { .name = "" };
 
-	file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this);
+	file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
 	if (unlikely(!file->f_path.dentry))
 		return -ENOMEM;
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH] VFS : Delay the dentry name generation on sockets and pipes.
  2007-03-08 16:58                     ` [PATCH] VFS : Delay the dentry name generation on sockets and pipes Eric Dumazet
@ 2007-03-08 18:29                       ` Eric Dumazet
  2007-03-09 10:18                         ` [PATCH, take2] " Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2007-03-08 18:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Christoph Hellwig, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1802 bytes --]

Hi Andrew

I am sorry, my previous patch had a /proc/*/fd/ in a comment, so the */ closed 
the comment and fs/dcache.c could not compile.

Could you please put this 'final-final' version in mm for testing ?

Thank's to all contributors, sorry for the noise.

[PATCH] VFS : Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all 
pipes/sockets) may need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen);


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

 Documentation/filesystems/Locking |    2 ++
 Documentation/filesystems/vfs.txt |   12 +++++++++++-
 fs/dcache.c                       |   10 ++++++++++
 fs/pipe.c                         |   15 +++++++++------
 include/linux/dcache.h            |    1 +
 net/socket.c                      |   17 ++++++++++-------
 6 files changed, 43 insertions(+), 14 deletions(-)

[-- Attachment #2: introduce_d_dname.patch --]
[-- Type: text/plain, Size: 5920 bytes --]

--- linux-2.6.21-rc3/include/linux/dcache.h	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h	2007-03-08 11:57:41.000000000 +0100
@@ -133,6 +133,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt	2007-03-08 10:14:38.000000000 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt	2007-03-08 12:08:56.000000000 +0100
@@ -827,7 +827,7 @@ This describes how a filesystem can over
 operations. Dentries and the dcache are the domain of the VFS and the
 individual filesystem implementations. Device drivers have no business
 here. These methods may be set to NULL, as they are either optional or
-the VFS uses a default. As of kernel 2.6.13, the following members are
+the VFS uses a default. As of kernel 2.6.22, the following members are
 defined:
 
 struct dentry_operations {
@@ -837,6 +837,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)(struct dentry *, char *, int);
 };
 
   d_revalidate: called when the VFS needs to revalidate a dentry. This
@@ -859,6 +860,15 @@ struct dentry_operations {
 	VFS calls iput(). If you define this method, you must call
 	iput() yourself
 
+  d_dname: called when the pathname of a dentry should be generated.
+	Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
+	pathname generation. (Instead of doing it when dentry is created,
+	its done only when the path is needed.). Real filesystems probably
+	dont want to use it, because their dentries are present in global
+	dcache hash, so their hash should be an invariant. As no lock is
+	held, d_dname() should not try to modify the dentry itself, unless
+	appropriate SMP safety is used.
+
 Each dentry has a pointer to its parent dentry, as well as a hash list
 of child dentries. Child dentries are basically like files in a
 directory.
--- linux-2.6.21-rc3/Documentation/filesystems/Locking	2007-03-08 10:29:04.000000000 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking	2007-03-08 12:08:56.000000000 +0100
@@ -15,6 +15,7 @@ prototypes:
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
 
 locking rules:
 	none have BKL
@@ -25,6 +26,7 @@ d_compare:	no		yes		no		no 
 d_delete:	yes		no		yes		no
 d_release:	no		no		no		yes
 d_iput:		no		no		no		yes
+d_dname:	no		no		no		no
 
 --------------------------- inode_operations --------------------------- 
 prototypes:
--- linux-2.6.21-rc3/fs/dcache.c	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c	2007-03-08 17:47:04.000000000 +0100
@@ -1823,6 +1823,16 @@ char * d_path(struct dentry *dentry, str
 	struct vfsmount *rootmnt;
 	struct dentry *root;
 
+	/*
+	 * We have various synthetic filesystems that never get mounted.  On
+	 * these filesystems dentries are never used for lookup purposes, and
+	 * thus don't need to be hashed.  They also don't need a name until a
+	 * user wants to identify the object in /proc/pid/fd/. The little hack
+	 * below allows us to generate a name for these objects on demand:
+	 */
+	if (dentry->d_op && dentry->d_op->d_dname)
+		return dentry->d_op->d_dname(dentry, buf, buflen);
+
 	read_lock(&current->fs->lock);
 	rootmnt = mntget(current->fs->rootmnt);
 	root = dget(current->fs->root);
--- linux-2.6.21-rc3/fs/pipe.c	2007-03-07 17:42:36.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c	2007-03-08 17:44:06.000000000 +0100
@@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d
 	return 0;
 }
 
+static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "pipe:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations pipefs_dentry_operations = {
 	.d_delete	= pipefs_delete_dentry,
+	.d_dname	= pipefs_dname,
 };
 
 static struct inode * get_pipe_inode(void)
@@ -888,8 +895,7 @@ struct file *create_write_pipe(void)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	char name[32];
-	struct qstr this;
+	struct qstr name = { .name = "" };
 
 	f = get_empty_filp();
 	if (!f)
@@ -899,11 +905,8 @@ struct file *create_write_pipe(void)
 	if (!inode)
 		goto err_file;
 
-	this.len = sprintf(name, "[%lu]", inode->i_ino);
-	this.name = name;
-	this.hash = 0;
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
 	if (!dentry)
 		goto err_inode;
 
--- linux-2.6.21-rc3/net/socket.c	2007-03-07 17:37:56.000000000 +0100
+++ linux-2.6.21-rc3-ed/net/socket.c	2007-03-08 11:57:41.000000000 +0100
@@ -314,8 +314,16 @@ static int sockfs_delete_dentry(struct d
 	dentry->d_flags |= DCACHE_UNHASHED;
 	return 0;
 }
+
+static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "socket:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations sockfs_dentry_operations = {
 	.d_delete = sockfs_delete_dentry,
+	.d_dname  = sockfs_dname,
 };
 
 /*
@@ -355,14 +363,9 @@ static int sock_alloc_fd(struct file **f
 
 static int sock_attach_fd(struct socket *sock, struct file *file)
 {
-	struct qstr this;
-	char name[32];
-
-	this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
-	this.name = name;
-	this.hash = 0;
+	struct qstr name = { .name = "" };
 
-	file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this);
+	file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
 	if (unlikely(!file->f_path.dentry))
 		return -ENOMEM;
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-08 11:11                 ` Eric Dumazet
  2007-03-08 11:18                   ` Christoph Hellwig
  2007-03-08 15:52                   ` Linus Torvalds
@ 2007-03-08 20:00                   ` Bob Copeland
  2 siblings, 0 replies; 45+ messages in thread
From: Bob Copeland @ 2007-03-08 20:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, Linus Torvalds, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro

> +       its done only when the path is needed.). Real filesystems probably

> +       dont want to use it, because their dentries are present in global

Pedantry: it's and don't?

-Bob

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-08  7:24                           ` Eric Dumazet
  2007-03-08 16:57                             ` Valdis.Kletnieks
@ 2007-03-09  1:34                             ` Kyle Moffett
  2007-03-09  2:52                               ` Anton Blanchard
  2007-03-09  2:51                             ` Anton Blanchard
  2 siblings, 1 reply; 45+ messages in thread
From: Kyle Moffett @ 2007-03-09  1:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael K. Edwards, Linus Torvalds, Anton Blanchard,
	Davide Libenzi, Avi Kivity, Linux Kernel Mailing List,
	Andrew Morton, Al Viro

On Mar 08, 2007, at 02:24:04, Eric Dumazet wrote:
> Kyle Moffett a écrit :
>> Prefetching is also fairly critical on a Power4 or G5 PowerPC  
>> system as they have a long memory latency; an L2-cache miss can  
>> cost 200+ cycles.  On such systems the "dcbt" prefetch instruction  
>> brings in a single 128-byte cacheline and has no serializing  
>> effects whatsoever, making it ideal for use in a linked-list- 
>> traversal inner loop.
>
> OK, 200 cycles...
>
> But what is the cost of the conditional branch you added in prefetch 
> (x) ?
>
> if (!x) return;
>
> (correctly predicted or not, but do powerPC have a BTB ?)

Well, PowerPC "dcbt" does prefetch() correctly, it doesn't ever raise  
exceptions, doesn't have any side effects, takes only enough CPU to  
decode the address, and is ignored if it would have to do anything  
other than load the cacheline from RAM.  Prefetch streams are halted  
when they reach the end of a page boundary (no trapping to the MMU)  
and if the TLB entry isn't present then they would asynchronously  
abort.  This is a suitable prefetch implementation for G5:

   #define prefetch(x) __asm__ __volatile__ ("dcbt 0, %0"::"r"(x));

On GCC4+ the "__builtin_prefetch" built-in functions is probably  
mildly better tuned for most architectures:

   #define PREFETCH_RD 0
   #define PREFETCH_WR 1
   #define PREFETCH_LOCALITY_NONE 0
   #define PREFETCH_LOCALITY_LOW  1
   #define PREFETCH_LOCALITY_MED  2
   #define PREFETCH_LOCALITY_HIGH 3
   #define prefetch(x, args...) __builtin_prefetch(x ,##args)

The pertinent GCC docs:
> This function is used to minimize cache-miss latency by moving data  
> into a cache before it is accessed. You can insert calls to  
> __builtin_prefetch into code for which you know addresses of data  
> in memory that is likely to be accessed soon. If the target  
> supports them, data prefetch instructions will be generated. If the  
> prefetch is done early enough before the access then the data will  
> be in the cache by the time it is accessed.
>
> The value of addr is the address of the memory to prefetch. There  
> are two optional arguments, rw and locality. The value of rw is a  
> compile-time constant one or zero; one means that the prefetch is  
> preparing for a write to the memory address and zero, the default,  
> means that the prefetch is preparing for a read. The value locality  
> must be a compile-time constant integer between zero and three. A  
> value of zero means that the data has no temporal locality, so it  
> need not be left in the cache after the access. A value of three  
> means that the data has a high degree of temporal locality and  
> should be left in all levels of cache possible. Values of one and  
> two mean, respectively, a low or moderate degree of temporal  
> locality. The default is three.
> 	for (i = 0; i < n; i++) {
> 		a[i] = a[i] + b[i];
> 		__builtin_prefetch (&a[i+j], 1, 1);
> 		__builtin_prefetch (&b[i+j], 0, 1);
> 		/* ... */
> 	}
>
> Data prefetch does not generate faults if addr is invalid, but the  
> address expression itself must be valid. For example, a prefetch of  
> p->next will not fault if p->next is not a valid address, but  
> evaluation will fault if p is not a valid address.
>
> If the target does not support data prefetch, the address  
> expression is evaluated if it includes side effects but no other  
> code is generated and GCC does not issue a warning.

Cheers,
Kyle Moffett


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-07 22:57                     ` Linus Torvalds
  2007-03-08  1:25                       ` Michael K. Edwards
@ 2007-03-09  2:46                       ` Anton Blanchard
  1 sibling, 0 replies; 45+ messages in thread
From: Anton Blanchard @ 2007-03-09  2:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Davide Libenzi, Avi Kivity,
	Linux Kernel Mailing List, Andrew Morton, Al Viro


> Yeah, I'm not at all surprised. Any implementation of "prefetch" that 
> doesn't just turn into a no-op if the TLB entry doesn't exist (which makes 
> them weaker for *actual* prefetching) will generally have a hard time with 
> a NULL pointer. Exactly because it will try to do a totally unnecessary 
> TLB fill - and since most CPU's will not cache negative TLB entries, that 
> unnecessary TLB fill will be done over and over and over again..

Yeah this is exactly what we were seeing :)

Anton

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-08  7:24                           ` Eric Dumazet
  2007-03-08 16:57                             ` Valdis.Kletnieks
  2007-03-09  1:34                             ` Kyle Moffett
@ 2007-03-09  2:51                             ` Anton Blanchard
  2 siblings, 0 replies; 45+ messages in thread
From: Anton Blanchard @ 2007-03-09  2:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Kyle Moffett, Michael K. Edwards, Linus Torvalds, Davide Libenzi,
	Avi Kivity, Linux Kernel Mailing List, Andrew Morton, Al Viro


> OK, 200 cycles...
> 
> But what is the cost of the conditional branch you added in prefetch(x) ?

Much less than the tablewalk. On ppc64 a tablewalk of an address that is
not populated in the hashtable will incur 2 cacheline lookups (primary
and secondary buckets). This plus the MMU state machine overhead adds up.

Cue Linus rant about PowerPC MMU :)

Anton

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-09  1:34                             ` Kyle Moffett
@ 2007-03-09  2:52                               ` Anton Blanchard
  0 siblings, 0 replies; 45+ messages in thread
From: Anton Blanchard @ 2007-03-09  2:52 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Eric Dumazet, Michael K. Edwards, Linus Torvalds, Davide Libenzi,
	Avi Kivity, Linux Kernel Mailing List, Andrew Morton, Al Viro


Hi,

> Well, PowerPC "dcbt" does prefetch() correctly, it doesn't ever raise  
> exceptions, doesn't have any side effects, takes only enough CPU to  
> decode the address, and is ignored if it would have to do anything  
> other than load the cacheline from RAM.  Prefetch streams are halted  
> when they reach the end of a page boundary (no trapping to the MMU)  
> and if the TLB entry isn't present then they would asynchronously  
> abort.  

It depends on the implementation and the HID bit settings. Some do walk
the MMU hashtable if it isnt in the TLB.

Anton

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH, take2] VFS : Delay the dentry name generation on sockets and pipes.
  2007-03-08 18:29                       ` Eric Dumazet
@ 2007-03-09 10:18                         ` Eric Dumazet
  2007-03-09 18:22                           ` Linus Torvalds
  2007-03-10  1:21                           ` [PATCH, take3] " Eric Dumazet
  0 siblings, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2007-03-09 10:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Christoph Hellwig, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2359 bytes --]

Hi Andrew

Please find a new version of this patch : I realized d_path() has very 
uncommon semantic (it seems nobody caught the point in previous patches), and 
had to change the documentation and pipefs_dname() / sockfs_dname() 
accordingly.

Now, readlink("/proc/pid/fd/xx", buffer, 4096) returns the exact number of 
bytes, not the whole 4095 bytes :)

Extract of new Documentation:

	CAUTION : d_path() logic is quite  tricky. 
	The correct way to return for example "Hello" is to put it
	at the end of the buffer, and returns a pointer to the first char.

       Example :

static char *somefs_dname(struct dentry *dent, char *buffer, int buflen)
{
       char *string = "Hello";
       int sz = strlen(string) + 1;
       if (sz > buflen)
               return ERR_PTR(-ENAMETOOLONG);
       buffer += (buflen - sz);
       return memcpy(buffer, string, sz);
}



Thank you

[PATCH] VFS : Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all 
pipes/sockets) may need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen);


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

 Documentation/filesystems/Locking |    2 ++
 Documentation/filesystems/vfs.txt |   26 +++++++++++++++++++++++++-
 fs/dcache.c                       |   10 ++++++++++
 fs/pipe.c                         |   23 +++++++++++++++++------
 include/linux/dcache.h            |    1 +
 net/socket.c                      |   25 ++++++++++++++++++-------
 6 files changed, 73 insertions(+), 14 deletions(-)


[-- Attachment #2: introduce_d_dname_2.patch --]
[-- Type: text/plain, Size: 6947 bytes --]

--- linux-2.6.21-rc3/include/linux/dcache.h	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h	2007-03-08 11:57:41.000000000 +0100
@@ -133,6 +133,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt	2007-03-08 10:14:38.000000000 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt	2007-03-09 10:25:44.000000000 +0100
@@ -827,7 +827,7 @@ This describes how a filesystem can over
 operations. Dentries and the dcache are the domain of the VFS and the
 individual filesystem implementations. Device drivers have no business
 here. These methods may be set to NULL, as they are either optional or
-the VFS uses a default. As of kernel 2.6.13, the following members are
+the VFS uses a default. As of kernel 2.6.22, the following members are
 defined:
 
 struct dentry_operations {
@@ -837,6 +837,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)(struct dentry *, char *, int);
 };
 
   d_revalidate: called when the VFS needs to revalidate a dentry. This
@@ -859,6 +860,29 @@ struct dentry_operations {
 	VFS calls iput(). If you define this method, you must call
 	iput() yourself
 
+  d_dname: called when the pathname of a dentry should be generated.
+	Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
+	pathname generation. (Instead of doing it when dentry is created,
+	its done only when the path is needed.). Real filesystems probably
+	dont want to use it, because their dentries are present in global
+	dcache hash, so their hash should be an invariant. As no lock is
+	held, d_dname() should not try to modify the dentry itself, unless
+	appropriate SMP safety is used. CAUTION : d_path() logic is quite
+	tricky. The correct way to return for example "Hello" is to put it
+	at the end of the buffer, and returns a pointer to the first char.
+
+	Example :
+
+static char *somefs_dname(struct dentry *dent, char *buffer, int buflen)
+{
+	char *string = "Hello";
+	int sz = strlen(string) + 1;
+	if (sz > buflen)
+		return ERR_PTR(-ENAMETOOLONG);
+	buffer += (buflen - sz);
+	return memcpy(buffer, string, sz);
+}
+
 Each dentry has a pointer to its parent dentry, as well as a hash list
 of child dentries. Child dentries are basically like files in a
 directory.
--- linux-2.6.21-rc3/Documentation/filesystems/Locking	2007-03-08 10:29:04.000000000 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking	2007-03-08 12:08:56.000000000 +0100
@@ -15,6 +15,7 @@ prototypes:
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
 
 locking rules:
 	none have BKL
@@ -25,6 +26,7 @@ d_compare:	no		yes		no		no 
 d_delete:	yes		no		yes		no
 d_release:	no		no		no		yes
 d_iput:		no		no		no		yes
+d_dname:	no		no		no		no
 
 --------------------------- inode_operations --------------------------- 
 prototypes:
--- linux-2.6.21-rc3/fs/dcache.c	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c	2007-03-08 19:22:59.000000000 +0100
@@ -1823,6 +1823,16 @@ char * d_path(struct dentry *dentry, str
 	struct vfsmount *rootmnt;
 	struct dentry *root;
 
+	/*
+	 * We have various synthetic filesystems that never get mounted.  On
+	 * these filesystems dentries are never used for lookup purposes, and
+	 * thus don't need to be hashed.  They also don't need a name until a
+	 * user wants to identify the object in /proc/pid/fd/.  The little hack
+	 * below allows us to generate a name for these objects on demand:
+	 */
+	if (dentry->d_op && dentry->d_op->d_dname)
+		return dentry->d_op->d_dname(dentry, buf, buflen);
+
 	read_lock(&current->fs->lock);
 	rootmnt = mntget(current->fs->rootmnt);
 	root = dget(current->fs->root);
--- linux-2.6.21-rc3/fs/pipe.c	2007-03-07 17:42:36.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c	2007-03-09 10:25:44.000000000 +0100
@@ -841,8 +841,23 @@ static int pipefs_delete_dentry(struct d
 	return 0;
 }
 
+/*
+ * pipefs_dname() is called from d_path().
+ * We must build the pathname of this pipe and put it at buffer end
+ */
+static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	char temp[7 /*strlen("pipe:[]")*/ + 3*sizeof(unsigned long) + 1];
+	int sz = sprintf(temp, "pipe:[%lu]", dentry->d_inode->i_ino) + 1;
+	if (sz > buflen)
+		return ERR_PTR(-ENAMETOOLONG);
+	buffer += (buflen - sz);
+	return memcpy(buffer, temp, sz);
+}
+
 static struct dentry_operations pipefs_dentry_operations = {
 	.d_delete	= pipefs_delete_dentry,
+	.d_dname	= pipefs_dname,
 };
 
 static struct inode * get_pipe_inode(void)
@@ -888,8 +903,7 @@ struct file *create_write_pipe(void)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	char name[32];
-	struct qstr this;
+	struct qstr name = { .name = "" };
 
 	f = get_empty_filp();
 	if (!f)
@@ -899,11 +913,8 @@ struct file *create_write_pipe(void)
 	if (!inode)
 		goto err_file;
 
-	this.len = sprintf(name, "[%lu]", inode->i_ino);
-	this.name = name;
-	this.hash = 0;
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
 	if (!dentry)
 		goto err_inode;
 
--- linux-2.6.21-rc3/net/socket.c	2007-03-07 17:37:56.000000000 +0100
+++ linux-2.6.21-rc3-ed/net/socket.c	2007-03-09 10:25:44.000000000 +0100
@@ -314,8 +314,24 @@ static int sockfs_delete_dentry(struct d
 	dentry->d_flags |= DCACHE_UNHASHED;
 	return 0;
 }
+
+/*
+ * sockfs_dname() is called from d_path().
+ * We must build the pathname of this socket and put it at buffer end
+ */
+static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	char temp[9 /*strlen("socket:[]")*/ + 3*sizeof(unsigned long) + 1];
+	int sz = sprintf(temp, "socket:[%lu]", dentry->d_inode->i_ino) + 1;
+	if (sz > buflen)
+		return ERR_PTR(-ENAMETOOLONG);
+	buffer += (buflen - sz);
+	return memcpy(buffer, temp, sz);
+}
+
 static struct dentry_operations sockfs_dentry_operations = {
 	.d_delete = sockfs_delete_dentry,
+	.d_dname  = sockfs_dname,
 };
 
 /*
@@ -355,14 +371,9 @@ static int sock_alloc_fd(struct file **f
 
 static int sock_attach_fd(struct socket *sock, struct file *file)
 {
-	struct qstr this;
-	char name[32];
-
-	this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
-	this.name = name;
-	this.hash = 0;
+	struct qstr name = { .name = "" };
 
-	file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this);
+	file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
 	if (unlikely(!file->f_path.dentry))
 		return -ENOMEM;
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH, take2] VFS : Delay the dentry name generation on sockets and pipes.
  2007-03-09 10:18                         ` [PATCH, take2] " Eric Dumazet
@ 2007-03-09 18:22                           ` Linus Torvalds
  2007-03-10  1:21                           ` [PATCH, take3] " Eric Dumazet
  1 sibling, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2007-03-09 18:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrew Morton, Christoph Hellwig, Linux Kernel Mailing List



On Fri, 9 Mar 2007, Eric Dumazet wrote:
> 
> 	CAUTION : d_path() logic is quite  tricky. 
> 	The correct way to return for example "Hello" is to put it
> 	at the end of the buffer, and returns a pointer to the first char.

Yeah, it's subtle, since it wants to use a single buffer, and not copy 
things around too much.

But can I ask you to do a take3, and simply have a helper function like

	char *dynamic_dname(struct dentry *dentry, char *buffer, int len,
		const char *fmt, ...)
	{
		va_list args;
		char temp[64];
		int i;

		va_start(args, fmt);
		i = vsnprintf(tmp,sizeof(tmp),fmt,args) + 1;
		va_end(args);

		if (i > len)
			return ERR_PTR(-ENAMETOOLONG);

		buffer += len - i;
		memcpy(buffer, tmp, i);
		return buffer;
	}

and just require that everybody use that function.

Then the pipe code would just become

	static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
	{
		return dynamic_dname(dentry, buffer, buflen, ""pipe:[%lu]", dentry->d_inode->i_ino);
	}

and you're done, and you have only *one* place in the VFS layer 
(preferably right next to d_path() itself) that cares about the
subtle issues that we have.

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH, take3] VFS : Delay the dentry name generation on sockets and pipes.
  2007-03-09 10:18                         ` [PATCH, take2] " Eric Dumazet
  2007-03-09 18:22                           ` Linus Torvalds
@ 2007-03-10  1:21                           ` Eric Dumazet
  1 sibling, 0 replies; 45+ messages in thread
From: Eric Dumazet @ 2007-03-10  1:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Christoph Hellwig, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1325 bytes --]

Hi Andrew

Please find take3 of this patch : Linus suggested to introduce a helper 
function to factorize work done by most d_dname() implementations.

Thank you

[PATCH] VFS : Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all 
pipes/sockets) may need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen);

2) Adds a dynamic_dname() helper function that eases d_dname() implementations

3) Defines d_dname method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

4) Defines d_dname method for pipes : No more sprintf() at pipe creation.
This is delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

[-- Attachment #2: introduce_d_dname_3.patch --]
[-- Type: text/plain, Size: 7559 bytes --]

--- linux-2.6.21-rc3/include/linux/dcache.h	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h	2007-03-09 20:08:36.000000000 +0100
@@ -133,6 +133,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
@@ -293,6 +294,11 @@ extern struct dentry * d_hash_and_lookup
 /* validate "insecure" dentry pointer */
 extern int d_validate(struct dentry *, struct dentry *);
 
+/*
+ * helper function for dentry_operations.d_dname() members
+ */
+extern char *dynamic_dname(struct dentry *, char *, int, const char *, ...);
+
 extern char * d_path(struct dentry *, struct vfsmount *, char *, int);
   
 /* Allocation counts.. */
--- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt	2007-03-08 10:14:38.000000000 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt	2007-03-09 20:08:36.000000000 +0100
@@ -827,7 +827,7 @@ This describes how a filesystem can over
 operations. Dentries and the dcache are the domain of the VFS and the
 individual filesystem implementations. Device drivers have no business
 here. These methods may be set to NULL, as they are either optional or
-the VFS uses a default. As of kernel 2.6.13, the following members are
+the VFS uses a default. As of kernel 2.6.22, the following members are
 defined:
 
 struct dentry_operations {
@@ -837,6 +837,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)(struct dentry *, char *, int);
 };
 
   d_revalidate: called when the VFS needs to revalidate a dentry. This
@@ -859,6 +860,26 @@ struct dentry_operations {
 	VFS calls iput(). If you define this method, you must call
 	iput() yourself
 
+  d_dname: called when the pathname of a dentry should be generated.
+	Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
+	pathname generation. (Instead of doing it when dentry is created,
+	its done only when the path is needed.). Real filesystems probably
+	dont want to use it, because their dentries are present in global
+	dcache hash, so their hash should be an invariant. As no lock is
+	held, d_dname() should not try to modify the dentry itself, unless
+	appropriate SMP safety is used. CAUTION : d_path() logic is quite
+	tricky. The correct way to return for example "Hello" is to put it
+	at the end of the buffer, and returns a pointer to the first char.
+	dynamic_dname() helper function is provided to take care of this.
+
+Example :
+
+static char *pipefs_dname(struct dentry *dent, char *buffer, int buflen)
+{
+	return dynamic_dname(dentry, buffer, buflen, "pipe:[%lu]",
+				dentry->d_inode->i_ino);
+}
+
 Each dentry has a pointer to its parent dentry, as well as a hash list
 of child dentries. Child dentries are basically like files in a
 directory.
--- linux-2.6.21-rc3/Documentation/filesystems/Locking	2007-03-08 10:29:04.000000000 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking	2007-03-08 12:08:56.000000000 +0100
@@ -15,6 +15,7 @@ prototypes:
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
 
 locking rules:
 	none have BKL
@@ -25,6 +26,7 @@ d_compare:	no		yes		no		no 
 d_delete:	yes		no		yes		no
 d_release:	no		no		no		yes
 d_iput:		no		no		no		yes
+d_dname:	no		no		no		no
 
 --------------------------- inode_operations --------------------------- 
 prototypes:
--- linux-2.6.21-rc3/fs/dcache.c	2007-03-07 17:23:55.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c	2007-03-09 20:12:26.000000000 +0100
@@ -1823,6 +1823,16 @@ char * d_path(struct dentry *dentry, str
 	struct vfsmount *rootmnt;
 	struct dentry *root;
 
+	/*
+	 * We have various synthetic filesystems that never get mounted.  On
+	 * these filesystems dentries are never used for lookup purposes, and
+	 * thus don't need to be hashed.  They also don't need a name until a
+	 * user wants to identify the object in /proc/pid/fd/.  The little hack
+	 * below allows us to generate a name for these objects on demand:
+	 */
+	if (dentry->d_op && dentry->d_op->d_dname)
+		return dentry->d_op->d_dname(dentry, buf, buflen);
+
 	read_lock(&current->fs->lock);
 	rootmnt = mntget(current->fs->rootmnt);
 	root = dget(current->fs->root);
@@ -1836,6 +1846,27 @@ char * d_path(struct dentry *dentry, str
 }
 
 /*
+ * Helper function for dentry_operations.d_dname() members
+ */
+char *dynamic_dname(struct dentry *dentry, char *buffer, int buflen,
+			const char *fmt, ...)
+{
+	va_list args;
+	char temp[64];
+	int sz;
+
+	va_start(args, fmt);
+	sz = vsnprintf(temp, sizeof(temp), fmt, args) + 1;
+	va_end(args);
+
+	if (sz > sizeof(temp) || sz > buflen)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	buffer += buflen - sz;
+	return memcpy(buffer, temp, sz);
+}
+
+/*
  * NOTE! The user-level library version returns a
  * character pointer. The kernel system call just
  * returns the length of the buffer filled (which
--- linux-2.6.21-rc3/fs/pipe.c	2007-03-07 17:42:36.000000000 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c	2007-03-09 19:58:22.000000000 +0100
@@ -841,8 +841,18 @@ static int pipefs_delete_dentry(struct d
 	return 0;
 }
 
+/*
+ * pipefs_dname() is called from d_path().
+ */
+static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	return dynamic_dname(dentry, buffer, buflen, "pipe:[%lu]",
+				dentry->d_inode->i_ino);
+}
+
 static struct dentry_operations pipefs_dentry_operations = {
 	.d_delete	= pipefs_delete_dentry,
+	.d_dname	= pipefs_dname,
 };
 
 static struct inode * get_pipe_inode(void)
@@ -888,8 +898,7 @@ struct file *create_write_pipe(void)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	char name[32];
-	struct qstr this;
+	struct qstr name = { .name = "" };
 
 	f = get_empty_filp();
 	if (!f)
@@ -899,11 +908,8 @@ struct file *create_write_pipe(void)
 	if (!inode)
 		goto err_file;
 
-	this.len = sprintf(name, "[%lu]", inode->i_ino);
-	this.name = name;
-	this.hash = 0;
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
 	if (!dentry)
 		goto err_inode;
 
--- linux-2.6.21-rc3/net/socket.c	2007-03-07 17:37:56.000000000 +0100
+++ linux-2.6.21-rc3-ed/net/socket.c	2007-03-09 19:58:22.000000000 +0100
@@ -314,8 +314,19 @@ static int sockfs_delete_dentry(struct d
 	dentry->d_flags |= DCACHE_UNHASHED;
 	return 0;
 }
+
+/*
+ * sockfs_dname() is called from d_path().
+ */
+static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	return dynamic_dname(dentry, buffer, buflen, "socket:[%lu]",
+				dentry->d_inode->i_ino);
+}
+
 static struct dentry_operations sockfs_dentry_operations = {
 	.d_delete = sockfs_delete_dentry,
+	.d_dname  = sockfs_dname,
 };
 
 /*
@@ -355,14 +366,9 @@ static int sock_alloc_fd(struct file **f
 
 static int sock_attach_fd(struct socket *sock, struct file *file)
 {
-	struct qstr this;
-	char name[32];
-
-	this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
-	this.name = name;
-	this.hash = 0;
+	struct qstr name = { .name = "" };
 
-	file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this);
+	file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
 	if (unlikely(!file->f_path.dentry))
 		return -ENOMEM;
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
       [not found] <45ED046A.5010508@argo.co.il>
@ 2007-03-06  7:27 ` Davide Libenzi
  0 siblings, 0 replies; 45+ messages in thread
From: Davide Libenzi @ 2007-03-06  7:27 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Linux Kernel Mailing List, Linus Torvalds, Andrew Morton

On Tue, 6 Mar 2007, Avi Kivity wrote:

> >  /* File callbacks that implement the eventpoll file behaviour */
> >  static const struct file_operations eventpoll_fops = {
> >  	.release	= ep_eventpoll_close,
> > @@ -763,15 +767,18 @@
> >  	 * using the inode number.
> >  	 */
> >  	error = -ENOMEM;
> > -	sprintf(name, "[%lu]", inode->i_ino);
> > +	sprintf(name, "[%p]", ep);
> >  	this.name = name;
> >  	this.len = strlen(name);
> > -	this.hash = inode->i_ino;
> > +	this.hash = 0;
> >  	dentry = d_alloc(eventpoll_mnt->mnt_sb->s_root, &this);
> >  	if (!dentry)
> >  		goto eexit_4;
> >  	dentry->d_op = &eventpollfs_dentry_operations;
> > -	d_add(dentry, inode);
> > +	/* Do not publish this dentry inside the global dentry hash table */
> > +	dentry->d_flags &= ~DCACHE_UNHASHED;
> > +	d_instantiate(dentry, inode);
> > +
> >   
> 
> I've used d_alloc_anon() in similar code in kvmfs.  You lose the ep 
> pointer in the name, but as a kernel address it isn't particularly 
> useful to userspace.

That ends up calling two times the find alias for an unhashed dentry, no?
And this linearly scans the inode's dentry list, no?



- Davide



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch] epoll use a single inode ...
  2007-03-05 21:25 Davide Libenzi
@ 2007-03-05 21:52 ` Eric Dumazet
  0 siblings, 0 replies; 45+ messages in thread
From: Eric Dumazet @ 2007-03-05 21:52 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux Kernel Mailing List, Linus Torvalds, Andrew Morton

Davide Libenzi a écrit :
>  	 * using the inode number.
>  	 */
>  	error = -ENOMEM;
> -	sprintf(name, "[%lu]", inode->i_ino);
> +	sprintf(name, "[%p]", ep);
>  	this.name = name;
>  	this.len = strlen(name);

Small remark : you can avoid strlen(), since sprintf gives you the value :)

this.len = sprintf(name, "[%p]", ep);

Also, your patch description is not complete : you forgot to say that epoll 
dentries are not anymore hashed into global dcache hashtable :)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch] epoll use a single inode ...
@ 2007-03-05 21:25 Davide Libenzi
  2007-03-05 21:52 ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Davide Libenzi @ 2007-03-05 21:25 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Linus Torvalds, Andrew Morton


Epoll does not keep any private data attached to its inode, so there'd be 
no need to allocate one inode per fd. For epoll, the inode is just a 
placeholder for the file operations and could be shared by all instances.
I'd like to use the same optimization even for the upcoming file-based 
objects, so if you see problems let me know.
One that Al was pointing out was that an fstat(2) over an epoll fd would 
show the same st_ino. IMO that should be fine since an fstat(2) over an 
epoll fd is not something you want to do in any case and expecting 
meaningfull results.



Signed-off-by: Davide Libenzi <davidel@xmailserver.org>


- Davide



eventpoll.c |   36 ++++++++++++++++++++++++++++++++----
1 file changed, 32 insertions(+), 4 deletions(-)



Index: linux-2.6.20.ep2/fs/eventpoll.c
===================================================================
--- linux-2.6.20.ep2.orig/fs/eventpoll.c	2007-03-04 14:40:01.000000000 -0800
+++ linux-2.6.20.ep2/fs/eventpoll.c	2007-03-05 13:03:52.000000000 -0800
@@ -258,6 +258,7 @@
 		   int maxevents, long timeout);
 static int eventpollfs_delete_dentry(struct dentry *dentry);
 static struct inode *ep_eventpoll_inode(void);
+static struct inode *ep_create_inode(void);
 static int eventpollfs_get_sb(struct file_system_type *fs_type,
 			      int flags, const char *dev_name,
 			      void *data, struct vfsmount *mnt);
@@ -279,6 +280,9 @@
 /* Virtual fs used to allocate inodes for eventpoll files */
 static struct vfsmount *eventpoll_mnt __read_mostly;
 
+/* Placeholder inode for eventpoll fds */
+static struct inode *eventpoll_inode;
+
 /* File callbacks that implement the eventpoll file behaviour */
 static const struct file_operations eventpoll_fops = {
 	.release	= ep_eventpoll_close,
@@ -763,15 +767,18 @@
 	 * using the inode number.
 	 */
 	error = -ENOMEM;
-	sprintf(name, "[%lu]", inode->i_ino);
+	sprintf(name, "[%p]", ep);
 	this.name = name;
 	this.len = strlen(name);
-	this.hash = inode->i_ino;
+	this.hash = 0;
 	dentry = d_alloc(eventpoll_mnt->mnt_sb->s_root, &this);
 	if (!dentry)
 		goto eexit_4;
 	dentry->d_op = &eventpollfs_dentry_operations;
-	d_add(dentry, inode);
+	/* Do not publish this dentry inside the global dentry hash table */
+	dentry->d_flags &= ~DCACHE_UNHASHED;
+	d_instantiate(dentry, inode);
+
 	file->f_path.mnt = mntget(eventpoll_mnt);
 	file->f_path.dentry = dentry;
 	file->f_mapping = inode->i_mapping;
@@ -1555,6 +1562,11 @@
 
 static int eventpollfs_delete_dentry(struct dentry *dentry)
 {
+	/*
+	 * We faked vfs to believe the dentry was hashed when we created it.
+	 * Now we restore the flag so that dput() will work correctly.
+	 */
+	dentry->d_flags |= DCACHE_UNHASHED;
 
 	return 1;
 }
@@ -1562,6 +1574,17 @@
 
 static struct inode *ep_eventpoll_inode(void)
 {
+
+	return igrab(eventpoll_inode);
+}
+
+/*
+ * A single inode exist for all eventpoll files. On the contrary of pipes,
+ * eventpoll inodes has no per-instance data associated, so we can avoid
+ * the allocation of multiple of them.
+ */
+static struct inode *ep_create_inode(void)
+{
 	int error = -ENOMEM;
 	struct inode *inode = new_inode(eventpoll_mnt->mnt_sb);
 
@@ -1626,10 +1649,14 @@
 
 	/* Mount the above commented virtual file system */
 	eventpoll_mnt = kern_mount(&eventpoll_fs_type);
-	error = PTR_ERR(eventpoll_mnt);
 	if (IS_ERR(eventpoll_mnt))
 		goto epanic;
 
+	/* Create the single instance of inode for all eventpoll fds */
+	eventpoll_inode = ep_create_inode();
+	if (IS_ERR(eventpoll_inode))
+		goto epanic;
+
 	DNPRINTK(3, (KERN_INFO "[%p] eventpoll: successfully initialized.\n",
 			current));
 	return 0;
@@ -1642,6 +1669,7 @@
 static void __exit eventpoll_exit(void)
 {
 	/* Undo all operations done inside eventpoll_init() */
+	iput(eventpoll_inode);
 	unregister_filesystem(&eventpoll_fs_type);
 	mntput(eventpoll_mnt);
 	kmem_cache_destroy(pwq_cache);

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2007-03-10  1:21 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <45ED1A3C.6030707@argo.co.il>
2007-03-07  0:37 ` [patch] epoll use a single inode Davide Libenzi
2007-03-07  0:51   ` Linus Torvalds
2007-03-07  1:01     ` Davide Libenzi
2007-03-07  1:27       ` Linus Torvalds
2007-03-07  1:47         ` Davide Libenzi
2007-03-07  6:52     ` Eric Dumazet
2007-03-07  7:15       ` Davide Libenzi
2007-03-07  7:16       ` Eric Dumazet
2007-03-07  8:56         ` Christoph Hellwig
2007-03-07 17:21         ` Linus Torvalds
2007-03-07 17:02       ` Linus Torvalds
2007-03-07 17:31         ` Eric Dumazet
2007-03-07 17:36           ` Eric Dumazet
2007-03-07 17:45           ` Linus Torvalds
2007-03-07 18:06             ` Eric Dumazet
2007-03-07 18:30               ` Linus Torvalds
2007-03-07 18:52                 ` Eric Dumazet
2007-03-07 19:07                   ` Linus Torvalds
2007-03-07 22:14                   ` Anton Blanchard
2007-03-07 22:57                     ` Linus Torvalds
2007-03-08  1:25                       ` Michael K. Edwards
2007-03-08  2:48                         ` Kyle Moffett
2007-03-08  7:24                           ` Eric Dumazet
2007-03-08 16:57                             ` Valdis.Kletnieks
2007-03-09  1:34                             ` Kyle Moffett
2007-03-09  2:52                               ` Anton Blanchard
2007-03-09  2:51                             ` Anton Blanchard
2007-03-08  3:20                         ` Linus Torvalds
2007-03-08  8:37                           ` Michael K. Edwards
2007-03-09  2:46                       ` Anton Blanchard
2007-03-08  8:56           ` Christoph Hellwig
2007-03-08  9:42             ` Eric Dumazet
2007-03-08 10:21               ` Christoph Hellwig
2007-03-08 11:11                 ` Eric Dumazet
2007-03-08 11:18                   ` Christoph Hellwig
2007-03-08 15:52                   ` Linus Torvalds
2007-03-08 16:58                     ` [PATCH] VFS : Delay the dentry name generation on sockets and pipes Eric Dumazet
2007-03-08 18:29                       ` Eric Dumazet
2007-03-09 10:18                         ` [PATCH, take2] " Eric Dumazet
2007-03-09 18:22                           ` Linus Torvalds
2007-03-10  1:21                           ` [PATCH, take3] " Eric Dumazet
2007-03-08 20:00                   ` [patch] epoll use a single inode Bob Copeland
     [not found] <45ED046A.5010508@argo.co.il>
2007-03-06  7:27 ` Davide Libenzi
2007-03-05 21:25 Davide Libenzi
2007-03-05 21:52 ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.