linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* udevd is killing file write performance.
@ 2006-02-22 13:42 Robin Holt
  2006-02-22 13:55 ` Andrew Morton
  2006-02-22 16:48 ` John McCutchan
  0 siblings, 2 replies; 27+ messages in thread
From: Robin Holt @ 2006-02-22 13:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: rml, arnd, ttb, hch, akpm


I have a customer application which sees a significant performance
degradation when run with udevd running.  This appears to be due to:


void inotify_dentry_parent_queue_event(struct dentry *dentry, u32 mask,
                                       u32 cookie, const char *name)
{
...
        if (!atomic_read (&inotify_watches))
                return;

        spin_lock(&dentry->d_lock);

The particular application is a 512 thread application.  When run with
udevd turned off the application finishes in about 6 seconds.  With it
turned on, the appliction takes minutes (I think we timed it to around
22 minutes, but we have not been patient enough to wait it through to
the end in some time).  This is being run on a 512cpu system, but there
is a noticable performance hit on smaller systems as well.

As far as I can tell, the application has multiple threads writing
different portions of the same file, but the customer is still working
on providing a simplified test case.

I know _VERY_ little about filesystems.  udevd appears to be looking
at /etc/udev/rules.d.  This bumps inotify_watches to 1.  The file
being written is on an xfs filesystem mounted at a different mountpoint.
Could the inotify flag be moved from a global to a sb (or something
finer) point and therefore avoid taking the dentry->d_lock when there
is no possibility of a watch event being queued.

Thanks,
Robin Holt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-22 13:42 udevd is killing file write performance Robin Holt
@ 2006-02-22 13:55 ` Andrew Morton
  2006-02-22 16:48 ` John McCutchan
  1 sibling, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2006-02-22 13:55 UTC (permalink / raw)
  To: Robin Holt; +Cc: linux-kernel, rml, arnd, ttb, hch

Robin Holt <holt@sgi.com> wrote:
>
> 
> I have a customer application which sees a significant performance
> degradation when run with udevd running.  This appears to be due to:
> 
> 
> void inotify_dentry_parent_queue_event(struct dentry *dentry, u32 mask,
>                                        u32 cookie, const char *name)
> {
> ...
>         if (!atomic_read (&inotify_watches))
>                 return;
> 
>         spin_lock(&dentry->d_lock);
> 
> The particular application is a 512 thread application.  When run with
> udevd turned off the application finishes in about 6 seconds.  With it
> turned on, the appliction takes minutes (I think we timed it to around
> 22 minutes, but we have not been patient enough to wait it through to
> the end in some time).  This is being run on a 512cpu system, but there
> is a noticable performance hit on smaller systems as well.
> 
> As far as I can tell, the application has multiple threads writing
> different portions of the same file, but the customer is still working
> on providing a simplified test case.

Are you able to work out whether inotify_inode_watched(inode) is returning
true?

Probably it isn't, and everyone is hammering dentry->d_lock.  You'll
probably find that if all those processes were accessing the shared file
via separate filenames (all hardlinked to the same file), things would
improve.

I get a screwed feeling about this.  We have to take d_lock so we can get
at the parent dentry.  Otherwise we have obscure races with rename and
unlink.

> I know _VERY_ little about filesystems.  udevd appears to be looking
> at /etc/udev/rules.d.  This bumps inotify_watches to 1.  The file
> being written is on an xfs filesystem mounted at a different mountpoint.
> Could the inotify flag be moved from a global to a sb (or something
> finer) point and therefore avoid taking the dentry->d_lock when there
> is no possibility of a watch event being queued.

umm, yes, that's a bit of a palliative, but we could probably move
inotify_watches into dentry->d_inode->i_sb.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-22 13:42 udevd is killing file write performance Robin Holt
  2006-02-22 13:55 ` Andrew Morton
@ 2006-02-22 16:48 ` John McCutchan
  2006-02-22 17:50   ` Robin Holt
  1 sibling, 1 reply; 27+ messages in thread
From: John McCutchan @ 2006-02-22 16:48 UTC (permalink / raw)
  To: Robin Holt; +Cc: linux-kernel, rml, arnd, hch, akpm

On Wed, 2006-22-02 at 07:42 -0600, Robin Holt wrote:
> 
> I know _VERY_ little about filesystems.  udevd appears to be looking
> at /etc/udev/rules.d.  This bumps inotify_watches to 1.  The file
> being written is on an xfs filesystem mounted at a different mountpoint.
> Could the inotify flag be moved from a global to a sb (or something
> finer) point and therefore avoid taking the dentry->d_lock when there
> is no possibility of a watch event being queued.

We could do this, and avoid the problem, but only in this specific
scenario. The file being written is on a different mountpoint but whats
to stop a different app from running inotify on that mount point?
Perhaps the program could be altered instead? 

-- 
John McCutchan <john@johnmccutchan.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-22 16:48 ` John McCutchan
@ 2006-02-22 17:50   ` Robin Holt
  2006-02-22 20:05     ` Andrew Morton
  2006-02-22 22:52     ` John McCutchan
  0 siblings, 2 replies; 27+ messages in thread
From: Robin Holt @ 2006-02-22 17:50 UTC (permalink / raw)
  To: John McCutchan; +Cc: Robin Holt, linux-kernel, rml, arnd, hch, akpm

On Wed, Feb 22, 2006 at 11:48:23AM -0500, John McCutchan wrote:
> On Wed, 2006-22-02 at 07:42 -0600, Robin Holt wrote:
> > 
> > I know _VERY_ little about filesystems.  udevd appears to be looking
> > at /etc/udev/rules.d.  This bumps inotify_watches to 1.  The file
> > being written is on an xfs filesystem mounted at a different mountpoint.
> > Could the inotify flag be moved from a global to a sb (or something
> > finer) point and therefore avoid taking the dentry->d_lock when there
> > is no possibility of a watch event being queued.
> 
> We could do this, and avoid the problem, but only in this specific
> scenario. The file being written is on a different mountpoint but whats
> to stop a different app from running inotify on that mount point?
> Perhaps the program could be altered instead? 

Looking at fsnotify_access() I think we could hit the same scenario.
Are you proposing we alter any appliction where multiple threads read
a single data file to first make a hard link to that data file and each
read from their private copy?  I don't think that is a very reasonable
suggestion.

Let me reiterate, I know _VERY_ little about filesystems.  Can the
dentry->d_lock be changed to a read/write lock?

Thanks,
Robin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-22 17:50   ` Robin Holt
@ 2006-02-22 20:05     ` Andrew Morton
  2006-02-22 21:50       ` Jeff V. Merkey
  2006-02-23 12:56       ` Robin Holt
  2006-02-22 22:52     ` John McCutchan
  1 sibling, 2 replies; 27+ messages in thread
From: Andrew Morton @ 2006-02-22 20:05 UTC (permalink / raw)
  To: Robin Holt; +Cc: john, holt, linux-kernel, rml, arnd, hch

Robin Holt <holt@sgi.com> wrote:
>
> Let me reiterate, I know _VERY_ little about filesystems.  Can the
>  dentry->d_lock be changed to a read/write lock?

Well, it could, but I suspect that won't help - the hold times in there
will be very short so the problem is more likely acquisition frequency.

However it's a bit strange that this function is the bottleneck.  If their
workload is doing large numbers of reads or writes from large numbers of
processes against the same file then they should be hitting heavy
contention on other locks, such as i_sem and/or tree_lock and/or lru_lock
and others.

Can you tell us more about the kernel-visible behaviour of this app?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-22 20:05     ` Andrew Morton
@ 2006-02-22 21:50       ` Jeff V. Merkey
  2006-02-23 12:56       ` Robin Holt
  1 sibling, 0 replies; 27+ messages in thread
From: Jeff V. Merkey @ 2006-02-22 21:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Robin Holt, john, linux-kernel, rml, arnd, hch

Andrew Morton wrote:

>Robin Holt <holt@sgi.com> wrote:
>  
>
>>Let me reiterate, I know _VERY_ little about filesystems.  Can the
>> dentry->d_lock be changed to a read/write lock?
>>    
>>
>
>Well, it could, but I suspect that won't help - the hold times in there
>will be very short so the problem is more likely acquisition frequency.
>
>However it's a bit strange that this function is the bottleneck.  If their
>workload is doing large numbers of reads or writes from large numbers of
>processes against the same file then they should be hitting heavy
>contention on other locks, such as i_sem and/or tree_lock and/or lru_lock
>and others.
>
>Can you tell us more about the kernel-visible behaviour of this app?
>  
>

I have also seen this problem, and it's hard to reproduce. What you will 
see is udev getting spawned
multiple times as reported by top. I have found its related to 
intermittent failures of the hard drive and
the hotpluger for some reason invoking udev multiple times in response 
to this. I saw it on a Compaq
laptop right before my hard drive croaked, and it seems BIOS specific as 
well, since I have never seen
it or been able to reproduce it reliably.

Jeff

>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>  
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-22 17:50   ` Robin Holt
  2006-02-22 20:05     ` Andrew Morton
@ 2006-02-22 22:52     ` John McCutchan
  2006-02-22 23:12       ` Andrew Morton
  1 sibling, 1 reply; 27+ messages in thread
From: John McCutchan @ 2006-02-22 22:52 UTC (permalink / raw)
  To: Robin Holt; +Cc: linux-kernel, rml, arnd, hch, akpm

On Wed, 2006-22-02 at 11:50 -0600, Robin Holt wrote:
> On Wed, Feb 22, 2006 at 11:48:23AM -0500, John McCutchan wrote:
> > On Wed, 2006-22-02 at 07:42 -0600, Robin Holt wrote:
> > > 
> > > I know _VERY_ little about filesystems.  udevd appears to be looking
> > > at /etc/udev/rules.d.  This bumps inotify_watches to 1.  The file
> > > being written is on an xfs filesystem mounted at a different mountpoint.
> > > Could the inotify flag be moved from a global to a sb (or something
> > > finer) point and therefore avoid taking the dentry->d_lock when there
> > > is no possibility of a watch event being queued.
> > 
> > We could do this, and avoid the problem, but only in this specific
> > scenario. The file being written is on a different mountpoint but whats
> > to stop a different app from running inotify on that mount point?
> > Perhaps the program could be altered instead? 
> 
> Looking at fsnotify_access() I think we could hit the same scenario.
> Are you proposing we alter any appliction where multiple threads read
> a single data file to first make a hard link to that data file and each
> read from their private copy?  I don't think that is a very reasonable
> suggestion.

Listen, what I'm saying is that your suggested change will only help in
one specific scenario, and simply having inotify used on the 'wrong'
mountpoint will get you back to square one. So, your suggestion isn't
really a solution, but a way of avoiding the real problem. What I *am*
suggesting is that a real fix be found, instead of your hack.

-- 
John McCutchan <john@johnmccutchan.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-22 22:52     ` John McCutchan
@ 2006-02-22 23:12       ` Andrew Morton
  2006-02-22 23:41         ` John McCutchan
  2006-02-23 20:38         ` Benjamin LaHaise
  0 siblings, 2 replies; 27+ messages in thread
From: Andrew Morton @ 2006-02-22 23:12 UTC (permalink / raw)
  To: john; +Cc: holt, linux-kernel, rml, arnd, hch

John McCutchan <john@johnmccutchan.com> wrote:
>
> On Wed, 2006-22-02 at 11:50 -0600, Robin Holt wrote:
> > On Wed, Feb 22, 2006 at 11:48:23AM -0500, John McCutchan wrote:
> > > On Wed, 2006-22-02 at 07:42 -0600, Robin Holt wrote:
> > > > 
> > > > I know _VERY_ little about filesystems.  udevd appears to be looking
> > > > at /etc/udev/rules.d.  This bumps inotify_watches to 1.  The file
> > > > being written is on an xfs filesystem mounted at a different mountpoint.
> > > > Could the inotify flag be moved from a global to a sb (or something
> > > > finer) point and therefore avoid taking the dentry->d_lock when there
> > > > is no possibility of a watch event being queued.
> > > 
> > > We could do this, and avoid the problem, but only in this specific
> > > scenario. The file being written is on a different mountpoint but whats
> > > to stop a different app from running inotify on that mount point?
> > > Perhaps the program could be altered instead? 
> > 
> > Looking at fsnotify_access() I think we could hit the same scenario.
> > Are you proposing we alter any appliction where multiple threads read
> > a single data file to first make a hard link to that data file and each
> > read from their private copy?  I don't think that is a very reasonable
> > suggestion.
> 
> Listen, what I'm saying is that your suggested change will only help in
> one specific scenario, and simply having inotify used on the 'wrong'
> mountpoint will get you back to square one. So, your suggestion isn't
> really a solution, but a way of avoiding the real problem. What I *am*
> suggesting is that a real fix be found,

I have a bad feeling about this one.  It'd be nice to have an exact
understanding of the problen source, but if it's just lots of traffic on
->d_lock we're kinda stuck.  I don't expect we'll run off and RCUify
d_parent or turn d_lock into a seq_lock or anything liek that.

Then again, maybe making d_lock an rwlock _will_ help - if this workload is
also hitting tree_lock (Robin?) and we're not seeing suckiness due to that
then perhaps the rwlock is magically helping.


> instead of your hack.

It's not a terribly bad hack - it's just poor-man's hashing, and it's
reasonably well-suited to the sorts of machines and workloads which we
expect will hit this problem.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-22 23:12       ` Andrew Morton
@ 2006-02-22 23:41         ` John McCutchan
  2006-02-24  0:14           ` Andrew Morton
  2006-02-24  0:14           ` Andrew Morton
  2006-02-23 20:38         ` Benjamin LaHaise
  1 sibling, 2 replies; 27+ messages in thread
From: John McCutchan @ 2006-02-22 23:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: holt, linux-kernel, rml, arnd, hch

On Wed, 2006-22-02 at 15:12 -0800, Andrew Morton wrote:
> John McCutchan <john@johnmccutchan.com> wrote:
> >
> > On Wed, 2006-22-02 at 11:50 -0600, Robin Holt wrote:
> > > On Wed, Feb 22, 2006 at 11:48:23AM -0500, John McCutchan wrote:
> > > > On Wed, 2006-22-02 at 07:42 -0600, Robin Holt wrote:
> > > > > 
> > > > > I know _VERY_ little about filesystems.  udevd appears to be looking
> > > > > at /etc/udev/rules.d.  This bumps inotify_watches to 1.  The file
> > > > > being written is on an xfs filesystem mounted at a different mountpoint.
> > > > > Could the inotify flag be moved from a global to a sb (or something
> > > > > finer) point and therefore avoid taking the dentry->d_lock when there
> > > > > is no possibility of a watch event being queued.
> > > > 
> > > > We could do this, and avoid the problem, but only in this specific
> > > > scenario. The file being written is on a different mountpoint but whats
> > > > to stop a different app from running inotify on that mount point?
> > > > Perhaps the program could be altered instead? 
> > > 
> > > Looking at fsnotify_access() I think we could hit the same scenario.
> > > Are you proposing we alter any appliction where multiple threads read
> > > a single data file to first make a hard link to that data file and each
> > > read from their private copy?  I don't think that is a very reasonable
> > > suggestion.
> > 
> > Listen, what I'm saying is that your suggested change will only help in
> > one specific scenario, and simply having inotify used on the 'wrong'
> > mountpoint will get you back to square one. So, your suggestion isn't
> > really a solution, but a way of avoiding the real problem. What I *am*
> > suggesting is that a real fix be found,
> 
> I have a bad feeling about this one.  It'd be nice to have an exact
> understanding of the problen source, but if it's just lots of traffic on
> ->d_lock we're kinda stuck.  I don't expect we'll run off and RCUify
> d_parent or turn d_lock into a seq_lock or anything liek that.
> 
> Then again, maybe making d_lock an rwlock _will_ help - if this workload is
> also hitting tree_lock (Robin?) and we're not seeing suckiness due to that
> then perhaps the rwlock is magically helping.
> 
> 
> > instead of your hack.
> 
> It's not a terribly bad hack - it's just poor-man's hashing, and it's
> reasonably well-suited to the sorts of machines and workloads which we
> expect will hit this problem.
> 

If this is as good as it gets, here is a patch (totally untested).

Signed-off-by: John McCutchan <john@johnmccutchan.com>

Index: linux-2.6.16-rc4/fs/inotify.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/inotify.c	2006-02-17 17:23:45.000000000 -0500
+++ linux-2.6.16-rc4/fs/inotify.c	2006-02-22 18:36:29.000000000 -0500
@@ -38,7 +38,6 @@
 #include <asm/ioctls.h>
 
 static atomic_t inotify_cookie;
-static atomic_t inotify_watches;
 
 static kmem_cache_t *watch_cachep;
 static kmem_cache_t *event_cachep;
@@ -426,7 +425,7 @@
 	get_inotify_watch(watch);
 
 	atomic_inc(&dev->user->inotify_watches);
-	atomic_inc(&inotify_watches);
+	atomic_inc(&inode->i_sb->s_inotify_watches);
 
 	return watch;
 }
@@ -459,7 +458,7 @@
 	list_del(&watch->d_list);
 
 	atomic_dec(&dev->user->inotify_watches);
-	atomic_dec(&inotify_watches);
+	atomic_dec(&watch->inode->i_sb->s_inotify_watches);
 	idr_remove(&dev->idr, watch->wd);
 	put_inotify_watch(watch);
 }
@@ -538,7 +537,7 @@
 	struct dentry *parent;
 	struct inode *inode;
 
-	if (!atomic_read (&inotify_watches))
+	if (!atomic_read (&dentry->d_sb->s_inotify_watches))
 		return;
 
 	spin_lock(&dentry->d_lock);
@@ -1065,7 +1064,6 @@
 	inotify_max_user_watches = 8192;
 
 	atomic_set(&inotify_cookie, 0);
-	atomic_set(&inotify_watches, 0);
 
 	watch_cachep = kmem_cache_create("inotify_watch_cache",
 					 sizeof(struct inotify_watch),
Index: linux-2.6.16-rc4/fs/super.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/super.c	2006-02-17 17:23:45.000000000 -0500
+++ linux-2.6.16-rc4/fs/super.c	2006-02-22 18:34:27.000000000 -0500
@@ -86,6 +86,9 @@
 		s->s_qcop = sb_quotactl_ops;
 		s->s_op = &default_op;
 		s->s_time_gran = 1000000000;
+#ifdef CONFIG_INOTIFY
+		atomic_set(&s->s_inotify_watches, 0);
+#endif
 	}
 out:
 	return s;
Index: linux-2.6.16-rc4/include/linux/fs.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/fs.h	2006-02-22 18:30:14.000000000 -0500
+++ linux-2.6.16-rc4/include/linux/fs.h	2006-02-22 18:32:37.000000000 -0500
@@ -843,7 +843,7 @@
 	void 			*s_fs_info;	/* Filesystem private info */
 
+#ifdef CONFIG_INOTIFY
+	atomic_t		s_inotify_watches; /* Number of inotify watches */
+#endif
 
 	/*

-- 
John McCutchan <john@johnmccutchan.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-22 20:05     ` Andrew Morton
  2006-02-22 21:50       ` Jeff V. Merkey
@ 2006-02-23 12:56       ` Robin Holt
  2006-02-23 13:42         ` David Chinner
  1 sibling, 1 reply; 27+ messages in thread
From: Robin Holt @ 2006-02-23 12:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Robin Holt, john, linux-kernel, rml, arnd, hch

On Wed, Feb 22, 2006 at 12:05:47PM -0800, Andrew Morton wrote:
> Robin Holt <holt@sgi.com> wrote:
> >
> > Let me reiterate, I know _VERY_ little about filesystems.  Can the
> >  dentry->d_lock be changed to a read/write lock?
> 
> Well, it could, but I suspect that won't help - the hold times in there
> will be very short so the problem is more likely acquisition frequency.
> 
> However it's a bit strange that this function is the bottleneck.  If their
> workload is doing large numbers of reads or writes from large numbers of
> processes against the same file then they should be hitting heavy
> contention on other locks, such as i_sem and/or tree_lock and/or lru_lock
> and others.
> 
> Can you tell us more about the kernel-visible behaviour of this app?

I looked at a little more of the output.  All I have to go on is a few
back traces generated by kdb of the entire system a few seconds apart.

In all of the traces, the first chunk of cpus are the only ones doing
writes.  I have not counted exactly, but I think it is around 32.  There
may be more or less, but that is the feeling (sometimes they are doing
reads as well).

The remainder of the cpus seem to be doing seeks to various parts of
the file and doing reads.  The seeks+read seem to overlap as I see
a nearly uniform read of 64k, but the seeks seem to be 4k aligned.
Also, at the time the snapshots were taken, they all are fairly
early in the file (between about 15% and 35% of the file) where
the writes are happening toward to early part of the second half
(50% - 70%).

I no longer have access to the test machine.  Since we told them
to turn off udevd, they no longer see a problem.  I will try to
convince them to allow some more testing.  I will also try to
recreate on our machine as well.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-23 12:56       ` Robin Holt
@ 2006-02-23 13:42         ` David Chinner
  0 siblings, 0 replies; 27+ messages in thread
From: David Chinner @ 2006-02-23 13:42 UTC (permalink / raw)
  To: Robin Holt; +Cc: Andrew Morton, john, linux-kernel, rml, arnd, hch

On Thu, Feb 23, 2006 at 06:56:39AM -0600, Robin Holt wrote:
> On Wed, Feb 22, 2006 at 12:05:47PM -0800, Andrew Morton wrote:
> > Robin Holt <holt@sgi.com> wrote:
> > >
> > > Let me reiterate, I know _VERY_ little about filesystems.  Can the
> > >  dentry->d_lock be changed to a read/write lock?
> > 
> > Well, it could, but I suspect that won't help - the hold times in there
> > will be very short so the problem is more likely acquisition frequency.
> > 
> > However it's a bit strange that this function is the bottleneck.  If their
> > workload is doing large numbers of reads or writes from large numbers of
> > processes against the same file then they should be hitting heavy
> > contention on other locks, such as i_sem and/or tree_lock and/or lru_lock
> > and others.
> > 
> > Can you tell us more about the kernel-visible behaviour of this app?
> 
> I looked at a little more of the output.  All I have to go on is a few
> back traces generated by kdb of the entire system a few seconds apart.
> 
> In all of the traces, the first chunk of cpus are the only ones doing
> writes.  I have not counted exactly, but I think it is around 32.  There
> may be more or less, but that is the feeling (sometimes they are doing
> reads as well).

Robin, is the app doing direct or buffered I/O?

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-22 23:12       ` Andrew Morton
  2006-02-22 23:41         ` John McCutchan
@ 2006-02-23 20:38         ` Benjamin LaHaise
  1 sibling, 0 replies; 27+ messages in thread
From: Benjamin LaHaise @ 2006-02-23 20:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: john, holt, linux-kernel, rml, arnd, hch

On Wed, Feb 22, 2006 at 03:12:23PM -0800, Andrew Morton wrote:
> It's not a terribly bad hack - it's just poor-man's hashing, and it's
> reasonably well-suited to the sorts of machines and workloads which we
> expect will hit this problem.

The dnotify/inotify wakeups are a problem, namely because the implementation 
is braindead: it makes the wrong part of the interface fast (setting up 
notify entries) at the expense of making the rest of the kernel slow (adding 
locks to read()/write()).  read() and write() are incredibly hot paths in 
the kernel and should be optimized at the expense of dnotify and inotify, 
which are uncommon operations.

		-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-22 23:41         ` John McCutchan
@ 2006-02-24  0:14           ` Andrew Morton
  2006-02-24  0:14           ` Andrew Morton
  1 sibling, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2006-02-24  0:14 UTC (permalink / raw)
  To: john; +Cc: holt, linux-kernel, rml, arnd, hch

John McCutchan <john@johnmccutchan.com> wrote:
>
> ...
> > 
> > I have a bad feeling about this one.  It'd be nice to have an exact
> > understanding of the problen source, but if it's just lots of traffic on
> > ->d_lock we're kinda stuck.  I don't expect we'll run off and RCUify
> > d_parent or turn d_lock into a seq_lock or anything liek that.
> > 
> > Then again, maybe making d_lock an rwlock _will_ help - if this workload is
> > also hitting tree_lock (Robin?) and we're not seeing suckiness due to that
> > then perhaps the rwlock is magically helping.
> > 
> > 
> > > instead of your hack.
> > 
> > It's not a terribly bad hack - it's just poor-man's hashing, and it's
> > reasonably well-suited to the sorts of machines and workloads which we
> > expect will hit this problem.
> > 
> 
> If this is as good as it gets, here is a patch (totally untested).
> 
> ...
> @@ -538,7 +537,7 @@
>  	struct dentry *parent;
>  	struct inode *inode;
>  
> -	if (!atomic_read (&inotify_watches))
> +	if (!atomic_read (&dentry->d_sb->s_inotify_watches))
>  		return;
>  

What happens here if we're watching a mointpoint - the parent is on a
different fs?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-22 23:41         ` John McCutchan
  2006-02-24  0:14           ` Andrew Morton
@ 2006-02-24  0:14           ` Andrew Morton
  2006-02-24  5:47             ` John McCutchan
  1 sibling, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2006-02-24  0:14 UTC (permalink / raw)
  To: john; +Cc: holt, linux-kernel, rml, arnd, hch

John McCutchan <john@johnmccutchan.com> wrote:
>
> ...
> > 
> > I have a bad feeling about this one.  It'd be nice to have an exact
> > understanding of the problen source, but if it's just lots of traffic on
> > ->d_lock we're kinda stuck.  I don't expect we'll run off and RCUify
> > d_parent or turn d_lock into a seq_lock or anything liek that.
> > 
> > Then again, maybe making d_lock an rwlock _will_ help - if this workload is
> > also hitting tree_lock (Robin?) and we're not seeing suckiness due to that
> > then perhaps the rwlock is magically helping.
> > 
> > 
> > > instead of your hack.
> > 
> > It's not a terribly bad hack - it's just poor-man's hashing, and it's
> > reasonably well-suited to the sorts of machines and workloads which we
> > expect will hit this problem.
> > 
> 
> If this is as good as it gets, here is a patch (totally untested).
> 
> ...
> @@ -538,7 +537,7 @@
>  	struct dentry *parent;
>  	struct inode *inode;
>  
> -	if (!atomic_read (&inotify_watches))
> +	if (!atomic_read (&dentry->d_sb->s_inotify_watches))
>  		return;
>  

What happens here if we're watching a mountpoint - the parent is on a
different fs?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-24  0:14           ` Andrew Morton
@ 2006-02-24  5:47             ` John McCutchan
  2006-02-24  6:00               ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: John McCutchan @ 2006-02-24  5:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: john, holt, linux-kernel, rml, arnd, hch

On Thu, Feb 23, 2006 at 04:14:25PM -0800, Andrew Morton wrote:
> John McCutchan <john@johnmccutchan.com> wrote:
> >
> > ...
> > > 
> > > I have a bad feeling about this one.  It'd be nice to have an exact
> > > understanding of the problen source, but if it's just lots of traffic on
> > > ->d_lock we're kinda stuck.  I don't expect we'll run off and RCUify
> > > d_parent or turn d_lock into a seq_lock or anything liek that.
> > > 
> > > Then again, maybe making d_lock an rwlock _will_ help - if this workload is
> > > also hitting tree_lock (Robin?) and we're not seeing suckiness due to that
> > > then perhaps the rwlock is magically helping.
> > > 
> > > 
> > > > instead of your hack.
> > > 
> > > It's not a terribly bad hack - it's just poor-man's hashing, and it's
> > > reasonably well-suited to the sorts of machines and workloads which we
> > > expect will hit this problem.
> > > 
> > 
> > If this is as good as it gets, here is a patch (totally untested).
> > 
> > ...
> > @@ -538,7 +537,7 @@
> >  	struct dentry *parent;
> >  	struct inode *inode;
> >  
> > -	if (!atomic_read (&inotify_watches))
> > +	if (!atomic_read (&dentry->d_sb->s_inotify_watches))
> >  		return;
> >  
> 
> What happens here if we're watching a mountpoint - the parent is on a
> different fs?

There are four cases to consider here.

Case 1: parent fs watched and child fs watched
	correct results
Case 2: parent fs watched and child fs not watched
	We may not deliver an event that should be delivered.
Case 3: parent fs not watched and child fs watched
	We take d_lock when we don't need to
Case 4: parent fs not watched and child fs not watched
	correct results

Case 2 screws us. We have to take the lock to even look at the parent's
dentry->d_sb->s_inotify_watches. I don't know of a way around this one.

John

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-24  5:47             ` John McCutchan
@ 2006-02-24  6:00               ` Andrew Morton
  2006-02-24  7:07                 ` Nick Piggin
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2006-02-24  6:00 UTC (permalink / raw)
  To: John McCutchan; +Cc: john, holt, linux-kernel, rml, arnd, hch, Dipankar Sarma

John McCutchan <john@johnmccutchan.com> wrote:
>
>  > > @@ -538,7 +537,7 @@
>  > >  	struct dentry *parent;
>  > >  	struct inode *inode;
>  > >  
>  > > -	if (!atomic_read (&inotify_watches))
>  > > +	if (!atomic_read (&dentry->d_sb->s_inotify_watches))
>  > >  		return;
>  > >  
>  > 
>  > What happens here if we're watching a mountpoint - the parent is on a
>  > different fs?
> 
>  There are four cases to consider here.
> 
>  Case 1: parent fs watched and child fs watched
>  	correct results
>  Case 2: parent fs watched and child fs not watched
>  	We may not deliver an event that should be delivered.
>  Case 3: parent fs not watched and child fs watched
>  	We take d_lock when we don't need to
>  Case 4: parent fs not watched and child fs not watched
>  	correct results
> 
>  Case 2 screws us. We have to take the lock to even look at the parent's
>  dentry->d_sb->s_inotify_watches. I don't know of a way around this one.

Yeah.  There are a lot of "screw"s in this thread.

I wonder if RCU can save us - if we do an rcu_read_lock() we at least know
that the dentries won't get deallocated.  Then we can take a look at
d_parent (which might not be the parent any more).  Once in a million years
we might send a false event or miss sending an event, depending on where
our dentry suddenly got moved to.  Not very nice, but at least it won't
oops.

(hopefully cc's Dipankar)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-24  6:00               ` Andrew Morton
@ 2006-02-24  7:07                 ` Nick Piggin
  2006-02-24  7:16                   ` Andrew Morton
                                     ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Nick Piggin @ 2006-02-24  7:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: John McCutchan, holt, linux-kernel, rml, arnd, hch, Dipankar Sarma

[-- Attachment #1: Type: text/plain, Size: 1813 bytes --]

Andrew Morton wrote:
> John McCutchan <john@johnmccutchan.com> wrote:
> 
>> > > @@ -538,7 +537,7 @@
>> > >  	struct dentry *parent;
>> > >  	struct inode *inode;
>> > >  
>> > > -	if (!atomic_read (&inotify_watches))
>> > > +	if (!atomic_read (&dentry->d_sb->s_inotify_watches))
>> > >  		return;
>> > >  
>> > 
>> > What happens here if we're watching a mountpoint - the parent is on a
>> > different fs?
>>
>> There are four cases to consider here.
>>
>> Case 1: parent fs watched and child fs watched
>> 	correct results
>> Case 2: parent fs watched and child fs not watched
>> 	We may not deliver an event that should be delivered.
>> Case 3: parent fs not watched and child fs watched
>> 	We take d_lock when we don't need to
>> Case 4: parent fs not watched and child fs not watched
>> 	correct results
>>
>> Case 2 screws us. We have to take the lock to even look at the parent's
>> dentry->d_sb->s_inotify_watches. I don't know of a way around this one.
> 
> 
> Yeah.  There are a lot of "screw"s in this thread.
> 
> I wonder if RCU can save us - if we do an rcu_read_lock() we at least know
> that the dentries won't get deallocated.  Then we can take a look at
> d_parent (which might not be the parent any more).  Once in a million years
> we might send a false event or miss sending an event, depending on where
> our dentry suddenly got moved to.  Not very nice, but at least it won't
> oops.
> 
> (hopefully cc's Dipankar)

I saw this problem when testing my lockless pagecache a while back.

Attached is a first implementation of what was my idea then of how
to solve it... note it is pretty rough and I never got around to doing
much testing of it.

Basically: moves work out of inotify event time and to inotify attach
/detach time while staying out of the core VFS.

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: inotify-dentry-flag.patch --]
[-- Type: text/plain, Size: 5880 bytes --]

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -799,6 +799,7 @@ void d_instantiate(struct dentry *entry,
 	if (inode)
 		list_add(&entry->d_alias, &inode->i_dentry);
 	entry->d_inode = inode;
+	fsnotify_d_instantiate(entry, inode);
 	spin_unlock(&dcache_lock);
 	security_d_instantiate(entry, inode);
 }
@@ -850,6 +851,7 @@ struct dentry *d_instantiate_unique(stru
 	list_add(&entry->d_alias, &inode->i_dentry);
 do_negative:
 	entry->d_inode = inode;
+	fsnotify_d_instantiate(entry, inode);
 	spin_unlock(&dcache_lock);
 	security_d_instantiate(entry, inode);
 	return NULL;
@@ -980,6 +982,7 @@ struct dentry *d_splice_alias(struct ino
 		new = __d_find_alias(inode, 1);
 		if (new) {
 			BUG_ON(!(new->d_flags & DCACHE_DISCONNECTED));
+			fsnotify_d_instantiate(new, inode);
 			spin_unlock(&dcache_lock);
 			security_d_instantiate(new, inode);
 			d_rehash(dentry);
@@ -989,6 +992,7 @@ struct dentry *d_splice_alias(struct ino
 			/* d_instantiate takes dcache_lock, so we do it by hand */
 			list_add(&dentry->d_alias, &inode->i_dentry);
 			dentry->d_inode = inode;
+			fsnotify_d_instantiate(dentry, inode);
 			spin_unlock(&dcache_lock);
 			security_d_instantiate(dentry, inode);
 			d_rehash(dentry);
Index: linux-2.6/fs/inotify.c
===================================================================
--- linux-2.6.orig/fs/inotify.c
+++ linux-2.6/fs/inotify.c
@@ -381,6 +381,41 @@ static int find_inode(const char __user 
 }
 
 /*
+ * inotify_inode_watched - returns nonzero if there are watches on this inode
+ * and zero otherwise.  We call this lockless, we do not care if we race.
+ */
+static inline int inotify_inode_watched(struct inode *inode)
+{
+	return !list_empty(&inode->inotify_watches);
+}
+
+static void set_dentry_child_flags(struct inode *inode, int new_watch)
+{
+	struct dentry *alias;
+
+	if (inotify_inode_watched(inode))
+		return;
+
+	spin_lock(&dcache_lock);
+	list_for_each_entry(alias, &inode->i_dentry, d_alias) {
+		struct dentry *child;
+
+		list_for_each_entry(child, &alias->d_subdirs, d_u.d_child) {
+			spin_lock(&child->d_lock);
+			if (new_watch) {
+				BUG_ON(child->d_flags & DCACHE_INOTIFY_PARENT_WATCHED);
+				child->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED;
+			} else {
+				BUG_ON(!(child->d_flags & DCACHE_INOTIFY_PARENT_WATCHED));
+				child->d_flags &= ~DCACHE_INOTIFY_PARENT_WATCHED;
+			}
+			spin_unlock(&child->d_lock);
+		}
+	}
+	spin_unlock(&dcache_lock);
+}
+
+/*
  * create_watch - creates a watch on the given device.
  *
  * Callers must hold dev->sem.  Calls inotify_dev_get_wd() so may sleep.
@@ -406,6 +441,8 @@ static struct inotify_watch *create_watc
 		return ERR_PTR(ret);
 	}
 
+	set_dentry_child_flags(inode, 1);
+
 	dev->last_wd = watch->wd;
 	watch->mask = mask;
 	atomic_set(&watch->count, 0);
@@ -462,6 +499,8 @@ static void remove_watch_no_event(struct
 	atomic_dec(&inotify_watches);
 	idr_remove(&dev->idr, watch->wd);
 	put_inotify_watch(watch);
+
+	set_dentry_child_flags(watch->inode, 0);
 }
 
 /*
@@ -481,16 +520,18 @@ static void remove_watch(struct inotify_
 	remove_watch_no_event(watch, dev);
 }
 
-/*
- * inotify_inode_watched - returns nonzero if there are watches on this inode
- * and zero otherwise.  We call this lockless, we do not care if we race.
- */
-static inline int inotify_inode_watched(struct inode *inode)
+/* Kernel API */
+
+void inotify_d_instantiate(struct dentry *entry, struct inode *inode)
 {
-	return !list_empty(&inode->inotify_watches);
-}
+	struct dentry *parent;
 
-/* Kernel API */
+	spin_lock(&entry->d_lock);
+	parent = entry->d_parent;
+	if (inotify_inode_watched(parent->d_inode))
+		entry->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED;
+	spin_unlock(&entry->d_lock);
+}
 
 /**
  * inotify_inode_queue_event - queue an event to all watches on this inode
@@ -538,7 +579,7 @@ void inotify_dentry_parent_queue_event(s
 	struct dentry *parent;
 	struct inode *inode;
 
-	if (!atomic_read (&inotify_watches))
+	if (!(dentry->d_flags & DCACHE_INOTIFY_PARENT_WATCHED))
 		return;
 
 	spin_lock(&dentry->d_lock);
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -162,6 +162,8 @@ d_iput:		no		no		no       yes
 #define DCACHE_REFERENCED	0x0008  /* Recently used, don't discard. */
 #define DCACHE_UNHASHED		0x0010	
 
+#define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
+
 extern spinlock_t dcache_lock;
 
 /**
Index: linux-2.6/include/linux/fsnotify.h
===================================================================
--- linux-2.6.orig/include/linux/fsnotify.h
+++ linux-2.6/include/linux/fsnotify.h
@@ -16,6 +16,12 @@
 #include <linux/dnotify.h>
 #include <linux/inotify.h>
 
+static inline void fsnotify_d_instantiate(struct dentry *entry,
+						struct inode *inode)
+{
+	inotify_d_instantiate(entry, inode);
+}
+
 /*
  * fsnotify_move - file old_name at old_dir was moved to new_name at new_dir
  */
Index: linux-2.6/include/linux/inotify.h
===================================================================
--- linux-2.6.orig/include/linux/inotify.h
+++ linux-2.6/include/linux/inotify.h
@@ -71,6 +71,7 @@ struct inotify_event {
 
 #ifdef CONFIG_INOTIFY
 
+extern void inotify_d_instantiate(struct dentry *, struct inode *);
 extern void inotify_inode_queue_event(struct inode *, __u32, __u32,
 				      const char *);
 extern void inotify_dentry_parent_queue_event(struct dentry *, __u32, __u32,
@@ -81,6 +82,10 @@ extern u32 inotify_get_cookie(void);
 
 #else
 
+static inline void inotify_d_instantiate(struct dentry *, struct inode *)
+{
+}
+
 static inline void inotify_inode_queue_event(struct inode *inode,
 					     __u32 mask, __u32 cookie,
 					     const char *filename)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-24  7:07                 ` Nick Piggin
@ 2006-02-24  7:16                   ` Andrew Morton
  2006-02-24  7:19                     ` Nick Piggin
  2006-02-24 18:56                   ` Robin Holt
  2006-02-26 16:55                   ` udevd is killing file write performance John McCutchan
  2 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2006-02-24  7:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: john, holt, linux-kernel, rml, arnd, hch, dipankar

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> @@ -538,7 +579,7 @@ void inotify_dentry_parent_queue_event(s
>   	struct dentry *parent;
>   	struct inode *inode;
>   
>  -	if (!atomic_read (&inotify_watches))
>  +	if (!(dentry->d_flags & DCACHE_INOTIFY_PARENT_WATCHED))

Yeah, I think that would work.  One would need to wire up d_move() also -
for when a file is moved from a watched to non-watched directory and
vice-versa.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-24  7:16                   ` Andrew Morton
@ 2006-02-24  7:19                     ` Nick Piggin
  2006-02-26 16:58                       ` John McCutchan
  0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2006-02-24  7:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: john, holt, linux-kernel, rml, arnd, hch, dipankar

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>@@ -538,7 +579,7 @@ void inotify_dentry_parent_queue_event(s
>>  	struct dentry *parent;
>>  	struct inode *inode;
>>  
>> -	if (!atomic_read (&inotify_watches))
>> +	if (!(dentry->d_flags & DCACHE_INOTIFY_PARENT_WATCHED))
> 
> 
> Yeah, I think that would work.  One would need to wire up d_move() also -
> for when a file is moved from a watched to non-watched directory and
> vice-versa.
> 
> 

Oh yeah of course, good catch. Are there any other cases missing?

... I'll let the others on this thread digest before spitting out
another patch.

John or Robert, is there some kind of inotify test suite around?

Thanks,

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-24  7:07                 ` Nick Piggin
  2006-02-24  7:16                   ` Andrew Morton
@ 2006-02-24 18:56                   ` Robin Holt
  2006-02-25  2:44                     ` Nick Piggin
  2006-02-25 15:53                     ` [patch] inotify: lock avoidance with parent watch status in dentry Nick Piggin
  2006-02-26 16:55                   ` udevd is killing file write performance John McCutchan
  2 siblings, 2 replies; 27+ messages in thread
From: Robin Holt @ 2006-02-24 18:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, John McCutchan, holt, linux-kernel, rml, arnd,
	hch, Dipankar Sarma

On Fri, Feb 24, 2006 at 06:07:43PM +1100, Nick Piggin wrote:
> Attached is a first implementation of what was my idea then of how
> to solve it... note it is pretty rough and I never got around to doing
> much testing of it.
> 
> Basically: moves work out of inotify event time and to inotify attach
> /detach time while staying out of the core VFS.

The customer called and said they had tried with udevd running and this
patch applied.  They said the problem is gone.

Thanks,
Robin Holt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-24 18:56                   ` Robin Holt
@ 2006-02-25  2:44                     ` Nick Piggin
  2006-02-25 15:53                     ` [patch] inotify: lock avoidance with parent watch status in dentry Nick Piggin
  1 sibling, 0 replies; 27+ messages in thread
From: Nick Piggin @ 2006-02-25  2:44 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrew Morton, John McCutchan, linux-kernel, rml, arnd, hch,
	Dipankar Sarma

Robin Holt wrote:
> On Fri, Feb 24, 2006 at 06:07:43PM +1100, Nick Piggin wrote:
> 
>>Attached is a first implementation of what was my idea then of how
>>to solve it... note it is pretty rough and I never got around to doing
>>much testing of it.
>>
>>Basically: moves work out of inotify event time and to inotify attach
>>/detach time while staying out of the core VFS.
> 
> 
> The customer called and said they had tried with udevd running and this
> patch applied.  They said the problem is gone.
> 

OK, well tell them not to run it in production of course...

I'll spend a bit of time to implement Andrew's suggestion, and
do some testing on it.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch] inotify: lock avoidance with parent watch status in dentry
  2006-02-24 18:56                   ` Robin Holt
  2006-02-25  2:44                     ` Nick Piggin
@ 2006-02-25 15:53                     ` Nick Piggin
  2006-02-28  0:48                       ` Andrew Morton
  1 sibling, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2006-02-25 15:53 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrew Morton, John McCutchan, linux-kernel, rml, hch, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 495 bytes --]

This is about as far as my inotify/vfs knowledge takes me. Comments
would be appreciated, in particular whether locking and core vfs
parts look OK (eg. inotify_d_instantiate takes ->d_lock - might
that be possible to avoid? is d_move the best place for the move hook?)

The patch is tested with a basic inotify tester, moving files in and
out of watched directory, creating, deleting, overwriting, etc.

(sorry it is an attachment, I'm having technical difficulties)

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: inotify-dentry-flag.patch --]
[-- Type: text/plain, Size: 8874 bytes --]

Previous inotify work avoidance is good when inotify is completely
unused, but it breaks down if even a single watch is in place anywhere
in the system. Robin Holt notices that udev is one such culprit - it
slows down a 512-thread application on a 512 CPU system from 6 seconds
to 22 minutes.

Solve this by adding a flag in the dentry that tells inotify whether
or not its parent inode has a watch on it. Event queueing to parent
will skip taking locks if this flag is cleared. Setting and clearing
of this flag on all child dentries versus event delivery: this is no
different to the way that inode_watches modifications was implemented,
in terms of race cases, and that was shown to be equivalent to always
performing the check.

The essential behaviour is that activity occuring _after_ a watch has
been added and _before_ it has been removed, will generate events. 

Signed-off-by: Nick Piggin <npiggin@suse.de>


Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -799,6 +799,7 @@ void d_instantiate(struct dentry *entry,
 	if (inode)
 		list_add(&entry->d_alias, &inode->i_dentry);
 	entry->d_inode = inode;
+	fsnotify_d_instantiate(entry, inode);
 	spin_unlock(&dcache_lock);
 	security_d_instantiate(entry, inode);
 }
@@ -850,6 +851,7 @@ struct dentry *d_instantiate_unique(stru
 	list_add(&entry->d_alias, &inode->i_dentry);
 do_negative:
 	entry->d_inode = inode;
+	fsnotify_d_instantiate(entry, inode);
 	spin_unlock(&dcache_lock);
 	security_d_instantiate(entry, inode);
 	return NULL;
@@ -980,6 +982,7 @@ struct dentry *d_splice_alias(struct ino
 		new = __d_find_alias(inode, 1);
 		if (new) {
 			BUG_ON(!(new->d_flags & DCACHE_DISCONNECTED));
+			fsnotify_d_instantiate(new, inode);
 			spin_unlock(&dcache_lock);
 			security_d_instantiate(new, inode);
 			d_rehash(dentry);
@@ -989,6 +992,7 @@ struct dentry *d_splice_alias(struct ino
 			/* d_instantiate takes dcache_lock, so we do it by hand */
 			list_add(&dentry->d_alias, &inode->i_dentry);
 			dentry->d_inode = inode;
+			fsnotify_d_instantiate(dentry, inode);
 			spin_unlock(&dcache_lock);
 			security_d_instantiate(dentry, inode);
 			d_rehash(dentry);
@@ -1339,6 +1343,7 @@ already_unhashed:
 
 	list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
 	spin_unlock(&target->d_lock);
+	fsnotify_d_move(dentry);
 	spin_unlock(&dentry->d_lock);
 	write_sequnlock(&rename_lock);
 	spin_unlock(&dcache_lock);
Index: linux-2.6/fs/inotify.c
===================================================================
--- linux-2.6.orig/fs/inotify.c
+++ linux-2.6/fs/inotify.c
@@ -38,7 +38,6 @@
 #include <asm/ioctls.h>
 
 static atomic_t inotify_cookie;
-static atomic_t inotify_watches;
 
 static kmem_cache_t *watch_cachep;
 static kmem_cache_t *event_cachep;
@@ -381,6 +380,41 @@ static int find_inode(const char __user 
 }
 
 /*
+ * inotify_inode_watched - returns nonzero if there are watches on this inode
+ * and zero otherwise.  We call this lockless, we do not care if we race.
+ */
+static inline int inotify_inode_watched(struct inode *inode)
+{
+	return !list_empty(&inode->inotify_watches);
+}
+
+/*
+ * Get child dentry flag into synch with parent inode.
+ */
+static void set_dentry_child_flags(struct inode *inode, int watched)
+{
+	struct dentry *alias;
+
+	spin_lock(&dcache_lock);
+	list_for_each_entry(alias, &inode->i_dentry, d_alias) {
+		struct dentry *child;
+
+		list_for_each_entry(child, &alias->d_subdirs, d_u.d_child) {
+			spin_lock(&child->d_lock);
+			if (watched) {
+				WARN_ON(child->d_flags & DCACHE_INOTIFY_PARENT_WATCHED);
+				child->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED;
+			} else {
+				WARN_ON(!(child->d_flags & DCACHE_INOTIFY_PARENT_WATCHED));
+				child->d_flags &= ~DCACHE_INOTIFY_PARENT_WATCHED;
+			}
+			spin_unlock(&child->d_lock);
+		}
+	}
+	spin_unlock(&dcache_lock);
+}
+
+/*
  * create_watch - creates a watch on the given device.
  *
  * Callers must hold dev->sem.  Calls inotify_dev_get_wd() so may sleep.
@@ -426,7 +460,6 @@ static struct inotify_watch *create_watc
 	get_inotify_watch(watch);
 
 	atomic_inc(&dev->user->inotify_watches);
-	atomic_inc(&inotify_watches);
 
 	return watch;
 }
@@ -458,8 +491,10 @@ static void remove_watch_no_event(struct
 	list_del(&watch->i_list);
 	list_del(&watch->d_list);
 
+	if (!inotify_inode_watched(watch->inode))
+		set_dentry_child_flags(watch->inode, 0);
+
 	atomic_dec(&dev->user->inotify_watches);
-	atomic_dec(&inotify_watches);
 	idr_remove(&dev->idr, watch->wd);
 	put_inotify_watch(watch);
 }
@@ -481,16 +516,39 @@ static void remove_watch(struct inotify_
 	remove_watch_no_event(watch, dev);
 }
 
+/* Kernel API */
+
 /*
- * inotify_inode_watched - returns nonzero if there are watches on this inode
- * and zero otherwise.  We call this lockless, we do not care if we race.
+ * inotify_d_instantiate - instantiate dcache entry for inode
  */
-static inline int inotify_inode_watched(struct inode *inode)
+void inotify_d_instantiate(struct dentry *entry, struct inode *inode)
 {
-	return !list_empty(&inode->inotify_watches);
+	struct dentry *parent;
+
+	if (!inode)
+		return;
+
+	WARN_ON(entry->d_flags & DCACHE_INOTIFY_PARENT_WATCHED);
+	spin_lock(&entry->d_lock);
+	parent = entry->d_parent;
+	if (inotify_inode_watched(parent->d_inode))
+		entry->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED;
+	spin_unlock(&entry->d_lock);
 }
 
-/* Kernel API */
+/*
+ * inotify_d_move - dcache entry has been moved
+ */
+void inotify_d_move(struct dentry *entry)
+{
+	struct dentry *parent;
+
+	parent = entry->d_parent;
+	if (inotify_inode_watched(parent->d_inode))
+		entry->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED;
+	else
+		entry->d_flags &= ~DCACHE_INOTIFY_PARENT_WATCHED;
+}
 
 /**
  * inotify_inode_queue_event - queue an event to all watches on this inode
@@ -538,7 +596,7 @@ void inotify_dentry_parent_queue_event(s
 	struct dentry *parent;
 	struct inode *inode;
 
-	if (!atomic_read (&inotify_watches))
+	if (!(dentry->d_flags & DCACHE_INOTIFY_PARENT_WATCHED))
 		return;
 
 	spin_lock(&dentry->d_lock);
@@ -993,6 +1051,9 @@ asmlinkage long sys_inotify_add_watch(in
 		goto out;
 	}
 
+	if (!inotify_inode_watched(inode))
+		set_dentry_child_flags(inode, 1);
+
 	/* Add the watch to the device's and the inode's list */
 	list_add(&watch->d_list, &dev->watches);
 	list_add(&watch->i_list, &inode->inotify_watches);
@@ -1065,7 +1126,6 @@ static int __init inotify_setup(void)
 	inotify_max_user_watches = 8192;
 
 	atomic_set(&inotify_cookie, 0);
-	atomic_set(&inotify_watches, 0);
 
 	watch_cachep = kmem_cache_create("inotify_watch_cache",
 					 sizeof(struct inotify_watch),
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -162,6 +162,8 @@ d_iput:		no		no		no       yes
 #define DCACHE_REFERENCED	0x0008  /* Recently used, don't discard. */
 #define DCACHE_UNHASHED		0x0010	
 
+#define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
+
 extern spinlock_t dcache_lock;
 
 /**
Index: linux-2.6/include/linux/fsnotify.h
===================================================================
--- linux-2.6.orig/include/linux/fsnotify.h
+++ linux-2.6/include/linux/fsnotify.h
@@ -17,6 +17,25 @@
 #include <linux/inotify.h>
 
 /*
+ * fsnotify_d_instantiate - instantiate a dentry for inode
+ * Called with dcache_lock held.
+ */
+static inline void fsnotify_d_instantiate(struct dentry *entry,
+						struct inode *inode)
+{
+	inotify_d_instantiate(entry, inode);
+}
+
+/*
+ * fsnotify_d_move - entry has been moved
+ * Called with dcache_lock and entry->d_lock held.
+ */
+static inline void fsnotify_d_move(struct dentry *entry)
+{
+	inotify_d_move(entry);
+}
+
+/*
  * fsnotify_move - file old_name at old_dir was moved to new_name at new_dir
  */
 static inline void fsnotify_move(struct inode *old_dir, struct inode *new_dir,
Index: linux-2.6/include/linux/inotify.h
===================================================================
--- linux-2.6.orig/include/linux/inotify.h
+++ linux-2.6/include/linux/inotify.h
@@ -71,6 +71,8 @@ struct inotify_event {
 
 #ifdef CONFIG_INOTIFY
 
+extern void inotify_d_instantiate(struct dentry *, struct inode *);
+extern void inotify_d_move(struct dentry *);
 extern void inotify_inode_queue_event(struct inode *, __u32, __u32,
 				      const char *);
 extern void inotify_dentry_parent_queue_event(struct dentry *, __u32, __u32,
@@ -81,6 +83,14 @@ extern u32 inotify_get_cookie(void);
 
 #else
 
+static inline void inotify_d_instantiate(struct dentry *, struct inode *)
+{
+}
+
+static inline void inotify_d_move(struct dentry *)
+{
+}
+
 static inline void inotify_inode_queue_event(struct inode *inode,
 					     __u32 mask, __u32 cookie,
 					     const char *filename)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-24  7:07                 ` Nick Piggin
  2006-02-24  7:16                   ` Andrew Morton
  2006-02-24 18:56                   ` Robin Holt
@ 2006-02-26 16:55                   ` John McCutchan
  2006-02-27 10:11                     ` Nick Piggin
  2 siblings, 1 reply; 27+ messages in thread
From: John McCutchan @ 2006-02-26 16:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, holt, linux-kernel, rml, arnd, hch, Dipankar Sarma

On Fri, 2006-24-02 at 18:07 +1100, Nick Piggin wrote:
> Andrew Morton wrote:
> > John McCutchan <john@johnmccutchan.com> wrote:
> > 
> >> > > @@ -538,7 +537,7 @@
> >> > >  	struct dentry *parent;
> >> > >  	struct inode *inode;
> >> > >  
> >> > > -	if (!atomic_read (&inotify_watches))
> >> > > +	if (!atomic_read (&dentry->d_sb->s_inotify_watches))
> >> > >  		return;
> >> > >  
> >> > 
> >> > What happens here if we're watching a mountpoint - the parent is on a
> >> > different fs?
> >>
> >> There are four cases to consider here.
> >>
> >> Case 1: parent fs watched and child fs watched
> >> 	correct results
> >> Case 2: parent fs watched and child fs not watched
> >> 	We may not deliver an event that should be delivered.
> >> Case 3: parent fs not watched and child fs watched
> >> 	We take d_lock when we don't need to
> >> Case 4: parent fs not watched and child fs not watched
> >> 	correct results
> >>
> >> Case 2 screws us. We have to take the lock to even look at the parent's
> >> dentry->d_sb->s_inotify_watches. I don't know of a way around this one.
> > 
> > 
> > Yeah.  There are a lot of "screw"s in this thread.
> > 
> > I wonder if RCU can save us - if we do an rcu_read_lock() we at least know
> > that the dentries won't get deallocated.  Then we can take a look at
> > d_parent (which might not be the parent any more).  Once in a million years
> > we might send a false event or miss sending an event, depending on where
> > our dentry suddenly got moved to.  Not very nice, but at least it won't
> > oops.
> > 
> > (hopefully cc's Dipankar)
> 
> I saw this problem when testing my lockless pagecache a while back.
> 
> Attached is a first implementation of what was my idea then of how
> to solve it... note it is pretty rough and I never got around to doing
> much testing of it.
> 
> Basically: moves work out of inotify event time and to inotify attach
> /detach time while staying out of the core VFS.


This looks really good. There might be some corner cases but it looks
like it will solve this problem nicely.

-- 
John McCutchan <john@johnmccutchan.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-24  7:19                     ` Nick Piggin
@ 2006-02-26 16:58                       ` John McCutchan
  0 siblings, 0 replies; 27+ messages in thread
From: John McCutchan @ 2006-02-26 16:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, holt, linux-kernel, rml, arnd, hch, dipankar

On Fri, 2006-24-02 at 18:19 +1100, Nick Piggin wrote:
> Andrew Morton wrote:
> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > 
> >>@@ -538,7 +579,7 @@ void inotify_dentry_parent_queue_event(s
> >>  	struct dentry *parent;
> >>  	struct inode *inode;
> >>  
> >> -	if (!atomic_read (&inotify_watches))
> >> +	if (!(dentry->d_flags & DCACHE_INOTIFY_PARENT_WATCHED))
> > 
> > 
> > Yeah, I think that would work.  One would need to wire up d_move() also -
> > for when a file is moved from a watched to non-watched directory and
> > vice-versa.
> > 
> > 
> 
> Oh yeah of course, good catch. Are there any other cases missing?
> 
> ... I'll let the others on this thread digest before spitting out
> another patch.
> 
> John or Robert, is there some kind of inotify test suite around?

Unfortunately there isn't. In the past gamin's test suite was a good
start, but it hasn't been working for a while now. 
 
-- 
John McCutchan <john@johnmccutchan.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-26 16:55                   ` udevd is killing file write performance John McCutchan
@ 2006-02-27 10:11                     ` Nick Piggin
  2006-02-27 20:17                       ` John McCutchan
  0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2006-02-27 10:11 UTC (permalink / raw)
  To: john; +Cc: Andrew Morton, holt, linux-kernel, rml, arnd, hch, Dipankar Sarma

John McCutchan wrote:
> On Fri, 2006-24-02 at 18:07 +1100, Nick Piggin wrote:

>>I saw this problem when testing my lockless pagecache a while back.
>>
>>Attached is a first implementation of what was my idea then of how
>>to solve it... note it is pretty rough and I never got around to doing
>>much testing of it.
>>
>>Basically: moves work out of inotify event time and to inotify attach
>>/detach time while staying out of the core VFS.
> 
> 
> 
> This looks really good. There might be some corner cases but it looks
> like it will solve this problem nicely.
> 

Thanks. You should see I sent a new version which fixes several bugs
and cleans up the code a bit.

There might be some areas of potential problems:
- creating and deleting watches on directories with many entries will
   take a long time. Is anyone likely to be creating and destroying
   these things at a very high frequency? Probably nobody cares except
   it might twist some real-time knickers.

- concurrent operations in the same watched directory will incur the
   same scalability penalty. I think this is basically a non-issue since
   the sheer number of events coming out will likely be a bigger problem.
   Doctor, it hurts when I do this.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: udevd is killing file write performance.
  2006-02-27 10:11                     ` Nick Piggin
@ 2006-02-27 20:17                       ` John McCutchan
  0 siblings, 0 replies; 27+ messages in thread
From: John McCutchan @ 2006-02-27 20:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, holt, linux-kernel, rml, arnd, hch, Dipankar Sarma

On Mon, 2006-27-02 at 21:11 +1100, Nick Piggin wrote:
> John McCutchan wrote:
> > On Fri, 2006-24-02 at 18:07 +1100, Nick Piggin wrote:
> 
> >>I saw this problem when testing my lockless pagecache a while back.
> >>
> >>Attached is a first implementation of what was my idea then of how
> >>to solve it... note it is pretty rough and I never got around to doing
> >>much testing of it.
> >>
> >>Basically: moves work out of inotify event time and to inotify attach
> >>/detach time while staying out of the core VFS.
> > 
> > 
> > 
> > This looks really good. There might be some corner cases but it looks
> > like it will solve this problem nicely.
> > 
> 
> Thanks. You should see I sent a new version which fixes several bugs
> and cleans up the code a bit.
> 

Yeah, it looks good. I haven't had time to test it myself but nothing
jumps out at as being wrong. I can only say that about the code that
touches inotify -- the rest of the VFS someone else will need to comment
on.

> There might be some areas of potential problems:
> - creating and deleting watches on directories with many entries will
>    take a long time. Is anyone likely to be creating and destroying
>    these things at a very high frequency? Probably nobody cares except
>    it might twist some real-time knickers.
> 

That's not a typical inotify usage pattern. Typically a watch is created
and left until the directory is deleted, or the application closes.

> - concurrent operations in the same watched directory will incur the
>    same scalability penalty. I think this is basically a non-issue since
>    the sheer number of events coming out will likely be a bigger problem.
>    Doctor, it hurts when I do this.
> 

Again, Yeah, I don't think we need to worry.

-- 
John McCutchan <john@johnmccutchan.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch] inotify: lock avoidance with parent watch status in dentry
  2006-02-25 15:53                     ` [patch] inotify: lock avoidance with parent watch status in dentry Nick Piggin
@ 2006-02-28  0:48                       ` Andrew Morton
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2006-02-28  0:48 UTC (permalink / raw)
  To: Nick Piggin; +Cc: holt, john, linux-kernel, rml, hch, linux-fsdevel, Al Viro

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
>  Previous inotify work avoidance is good when inotify is completely
>  unused, but it breaks down if even a single watch is in place anywhere
>  in the system. Robin Holt notices that udev is one such culprit - it
>  slows down a 512-thread application on a 512 CPU system from 6 seconds
>  to 22 minutes.

A problem is that the audit tree (believe it or not) adds a pile of new
inotify functionality.  I don't know what those changes do and they might
conflict with the changes you've made (apart from giving us two copies of
inotify_inode_watched()) and the audit changes were apparently only
socialised on the linux-audit mailing list and my twice-sent patch to make
the audit tree compile has been ignored for a couple of weeks.

So I'm going to bitbucket the audit tree until a) it compiles and b) its
inotify changes have been explained and reviewed and c) we've reviewed
those changes against your optimisations.  I think fixes will be needed.


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2006-02-28  0:58 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-02-22 13:42 udevd is killing file write performance Robin Holt
2006-02-22 13:55 ` Andrew Morton
2006-02-22 16:48 ` John McCutchan
2006-02-22 17:50   ` Robin Holt
2006-02-22 20:05     ` Andrew Morton
2006-02-22 21:50       ` Jeff V. Merkey
2006-02-23 12:56       ` Robin Holt
2006-02-23 13:42         ` David Chinner
2006-02-22 22:52     ` John McCutchan
2006-02-22 23:12       ` Andrew Morton
2006-02-22 23:41         ` John McCutchan
2006-02-24  0:14           ` Andrew Morton
2006-02-24  0:14           ` Andrew Morton
2006-02-24  5:47             ` John McCutchan
2006-02-24  6:00               ` Andrew Morton
2006-02-24  7:07                 ` Nick Piggin
2006-02-24  7:16                   ` Andrew Morton
2006-02-24  7:19                     ` Nick Piggin
2006-02-26 16:58                       ` John McCutchan
2006-02-24 18:56                   ` Robin Holt
2006-02-25  2:44                     ` Nick Piggin
2006-02-25 15:53                     ` [patch] inotify: lock avoidance with parent watch status in dentry Nick Piggin
2006-02-28  0:48                       ` Andrew Morton
2006-02-26 16:55                   ` udevd is killing file write performance John McCutchan
2006-02-27 10:11                     ` Nick Piggin
2006-02-27 20:17                       ` John McCutchan
2006-02-23 20:38         ` Benjamin LaHaise

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).