All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: sysfs reclaim crash
       [not found]         ` <46042DDF.5060000@google.com>
@ 2007-03-24  3:05           ` Maneesh Soni
  2007-03-27 18:27             ` Ethan Solomita
  2007-04-03  7:38             ` [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash) Maneesh Soni
  0 siblings, 2 replies; 6+ messages in thread
From: Maneesh Soni @ 2007-03-24  3:05 UTC (permalink / raw)
  To: Ethan Solomita
  Cc: Dipankar Sarma, Andrew Morton, Greg KH, Martin Bligh, Rohit Seth,
	viro, LKML

On Fri, Mar 23, 2007 at 12:43:27PM -0700, Ethan Solomita wrote:
>     I ran stress testing overnight and came up with a similar failure
> (s_dentry == NULL) but in a different location. A NULL pointer
> dereference happened in sysfs_readdir():
> 
> |                                if (next->s_dentry)
> |                                        ino =
> next->s_dentry->d_inode->i_ino;
> 
>     It seems that d_inode was NULL. I don't have the pointer to d_inode
> to look at, but I have next and, lo and behold, its s_dentry is now
> NULL, which it clearly wasn't when the if-clause above ran.
> 
>     I tried to reconstruct the sysfs_dirents starting with "next". I
> filled in all the structure contents that I had data for:
> 
> sysfs_dirent 0xffff81000fc61690:
> s_count         1
> s_sibling       ffff81000fc616e8 / ffff81000e0c7468
> s_children      ffff81000fc616a8 / ffff81000fc616a8
> s_element       ffff81000f4ad1b0
>   DOR__ATA1RTS
>   ffffffff8800b600
>   124
> s_type          4
> s_mode          8124
> s_dentry        NULL
> s_iattr         NULL
> s_event         0
> 
> s_sibling.prev:
> s_count         1
> s_sibling       ffff81000fc61738 / ffff81000fc61698
> s_children      ffff81000fc616f8 / ffff81000fc616f8
> s_element       ffff81000f4ad148
> s_type          4
> s_mode          8124
> <unknown>
> 
> s_sibling.next:
> s_count         1
> s_sibling       ffff81000fc61698 / ffff81000fc61648
> s_children      ffff81000e0c7478 / ffff81000e0c7478
> s_element       NULL
> s_type          0
> s_mode          0
> s_dentry        NULL
> s_iattr         NULL
> s_event         0
> 
> s_sibling.next.next:
> s_count         1
> s_sibling       ffff81000e0c7468 / ffff81000fc615f8
> s_children      ffff81000fc61658 / ffff81000fc61658
> s_element       ffff81000f4ad218
>   DOR__ATA1RTS
>   ffffffff8800b600
>   124
> s_type          4
> s_mode          8124
> s_dentry        NULL
> s_iattr         NULL
> s_event         0
> 
> s_sibling.next.next.next:
> s_count         1
> s_sibling       ffff81000fc61648 / ffff81000fc610f8
> s_children      ffff81000fc61608 / ffff81000fc61608
> s_element       ffff81000f4ad280
>   CK??ATCR
>   ffffffff8800b600
>   124
> s_type          4
> s_mode          8124
> s_dentry        NULL
> s_iattr         NULL
> s_event         0
> 
>     I should acknowledge that this is based upon 2.6.18 with some newer
> code backported. If there are fixes since 2.6.18 that we should know
> about I can try backporting them into our kernel.
> 
>     Thanks,
>     -- Ethan


Hi Ethan,

Thank you very much for the crash data. This is helpful. Could you please
test the appended patch. Here we avoid modifying s_dentry in sysfs_d_iput()
and also uses iunique() in sysfs_readdir() instead of accessing s_dentry
to get the inode number. 

Though I was not able to recreate this race without the patch, but I am
running this patch successfully for more than 6 hrs now with the following
loops parallely on a 4-way SMP system. So, at least it has not done anything
bad.

1. while true; do insmod drivers/net/dummy.ko ; rmmod dummy; done
2. [root@llm01 net]# pwd
   /sys/class/net
   [root@llm01 net]# while true; do find | xargs cat > /dev/null; done
3. [root@llm01 sys]# pwd
   /sys
   [root@llm01 sys]# while true; do ls -liR; done

Also, CC-ed Viro and lkml for feedback.

Thanks
Maneesh


o sysfs_d_iput() is invoked in dentry reclaim path under memory pressure. This
  happens without i_mutex. It also nullifies s_dentry just to indicate that
  the associated dentry is evicted. sysfs_readdir() access the s_dentry,
  and gets the inode number from the associated dentry, if there is one, else
  it invokes iunique(). This can create a race situation, and crash while
  accessing the dentry/inode in sysfs_readdir().
  
o The following patch always use i_unique() to get the inode number. This
  is ok as sysfs doesnot have permanent inode numbering. It could be slower
  but avoids the above mentioned race. 

o This also avoids the now unnecessary s_dentry in sysfs_d_iput().

Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
---

 linux-2.6.21-rc4-git8-maneesh/fs/sysfs/dir.c |    6 +-----
 1 files changed, 1 insertion(+), 5 deletions(-)

diff -puN fs/sysfs/dir.c~fix-sysfs-reclaim-race fs/sysfs/dir.c
--- linux-2.6.21-rc4-git8/fs/sysfs/dir.c~fix-sysfs-reclaim-race	2007-03-24 02:06:38.000000000 +0530
+++ linux-2.6.21-rc4-git8-maneesh/fs/sysfs/dir.c	2007-03-24 02:09:26.000000000 +0530
@@ -20,7 +20,6 @@ static void sysfs_d_iput(struct dentry *
 
 	if (sd) {
 		BUG_ON(sd->s_dentry != dentry);
-		sd->s_dentry = NULL;
 		sysfs_put(sd);
 	}
 	iput(inode);
@@ -538,10 +537,7 @@ static int sysfs_readdir(struct file * f
 
 				name = sysfs_get_name(next);
 				len = strlen(name);
-				if (next->s_dentry)
-					ino = next->s_dentry->d_inode->i_ino;
-				else
-					ino = iunique(sysfs_sb, 2);
+				ino = iunique(sysfs_sb, 2);
 
 				if (filldir(dirent, name, len, filp->f_pos, ino,
 						 dt_type(next)) < 0)
_

--
Maneesh Soni
Linux Technology Center,
IBM India Systems and Technology Lab, 
Bangalore, India

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sysfs reclaim crash
  2007-03-24  3:05           ` sysfs reclaim crash Maneesh Soni
@ 2007-03-27 18:27             ` Ethan Solomita
  2007-03-27 18:49               ` Maneesh Soni
  2007-04-03  7:38             ` [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash) Maneesh Soni
  1 sibling, 1 reply; 6+ messages in thread
From: Ethan Solomita @ 2007-03-27 18:27 UTC (permalink / raw)
  To: maneesh
  Cc: Dipankar Sarma, Andrew Morton, Greg KH, Martin Bligh, Rohit Seth,
	viro, LKML

	Hi Maneesh -- I will start testing with the patch you provided. If you 
come up with any further issues please let me know. Also, if you could 
suggest some additional BUG() lines that I could insert I would 
appreciate it. Since the bug is hard to reproduce, it may be easier to 
catch a race condition in the making via BUG() than an actual failure 
due to a race condition.

	Thanks!
	-- Ethan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sysfs reclaim crash
  2007-03-27 18:27             ` Ethan Solomita
@ 2007-03-27 18:49               ` Maneesh Soni
  0 siblings, 0 replies; 6+ messages in thread
From: Maneesh Soni @ 2007-03-27 18:49 UTC (permalink / raw)
  To: Ethan Solomita
  Cc: Dipankar Sarma, Andrew Morton, Greg KH, Martin Bligh, Rohit Seth,
	viro, LKML

On Tue, Mar 27, 2007 at 11:27:14AM -0700, Ethan Solomita wrote:
> 	Hi Maneesh -- I will start testing with the patch you provided. If 
> 	you come up with any further issues please let me know. Also, if you could 
> suggest some additional BUG() lines that I could insert I would 
> appreciate it. Since the bug is hard to reproduce, it may be easier to 
> catch a race condition in the making via BUG() than an actual failure 
> due to a race condition.
> 

Hi Ethan,

Thanks for testing. The BUG_ON in sysfs_d_iput() is still there to catch
the first race we saw. And the second one should not occur as now we are not
using the s_dentry in sysfs_readdir(). 

Regards,
Maneesh

--
Maneesh Soni
Linux Technology Center,
IBM India Systems and Technology Lab, 
Bangalore, India

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash)
  2007-03-24  3:05           ` sysfs reclaim crash Maneesh Soni
  2007-03-27 18:27             ` Ethan Solomita
@ 2007-04-03  7:38             ` Maneesh Soni
  2007-04-03 20:46               ` Ethan Solomita
  2007-04-07  8:31               ` Tejun Heo
  1 sibling, 2 replies; 6+ messages in thread
From: Maneesh Soni @ 2007-04-03  7:38 UTC (permalink / raw)
  To: Ethan Solomita
  Cc: Dipankar Sarma, Andrew Morton, Greg KH, Martin Bligh, Rohit Seth,
	viro, LKML

On Sat, Mar 24, 2007 at 08:35:35AM +0530, Maneesh Soni wrote:
> On Fri, Mar 23, 2007 at 12:43:27PM -0700, Ethan Solomita wrote:
> >     I ran stress testing overnight and came up with a similar failure
> > (s_dentry == NULL) but in a different location. A NULL pointer
> > dereference happened in sysfs_readdir():
> > 
> > |                                if (next->s_dentry)
> > |                                        ino =
> > next->s_dentry->d_inode->i_ino;
> > 
> >     It seems that d_inode was NULL. I don't have the pointer to d_inode
> > to look at, but I have next and, lo and behold, its s_dentry is now
> > NULL, which it clearly wasn't when the if-clause above ran.
> > 
> >     I tried to reconstruct the sysfs_dirents starting with "next". I
> > filled in all the structure contents that I had data for:
> > 
> > sysfs_dirent 0xffff81000fc61690:
> > s_count         1
> > s_sibling       ffff81000fc616e8 / ffff81000e0c7468
> > s_children      ffff81000fc616a8 / ffff81000fc616a8
> > s_element       ffff81000f4ad1b0
> >   DOR__ATA1RTS
> >   ffffffff8800b600
> >   124
> > s_type          4
> > s_mode          8124
> > s_dentry        NULL
> > s_iattr         NULL
> > s_event         0
> > 
> > s_sibling.prev:
> > s_count         1
> > s_sibling       ffff81000fc61738 / ffff81000fc61698
> > s_children      ffff81000fc616f8 / ffff81000fc616f8
> > s_element       ffff81000f4ad148
> > s_type          4
> > s_mode          8124
> > <unknown>
> > 
> > s_sibling.next:
> > s_count         1
> > s_sibling       ffff81000fc61698 / ffff81000fc61648
> > s_children      ffff81000e0c7478 / ffff81000e0c7478
> > s_element       NULL
> > s_type          0
> > s_mode          0
> > s_dentry        NULL
> > s_iattr         NULL
> > s_event         0
> > 
> > s_sibling.next.next:
> > s_count         1
> > s_sibling       ffff81000e0c7468 / ffff81000fc615f8
> > s_children      ffff81000fc61658 / ffff81000fc61658
> > s_element       ffff81000f4ad218
> >   DOR__ATA1RTS
> >   ffffffff8800b600
> >   124
> > s_type          4
> > s_mode          8124
> > s_dentry        NULL
> > s_iattr         NULL
> > s_event         0
> > 
> > s_sibling.next.next.next:
> > s_count         1
> > s_sibling       ffff81000fc61648 / ffff81000fc610f8
> > s_children      ffff81000fc61608 / ffff81000fc61608
> > s_element       ffff81000f4ad280
> >   CK??ATCR
> >   ffffffff8800b600
> >   124
> > s_type          4
> > s_mode          8124
> > s_dentry        NULL
> > s_iattr         NULL
> > s_event         0
> > 
> >     I should acknowledge that this is based upon 2.6.18 with some newer
> > code backported. If there are fixes since 2.6.18 that we should know
> > about I can try backporting them into our kernel.
> > 
> >     Thanks,
> >     -- Ethan
> 
> 

Hi Ethan / Andrew,

I have modified the previous patch (which was dropped from -mm) and now keeping
the statement making s_dentry as NULL in sysfs_d_iput(), so this should
_safely_ fix sysfs_readdir() oops. 

Please see the other mail for sysfs_d_iput() BUG hit.

Thanks
Maneesh

--
Maneesh Soni
Linux Technology Center,
IBM India Systems and Technology Lab, 
Bangalore, India




o sysfs_d_iput() is invoked in dentry reclaim path under memory pressure. This
  happens without i_mutex. It also nullifies s_dentry to indicate that
  the associated dentry is evicted. sysfs_readdir() accesses the s_dentry,
  and gets the inode number from the associated dentry->d_inode, if 
  there is one, else it invokes iunique(). This can create a race situation,
  and crash while accessing the d_inode in sysfs_readdir(). 

o The race happens when the dentry is getting reclaimed and detached from
  the corresponding sysfs_dirent though sysfs_dirent is still a valid
  node. Accessing dentry fields are ok as it is under RCU but the inode is
  not hence we may see oops accessing dentry->d_inode->i_no.

o The following patch always use i_unique() to get the inode number in
  sysfs_readdir. This is ok as sysfs doesnot have permanent inode numbering.
  It could be slower but avoids the oops. 



Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
---

 linux-2.6.21-rc5-mm3-maneesh/fs/sysfs/dir.c |    5 +----
 1 files changed, 1 insertion(+), 4 deletions(-)

diff -puN fs/sysfs/dir.c~fix-sysfs_readdir-oops fs/sysfs/dir.c
--- linux-2.6.21-rc5-mm3/fs/sysfs/dir.c~fix-sysfs_readdir-oops	2007-04-03 10:39:52.000000000 +0530
+++ linux-2.6.21-rc5-mm3-maneesh/fs/sysfs/dir.c	2007-04-03 10:39:52.000000000 +0530
@@ -538,10 +538,7 @@ static int sysfs_readdir(struct file * f
 
 				name = sysfs_get_name(next);
 				len = strlen(name);
-				if (next->s_dentry)
-					ino = next->s_dentry->d_inode->i_ino;
-				else
-					ino = iunique(sysfs_sb, 2);
+				ino = iunique(sysfs_sb, 2);
 
 				if (filldir(dirent, name, len, filp->f_pos, ino,
 						 dt_type(next)) < 0)
_

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash)
  2007-04-03  7:38             ` [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash) Maneesh Soni
@ 2007-04-03 20:46               ` Ethan Solomita
  2007-04-07  8:31               ` Tejun Heo
  1 sibling, 0 replies; 6+ messages in thread
From: Ethan Solomita @ 2007-04-03 20:46 UTC (permalink / raw)
  To: maneesh
  Cc: Dipankar Sarma, Andrew Morton, Greg KH, Martin Bligh, Rohit Seth,
	viro, LKML

Maneesh Soni wrote:
> I have modified the previous patch (which was dropped from -mm) and now keeping
> the statement making s_dentry as NULL in sysfs_d_iput(), so this should
> _safely_ fix sysfs_readdir() oops. 
>   

    If you could find some additional places in sysfs code to add new
BUG() checks I'd appreciate it. Especially if it turns out that you
can't reproduce it, I'd like to have as many asserts as is reasonable.
    -- Ethan


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash)
  2007-04-03  7:38             ` [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash) Maneesh Soni
  2007-04-03 20:46               ` Ethan Solomita
@ 2007-04-07  8:31               ` Tejun Heo
  1 sibling, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2007-04-07  8:31 UTC (permalink / raw)
  To: maneesh
  Cc: Ethan Solomita, Dipankar Sarma, Andrew Morton, Greg KH,
	Martin Bligh, Rohit Seth, viro, LKML

Hello, Maneesh.

Maneesh Soni wrote:
> o sysfs_d_iput() is invoked in dentry reclaim path under memory pressure. This
>   happens without i_mutex. It also nullifies s_dentry to indicate that
>   the associated dentry is evicted. sysfs_readdir() accesses the s_dentry,
>   and gets the inode number from the associated dentry->d_inode, if 
>   there is one, else it invokes iunique(). This can create a race situation,
>   and crash while accessing the d_inode in sysfs_readdir(). 
> 
> o The race happens when the dentry is getting reclaimed and detached from
>   the corresponding sysfs_dirent though sysfs_dirent is still a valid
>   node. Accessing dentry fields are ok as it is under RCU but the inode is
>   not hence we may see oops accessing dentry->d_inode->i_no.
> 
> o The following patch always use i_unique() to get the inode number in
>   sysfs_readdir. This is ok as sysfs doesnot have permanent inode numbering.
>   It could be slower but avoids the oops. 

This isn't correct as i_unique() assumes the inode is in inode hash
table which isn't true for sysfs.  This can result in duplicate inode
numbers.  Please take a look at the following alternative fix.

  http://article.gmane.org/gmane.linux.kernel/513325

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-04-07  8:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20070319140238.cbf28b2b.akpm@linux-foundation.org>
     [not found] ` <20070321134654.5wkqpfbpk0ggwks8@imap.linux.ibm.com>
     [not found]   ` <20070321182123.GA12602@in.ibm.com>
     [not found]     ` <20070323043047.GA5641@in.ibm.com>
     [not found]       ` <20070322210504.3b052139.akpm@linux-foundation.org>
     [not found]         ` <46042DDF.5060000@google.com>
2007-03-24  3:05           ` sysfs reclaim crash Maneesh Soni
2007-03-27 18:27             ` Ethan Solomita
2007-03-27 18:49               ` Maneesh Soni
2007-04-03  7:38             ` [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash) Maneesh Soni
2007-04-03 20:46               ` Ethan Solomita
2007-04-07  8:31               ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.