Re: Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50

linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50
       [not found] <7A2FC0CD30EF4745AE15F485252D38AC2F45A70C9A@clark>
@ 2013-01-03  3:09 ` Steven Rostedt
  2013-01-03  4:22   ` Theodore Ts'o
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2013-01-03  3:09 UTC (permalink / raw)
  To: Staffan Tjernstrom; +Cc: linux-rt-users, tglx, C.Emde, jkacur

On Thu, 2012-12-27 at 09:34 -0600, Staffan Tjernstrom wrote:
> First of all apologies if this should have gone to the ext4
> maintainers instead, but I figured since I'd seen some -rt specific
> patches from you folks in the last few weeks I'd send this your way
> first.
> 
No your issue is most likely caused by -rt.

I guess the question is, can you try 3.2.35-rt53 and even better, the
latest -rt kernel, or even 3.4.24-rt36.


> Unfortunately all I have available to help trace it down at this time
> is the kernel (via /proc/pid/stack) stack of the tasks involved - it
> seems that it can get cleared by a signal hitting the blocking task(s)
> (eg attaching gdb seems to clear the problem). The stack dumps are
> transcribed rather than cut/pasted due to the security regimen around
> the systems the issue has been observed on - apologies for
> transcription errors.
> 

The fact that it clears up when you attach gdb, is probably due to a
live lock. That is, something is preventing something else from running.
Not necessarily a deadlock where something is blocked on something
that's blocked on it.


> The traces below are from 3.2.23-rt37.56.el6rt (but we've also
> observed it under 3.2.33-rt50.66.el6rt).
> 

It would be best if you could give a full task dump:

Boot the kernel with something like log_buf_len=10M (a printk buffer of
10 megs). And then 'echo t > /proc/sysrq-trigger' when the lock up
happens. If you can get a copy of the dmesg, that would be great. Of
course you may need to higher a few secretaries to transcribe the
output ;-)

Thanks,

-- Steve

> Trace 1:
> 
>  
> 
> [<ffffffffa01a085d>] jbd2_log_wait_commit+0xcd/0x150
> 
> [<ffffffffa01b74a5>] ext4_sync_file+0x1e5/0x480
> 
> [<ffffffff8117a42b>] vfs_fsync_range+0x2b/0x30
> 
> [<ffffffff8117a44c>] vfs_fsync+0x1c/0x20
> 
> [<ffffffff8117a68a>] do_fsync+0x3a/0x60
> 
> [<ffffffff8117a6c3>] sys_fdatasync+0x13/0x20
> 
> [<ffffffff814e7feb>] system_call_fastpath+0x16/0x1b
> 
>  
> 
> Trace 2:
> 
> [<ffffffffa0199dc5>] do_get_write_access+0x2a5/0x4d0
> 
> [<ffffffffa019a021>] jbd2_journal_get_write_access+0x31/0x50
> 
> [<ffffffffa01e93ce>] __ext4_journal_get_write_access+0x3e/0x80
> 
> [<ffffffffa01be028>] ext4_reserve_inode_write+0x78/0xa0
> 
> [<ffffffffa01be0a6>] ext4_mark_inode_dirty+0x56/0x270
> 
> [<ffffffffa01be41d>] ext4_dirty_inode+0x3d/0x60
> 
> [<ffffffff81173ee0>] __mark_inode_dirty+0x40/0x250
> 
> [<ffffffff81165282>] file_update_time+0xd2/0x160
> 
> [<ffffffff810f9968>] __generic_file_aio_write+0x208/0x460
> 
> [<ffffffff810f9c36>] generic_file_aio_write+0x76/0xf0
> 
> [<ffffffffa01b6f89>] ext4_file_write+x069/0x280
> 
> [<ffffffff8114a9ea>] do_sync_write+0xea/0x130
> 
> [<ffffffff8114af78>] vfs_write+0xc8/0x190
> 
> [<ffffffff8114b141>] sys_write+0x51/0x90
> 
> [<ffffffff814e7feb>] system_call_fastpath+0x16/0x1b
> 
>  
> 
> When the situation occurs (for us about once every two days or so) we
> get multiple procs in Trace1, one in Trace2. Stopping the Trace2 task
> clears the situaition.
> 
>  
> 
> Hopefully that's enough information to give you good folks a clue as
> to what is going on. If you want more detail, please drop me a line
> and I will see what I can do to obtain it.
> 
>  
> 




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50
  2013-01-03  3:09 ` Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50 Steven Rostedt
@ 2013-01-03  4:22   ` Theodore Ts'o
  2013-01-03 13:21     ` Steven Rostedt
  0 siblings, 1 reply; 9+ messages in thread
From: Theodore Ts'o @ 2013-01-03  4:22 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Staffan Tjernstrom, linux-rt-users, tglx, C.Emde, jkacur

On Wed, Jan 02, 2013 at 10:09:43PM -0500, Steven Rostedt wrote:
> -- Steve
> 
> > Trace 1:
> > [<ffffffffa01a085d>] jbd2_log_wait_commit+0xcd/0x150
> > [<ffffffffa01b74a5>] ext4_sync_file+0x1e5/0x480
> > [<ffffffff8117a42b>] vfs_fsync_range+0x2b/0x30
> > [<ffffffff8117a44c>] vfs_fsync+0x1c/0x20
> > [<ffffffff8117a68a>] do_fsync+0x3a/0x60
> > [<ffffffff8117a6c3>] sys_fdatasync+0x13/0x20
> > [<ffffffff814e7feb>] system_call_fastpath+0x16/0x1b

Is this process running at a real-time priority?  If so, it looks like
a classic priority inversion problem.  fsync() triggers a journal
commit, and then waits for the jbd2 process to do the work.  If you
have real-time threads/processes which prevent the jbd2 process from
scheduling, that would explain what's going on.

In general, real-time processes/threads should *not* be doing file
system I/O, but if you must, you need to make sure that you've
adjusted the jbd2 kernel threads to run at the same or slightly higher
priority than the highest priority process which will be writing to
the file system.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50
  2013-01-03  4:22   ` Theodore Ts'o
@ 2013-01-03 13:21     ` Steven Rostedt
  2013-01-03 14:18       ` Theodore Ts'o
  2013-01-03 14:29       ` Staffan Tjernstrom
  0 siblings, 2 replies; 9+ messages in thread
From: Steven Rostedt @ 2013-01-03 13:21 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Staffan Tjernstrom, linux-rt-users, tglx, C.Emde, jkacur

On Wed, 2013-01-02 at 23:22 -0500, Theodore Ts'o wrote:
> On Wed, Jan 02, 2013 at 10:09:43PM -0500, Steven Rostedt wrote:
> > -- Steve
> > 
> > > Trace 1:
> > > [<ffffffffa01a085d>] jbd2_log_wait_commit+0xcd/0x150
> > > [<ffffffffa01b74a5>] ext4_sync_file+0x1e5/0x480
> > > [<ffffffff8117a42b>] vfs_fsync_range+0x2b/0x30
> > > [<ffffffff8117a44c>] vfs_fsync+0x1c/0x20
> > > [<ffffffff8117a68a>] do_fsync+0x3a/0x60
> > > [<ffffffff8117a6c3>] sys_fdatasync+0x13/0x20
> > > [<ffffffff814e7feb>] system_call_fastpath+0x16/0x1b
> 
> Is this process running at a real-time priority?  If so, it looks like
> a classic priority inversion problem.  fsync() triggers a journal
> commit, and then waits for the jbd2 process to do the work.  If you
> have real-time threads/processes which prevent the jbd2 process from
> scheduling, that would explain what's going on.
> 
> In general, real-time processes/threads should *not* be doing file
> system I/O, but if you must, you need to make sure that you've
> adjusted the jbd2 kernel threads to run at the same or slightly higher
> priority than the highest priority process which will be writing to
> the file system.

With -rt things can get worse too. Even caused by kjournald being upped
in priority.

Anytime you have something that does the following in order to break
lock ordering:

repeat:
	lock(A);
	<do something>
	if (!trylock(B)) {
		unlock(A);
		cpu_relax();
		goto repeat;
	}


We can live lock, because spinlocks in -rt turn into a mutex. Thus, the
holder of lock B may not be on another CPU but actually on the current
CPU and is waiting for the process that is in this loop. If that process
happens to be an RT task, then the system stops.

-- Steve



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50
  2013-01-03 13:21     ` Steven Rostedt
@ 2013-01-03 14:18       ` Theodore Ts'o
  2013-01-03 14:36         ` Staffan Tjernstrom
  2013-01-03 15:35         ` Steven Rostedt
  2013-01-03 14:29       ` Staffan Tjernstrom
  1 sibling, 2 replies; 9+ messages in thread
From: Theodore Ts'o @ 2013-01-03 14:18 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Staffan Tjernstrom, linux-rt-users, tglx, C.Emde, jkacur

On Thu, Jan 03, 2013 at 08:21:31AM -0500, Steven Rostedt wrote:
> Anytime you have something that does the following in order to break
> lock ordering:
> 
> repeat:
> 	lock(A);
> 	<do something>
> 	if (!trylock(B)) {
> 		unlock(A);
> 		cpu_relax();
> 		goto repeat;
> 	}
> 
> 
> We can live lock, because spinlocks in -rt turn into a mutex. Thus, the
> holder of lock B may not be on another CPU but actually on the current
> CPU and is waiting for the process that is in this loop. If that process
> happens to be an RT task, then the system stops.

Good point, but I'm prety sure we don't have any kind of trylock loops
in either fs/ext4 or fs/jbd2, though.

					- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50
  2013-01-03 13:21     ` Steven Rostedt
  2013-01-03 14:18       ` Theodore Ts'o
@ 2013-01-03 14:29       ` Staffan Tjernstrom
  1 sibling, 0 replies; 9+ messages in thread
From: Staffan Tjernstrom @ 2013-01-03 14:29 UTC (permalink / raw)
  To: Steven Rostedt, Theodore Ts'o; +Cc: linux-rt-users, tglx, C.Emde, jkacur

Thanks for all the prompt responses, very much appreciated.

Since this is occurring on production boxes, we're somewhat constrained in troubleshooting (for now we've re-initialized the disks to ext2 in order to get around the journal commits, although that may not be the actual solution as you suggest).

I'll certainly keep my eyes open and if we see something similar I will see if we can get the extra debug info that Steven suggested in his earlier e-mail.

In the example I caught the process in get_write_access last ran on cpu 0 at SCHED_RR prio 70, the other three (all in jbd2_log_wait_commit) last ran on cpu 6 (the hyperthread opposite cpu 0) at regular prio 0.

-----Original Message-----
From: Steven Rostedt [mailto:rostedt@goodmis.org] 
Sent: Thursday, January 03, 2013 7:22 AM
To: Theodore Ts'o
Cc: Staffan Tjernstrom; linux-rt-users@vger.kernel.org; tglx@linutronix.de; C.Emde@osadl.org; jkacur@redhat.com
Subject: Re: Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50

On Wed, 2013-01-02 at 23:22 -0500, Theodore Ts'o wrote:
> On Wed, Jan 02, 2013 at 10:09:43PM -0500, Steven Rostedt wrote:
> > -- Steve
> > 
> > > Trace 1:
> > > [<ffffffffa01a085d>] jbd2_log_wait_commit+0xcd/0x150 
> > > [<ffffffffa01b74a5>] ext4_sync_file+0x1e5/0x480 
> > > [<ffffffff8117a42b>] vfs_fsync_range+0x2b/0x30 
> > > [<ffffffff8117a44c>] vfs_fsync+0x1c/0x20 [<ffffffff8117a68a>] 
> > > do_fsync+0x3a/0x60 [<ffffffff8117a6c3>] sys_fdatasync+0x13/0x20 
> > > [<ffffffff814e7feb>] system_call_fastpath+0x16/0x1b
> 
> Is this process running at a real-time priority?  If so, it looks like 
> a classic priority inversion problem.  fsync() triggers a journal 
> commit, and then waits for the jbd2 process to do the work.  If you 
> have real-time threads/processes which prevent the jbd2 process from 
> scheduling, that would explain what's going on.
> 
> In general, real-time processes/threads should *not* be doing file 
> system I/O, but if you must, you need to make sure that you've 
> adjusted the jbd2 kernel threads to run at the same or slightly higher 
> priority than the highest priority process which will be writing to 
> the file system.

With -rt things can get worse too. Even caused by kjournald being upped in priority.

Anytime you have something that does the following in order to break lock ordering:

repeat:
	lock(A);
	<do something>
	if (!trylock(B)) {
		unlock(A);
		cpu_relax();
		goto repeat;
	}

We can live lock, because spinlocks in -rt turn into a mutex. Thus, the holder of lock B may not be on another CPU but actually on the current CPU and is waiting for the process that is in this loop. If that process happens to be an RT task, then the system stops.

-- Steve

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50
  2013-01-03 14:18       ` Theodore Ts'o
@ 2013-01-03 14:36         ` Staffan Tjernstrom
  2013-01-03 15:36           ` Theodore Ts'o
  2013-01-03 15:35         ` Steven Rostedt
  1 sibling, 1 reply; 9+ messages in thread
From: Staffan Tjernstrom @ 2013-01-03 14:36 UTC (permalink / raw)
  To: Theodore Ts'o, Steven Rostedt; +Cc: linux-rt-users, tglx, C.Emde, jkacur

This may be completely off-in-newbie land, but I figured I'd throw in what I think I've tracked down.

It looks as if there was a fairly recent patch to turn locks in parts of the code into atomic instructions (apologies - I don't have the patch id to hand atm) in do_get_write_access() amongst others.

Then in turn the C++ standard library loops around calls to write() whilst access isn't available, basically blocking on the atomic (which then in turn doesn't support priority inheritance), causing the wait loop.

Absolutely one of the first things we need to do in our "userland" is to get the file writes out of  the high priority code - that certainly makes every sense in the book.

-----Original Message-----
From: Theodore Ts'o [mailto:tytso@mit.edu] 
Sent: Thursday, January 03, 2013 8:19 AM
To: Steven Rostedt
Cc: Staffan Tjernstrom; linux-rt-users@vger.kernel.org; tglx@linutronix.de; C.Emde@osadl.org; jkacur@redhat.com
Subject: Re: Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50

On Thu, Jan 03, 2013 at 08:21:31AM -0500, Steven Rostedt wrote:
> Anytime you have something that does the following in order to break 
> lock ordering:
> 
> repeat:
> 	lock(A);
> 	<do something>
> 	if (!trylock(B)) {
> 		unlock(A);
> 		cpu_relax();
> 		goto repeat;
> 	}
> 
> 
> We can live lock, because spinlocks in -rt turn into a mutex. Thus, 
> the holder of lock B may not be on another CPU but actually on the 
> current CPU and is waiting for the process that is in this loop. If 
> that process happens to be an RT task, then the system stops.

Good point, but I'm prety sure we don't have any kind of trylock loops in either fs/ext4 or fs/jbd2, though.

					- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50
  2013-01-03 14:18       ` Theodore Ts'o
  2013-01-03 14:36         ` Staffan Tjernstrom
@ 2013-01-03 15:35         ` Steven Rostedt
  1 sibling, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2013-01-03 15:35 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Staffan Tjernstrom, linux-rt-users, tglx, C.Emde, jkacur

On Thu, 2013-01-03 at 09:18 -0500, Theodore Ts'o wrote:
> On Thu, Jan 03, 2013 at 08:21:31AM -0500, Steven Rostedt wrote:
> > Anytime you have something that does the following in order to break
> > lock ordering:
> > 
> > repeat:
> > 	lock(A);
> > 	<do something>
> > 	if (!trylock(B)) {
> > 		unlock(A);
> > 		cpu_relax();
> > 		goto repeat;
> > 	}
> > 
> > 
> > We can live lock, because spinlocks in -rt turn into a mutex. Thus, the
> > holder of lock B may not be on another CPU but actually on the current
> > CPU and is waiting for the process that is in this loop. If that process
> > happens to be an RT task, then the system stops.
> 
> Good point, but I'm prety sure we don't have any kind of trylock loops
> in either fs/ext4 or fs/jbd2, though.
> 

We found a few in the underlining block layer. Those were fixed in
recent versions. Although, 3.2 didn't have that specific problem so I'm
not sure what the original posters issues are, and wont be sure without
a full task dump of all tasks.

-- Steve



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50
  2013-01-03 14:36         ` Staffan Tjernstrom
@ 2013-01-03 15:36           ` Theodore Ts'o
  2013-01-03 15:52             ` Staffan Tjernstrom
  0 siblings, 1 reply; 9+ messages in thread
From: Theodore Ts'o @ 2013-01-03 15:36 UTC (permalink / raw)
  To: Staffan Tjernstrom; +Cc: Steven Rostedt, linux-rt-users, tglx, C.Emde, jkacur

On Thu, Jan 03, 2013 at 08:36:39AM -0600, Staffan Tjernstrom wrote:
> This may be completely off-in-newbie land, but I figured I'd throw in what I think I've tracked down.
> 
> It looks as if there was a fairly recent patch to turn locks in
> parts of the code into atomic instructions (apologies - I don't have
> the patch id to hand atm) in do_get_write_access() amongst others.

In fs/jbd2/transaction.c?  Can you give me the code snippit and/or
function and line number that you're concerned about?

> Then in turn the C++ standard library loops around calls to write()
> whilst access isn't available, basically blocking on the atomic
> (which then in turn doesn't support priority inheritance), causing
> the wait loop.

Yeah, but do_get_write_access() blocks (usually waiting for the jbd2
kernel thread to complete, but possibly on a memory allocation); we
don't return EAGAIN or anything like that.  So I don't see how that
would cause a wait loop.

It's possible we could be returning -ENOMEM; are you looping for all
write failures, or just for EAGAIN/EINTR and partial writes?

	      	     	  	    - Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50
  2013-01-03 15:36           ` Theodore Ts'o
@ 2013-01-03 15:52             ` Staffan Tjernstrom
  0 siblings, 0 replies; 9+ messages in thread
From: Staffan Tjernstrom @ 2013-01-03 15:52 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Steven Rostedt, linux-rt-users, tglx, C.Emde, jkacur

From: Theodore Ts'o [mailto:tytso@mit.edu] 
Sent: Thursday, January 03, 2013 9:37 AM
To: Staffan Tjernstrom
Cc: Steven Rostedt; linux-rt-users@vger.kernel.org; tglx@linutronix.de; C.Emde@osadl.org; jkacur@redhat.com
Subject: Re: Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50

>In fs/jbd2/transaction.c?  Can you give me the code snippit and/or function and line number that you're concerned about?

Rather in fs/fs.h and fs/namei.c - I think that' where I ended up in my trace of a previous encounter with the issue (either via open() and/or truncate() calls from user land). Coming in from jbd2/transaction.c would make more sense than what I thought I manually traced out however.

See http://lxr.linux.no/#linux+v2.6.33.20/fs/namei.c#L325 vs http://lxr.linux.no/#linux+v3.2.33/include/linux/fs.h#L2286 for the change I got suspicious about.

>Yeah, but do_get_write_access() blocks (usually waiting for the jbd2 kernel thread to complete, but possibly on a memory allocation); we don't return >EAGAIN or anything like that.  So I don't see how that would cause a wait loop.

>It's possible we could be returning -ENOMEM; are you looping for all write failures, or just for EAGAIN/EINTR and partial writes?

Blocking would make more sense with what I was seeing - the loop around the write() failures deep inside libstdc++'s output stream code may just have been me not manging to navigate that library particularly well.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-01-03 15:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <7A2FC0CD30EF4745AE15F485252D38AC2F45A70C9A@clark>
2013-01-03  3:09 ` Observed deadlock in ext4 under 3.2.23-rt37 & 3.2.33-rt50 Steven Rostedt
2013-01-03  4:22   ` Theodore Ts'o
2013-01-03 13:21     ` Steven Rostedt
2013-01-03 14:18       ` Theodore Ts'o
2013-01-03 14:36         ` Staffan Tjernstrom
2013-01-03 15:36           ` Theodore Ts'o
2013-01-03 15:52             ` Staffan Tjernstrom
2013-01-03 15:35         ` Steven Rostedt
2013-01-03 14:29       ` Staffan Tjernstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).