Re: Test generic/299 stalling forever

From: Jens Axboe <axboe@fb.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>,
	linux-ext4@vger.kernel.org, fstests@vger.kernel.org,
	tarasov@vasily.name
Subject: Re: Test generic/299 stalling forever
Date: Mon, 24 Oct 2016 10:28:14 -0600	[thread overview]
Message-ID: <773e0780-6641-ec85-5e78-d04e5a82d6b1@fb.com> (raw)
In-Reply-To: <20161024033852.quinlee4a24mb2e2@thunk.org>

On 10/23/2016 09:38 PM, Theodore Ts'o wrote:
> I enabled some more debugging and it's become more clear what's going
> on.   (See attached for the full log).
>
> The main issue seems to be that once one of fio is done, it kills off
> the other threads (actually, we're using processes):
>
> process  31848 terminate group_id=0
> process  31848 setting terminate on direct_aio/31846
> process  31848 setting terminate on direct_aio/31848
> process  31848 setting terminate on direct_aio/31849
> process  31848 setting terminate on direct_aio/31851
> process  31848 setting terminate on aio-dio-verifier/31852
> process  31848 setting terminate on buffered-aio-verifier/31854
> process  31851 pid=31851: runstate RUNNING -> FINISHING
> process  31851 terminate group_id=0
> process  31851 setting terminate on direct_aio/31846
> process  31851 setting terminate on direct_aio/31848
> process  31851 setting terminate on direct_aio/31849
> process  31851 setting terminate on direct_aio/31851
> process  31851 setting terminate on aio-dio-verifier/31852
> process  31851 setting terminate on buffered-aio-verifier/31854
> process  31852 pid=31852: runstate RUNNING -> FINISHING
> process  31846 pid=31846: runstate RUNNING -> FINISHING
>     ...
>
> but one or more of the threads doesn't exit within 60 seconds:
>
> fio: job 'direct_aio' (state=5) hasn't exited in 60 seconds, it appears to be stuck. Doing forceful exit of this job.
> process  31794 pid=31849: runstate RUNNING -> REAPED
> fio: job 'buffered-aio-verifier' (state=5) hasn't exited in 60 seconds, it appears to be stuck. Doing forceful exit of this job.
> process  31794 pid=31854: runstate RUNNING -> REAPED
> process  31794 terminate group_id=-1
>
> The main thread then prints all of the statistics, and calls stat_exit():
>
> stat_exit called by tid: 31794       <---- debugging message which prints gettid()
>
> Unfortunately, this process(es) aren't actually, killed, they are
> marked as reap, but they are still in the process listing:
>
> root@xfstests:~# ps augxww | grep fio
> root      1585  0.0  0.0      0     0 ?        S<   18:45   0:00 [dm_bufio_cache]
> root      7191  0.0  0.0  12732  2200 pts/1    S+   23:05   0:00 grep fio
> root     31849  1.5  0.2 407208 18876 ?        Ss   22:36   0:26 /root/xfstests/bin/fio /tmp/31503.fio
> root     31854  1.2  0.1 398480 10240 ?        Ssl  22:36   0:22 /root/xfstests/bin/fio /tmp/31503.fio
>
> And if you attach to them with a gdb, they are spinning trying to grab
> the stat_mutex(), which they can't get because the main thread has
> already called stat_exit() and then has exited.  So these two threads
> did eventually return, but some time after 60 seconds had passed, and
> then they hung waiting for stat_mutex(), which they will never get
> because the main thread has already called stat_exit().
>
> This probably also explains why you had trouble reproducing it.  It
> requires a disk whose performance is variable enougoh that under heavy
> load, it might take more than 60 seconds for the direct_aio or
> buffered-aio-verifier thread to close itself out.

Good catch! Yes, that could certainly explain why we are stuck on that 
stat_mutex and why the main thread just gave up on it and ended up in 
stat_exit() with a thread (or more) still running.

> And I suspect once the main thread exited, it probably also closed out
> the debugging channel so the deadlock detector did probably trip, but
> somehow we just didn't see the output.
>
> So I can imagine some possible fixes.  We could make the thread
> timeout configurable, and/or increase it from 60 seconds to something like
> 300 seconds.  We could make stat_exit() a no-op --- after all, if the
> main thread is exiting, there's no real point to down and then destroy
> the stat_mutex.  And/or we could change the forced reap to send a kill
> -9 to the thread, and instead of maring it as reaped.

We have to clean up - for normal runs, it's not a big deal, but if fio
is run as a client/server setup, the backend will persist across runs.
If we leak, then that could be a concern.

How about the below? Bump the timeout to 5 min, 1 min is a little on the
short side, we want normal error handling to be out of the way before
that happens. And additionally, break out if we have been marked as
reaped/exited, so we avoid grabbing the stat mutex again.

diff --git a/backend.c b/backend.c
index 093b6a3a290e..f0927abfccb0 100644
--- a/backend.c
+++ b/backend.c
@@ -1723,6 +1723,14 @@ static void *thread_main(void *data)
  			}
  		}

+		/*
+		 * If we took too long to shut down, the main thread could
+		 * already consider us reaped/exited. If that happens, break
+		 * out and clean up.
+		 */
+		if (td->runstate >= TD_EXITED)
+			break;
+
  		clear_state = 1;

  		/*
diff --git a/fio.h b/fio.h
index 080842aef4f8..74c1b306af26 100644
--- a/fio.h
+++ b/fio.h
@@ -588,7 +588,7 @@ extern const char *runstate_to_name(int runstate);
   * Allow 60 seconds for a job to quit on its own, otherwise reap with
   * a vengeance.
   */
-#define FIO_REAP_TIMEOUT	60
+#define FIO_REAP_TIMEOUT	300

  #define TERMINATE_ALL		(-1U)
  extern void fio_terminate_threads(unsigned int);

-- 
Jens Axboe