All of lore.kernel.org
 help / color / mirror / Atom feed
* fio main thread got stuck over the weekend
@ 2014-08-11 15:44 scameron
  2014-08-11 16:04 ` scameron
  0 siblings, 1 reply; 20+ messages in thread
From: scameron @ 2014-08-11 15:44 UTC (permalink / raw)
  To: fio; +Cc: scameron


Hi,

I was running an fio job over the weekend and the main thread
seems to have hung.  This in on RHEL6u5 with kernel
2.6.32-431.20.3.el6.x86_64 but with a highly modified hpsa
driver, just in case it matters.

[root@msablackburn saperf]# uname -a
Linux msablackburn.americas.hpqcorp.net 2.6.32-431.20.3.el6.x86_64 #1 SMP Fri Jun 6 18:30:54 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
[root@msablackburn saperf]# /usr/local/bin/fio --version
fio-2.1.11-3-g789d

The jobfile looks like this (note the runtime=216000, that's 60 hours)
It seems to have hung with about 25 hours remaining.) 

[root@msablackburn saperf]# cat 2drive_sdr.fio 
[global]
direct=1
ioengine=libaio
norandommap
randrepeat=0
bs=4096
iodepth=1011
#iodepth=96
#numjobs=1
numjobs=12
#numjobs=24
# number_ios=1
runtime=216000
#runtime=54000
time_based=1
group_reporting
thread
gtod_reduce=1
#iodepth_batch=4
#iodepth_batch_complete=4
cpus_allowed=0-5
cpus_allowed_policy=split
rw=randread

[4_KiB_RR_drive_r]
filename=/dev/sdr

[4_KiB_RR_drive_s]
filename=/dev/sds


Here is all the output so far:

---------------8<------------8<-------------8<-----------------
[root@msablackburn saperf]# fio 2drive_sdr.fio
4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
...
4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
...
fio-2.1.11-3-g789d
Starting 24 threads
^Cbs: 24 (f=24): [r(24)] [1.2% done] [1113MB/0KB/0KB /s] [285K/0/0 iops] [eta 14h:48m:47s] 
fio: terminating on signal 2

4_KiB_RR_drive_r: (groupid=0, jobs=24): err= 0: pid=7696: Fri Aug  8 16:49:19 2014
  read : io=881568MB, bw=1309.1MB/s, iops=335341, runt=672989msec
  cpu          : usr=1.75%, sys=23.15%, ctx=1266117, majf=9, minf=835
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=225681428/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1011

Run status group 0 (all jobs):
   READ: io=881568MB, aggrb=1309.1MB/s, minb=1309.1MB/s, maxb=1309.1MB/s, mint=672989msec, maxt=672989msec

Disk stats (read/write):
  sdr: ios=112299535/0, merge=0/0, ticks=13349007/0, in_queue=13054755, util=98.57%
  sds: ios=113359581/0, merge=0/0, ticks=13710834/0, in_queue=13414931, util=98.71%
[root@msablackburn saperf]# vi !$
vi 2drive_sdr.fio
[root@msablackburn saperf]# fio 2drive_sdr.fio
4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
...
4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
...
fio-2.1.11-3-g789d
Starting 24 threads
Jobs: 24 (f=24): [r(24)] [57.1% done] [1408MB/0KB/0KB /s] [361K/0/0 iops] [eta 01d:01h:43m:07s]
---------------8<------------8<-------------8<-----------------

And there it sits, the eta does not update.


I tried connecting to the process with gdb, and doing "info threads" and "bt":

PID of main thread is 8530:

root      8530  598  1.0 2159116 170812 pts/0  Sl+  Aug08 23570:56 fio 2drive_sdr.fio

...
(gdb) info threads
  26 Thread 0x7f65da431700 (LWP 8531)  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
  25 Thread 0x7f65d9a30700 (LWP 8532)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  24 Thread 0x7f65d902f700 (LWP 8533)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  23 Thread 0x7f65d862e700 (LWP 8534)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  22 Thread 0x7f65d7c2d700 (LWP 8535)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  21 Thread 0x7f65d722c700 (LWP 8536)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  20 Thread 0x7f65d682b700 (LWP 8537)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  19 Thread 0x7f65d5e2a700 (LWP 8538)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  18 Thread 0x7f65d5429700 (LWP 8539)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  17 Thread 0x7f65d4a28700 (LWP 8540)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  16 Thread 0x7f65d4027700 (LWP 8541)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  15 Thread 0x7f65d3626700 (LWP 8542)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  14 Thread 0x7f65d2c25700 (LWP 8543)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  13 Thread 0x7f65d2222700 (LWP 8544)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  12 Thread 0x7f65d1820700 (LWP 8545)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  11 Thread 0x7f65d0e1e700 (LWP 8546)  td_io_queue (td=0x7f65da496410, io_u=0x7f656804d010) at ioengines.c:344
  10 Thread 0x7f65d041c700 (LWP 8547)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  9 Thread 0x7f65cfa1a700 (LWP 8548)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  8 Thread 0x7f65cf018700 (LWP 8549)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  7 Thread 0x7f65ce616700 (LWP 8550)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  6 Thread 0x7f65cdc14700 (LWP 8551)  fio_libaio_event (td=0x7f65da4b06a8, event=<value optimized out>) at engines/libaio.c:65
  5 Thread 0x7f65cd212700 (LWP 8552)  __get_next_offset (td=0x7f65da4b5a60) at io_u.c:389
  4 Thread 0x7f65cc810700 (LWP 8553)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
  3 Thread 0x7f65cbe0e700 (LWP 8554)  td_io_u_unlock (td=0x7f65da4c01d0, min_evts=<value optimized out>, bytes=0x7f65cbe0dd10) at fio.h:619
  2 Thread 0x7f65cb40c700 (LWP 8555)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
* 1 Thread 0x7f65e34c6720 (LWP 8530)  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
#1  0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
#2  0x000000000044afe4 in do_usleep () at backend.c:1727
#3  run_threads () at backend.c:1965
#4  0x000000000044b84d in fio_backend () at backend.c:2068
#5  0x0000003aa4c1ed1d in __libc_start_main () from /lib64/libc.so.6
#6  0x0000000000409c39 in _start ()
(gdb) detach
Detaching from program: /usr/local/bin/fio, process 8530
(gdb) quit
[root@msablackburn saperf]# 

I am guessing the 8530 thread is the one that prints out the eta updates?
Well, maybe not, I see a call to create_disk_util_thread() in backend.c.

i/o still seems to be running, and there are no messages in dmesg about
hung i/o or the scsi mid layer getting angry and trying to abort things
or anything like that.  Seems to be happily cruising along, except that
the eta isn't updating, and the job should have finished several hours
ago now.  I'm leaving it running for now just to try to figure out
what happened.

Any ideas?  Anything else I should poke at?

-- steve


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fio main thread got stuck over the weekend
  2014-08-11 15:44 fio main thread got stuck over the weekend scameron
@ 2014-08-11 16:04 ` scameron
  2014-08-22 19:04   ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: scameron @ 2014-08-11 16:04 UTC (permalink / raw)
  To: fio; +Cc: scameron

On Mon, Aug 11, 2014 at 10:44:23AM -0500, scameron@beardog.cce.hp.com wrote:
> 
> Hi,
> 
> I was running an fio job over the weekend and the main thread
> seems to have hung.  This in on RHEL6u5 with kernel
> 2.6.32-431.20.3.el6.x86_64 but with a highly modified hpsa
> driver, just in case it matters.
> 
> [root@msablackburn saperf]# uname -a
> Linux msablackburn.americas.hpqcorp.net 2.6.32-431.20.3.el6.x86_64 #1 SMP Fri Jun 6 18:30:54 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
> [root@msablackburn saperf]# /usr/local/bin/fio --version
> fio-2.1.11-3-g789d
> 
> The jobfile looks like this (note the runtime=216000, that's 60 hours)
> It seems to have hung with about 25 hours remaining.) 
> 
> [root@msablackburn saperf]# cat 2drive_sdr.fio 
> [global]
> direct=1
> ioengine=libaio
> norandommap
> randrepeat=0
> bs=4096
> iodepth=1011
> #iodepth=96
> #numjobs=1
> numjobs=12
> #numjobs=24
> # number_ios=1
> runtime=216000
> #runtime=54000
> time_based=1
> group_reporting
> thread
> gtod_reduce=1
> #iodepth_batch=4
> #iodepth_batch_complete=4
> cpus_allowed=0-5
> cpus_allowed_policy=split
> rw=randread
> 
> [4_KiB_RR_drive_r]
> filename=/dev/sdr
> 
> [4_KiB_RR_drive_s]
> filename=/dev/sds
> 
> 
> Here is all the output so far:
> 
> ---------------8<------------8<-------------8<-----------------
> [root@msablackburn saperf]# fio 2drive_sdr.fio
> 4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
> ...
> 4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
> ...
> fio-2.1.11-3-g789d
> Starting 24 threads
> ^Cbs: 24 (f=24): [r(24)] [1.2% done] [1113MB/0KB/0KB /s] [285K/0/0 iops] [eta 14h:48m:47s] 
> fio: terminating on signal 2
> 
> 4_KiB_RR_drive_r: (groupid=0, jobs=24): err= 0: pid=7696: Fri Aug  8 16:49:19 2014
>   read : io=881568MB, bw=1309.1MB/s, iops=335341, runt=672989msec
>   cpu          : usr=1.75%, sys=23.15%, ctx=1266117, majf=9, minf=835
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=225681428/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1011
> 
> Run status group 0 (all jobs):
>    READ: io=881568MB, aggrb=1309.1MB/s, minb=1309.1MB/s, maxb=1309.1MB/s, mint=672989msec, maxt=672989msec
> 
> Disk stats (read/write):
>   sdr: ios=112299535/0, merge=0/0, ticks=13349007/0, in_queue=13054755, util=98.57%
>   sds: ios=113359581/0, merge=0/0, ticks=13710834/0, in_queue=13414931, util=98.71%
> [root@msablackburn saperf]# vi !$
> vi 2drive_sdr.fio
> [root@msablackburn saperf]# fio 2drive_sdr.fio
> 4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
> ...
> 4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
> ...
> fio-2.1.11-3-g789d
> Starting 24 threads
> Jobs: 24 (f=24): [r(24)] [57.1% done] [1408MB/0KB/0KB /s] [361K/0/0 iops] [eta 01d:01h:43m:07s]
> ---------------8<------------8<-------------8<-----------------
> 
> And there it sits, the eta does not update.
> 
> 
> I tried connecting to the process with gdb, and doing "info threads" and "bt":
> 
> PID of main thread is 8530:
> 
> root      8530  598  1.0 2159116 170812 pts/0  Sl+  Aug08 23570:56 fio 2drive_sdr.fio
> 
> ...
> (gdb) info threads
>   26 Thread 0x7f65da431700 (LWP 8531)  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>   25 Thread 0x7f65d9a30700 (LWP 8532)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   24 Thread 0x7f65d902f700 (LWP 8533)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   23 Thread 0x7f65d862e700 (LWP 8534)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   22 Thread 0x7f65d7c2d700 (LWP 8535)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   21 Thread 0x7f65d722c700 (LWP 8536)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   20 Thread 0x7f65d682b700 (LWP 8537)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   19 Thread 0x7f65d5e2a700 (LWP 8538)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   18 Thread 0x7f65d5429700 (LWP 8539)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   17 Thread 0x7f65d4a28700 (LWP 8540)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   16 Thread 0x7f65d4027700 (LWP 8541)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   15 Thread 0x7f65d3626700 (LWP 8542)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   14 Thread 0x7f65d2c25700 (LWP 8543)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   13 Thread 0x7f65d2222700 (LWP 8544)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   12 Thread 0x7f65d1820700 (LWP 8545)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   11 Thread 0x7f65d0e1e700 (LWP 8546)  td_io_queue (td=0x7f65da496410, io_u=0x7f656804d010) at ioengines.c:344
>   10 Thread 0x7f65d041c700 (LWP 8547)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   9 Thread 0x7f65cfa1a700 (LWP 8548)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   8 Thread 0x7f65cf018700 (LWP 8549)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   7 Thread 0x7f65ce616700 (LWP 8550)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   6 Thread 0x7f65cdc14700 (LWP 8551)  fio_libaio_event (td=0x7f65da4b06a8, event=<value optimized out>) at engines/libaio.c:65
>   5 Thread 0x7f65cd212700 (LWP 8552)  __get_next_offset (td=0x7f65da4b5a60) at io_u.c:389
>   4 Thread 0x7f65cc810700 (LWP 8553)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>   3 Thread 0x7f65cbe0e700 (LWP 8554)  td_io_u_unlock (td=0x7f65da4c01d0, min_evts=<value optimized out>, bytes=0x7f65cbe0dd10) at fio.h:619
>   2 Thread 0x7f65cb40c700 (LWP 8555)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
> * 1 Thread 0x7f65e34c6720 (LWP 8530)  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
> (gdb) bt
> #0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
> #1  0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
> #2  0x000000000044afe4 in do_usleep () at backend.c:1727
> #3  run_threads () at backend.c:1965
> #4  0x000000000044b84d in fio_backend () at backend.c:2068
> #5  0x0000003aa4c1ed1d in __libc_start_main () from /lib64/libc.so.6
> #6  0x0000000000409c39 in _start ()
> (gdb) detach
> Detaching from program: /usr/local/bin/fio, process 8530
> (gdb) quit
> [root@msablackburn saperf]# 
> 
> I am guessing the 8530 thread is the one that prints out the eta updates?
> Well, maybe not, I see a call to create_disk_util_thread() in backend.c.
> 
> i/o still seems to be running, and there are no messages in dmesg about
> hung i/o or the scsi mid layer getting angry and trying to abort things
> or anything like that.  Seems to be happily cruising along, except that
> the eta isn't updating, and the job should have finished several hours
> ago now.  I'm leaving it running for now just to try to figure out
> what happened.
> 
> Any ideas?  Anything else I should poke at?

A bit more info:

thread 26 looks to be the one that's supposed to print the eta info:

(gdb) thread 26
[Switching to thread 26 (Thread 0x7f65da431700 (LWP 8531))]#0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
#1  0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
#2  0x00000000004474a0 in disk_thread_main (data=<value optimized out>) at backend.c:1992
#3  0x0000003aa50079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x0000003aa4ce8b5d in clone () from /lib64/libc.so.6
(gdb)

static void *disk_thread_main(void *data)
{
        int ret = 0;

        fio_mutex_up(startup_mutex);

        while (threads && !ret) {
                usleep(DISK_UTIL_MSEC * 1000);
                if (!threads)
                        break;
                ret = update_io_ticks();

                if (!is_backend)
                        print_thread_status();
        }

        fio_mutex_up(disk_thread_mutex);
        return NULL;
}

that usleep() is where it's in nanosleep, I guess.

from eta.c:

void print_thread_status(void)
{
        struct jobs_eta *je;
        size_t size;

        je = get_jobs_eta(0, &size);
        if (je)
                display_thread_status(je);

        free(je);
}

Maybe that je is coming back false?  which is
probably the return value of calc_thread_status() which, well,
at a glance, I'm not sure what calc_thread_status() is doing. 

-- steve


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fio main thread got stuck over the weekend
  2014-08-11 16:04 ` scameron
@ 2014-08-22 19:04   ` Jens Axboe
  2014-08-22 19:09     ` scameron
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2014-08-22 19:04 UTC (permalink / raw)
  To: scameron, fio

On 2014-08-11 11:04, scameron@beardog.cce.hp.com wrote:
> On Mon, Aug 11, 2014 at 10:44:23AM -0500, scameron@beardog.cce.hp.com wrote:
>>
>> Hi,
>>
>> I was running an fio job over the weekend and the main thread
>> seems to have hung.  This in on RHEL6u5 with kernel
>> 2.6.32-431.20.3.el6.x86_64 but with a highly modified hpsa
>> driver, just in case it matters.
>>
>> [root@msablackburn saperf]# uname -a
>> Linux msablackburn.americas.hpqcorp.net 2.6.32-431.20.3.el6.x86_64 #1 SMP Fri Jun 6 18:30:54 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
>> [root@msablackburn saperf]# /usr/local/bin/fio --version
>> fio-2.1.11-3-g789d
>>
>> The jobfile looks like this (note the runtime=216000, that's 60 hours)
>> It seems to have hung with about 25 hours remaining.)
>>
>> [root@msablackburn saperf]# cat 2drive_sdr.fio
>> [global]
>> direct=1
>> ioengine=libaio
>> norandommap
>> randrepeat=0
>> bs=4096
>> iodepth=1011
>> #iodepth=96
>> #numjobs=1
>> numjobs=12
>> #numjobs=24
>> # number_ios=1
>> runtime=216000
>> #runtime=54000
>> time_based=1
>> group_reporting
>> thread
>> gtod_reduce=1
>> #iodepth_batch=4
>> #iodepth_batch_complete=4
>> cpus_allowed=0-5
>> cpus_allowed_policy=split
>> rw=randread
>>
>> [4_KiB_RR_drive_r]
>> filename=/dev/sdr
>>
>> [4_KiB_RR_drive_s]
>> filename=/dev/sds
>>
>>
>> Here is all the output so far:
>>
>> ---------------8<------------8<-------------8<-----------------
>> [root@msablackburn saperf]# fio 2drive_sdr.fio
>> 4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
>> ...
>> 4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
>> ...
>> fio-2.1.11-3-g789d
>> Starting 24 threads
>> ^Cbs: 24 (f=24): [r(24)] [1.2% done] [1113MB/0KB/0KB /s] [285K/0/0 iops] [eta 14h:48m:47s]
>> fio: terminating on signal 2
>>
>> 4_KiB_RR_drive_r: (groupid=0, jobs=24): err= 0: pid=7696: Fri Aug  8 16:49:19 2014
>>    read : io=881568MB, bw=1309.1MB/s, iops=335341, runt=672989msec
>>    cpu          : usr=1.75%, sys=23.15%, ctx=1266117, majf=9, minf=835
>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>>       issued    : total=r=225681428/w=0/d=0, short=r=0/w=0/d=0
>>       latency   : target=0, window=0, percentile=100.00%, depth=1011
>>
>> Run status group 0 (all jobs):
>>     READ: io=881568MB, aggrb=1309.1MB/s, minb=1309.1MB/s, maxb=1309.1MB/s, mint=672989msec, maxt=672989msec
>>
>> Disk stats (read/write):
>>    sdr: ios=112299535/0, merge=0/0, ticks=13349007/0, in_queue=13054755, util=98.57%
>>    sds: ios=113359581/0, merge=0/0, ticks=13710834/0, in_queue=13414931, util=98.71%
>> [root@msablackburn saperf]# vi !$
>> vi 2drive_sdr.fio
>> [root@msablackburn saperf]# fio 2drive_sdr.fio
>> 4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
>> ...
>> 4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
>> ...
>> fio-2.1.11-3-g789d
>> Starting 24 threads
>> Jobs: 24 (f=24): [r(24)] [57.1% done] [1408MB/0KB/0KB /s] [361K/0/0 iops] [eta 01d:01h:43m:07s]
>> ---------------8<------------8<-------------8<-----------------
>>
>> And there it sits, the eta does not update.
>>
>>
>> I tried connecting to the process with gdb, and doing "info threads" and "bt":
>>
>> PID of main thread is 8530:
>>
>> root      8530  598  1.0 2159116 170812 pts/0  Sl+  Aug08 23570:56 fio 2drive_sdr.fio
>>
>> ...
>> (gdb) info threads
>>    26 Thread 0x7f65da431700 (LWP 8531)  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>>    25 Thread 0x7f65d9a30700 (LWP 8532)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    24 Thread 0x7f65d902f700 (LWP 8533)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    23 Thread 0x7f65d862e700 (LWP 8534)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    22 Thread 0x7f65d7c2d700 (LWP 8535)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    21 Thread 0x7f65d722c700 (LWP 8536)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    20 Thread 0x7f65d682b700 (LWP 8537)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    19 Thread 0x7f65d5e2a700 (LWP 8538)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    18 Thread 0x7f65d5429700 (LWP 8539)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    17 Thread 0x7f65d4a28700 (LWP 8540)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    16 Thread 0x7f65d4027700 (LWP 8541)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    15 Thread 0x7f65d3626700 (LWP 8542)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    14 Thread 0x7f65d2c25700 (LWP 8543)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    13 Thread 0x7f65d2222700 (LWP 8544)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    12 Thread 0x7f65d1820700 (LWP 8545)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    11 Thread 0x7f65d0e1e700 (LWP 8546)  td_io_queue (td=0x7f65da496410, io_u=0x7f656804d010) at ioengines.c:344
>>    10 Thread 0x7f65d041c700 (LWP 8547)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    9 Thread 0x7f65cfa1a700 (LWP 8548)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    8 Thread 0x7f65cf018700 (LWP 8549)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    7 Thread 0x7f65ce616700 (LWP 8550)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    6 Thread 0x7f65cdc14700 (LWP 8551)  fio_libaio_event (td=0x7f65da4b06a8, event=<value optimized out>) at engines/libaio.c:65
>>    5 Thread 0x7f65cd212700 (LWP 8552)  __get_next_offset (td=0x7f65da4b5a60) at io_u.c:389
>>    4 Thread 0x7f65cc810700 (LWP 8553)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>>    3 Thread 0x7f65cbe0e700 (LWP 8554)  td_io_u_unlock (td=0x7f65da4c01d0, min_evts=<value optimized out>, bytes=0x7f65cbe0dd10) at fio.h:619
>>    2 Thread 0x7f65cb40c700 (LWP 8555)  0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> * 1 Thread 0x7f65e34c6720 (LWP 8530)  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>> (gdb) bt
>> #0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>> #1  0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
>> #2  0x000000000044afe4 in do_usleep () at backend.c:1727
>> #3  run_threads () at backend.c:1965
>> #4  0x000000000044b84d in fio_backend () at backend.c:2068
>> #5  0x0000003aa4c1ed1d in __libc_start_main () from /lib64/libc.so.6
>> #6  0x0000000000409c39 in _start ()
>> (gdb) detach
>> Detaching from program: /usr/local/bin/fio, process 8530
>> (gdb) quit
>> [root@msablackburn saperf]#
>>
>> I am guessing the 8530 thread is the one that prints out the eta updates?
>> Well, maybe not, I see a call to create_disk_util_thread() in backend.c.
>>
>> i/o still seems to be running, and there are no messages in dmesg about
>> hung i/o or the scsi mid layer getting angry and trying to abort things
>> or anything like that.  Seems to be happily cruising along, except that
>> the eta isn't updating, and the job should have finished several hours
>> ago now.  I'm leaving it running for now just to try to figure out
>> what happened.
>>
>> Any ideas?  Anything else I should poke at?
>
> A bit more info:
>
> thread 26 looks to be the one that's supposed to print the eta info:
>
> (gdb) thread 26
> [Switching to thread 26 (Thread 0x7f65da431700 (LWP 8531))]#0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
> (gdb) bt
> #0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
> #1  0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
> #2  0x00000000004474a0 in disk_thread_main (data=<value optimized out>) at backend.c:1992
> #3  0x0000003aa50079d1 in start_thread () from /lib64/libpthread.so.0
> #4  0x0000003aa4ce8b5d in clone () from /lib64/libc.so.6
> (gdb)
>
> static void *disk_thread_main(void *data)
> {
>          int ret = 0;
>
>          fio_mutex_up(startup_mutex);
>
>          while (threads && !ret) {
>                  usleep(DISK_UTIL_MSEC * 1000);
>                  if (!threads)
>                          break;
>                  ret = update_io_ticks();
>
>                  if (!is_backend)
>                          print_thread_status();
>          }
>
>          fio_mutex_up(disk_thread_mutex);
>          return NULL;
> }
>
> that usleep() is where it's in nanosleep, I guess.
>
> from eta.c:
>
> void print_thread_status(void)
> {
>          struct jobs_eta *je;
>          size_t size;
>
>          je = get_jobs_eta(0, &size);
>          if (je)
>                  display_thread_status(je);
>
>          free(je);
> }
>
> Maybe that je is coming back false?  which is
> probably the return value of calc_thread_status() which, well,
> at a glance, I'm not sure what calc_thread_status() is doing.

I'll take a look at this next week, been away at a conference since last 
weekend.


-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fio main thread got stuck over the weekend
  2014-08-22 19:04   ` Jens Axboe
@ 2014-08-22 19:09     ` scameron
  2014-08-22 19:11       ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: scameron @ 2014-08-22 19:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio, scameron

On Fri, Aug 22, 2014 at 02:04:34PM -0500, Jens Axboe wrote:
> On 2014-08-11 11:04, scameron@beardog.cce.hp.com wrote:
> >On Mon, Aug 11, 2014 at 10:44:23AM -0500, scameron@beardog.cce.hp.com 
> >wrote:
> >>
> >>Hi,
> >>
> >>I was running an fio job over the weekend and the main thread
> >>seems to have hung.  This in on RHEL6u5 with kernel
> >>2.6.32-431.20.3.el6.x86_64 but with a highly modified hpsa
> >>driver, just in case it matters.
> >>
> >>[root@msablackburn saperf]# uname -a
> >>Linux msablackburn.americas.hpqcorp.net 2.6.32-431.20.3.el6.x86_64 #1 SMP 
> >>Fri Jun 6 18:30:54 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
> >>[root@msablackburn saperf]# /usr/local/bin/fio --version
> >>fio-2.1.11-3-g789d
> >>
> >>The jobfile looks like this (note the runtime=216000, that's 60 hours)
> >>It seems to have hung with about 25 hours remaining.)
> >>
> >>[root@msablackburn saperf]# cat 2drive_sdr.fio
> >>[global]
> >>direct=1
> >>ioengine=libaio
> >>norandommap
> >>randrepeat=0
> >>bs=4096
> >>iodepth=1011
> >>#iodepth=96
> >>#numjobs=1
> >>numjobs=12
> >>#numjobs=24
> >># number_ios=1
> >>runtime=216000
> >>#runtime=54000
> >>time_based=1
> >>group_reporting
> >>thread
> >>gtod_reduce=1
> >>#iodepth_batch=4
> >>#iodepth_batch_complete=4
> >>cpus_allowed=0-5
> >>cpus_allowed_policy=split
> >>rw=randread
> >>
> >>[4_KiB_RR_drive_r]
> >>filename=/dev/sdr
> >>
> >>[4_KiB_RR_drive_s]
> >>filename=/dev/sds
> >>
> >>
> >>Here is all the output so far:
> >>
> >>---------------8<------------8<-------------8<-----------------
> >>[root@msablackburn saperf]# fio 2drive_sdr.fio
> >>4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
> >>ioengine=libaio, iodepth=1011
> >>...
> >>4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
> >>ioengine=libaio, iodepth=1011
> >>...
> >>fio-2.1.11-3-g789d
> >>Starting 24 threads
> >>^Cbs: 24 (f=24): [r(24)] [1.2% done] [1113MB/0KB/0KB /s] [285K/0/0 iops] 
> >>[eta 14h:48m:47s]
> >>fio: terminating on signal 2
> >>
> >>4_KiB_RR_drive_r: (groupid=0, jobs=24): err= 0: pid=7696: Fri Aug  8 
> >>16:49:19 2014
> >>   read : io=881568MB, bw=1309.1MB/s, iops=335341, runt=672989msec
> >>   cpu          : usr=1.75%, sys=23.15%, ctx=1266117, majf=9, minf=835
> >>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
> >>   >=64=100.0%
> >>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >>      >=64=0.0%
> >>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >>      >=64=0.1%
> >>      issued    : total=r=225681428/w=0/d=0, short=r=0/w=0/d=0
> >>      latency   : target=0, window=0, percentile=100.00%, depth=1011
> >>
> >>Run status group 0 (all jobs):
> >>    READ: io=881568MB, aggrb=1309.1MB/s, minb=1309.1MB/s, 
> >>    maxb=1309.1MB/s, mint=672989msec, maxt=672989msec
> >>
> >>Disk stats (read/write):
> >>   sdr: ios=112299535/0, merge=0/0, ticks=13349007/0, in_queue=13054755, 
> >>   util=98.57%
> >>   sds: ios=113359581/0, merge=0/0, ticks=13710834/0, in_queue=13414931, 
> >>   util=98.71%
> >>[root@msablackburn saperf]# vi !$
> >>vi 2drive_sdr.fio
> >>[root@msablackburn saperf]# fio 2drive_sdr.fio
> >>4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
> >>ioengine=libaio, iodepth=1011
> >>...
> >>4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
> >>ioengine=libaio, iodepth=1011
> >>...
> >>fio-2.1.11-3-g789d
> >>Starting 24 threads
> >>Jobs: 24 (f=24): [r(24)] [57.1% done] [1408MB/0KB/0KB /s] [361K/0/0 iops] 
> >>[eta 01d:01h:43m:07s]
> >>---------------8<------------8<-------------8<-----------------
> >>
> >>And there it sits, the eta does not update.
> >>
> >>
> >>I tried connecting to the process with gdb, and doing "info threads" and 
> >>"bt":
> >>
> >>PID of main thread is 8530:
> >>
> >>root      8530  598  1.0 2159116 170812 pts/0  Sl+  Aug08 23570:56 fio 
> >>2drive_sdr.fio
> >>
> >>...
> >>(gdb) info threads
> >>   26 Thread 0x7f65da431700 (LWP 8531)  0x0000003aa4cacced in nanosleep 
> >>   () from /lib64/libc.so.6
> >>   25 Thread 0x7f65d9a30700 (LWP 8532)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   24 Thread 0x7f65d902f700 (LWP 8533)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   23 Thread 0x7f65d862e700 (LWP 8534)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   22 Thread 0x7f65d7c2d700 (LWP 8535)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   21 Thread 0x7f65d722c700 (LWP 8536)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   20 Thread 0x7f65d682b700 (LWP 8537)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   19 Thread 0x7f65d5e2a700 (LWP 8538)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   18 Thread 0x7f65d5429700 (LWP 8539)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   17 Thread 0x7f65d4a28700 (LWP 8540)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   16 Thread 0x7f65d4027700 (LWP 8541)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   15 Thread 0x7f65d3626700 (LWP 8542)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   14 Thread 0x7f65d2c25700 (LWP 8543)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   13 Thread 0x7f65d2222700 (LWP 8544)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   12 Thread 0x7f65d1820700 (LWP 8545)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   11 Thread 0x7f65d0e1e700 (LWP 8546)  td_io_queue (td=0x7f65da496410, 
> >>   io_u=0x7f656804d010) at ioengines.c:344
> >>   10 Thread 0x7f65d041c700 (LWP 8547)  0x0000003c69800667 in io_submit 
> >>   () from /lib64/libaio.so.1
> >>   9 Thread 0x7f65cfa1a700 (LWP 8548)  0x0000003c69800667 in io_submit () 
> >>   from /lib64/libaio.so.1
> >>   8 Thread 0x7f65cf018700 (LWP 8549)  0x0000003c69800667 in io_submit () 
> >>   from /lib64/libaio.so.1
> >>   7 Thread 0x7f65ce616700 (LWP 8550)  0x0000003c69800667 in io_submit () 
> >>   from /lib64/libaio.so.1
> >>   6 Thread 0x7f65cdc14700 (LWP 8551)  fio_libaio_event 
> >>   (td=0x7f65da4b06a8, event=<value optimized out>) at engines/libaio.c:65
> >>   5 Thread 0x7f65cd212700 (LWP 8552)  __get_next_offset 
> >>   (td=0x7f65da4b5a60) at io_u.c:389
> >>   4 Thread 0x7f65cc810700 (LWP 8553)  0x0000003c69800667 in io_submit () 
> >>   from /lib64/libaio.so.1
> >>   3 Thread 0x7f65cbe0e700 (LWP 8554)  td_io_u_unlock (td=0x7f65da4c01d0, 
> >>   min_evts=<value optimized out>, bytes=0x7f65cbe0dd10) at fio.h:619
> >>   2 Thread 0x7f65cb40c700 (LWP 8555)  0x0000003c69800667 in io_submit () 
> >>   from /lib64/libaio.so.1
> >>* 1 Thread 0x7f65e34c6720 (LWP 8530)  0x0000003aa4cacced in nanosleep () 
> >>from /lib64/libc.so.6
> >>(gdb) bt
> >>#0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
> >>#1  0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
> >>#2  0x000000000044afe4 in do_usleep () at backend.c:1727
> >>#3  run_threads () at backend.c:1965
> >>#4  0x000000000044b84d in fio_backend () at backend.c:2068
> >>#5  0x0000003aa4c1ed1d in __libc_start_main () from /lib64/libc.so.6
> >>#6  0x0000000000409c39 in _start ()
> >>(gdb) detach
> >>Detaching from program: /usr/local/bin/fio, process 8530
> >>(gdb) quit
> >>[root@msablackburn saperf]#
> >>
> >>I am guessing the 8530 thread is the one that prints out the eta updates?
> >>Well, maybe not, I see a call to create_disk_util_thread() in backend.c.
> >>
> >>i/o still seems to be running, and there are no messages in dmesg about
> >>hung i/o or the scsi mid layer getting angry and trying to abort things
> >>or anything like that.  Seems to be happily cruising along, except that
> >>the eta isn't updating, and the job should have finished several hours
> >>ago now.  I'm leaving it running for now just to try to figure out
> >>what happened.
> >>
> >>Any ideas?  Anything else I should poke at?
> >
> >A bit more info:
> >
> >thread 26 looks to be the one that's supposed to print the eta info:
> >
> >(gdb) thread 26
> >[Switching to thread 26 (Thread 0x7f65da431700 (LWP 8531))]#0  
> >0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
> >(gdb) bt
> >#0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
> >#1  0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
> >#2  0x00000000004474a0 in disk_thread_main (data=<value optimized out>) at 
> >backend.c:1992
> >#3  0x0000003aa50079d1 in start_thread () from /lib64/libpthread.so.0
> >#4  0x0000003aa4ce8b5d in clone () from /lib64/libc.so.6
> >(gdb)
> >
> >static void *disk_thread_main(void *data)
> >{
> >         int ret = 0;
> >
> >         fio_mutex_up(startup_mutex);
> >
> >         while (threads && !ret) {
> >                 usleep(DISK_UTIL_MSEC * 1000);
> >                 if (!threads)
> >                         break;
> >                 ret = update_io_ticks();
> >
> >                 if (!is_backend)
> >                         print_thread_status();
> >         }
> >
> >         fio_mutex_up(disk_thread_mutex);
> >         return NULL;
> >}
> >
> >that usleep() is where it's in nanosleep, I guess.
> >
> >from eta.c:
> >
> >void print_thread_status(void)
> >{
> >         struct jobs_eta *je;
> >         size_t size;
> >
> >         je = get_jobs_eta(0, &size);
> >         if (je)
> >                 display_thread_status(je);
> >
> >         free(je);
> >}
> >
> >Maybe that je is coming back false?  which is
> >probably the return value of calc_thread_status() which, well,
> >at a glance, I'm not sure what calc_thread_status() is doing.
> 
> I'll take a look at this next week, been away at a conference since last 
> weekend.

Ok.  Meantime, I had to reclaim the machine for testing, so I no longer
have it just sitting there to debug, and I have not sseen the problem again
that I know of.

-- steve



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fio main thread got stuck over the weekend
  2014-08-22 19:09     ` scameron
@ 2014-08-22 19:11       ` Jens Axboe
  2014-12-12 20:32         ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2014-08-22 19:11 UTC (permalink / raw)
  To: scameron; +Cc: fio

On 2014-08-22 14:09, scameron@beardog.cce.hp.com wrote:
> On Fri, Aug 22, 2014 at 02:04:34PM -0500, Jens Axboe wrote:
>> On 2014-08-11 11:04, scameron@beardog.cce.hp.com wrote:
>>> On Mon, Aug 11, 2014 at 10:44:23AM -0500, scameron@beardog.cce.hp.com
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I was running an fio job over the weekend and the main thread
>>>> seems to have hung.  This in on RHEL6u5 with kernel
>>>> 2.6.32-431.20.3.el6.x86_64 but with a highly modified hpsa
>>>> driver, just in case it matters.
>>>>
>>>> [root@msablackburn saperf]# uname -a
>>>> Linux msablackburn.americas.hpqcorp.net 2.6.32-431.20.3.el6.x86_64 #1 SMP
>>>> Fri Jun 6 18:30:54 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>> [root@msablackburn saperf]# /usr/local/bin/fio --version
>>>> fio-2.1.11-3-g789d
>>>>
>>>> The jobfile looks like this (note the runtime=216000, that's 60 hours)
>>>> It seems to have hung with about 25 hours remaining.)
>>>>
>>>> [root@msablackburn saperf]# cat 2drive_sdr.fio
>>>> [global]
>>>> direct=1
>>>> ioengine=libaio
>>>> norandommap
>>>> randrepeat=0
>>>> bs=4096
>>>> iodepth=1011
>>>> #iodepth=96
>>>> #numjobs=1
>>>> numjobs=12
>>>> #numjobs=24
>>>> # number_ios=1
>>>> runtime=216000
>>>> #runtime=54000
>>>> time_based=1
>>>> group_reporting
>>>> thread
>>>> gtod_reduce=1
>>>> #iodepth_batch=4
>>>> #iodepth_batch_complete=4
>>>> cpus_allowed=0-5
>>>> cpus_allowed_policy=split
>>>> rw=randread
>>>>
>>>> [4_KiB_RR_drive_r]
>>>> filename=/dev/sdr
>>>>
>>>> [4_KiB_RR_drive_s]
>>>> filename=/dev/sds
>>>>
>>>>
>>>> Here is all the output so far:
>>>>
>>>> ---------------8<------------8<-------------8<-----------------
>>>> [root@msablackburn saperf]# fio 2drive_sdr.fio
>>>> 4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>>>> ioengine=libaio, iodepth=1011
>>>> ...
>>>> 4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>>>> ioengine=libaio, iodepth=1011
>>>> ...
>>>> fio-2.1.11-3-g789d
>>>> Starting 24 threads
>>>> ^Cbs: 24 (f=24): [r(24)] [1.2% done] [1113MB/0KB/0KB /s] [285K/0/0 iops]
>>>> [eta 14h:48m:47s]
>>>> fio: terminating on signal 2
>>>>
>>>> 4_KiB_RR_drive_r: (groupid=0, jobs=24): err= 0: pid=7696: Fri Aug  8
>>>> 16:49:19 2014
>>>>    read : io=881568MB, bw=1309.1MB/s, iops=335341, runt=672989msec
>>>>    cpu          : usr=1.75%, sys=23.15%, ctx=1266117, majf=9, minf=835
>>>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>    >=64=100.0%
>>>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>       >=64=0.0%
>>>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>       >=64=0.1%
>>>>       issued    : total=r=225681428/w=0/d=0, short=r=0/w=0/d=0
>>>>       latency   : target=0, window=0, percentile=100.00%, depth=1011
>>>>
>>>> Run status group 0 (all jobs):
>>>>     READ: io=881568MB, aggrb=1309.1MB/s, minb=1309.1MB/s,
>>>>     maxb=1309.1MB/s, mint=672989msec, maxt=672989msec
>>>>
>>>> Disk stats (read/write):
>>>>    sdr: ios=112299535/0, merge=0/0, ticks=13349007/0, in_queue=13054755,
>>>>    util=98.57%
>>>>    sds: ios=113359581/0, merge=0/0, ticks=13710834/0, in_queue=13414931,
>>>>    util=98.71%
>>>> [root@msablackburn saperf]# vi !$
>>>> vi 2drive_sdr.fio
>>>> [root@msablackburn saperf]# fio 2drive_sdr.fio
>>>> 4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>>>> ioengine=libaio, iodepth=1011
>>>> ...
>>>> 4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>>>> ioengine=libaio, iodepth=1011
>>>> ...
>>>> fio-2.1.11-3-g789d
>>>> Starting 24 threads
>>>> Jobs: 24 (f=24): [r(24)] [57.1% done] [1408MB/0KB/0KB /s] [361K/0/0 iops]
>>>> [eta 01d:01h:43m:07s]
>>>> ---------------8<------------8<-------------8<-----------------
>>>>
>>>> And there it sits, the eta does not update.
>>>>
>>>>
>>>> I tried connecting to the process with gdb, and doing "info threads" and
>>>> "bt":
>>>>
>>>> PID of main thread is 8530:
>>>>
>>>> root      8530  598  1.0 2159116 170812 pts/0  Sl+  Aug08 23570:56 fio
>>>> 2drive_sdr.fio
>>>>
>>>> ...
>>>> (gdb) info threads
>>>>    26 Thread 0x7f65da431700 (LWP 8531)  0x0000003aa4cacced in nanosleep
>>>>    () from /lib64/libc.so.6
>>>>    25 Thread 0x7f65d9a30700 (LWP 8532)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    24 Thread 0x7f65d902f700 (LWP 8533)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    23 Thread 0x7f65d862e700 (LWP 8534)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    22 Thread 0x7f65d7c2d700 (LWP 8535)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    21 Thread 0x7f65d722c700 (LWP 8536)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    20 Thread 0x7f65d682b700 (LWP 8537)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    19 Thread 0x7f65d5e2a700 (LWP 8538)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    18 Thread 0x7f65d5429700 (LWP 8539)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    17 Thread 0x7f65d4a28700 (LWP 8540)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    16 Thread 0x7f65d4027700 (LWP 8541)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    15 Thread 0x7f65d3626700 (LWP 8542)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    14 Thread 0x7f65d2c25700 (LWP 8543)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    13 Thread 0x7f65d2222700 (LWP 8544)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    12 Thread 0x7f65d1820700 (LWP 8545)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    11 Thread 0x7f65d0e1e700 (LWP 8546)  td_io_queue (td=0x7f65da496410,
>>>>    io_u=0x7f656804d010) at ioengines.c:344
>>>>    10 Thread 0x7f65d041c700 (LWP 8547)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    9 Thread 0x7f65cfa1a700 (LWP 8548)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>>    8 Thread 0x7f65cf018700 (LWP 8549)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>>    7 Thread 0x7f65ce616700 (LWP 8550)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>>    6 Thread 0x7f65cdc14700 (LWP 8551)  fio_libaio_event
>>>>    (td=0x7f65da4b06a8, event=<value optimized out>) at engines/libaio.c:65
>>>>    5 Thread 0x7f65cd212700 (LWP 8552)  __get_next_offset
>>>>    (td=0x7f65da4b5a60) at io_u.c:389
>>>>    4 Thread 0x7f65cc810700 (LWP 8553)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>>    3 Thread 0x7f65cbe0e700 (LWP 8554)  td_io_u_unlock (td=0x7f65da4c01d0,
>>>>    min_evts=<value optimized out>, bytes=0x7f65cbe0dd10) at fio.h:619
>>>>    2 Thread 0x7f65cb40c700 (LWP 8555)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>> * 1 Thread 0x7f65e34c6720 (LWP 8530)  0x0000003aa4cacced in nanosleep ()
>>> >from /lib64/libc.so.6
>>>> (gdb) bt
>>>> #0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>>>> #1  0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
>>>> #2  0x000000000044afe4 in do_usleep () at backend.c:1727
>>>> #3  run_threads () at backend.c:1965
>>>> #4  0x000000000044b84d in fio_backend () at backend.c:2068
>>>> #5  0x0000003aa4c1ed1d in __libc_start_main () from /lib64/libc.so.6
>>>> #6  0x0000000000409c39 in _start ()
>>>> (gdb) detach
>>>> Detaching from program: /usr/local/bin/fio, process 8530
>>>> (gdb) quit
>>>> [root@msablackburn saperf]#
>>>>
>>>> I am guessing the 8530 thread is the one that prints out the eta updates?
>>>> Well, maybe not, I see a call to create_disk_util_thread() in backend.c.
>>>>
>>>> i/o still seems to be running, and there are no messages in dmesg about
>>>> hung i/o or the scsi mid layer getting angry and trying to abort things
>>>> or anything like that.  Seems to be happily cruising along, except that
>>>> the eta isn't updating, and the job should have finished several hours
>>>> ago now.  I'm leaving it running for now just to try to figure out
>>>> what happened.
>>>>
>>>> Any ideas?  Anything else I should poke at?
>>>
>>> A bit more info:
>>>
>>> thread 26 looks to be the one that's supposed to print the eta info:
>>>
>>> (gdb) thread 26
>>> [Switching to thread 26 (Thread 0x7f65da431700 (LWP 8531))]#0
>>> 0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>>> (gdb) bt
>>> #0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>>> #1  0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
>>> #2  0x00000000004474a0 in disk_thread_main (data=<value optimized out>) at
>>> backend.c:1992
>>> #3  0x0000003aa50079d1 in start_thread () from /lib64/libpthread.so.0
>>> #4  0x0000003aa4ce8b5d in clone () from /lib64/libc.so.6
>>> (gdb)
>>>
>>> static void *disk_thread_main(void *data)
>>> {
>>>          int ret = 0;
>>>
>>>          fio_mutex_up(startup_mutex);
>>>
>>>          while (threads && !ret) {
>>>                  usleep(DISK_UTIL_MSEC * 1000);
>>>                  if (!threads)
>>>                          break;
>>>                  ret = update_io_ticks();
>>>
>>>                  if (!is_backend)
>>>                          print_thread_status();
>>>          }
>>>
>>>          fio_mutex_up(disk_thread_mutex);
>>>          return NULL;
>>> }
>>>
>>> that usleep() is where it's in nanosleep, I guess.
>>>
>> >from eta.c:
>>>
>>> void print_thread_status(void)
>>> {
>>>          struct jobs_eta *je;
>>>          size_t size;
>>>
>>>          je = get_jobs_eta(0, &size);
>>>          if (je)
>>>                  display_thread_status(je);
>>>
>>>          free(je);
>>> }
>>>
>>> Maybe that je is coming back false?  which is
>>> probably the return value of calc_thread_status() which, well,
>>> at a glance, I'm not sure what calc_thread_status() is doing.
>>
>> I'll take a look at this next week, been away at a conference since last
>> weekend.
>
> Ok.  Meantime, I had to reclaim the machine for testing, so I no longer
> have it just sitting there to debug, and I have not sseen the problem again
> that I know of.

Clearly a hardware issue :-)

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: fio main thread got stuck over the weekend
  2014-08-22 19:11       ` Jens Axboe
@ 2014-12-12 20:32         ` Elliott, Robert (Server Storage)
  2014-12-13  4:49           ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-12-12 20:32 UTC (permalink / raw)
  To: Jens Axboe, stephenmcameron; +Cc: fio



> -----Original Message-----
> From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On
> Behalf Of Jens Axboe
> Sent: Friday, 22 August, 2014 2:11 PM
> To: scameron@beardog.cce.hp.com
...
> On 2014-08-22 14:09, scameron@beardog.cce.hp.com wrote:
> > On Fri, Aug 22, 2014 at 02:04:34PM -0500, Jens Axboe wrote:
> >> On 2014-08-11 11:04, scameron@beardog.cce.hp.com wrote:
> >>> On Mon, Aug 11, 2014 at 10:44:23AM -0500, scameron@beardog.cce.hp.com
> >>> wrote:
> >>>>
...
> >>>
> >> >from eta.c:
> >>>
> >>> void print_thread_status(void)
> >>> {
> >>>          struct jobs_eta *je;
> >>>          size_t size;
> >>>
> >>>          je = get_jobs_eta(0, &size);
> >>>          if (je)
> >>>                  display_thread_status(je);
> >>>
> >>>          free(je);
> >>> }
> >>>
> >>> Maybe that je is coming back false?  which is
> >>> probably the return value of calc_thread_status() which, well,
> >>> at a glance, I'm not sure what calc_thread_status() is doing.
> >>
> >> I'll take a look at this next week, been away at a conference since
> last
> >> weekend.
> >
> > Ok.  Meantime, I had to reclaim the machine for testing, so I no longer
> > have it just sitting there to debug, and I have not sseen the problem
> again
> > that I know of.
> 
> Clearly a hardware issue :-)
> 
> --
> Jens Axboe

Rerunning a multi-day job to test out the 64-bit counter fixes,
I just saw the same thing after about 2 days - eta updates stop, 
although IO is still running.

Jobs: 210 (f=210): [r(98),X(14),r(112)] [31.5% done] [2388MB/0KB/0KB /s] [4891K/0/0 iops] [eta 01d:17h:05m:24s]

I notice that get_jobs_eta makes a malloc() call without 
checking for NULL - maybe that happened?



commit 4b9d6e40b90029f42c378ee82b130af9ceafffd7
Author: Robert Elliott <elliott@hp.com>
Date:   Fri Dec 12 14:04:11 2014 -0600

    eta.c: check malloc return code
    
    Check the malloc return code in get_jobs_eta.
    
    Signed-off-by: Robert Elliott <elliott@hp.com>

diff --git a/eta.c b/eta.c
index a90f1fb47637..167bf5f62b21 100644
--- a/eta.c
+++ b/eta.c
@@ -572,6 +572,8 @@ struct jobs_eta *get_jobs_eta(int force, size_t *size)
 
 	*size = sizeof(*je) + THREAD_RUNSTR_SZ;
 	je = malloc(*size);
+	if (!je)
+		return NULL;
 	memset(je, 0, *size);
 
 	if (!calc_thread_status(je, force)) {


---
Rob Elliott    HP Server Storage







^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: fio main thread got stuck over the weekend
  2014-12-12 20:32         ` Elliott, Robert (Server Storage)
@ 2014-12-13  4:49           ` Jens Axboe
  2014-12-15 17:33             ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2014-12-13  4:49 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), stephenmcameron; +Cc: fio

On 12/12/2014 01:32 PM, Elliott, Robert (Server Storage) wrote:
>
>
>> -----Original Message-----
>> From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On
>> Behalf Of Jens Axboe
>> Sent: Friday, 22 August, 2014 2:11 PM
>> To: scameron@beardog.cce.hp.com
> ...
>> On 2014-08-22 14:09, scameron@beardog.cce.hp.com wrote:
>>> On Fri, Aug 22, 2014 at 02:04:34PM -0500, Jens Axboe wrote:
>>>> On 2014-08-11 11:04, scameron@beardog.cce.hp.com wrote:
>>>>> On Mon, Aug 11, 2014 at 10:44:23AM -0500, scameron@beardog.cce.hp.com
>>>>> wrote:
>>>>>>
> ...
>>>>>
>>>> >from eta.c:
>>>>>
>>>>> void print_thread_status(void)
>>>>> {
>>>>>           struct jobs_eta *je;
>>>>>           size_t size;
>>>>>
>>>>>           je = get_jobs_eta(0, &size);
>>>>>           if (je)
>>>>>                   display_thread_status(je);
>>>>>
>>>>>           free(je);
>>>>> }
>>>>>
>>>>> Maybe that je is coming back false?  which is
>>>>> probably the return value of calc_thread_status() which, well,
>>>>> at a glance, I'm not sure what calc_thread_status() is doing.
>>>>
>>>> I'll take a look at this next week, been away at a conference since
>> last
>>>> weekend.
>>>
>>> Ok.  Meantime, I had to reclaim the machine for testing, so I no longer
>>> have it just sitting there to debug, and I have not sseen the problem
>> again
>>> that I know of.
>>
>> Clearly a hardware issue :-)
>>
>> --
>> Jens Axboe
>
> Rerunning a multi-day job to test out the 64-bit counter fixes,
> I just saw the same thing after about 2 days - eta updates stop,
> although IO is still running.
>
> Jobs: 210 (f=210): [r(98),X(14),r(112)] [31.5% done] [2388MB/0KB/0KB /s] [4891K/0/0 iops] [eta 01d:17h:05m:24s]
>
> I notice that get_jobs_eta makes a malloc() call without
> checking for NULL - maybe that happened?

If that happened, the frontend would crash, so I don't think that's too 
likely. But the patch is still sane, of course :-)

Is this close to when it stopped last time as well?

If you have it running, it would be great to do a gdb attach and see 
what the frontend is up to (or where it might be stuck)...

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: fio main thread got stuck over the weekend
  2014-12-13  4:49           ` Jens Axboe
@ 2014-12-15 17:33             ` Elliott, Robert (Server Storage)
  2014-12-15 17:37               ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-12-15 17:33 UTC (permalink / raw)
  To: Jens Axboe, stephenmcameron; +Cc: fio

> 
> If you have it running, it would be great to do a gdb attach and see
> what the frontend is up to (or where it might be stuck)...
> 

The last eta printed over 3 days ago was
	[eta 01d:17h:05m:24s]
but it is still running.

One device did report several errors during the test before the eta 
hang, and there are no IOs going to that device.

There are 212 threads running out of 224 started.

(gdb) info threads
  212 Thread 0x7fa9a086e700 (LWP 6509)  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  211 Thread 0x7fa9a006b700 (LWP 6510)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  210 Thread 0x7fa99f869700 (LWP 6511)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  209 Thread 0x7fa99f067700 (LWP 6512)  last_block (td=0x7fa9a094e740, f=0x7fa9a91a8390, ddir=<value optimized out>, b=0x7fa99f066c50) at io_u.c:65
  208 Thread 0x7fa99e865700 (LWP 6513)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  207 Thread 0x7fa99e063700 (LWP 6514)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  206 Thread 0x7fa99d861700 (LWP 6515)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  205 Thread 0x7fa99d05f700 (LWP 6516)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  204 Thread 0x7fa99c85d700 (LWP 6517)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  203 Thread 0x7fa99c05b700 (LWP 6518)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  202 Thread 0x7fa99b859700 (LWP 6519)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  201 Thread 0x7fa99b057700 (LWP 6520)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  200 Thread 0x7fa99a855700 (LWP 6521)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  199 Thread 0x7fa99a053700 (LWP 6522)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  198 Thread 0x7fa999851700 (LWP 6523)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  197 Thread 0x7fa99904e700 (LWP 6524)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  196 Thread 0x7fa99884c700 (LWP 6525)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  195 Thread 0x7fa99804a700 (LWP 6526)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  194 Thread 0x7fa997848700 (LWP 6527)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  193 Thread 0x7fa997046700 (LWP 6528)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  192 Thread 0x7fa996844700 (LWP 6529)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  191 Thread 0x7fa996042700 (LWP 6530)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  190 Thread 0x7fa995840700 (LWP 6531)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  189 Thread 0x7fa99503e700 (LWP 6532)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  188 Thread 0x7fa99483c700 (LWP 6533)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  187 Thread 0x7fa99403a700 (LWP 6534)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  186 Thread 0x7fa993838700 (LWP 6535)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  185 Thread 0x7fa993036700 (LWP 6536)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  184 Thread 0x7fa992834700 (LWP 6537)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  183 Thread 0x7fa992031700 (LWP 6538)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  182 Thread 0x7fa99182f700 (LWP 6539)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  181 Thread 0x7fa99102d700 (LWP 6540)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  180 Thread 0x7fa99082b700 (LWP 6541)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  179 Thread 0x7fa990029700 (LWP 6542)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  178 Thread 0x7fa98f827700 (LWP 6543)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  177 Thread 0x7fa98f025700 (LWP 6544)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  176 Thread 0x7fa98e823700 (LWP 6545)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  175 Thread 0x7fa98e021700 (LWP 6546)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  174 Thread 0x7fa98d81f700 (LWP 6547)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  173 Thread 0x7fa98d01d700 (LWP 6548)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  172 Thread 0x7fa98c81b700 (LWP 6549)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  171 Thread 0x7fa98c019700 (LWP 6550)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  170 Thread 0x7fa98b817700 (LWP 6551)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  169 Thread 0x7fa98b014700 (LWP 6552)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  168 Thread 0x7fa98a812700 (LWP 6553)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  167 Thread 0x7fa98a010700 (LWP 6554)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  166 Thread 0x7fa98980e700 (LWP 6555)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  165 Thread 0x7fa98900c700 (LWP 6556)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  164 Thread 0x7fa98880a700 (LWP 6557)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  163 Thread 0x7fa988008700 (LWP 6558)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  162 Thread 0x7fa987806700 (LWP 6559)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  161 Thread 0x7fa987004700 (LWP 6560)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  160 Thread 0x7fa986802700 (LWP 6561)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  159 Thread 0x7fa986000700 (LWP 6562)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  158 Thread 0x7fa9857fe700 (LWP 6563)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  157 Thread 0x7fa984ffc700 (LWP 6564)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  156 Thread 0x7fa9847fa700 (LWP 6565)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  155 Thread 0x7fa983ff7700 (LWP 6566)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  154 Thread 0x7fa9837f5700 (LWP 6567)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  153 Thread 0x7fa982ff3700 (LWP 6568)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  152 Thread 0x7fa9827f1700 (LWP 6569)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  151 Thread 0x7fa981fef700 (LWP 6570)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  150 Thread 0x7fa9817ed700 (LWP 6571)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  149 Thread 0x7fa980feb700 (LWP 6572)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  148 Thread 0x7fa9807e9700 (LWP 6573)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  147 Thread 0x7fa97ffe7700 (LWP 6574)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  146 Thread 0x7fa97f7e5700 (LWP 6575)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  145 Thread 0x7fa97efe3700 (LWP 6576)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  144 Thread 0x7fa97e7e1700 (LWP 6577)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  143 Thread 0x7fa97dfdf700 (LWP 6578)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  142 Thread 0x7fa97d7dd700 (LWP 6579)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  141 Thread 0x7fa97cfda700 (LWP 6580)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  140 Thread 0x7fa97c7d8700 (LWP 6581)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  139 Thread 0x7fa97bfd6700 (LWP 6582)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  138 Thread 0x7fa97b7d4700 (LWP 6583)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  137 Thread 0x7fa97afd2700 (LWP 6584)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  136 Thread 0x7fa97a7d0700 (LWP 6585)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  135 Thread 0x7fa979fce700 (LWP 6586)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  134 Thread 0x7fa9797cc700 (LWP 6587)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  133 Thread 0x7fa978fca700 (LWP 6588)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  132 Thread 0x7fa9787c8700 (LWP 6589)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  131 Thread 0x7fa977fc6700 (LWP 6590)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  130 Thread 0x7fa9777c4700 (LWP 6591)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  129 Thread 0x7fa976fc2700 (LWP 6592)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  128 Thread 0x7fa9767c0700 (LWP 6593)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  127 Thread 0x7fa975fbd700 (LWP 6594)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  126 Thread 0x7fa9757bb700 (LWP 6595)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  125 Thread 0x7fa974fb9700 (LWP 6596)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  124 Thread 0x7fa9747b7700 (LWP 6597)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  123 Thread 0x7fa973fb5700 (LWP 6598)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  122 Thread 0x7fa9737b3700 (LWP 6599)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  121 Thread 0x7fa972fb1700 (LWP 6600)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  120 Thread 0x7fa9727af700 (LWP 6601)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  119 Thread 0x7fa971fad700 (LWP 6602)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  118 Thread 0x7fa9717ab700 (LWP 6603)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  117 Thread 0x7fa970fa9700 (LWP 6604)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  116 Thread 0x7fa9707a7700 (LWP 6605)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  115 Thread 0x7fa96ffa5700 (LWP 6606)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  114 Thread 0x7fa96f7a3700 (LWP 6607)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  113 Thread 0x7fa9637f6700 (LWP 6622)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  112 Thread 0x7fa962ff5700 (LWP 6623)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  111 Thread 0x7fa9627f4700 (LWP 6624)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  110 Thread 0x7fa961ff3700 (LWP 6625)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  109 Thread 0x7fa9617f2700 (LWP 6626)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  108 Thread 0x7fa960ff1700 (LWP 6627)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  107 Thread 0x7fa9607f0700 (LWP 6628)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  106 Thread 0x7fa95ffef700 (LWP 6629)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  105 Thread 0x7fa95f7ee700 (LWP 6630)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  104 Thread 0x7fa95efed700 (LWP 6631)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  103 Thread 0x7fa95e7ec700 (LWP 6632)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  102 Thread 0x7fa95dfeb700 (LWP 6633)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  101 Thread 0x7fa95d7ea700 (LWP 6634)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  100 Thread 0x7fa95cfe9700 (LWP 6635)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  99 Thread 0x7fa95c7e8700 (LWP 6636)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  98 Thread 0x7fa95bfe7700 (LWP 6637)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  97 Thread 0x7fa95b7e6700 (LWP 6638)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  96 Thread 0x7fa95afe5700 (LWP 6639)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  95 Thread 0x7fa95a7e4700 (LWP 6640)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  94 Thread 0x7fa959fe3700 (LWP 6641)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  93 Thread 0x7fa9597e2700 (LWP 6642)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  92 Thread 0x7fa958fe1700 (LWP 6643)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  91 Thread 0x7fa9587e0700 (LWP 6644)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  90 Thread 0x7fa957fdf700 (LWP 6645)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  89 Thread 0x7fa9577de700 (LWP 6646)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  88 Thread 0x7fa956fdd700 (LWP 6647)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  87 Thread 0x7fa9567dc700 (LWP 6648)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  86 Thread 0x7fa955fdb700 (LWP 6649)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  85 Thread 0x7fa9557da700 (LWP 6650)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  84 Thread 0x7fa954fd9700 (LWP 6651)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  83 Thread 0x7fa9547d8700 (LWP 6652)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  82 Thread 0x7fa953fd7700 (LWP 6653)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  81 Thread 0x7fa9537d6700 (LWP 6654)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  80 Thread 0x7fa952fd5700 (LWP 6655)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  79 Thread 0x7fa9527d4700 (LWP 6656)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  78 Thread 0x7fa951fd3700 (LWP 6657)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  77 Thread 0x7fa9517d2700 (LWP 6658)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  76 Thread 0x7fa950fd1700 (LWP 6659)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  75 Thread 0x7fa9507d0700 (LWP 6660)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  74 Thread 0x7fa94ffcf700 (LWP 6661)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  73 Thread 0x7fa94f7ce700 (LWP 6662)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  72 Thread 0x7fa94efcd700 (LWP 6663)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  71 Thread 0x7fa94e7cc700 (LWP 6664)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  70 Thread 0x7fa94dfcb700 (LWP 6665)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  69 Thread 0x7fa94d7ca700 (LWP 6666)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  68 Thread 0x7fa94cfc9700 (LWP 6667)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  67 Thread 0x7fa94c7c8700 (LWP 6668)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  66 Thread 0x7fa94bfc7700 (LWP 6669)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  65 Thread 0x7fa94b7c6700 (LWP 6670)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  64 Thread 0x7fa94afc5700 (LWP 6671)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  63 Thread 0x7fa94a7c4700 (LWP 6672)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  62 Thread 0x7fa949fc3700 (LWP 6673)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  61 Thread 0x7fa9497c2700 (LWP 6674)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  60 Thread 0x7fa948fc1700 (LWP 6675)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  59 Thread 0x7fa9487c0700 (LWP 6676)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  58 Thread 0x7fa947fbf700 (LWP 6677)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  57 Thread 0x7fa9477be700 (LWP 6678)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  56 Thread 0x7fa946fbd700 (LWP 6679)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  55 Thread 0x7fa9467bc700 (LWP 6680)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  54 Thread 0x7fa945fbb700 (LWP 6681)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  53 Thread 0x7fa9457ba700 (LWP 6682)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  52 Thread 0x7fa944fb9700 (LWP 6683)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  51 Thread 0x7fa9447b8700 (LWP 6684)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  50 Thread 0x7fa943fb7700 (LWP 6685)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  49 Thread 0x7fa9437b6700 (LWP 6686)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  48 Thread 0x7fa942fb5700 (LWP 6687)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  47 Thread 0x7fa9427b4700 (LWP 6688)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  46 Thread 0x7fa941fb3700 (LWP 6689)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  45 Thread 0x7fa9417b2700 (LWP 6690)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  44 Thread 0x7fa940fb1700 (LWP 6691)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  43 Thread 0x7fa9407b0700 (LWP 6692)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  42 Thread 0x7fa93ffaf700 (LWP 6693)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  41 Thread 0x7fa93f7ae700 (LWP 6694)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  40 Thread 0x7fa93efad700 (LWP 6695)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  39 Thread 0x7fa93e7ac700 (LWP 6696)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  38 Thread 0x7fa93dfab700 (LWP 6697)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  37 Thread 0x7fa93d7aa700 (LWP 6698)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  36 Thread 0x7fa93cfa9700 (LWP 6699)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  35 Thread 0x7fa93c7a8700 (LWP 6700)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  34 Thread 0x7fa93bfa7700 (LWP 6701)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  33 Thread 0x7fa93b7a6700 (LWP 6702)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  32 Thread 0x7fa93afa5700 (LWP 6703)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  31 Thread 0x7fa93a7a4700 (LWP 6704)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  30 Thread 0x7fa939fa3700 (LWP 6705)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  29 Thread 0x7fa9397a2700 (LWP 6706)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  28 Thread 0x7fa938fa1700 (LWP 6707)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  27 Thread 0x7fa9387a0700 (LWP 6708)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  26 Thread 0x7fa937f9f700 (LWP 6709)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  25 Thread 0x7fa93779e700 (LWP 6710)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  24 Thread 0x7fa936f9d700 (LWP 6711)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  23 Thread 0x7fa93679c700 (LWP 6712)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  22 Thread 0x7fa935f9b700 (LWP 6713)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  21 Thread 0x7fa93579a700 (LWP 6714)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  20 Thread 0x7fa934f99700 (LWP 6715)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  19 Thread 0x7fa934798700 (LWP 6716)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  18 Thread 0x7fa933f97700 (LWP 6717)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  17 Thread 0x7fa933796700 (LWP 6718)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  16 Thread 0x7fa932f95700 (LWP 6719)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  15 Thread 0x7fa932794700 (LWP 6720)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  14 Thread 0x7fa931f93700 (LWP 6721)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  13 Thread 0x7fa931792700 (LWP 6722)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  12 Thread 0x7fa930f91700 (LWP 6723)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  11 Thread 0x7fa930790700 (LWP 6724)  0x000000000040b466 in td_io_getevents (td=0x7fa9a0da27c0, min=14, max=14, t=0x0) at ioengines.c:237
  10 Thread 0x7fa92ff8f700 (LWP 6725)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  9 Thread 0x7fa92f78e700 (LWP 6726)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  8 Thread 0x7fa92ef8d700 (LWP 6727)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  7 Thread 0x7fa92e78c700 (LWP 6728)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  6 Thread 0x7fa92df8b700 (LWP 6729)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
  5 Thread 0x7fa92d78a700 (LWP 6730)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  4 Thread 0x7fa92cf89700 (LWP 6731)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  3 Thread 0x7fa92c788700 (LWP 6732)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
  2 Thread 0x7fa92bf87700 (LWP 6733)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
* 1 Thread 0x7fa9a99b1720 (LWP 6508)  0x00000039748acced in nanosleep () from /lib64/libc.so.6

I wonder if one of the time values is wrong or overflowing.  There was a problem in
handling ms vs. s units a while back...

(gdb) thread 2
[Switching to thread 2 (Thread 0x7fa92bf87700 (LWP 6733))]#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
(gdb) bt
#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
#1  0x0000000000457058 in fio_libaio_commit (td=0x7fa9a0dd1860) at engines/libaio.c:255
#2  0x000000000040b395 in td_io_commit (td=0x7fa9a0dd1860) at ioengines.c:396
#3  0x000000000040bea1 in td_io_queue (td=0x7fa9a0dd1860, io_u=0x7fa8e4015400) at ioengines.c:343
#4  0x000000000044a75d in do_io (td=0x7fa9a0dd1860) at backend.c:792
#5  0x000000000044c209 in thread_main (data=0x7fa9a0dd1860) at backend.c:1504
#6  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
#7  0x00000039748e8b7d in clone () from /lib64/libc.so.6
(gdb) select-frame 5
(gdb) print elapsed
$29 = <value optimized out>
(gdb) print td->ts.runtime
$30 = {68075502, 0, 0}
(gdb) print td->start
$31 = {tv_sec = 1099511, tv_usec = 861468}
(gdb) print td->epoch
$32 = {tv_sec = 1031436, tv_usec = 430008}
(gdb) print td->ts.total_run_time
$33 = 0
(gdb) print td->ts.runtime
$35 = {68075502, 0, 0}

68075502 ms is only 18.9 minutes.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fio main thread got stuck over the weekend
  2014-12-15 17:33             ` Elliott, Robert (Server Storage)
@ 2014-12-15 17:37               ` Jens Axboe
  2014-12-15 19:39                 ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2014-12-15 17:37 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), stephenmcameron; +Cc: fio

On 12/15/2014 10:33 AM, Elliott, Robert (Server Storage) wrote:
>>
>> If you have it running, it would be great to do a gdb attach and see
>> what the frontend is up to (or where it might be stuck)...
>>
>
> The last eta printed over 3 days ago was
> 	[eta 01d:17h:05m:24s]
> but it is still running.
>
> One device did report several errors during the test before the eta
> hang, and there are no IOs going to that device.
>
> There are 212 threads running out of 224 started.
>
> (gdb) info threads
>    212 Thread 0x7fa9a086e700 (LWP 6509)  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>    211 Thread 0x7fa9a006b700 (LWP 6510)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    210 Thread 0x7fa99f869700 (LWP 6511)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    209 Thread 0x7fa99f067700 (LWP 6512)  last_block (td=0x7fa9a094e740, f=0x7fa9a91a8390, ddir=<value optimized out>, b=0x7fa99f066c50) at io_u.c:65
>    208 Thread 0x7fa99e865700 (LWP 6513)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    207 Thread 0x7fa99e063700 (LWP 6514)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    206 Thread 0x7fa99d861700 (LWP 6515)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    205 Thread 0x7fa99d05f700 (LWP 6516)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    204 Thread 0x7fa99c85d700 (LWP 6517)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    203 Thread 0x7fa99c05b700 (LWP 6518)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    202 Thread 0x7fa99b859700 (LWP 6519)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    201 Thread 0x7fa99b057700 (LWP 6520)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    200 Thread 0x7fa99a855700 (LWP 6521)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    199 Thread 0x7fa99a053700 (LWP 6522)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    198 Thread 0x7fa999851700 (LWP 6523)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    197 Thread 0x7fa99904e700 (LWP 6524)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    196 Thread 0x7fa99884c700 (LWP 6525)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    195 Thread 0x7fa99804a700 (LWP 6526)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    194 Thread 0x7fa997848700 (LWP 6527)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    193 Thread 0x7fa997046700 (LWP 6528)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    192 Thread 0x7fa996844700 (LWP 6529)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    191 Thread 0x7fa996042700 (LWP 6530)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    190 Thread 0x7fa995840700 (LWP 6531)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    189 Thread 0x7fa99503e700 (LWP 6532)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    188 Thread 0x7fa99483c700 (LWP 6533)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    187 Thread 0x7fa99403a700 (LWP 6534)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    186 Thread 0x7fa993838700 (LWP 6535)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    185 Thread 0x7fa993036700 (LWP 6536)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    184 Thread 0x7fa992834700 (LWP 6537)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    183 Thread 0x7fa992031700 (LWP 6538)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    182 Thread 0x7fa99182f700 (LWP 6539)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    181 Thread 0x7fa99102d700 (LWP 6540)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    180 Thread 0x7fa99082b700 (LWP 6541)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    179 Thread 0x7fa990029700 (LWP 6542)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    178 Thread 0x7fa98f827700 (LWP 6543)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    177 Thread 0x7fa98f025700 (LWP 6544)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    176 Thread 0x7fa98e823700 (LWP 6545)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    175 Thread 0x7fa98e021700 (LWP 6546)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    174 Thread 0x7fa98d81f700 (LWP 6547)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    173 Thread 0x7fa98d01d700 (LWP 6548)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    172 Thread 0x7fa98c81b700 (LWP 6549)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    171 Thread 0x7fa98c019700 (LWP 6550)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    170 Thread 0x7fa98b817700 (LWP 6551)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    169 Thread 0x7fa98b014700 (LWP 6552)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    168 Thread 0x7fa98a812700 (LWP 6553)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    167 Thread 0x7fa98a010700 (LWP 6554)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    166 Thread 0x7fa98980e700 (LWP 6555)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    165 Thread 0x7fa98900c700 (LWP 6556)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    164 Thread 0x7fa98880a700 (LWP 6557)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    163 Thread 0x7fa988008700 (LWP 6558)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    162 Thread 0x7fa987806700 (LWP 6559)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    161 Thread 0x7fa987004700 (LWP 6560)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    160 Thread 0x7fa986802700 (LWP 6561)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    159 Thread 0x7fa986000700 (LWP 6562)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    158 Thread 0x7fa9857fe700 (LWP 6563)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    157 Thread 0x7fa984ffc700 (LWP 6564)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    156 Thread 0x7fa9847fa700 (LWP 6565)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    155 Thread 0x7fa983ff7700 (LWP 6566)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    154 Thread 0x7fa9837f5700 (LWP 6567)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    153 Thread 0x7fa982ff3700 (LWP 6568)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    152 Thread 0x7fa9827f1700 (LWP 6569)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    151 Thread 0x7fa981fef700 (LWP 6570)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    150 Thread 0x7fa9817ed700 (LWP 6571)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    149 Thread 0x7fa980feb700 (LWP 6572)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    148 Thread 0x7fa9807e9700 (LWP 6573)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    147 Thread 0x7fa97ffe7700 (LWP 6574)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    146 Thread 0x7fa97f7e5700 (LWP 6575)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    145 Thread 0x7fa97efe3700 (LWP 6576)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    144 Thread 0x7fa97e7e1700 (LWP 6577)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    143 Thread 0x7fa97dfdf700 (LWP 6578)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    142 Thread 0x7fa97d7dd700 (LWP 6579)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    141 Thread 0x7fa97cfda700 (LWP 6580)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    140 Thread 0x7fa97c7d8700 (LWP 6581)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    139 Thread 0x7fa97bfd6700 (LWP 6582)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    138 Thread 0x7fa97b7d4700 (LWP 6583)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    137 Thread 0x7fa97afd2700 (LWP 6584)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    136 Thread 0x7fa97a7d0700 (LWP 6585)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    135 Thread 0x7fa979fce700 (LWP 6586)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    134 Thread 0x7fa9797cc700 (LWP 6587)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    133 Thread 0x7fa978fca700 (LWP 6588)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    132 Thread 0x7fa9787c8700 (LWP 6589)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    131 Thread 0x7fa977fc6700 (LWP 6590)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    130 Thread 0x7fa9777c4700 (LWP 6591)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    129 Thread 0x7fa976fc2700 (LWP 6592)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    128 Thread 0x7fa9767c0700 (LWP 6593)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    127 Thread 0x7fa975fbd700 (LWP 6594)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    126 Thread 0x7fa9757bb700 (LWP 6595)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    125 Thread 0x7fa974fb9700 (LWP 6596)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    124 Thread 0x7fa9747b7700 (LWP 6597)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    123 Thread 0x7fa973fb5700 (LWP 6598)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    122 Thread 0x7fa9737b3700 (LWP 6599)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    121 Thread 0x7fa972fb1700 (LWP 6600)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    120 Thread 0x7fa9727af700 (LWP 6601)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    119 Thread 0x7fa971fad700 (LWP 6602)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    118 Thread 0x7fa9717ab700 (LWP 6603)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    117 Thread 0x7fa970fa9700 (LWP 6604)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    116 Thread 0x7fa9707a7700 (LWP 6605)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    115 Thread 0x7fa96ffa5700 (LWP 6606)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    114 Thread 0x7fa96f7a3700 (LWP 6607)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    113 Thread 0x7fa9637f6700 (LWP 6622)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    112 Thread 0x7fa962ff5700 (LWP 6623)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    111 Thread 0x7fa9627f4700 (LWP 6624)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    110 Thread 0x7fa961ff3700 (LWP 6625)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    109 Thread 0x7fa9617f2700 (LWP 6626)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    108 Thread 0x7fa960ff1700 (LWP 6627)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    107 Thread 0x7fa9607f0700 (LWP 6628)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    106 Thread 0x7fa95ffef700 (LWP 6629)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    105 Thread 0x7fa95f7ee700 (LWP 6630)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    104 Thread 0x7fa95efed700 (LWP 6631)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    103 Thread 0x7fa95e7ec700 (LWP 6632)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    102 Thread 0x7fa95dfeb700 (LWP 6633)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    101 Thread 0x7fa95d7ea700 (LWP 6634)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    100 Thread 0x7fa95cfe9700 (LWP 6635)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    99 Thread 0x7fa95c7e8700 (LWP 6636)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    98 Thread 0x7fa95bfe7700 (LWP 6637)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    97 Thread 0x7fa95b7e6700 (LWP 6638)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    96 Thread 0x7fa95afe5700 (LWP 6639)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    95 Thread 0x7fa95a7e4700 (LWP 6640)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    94 Thread 0x7fa959fe3700 (LWP 6641)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    93 Thread 0x7fa9597e2700 (LWP 6642)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    92 Thread 0x7fa958fe1700 (LWP 6643)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    91 Thread 0x7fa9587e0700 (LWP 6644)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    90 Thread 0x7fa957fdf700 (LWP 6645)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    89 Thread 0x7fa9577de700 (LWP 6646)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    88 Thread 0x7fa956fdd700 (LWP 6647)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    87 Thread 0x7fa9567dc700 (LWP 6648)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    86 Thread 0x7fa955fdb700 (LWP 6649)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    85 Thread 0x7fa9557da700 (LWP 6650)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    84 Thread 0x7fa954fd9700 (LWP 6651)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    83 Thread 0x7fa9547d8700 (LWP 6652)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    82 Thread 0x7fa953fd7700 (LWP 6653)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    81 Thread 0x7fa9537d6700 (LWP 6654)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    80 Thread 0x7fa952fd5700 (LWP 6655)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    79 Thread 0x7fa9527d4700 (LWP 6656)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    78 Thread 0x7fa951fd3700 (LWP 6657)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    77 Thread 0x7fa9517d2700 (LWP 6658)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    76 Thread 0x7fa950fd1700 (LWP 6659)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    75 Thread 0x7fa9507d0700 (LWP 6660)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    74 Thread 0x7fa94ffcf700 (LWP 6661)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    73 Thread 0x7fa94f7ce700 (LWP 6662)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    72 Thread 0x7fa94efcd700 (LWP 6663)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    71 Thread 0x7fa94e7cc700 (LWP 6664)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    70 Thread 0x7fa94dfcb700 (LWP 6665)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    69 Thread 0x7fa94d7ca700 (LWP 6666)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    68 Thread 0x7fa94cfc9700 (LWP 6667)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    67 Thread 0x7fa94c7c8700 (LWP 6668)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    66 Thread 0x7fa94bfc7700 (LWP 6669)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    65 Thread 0x7fa94b7c6700 (LWP 6670)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    64 Thread 0x7fa94afc5700 (LWP 6671)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    63 Thread 0x7fa94a7c4700 (LWP 6672)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    62 Thread 0x7fa949fc3700 (LWP 6673)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    61 Thread 0x7fa9497c2700 (LWP 6674)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    60 Thread 0x7fa948fc1700 (LWP 6675)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    59 Thread 0x7fa9487c0700 (LWP 6676)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    58 Thread 0x7fa947fbf700 (LWP 6677)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    57 Thread 0x7fa9477be700 (LWP 6678)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    56 Thread 0x7fa946fbd700 (LWP 6679)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    55 Thread 0x7fa9467bc700 (LWP 6680)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    54 Thread 0x7fa945fbb700 (LWP 6681)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    53 Thread 0x7fa9457ba700 (LWP 6682)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    52 Thread 0x7fa944fb9700 (LWP 6683)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    51 Thread 0x7fa9447b8700 (LWP 6684)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    50 Thread 0x7fa943fb7700 (LWP 6685)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    49 Thread 0x7fa9437b6700 (LWP 6686)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    48 Thread 0x7fa942fb5700 (LWP 6687)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    47 Thread 0x7fa9427b4700 (LWP 6688)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    46 Thread 0x7fa941fb3700 (LWP 6689)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    45 Thread 0x7fa9417b2700 (LWP 6690)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    44 Thread 0x7fa940fb1700 (LWP 6691)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    43 Thread 0x7fa9407b0700 (LWP 6692)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    42 Thread 0x7fa93ffaf700 (LWP 6693)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    41 Thread 0x7fa93f7ae700 (LWP 6694)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    40 Thread 0x7fa93efad700 (LWP 6695)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    39 Thread 0x7fa93e7ac700 (LWP 6696)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    38 Thread 0x7fa93dfab700 (LWP 6697)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    37 Thread 0x7fa93d7aa700 (LWP 6698)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    36 Thread 0x7fa93cfa9700 (LWP 6699)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    35 Thread 0x7fa93c7a8700 (LWP 6700)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    34 Thread 0x7fa93bfa7700 (LWP 6701)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    33 Thread 0x7fa93b7a6700 (LWP 6702)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    32 Thread 0x7fa93afa5700 (LWP 6703)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    31 Thread 0x7fa93a7a4700 (LWP 6704)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    30 Thread 0x7fa939fa3700 (LWP 6705)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    29 Thread 0x7fa9397a2700 (LWP 6706)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    28 Thread 0x7fa938fa1700 (LWP 6707)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    27 Thread 0x7fa9387a0700 (LWP 6708)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    26 Thread 0x7fa937f9f700 (LWP 6709)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    25 Thread 0x7fa93779e700 (LWP 6710)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    24 Thread 0x7fa936f9d700 (LWP 6711)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    23 Thread 0x7fa93679c700 (LWP 6712)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    22 Thread 0x7fa935f9b700 (LWP 6713)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    21 Thread 0x7fa93579a700 (LWP 6714)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    20 Thread 0x7fa934f99700 (LWP 6715)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    19 Thread 0x7fa934798700 (LWP 6716)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    18 Thread 0x7fa933f97700 (LWP 6717)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    17 Thread 0x7fa933796700 (LWP 6718)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    16 Thread 0x7fa932f95700 (LWP 6719)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    15 Thread 0x7fa932794700 (LWP 6720)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    14 Thread 0x7fa931f93700 (LWP 6721)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    13 Thread 0x7fa931792700 (LWP 6722)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    12 Thread 0x7fa930f91700 (LWP 6723)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    11 Thread 0x7fa930790700 (LWP 6724)  0x000000000040b466 in td_io_getevents (td=0x7fa9a0da27c0, min=14, max=14, t=0x0) at ioengines.c:237
>    10 Thread 0x7fa92ff8f700 (LWP 6725)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    9 Thread 0x7fa92f78e700 (LWP 6726)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    8 Thread 0x7fa92ef8d700 (LWP 6727)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    7 Thread 0x7fa92e78c700 (LWP 6728)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    6 Thread 0x7fa92df8b700 (LWP 6729)  0x0000003657600614 in ?? () from /lib64/libaio.so.1
>    5 Thread 0x7fa92d78a700 (LWP 6730)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    4 Thread 0x7fa92cf89700 (LWP 6731)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    3 Thread 0x7fa92c788700 (LWP 6732)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
>    2 Thread 0x7fa92bf87700 (LWP 6733)  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
> * 1 Thread 0x7fa9a99b1720 (LWP 6508)  0x00000039748acced in nanosleep () from /lib64/libc.so.6
>
> I wonder if one of the time values is wrong or overflowing.  There was a problem in
> handling ms vs. s units a while back...
>
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7fa92bf87700 (LWP 6733))]#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
> (gdb) bt
> #0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
> #1  0x0000000000457058 in fio_libaio_commit (td=0x7fa9a0dd1860) at engines/libaio.c:255
> #2  0x000000000040b395 in td_io_commit (td=0x7fa9a0dd1860) at ioengines.c:396
> #3  0x000000000040bea1 in td_io_queue (td=0x7fa9a0dd1860, io_u=0x7fa8e4015400) at ioengines.c:343
> #4  0x000000000044a75d in do_io (td=0x7fa9a0dd1860) at backend.c:792
> #5  0x000000000044c209 in thread_main (data=0x7fa9a0dd1860) at backend.c:1504
> #6  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
> #7  0x00000039748e8b7d in clone () from /lib64/libc.so.6
> (gdb) select-frame 5
> (gdb) print elapsed
> $29 = <value optimized out>
> (gdb) print td->ts.runtime
> $30 = {68075502, 0, 0}
> (gdb) print td->start
> $31 = {tv_sec = 1099511, tv_usec = 861468}
> (gdb) print td->epoch
> $32 = {tv_sec = 1031436, tv_usec = 430008}
> (gdb) print td->ts.total_run_time
> $33 = 0
> (gdb) print td->ts.runtime
> $35 = {68075502, 0, 0}
>
> 68075502 ms is only 18.9 minutes.

->start is set on entering a new loop, so it's not that unlikely that 
it's just 18.9 minutes. Total runtime is summed after the fact. The one 
I think is interesting is the first thread, that is the one that should 
be printing eta stats. And it seems to be stuck:

 >    212 Thread 0x7fa9a086e700 (LWP 6509)  0x0000003974c0b98e in 
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

can you do a bt on that? This looks like it _might_ be an issue that was 
fixed recently with stats. Are you running fio with any output or status 
options?


-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: fio main thread got stuck over the weekend
  2014-12-15 17:37               ` Jens Axboe
@ 2014-12-15 19:39                 ` Elliott, Robert (Server Storage)
  2014-12-15 20:12                   ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-12-15 19:39 UTC (permalink / raw)
  To: Jens Axboe, stephenmcameron; +Cc: fio



> it's just 18.9 minutes. Total runtime is summed after the fact. The one
> I think is interesting is the first thread, that is the one that should
> be printing eta stats. And it seems to be stuck:
> 
>  >    212 Thread 0x7fa9a086e700 (LWP 6509)  0x0000003974c0b98e in
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> 
> can you do a bt on that? This looks like it _might_ be an issue that was
> fixed recently with stats. Are you running fio with any output or status
> options?
> 

No output options.

(gdb) thread 212
[Switching to thread 212 (Thread 0x7fa9a086e700 (LWP 6509))]#0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000449381 in helper_thread_main (data=<value optimized out>) at backend.c:2127
#2  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x00000039748e8b7d in clone () from /lib64/libc.so.6
(gdb) select-frame 1
(gdb) info local
ts = {tv_sec = 1418658662, tv_nsec = 980074000}
tv = {tv_sec = 1418658662, tv_usec = 730074}
ret = <value optimized out>
(gdb) print helper_exit
$36 = 0
(gdb) print nr_thread
$37 = 224
(gdb) print nr_process
$38 = 0

(gdb) thread 1
[Switching to thread 1 (Thread 0x7fa9a99b1720 (LWP 6508))]#0  0x00000039748acced in nanosleep () from /lib64/libc.so.6
(gdb) bt
#0  0x00000039748acced in nanosleep () from /lib64/libc.so.6
#1  0x00000039748e1e64 in usleep () from /lib64/libc.so.6
#2  0x000000000044d019 in do_usleep () at backend.c:1841
#3  run_threads () at backend.c:2083
#4  0x000000000044d8ab in fio_backend () at backend.c:2199
#5  0x000000397481ed1d in __libc_start_main () from /lib64/libc.so.6
#6  0x000000000040a199 in _start ()
 (gdb) select-frame 3
 (gdb) info local
td = <value optimized out>
i = <value optimized out>
todo = <value optimized out>
nr_running = 210
m_rate = 0
t_rate = 0
nr_started = <value optimized out>
spent = <value optimized out>
(gdb) print thread_number
$39 = 224

so run_threads() is probably in this loop waiting for 210 threads to quit:
        while (nr_running) {
                reap_threads(&nr_running, &t_rate, &m_rate);
                do_usleep(10000);
        }

Given:

(gdb) thread 2
[Switching to thread 2 (Thread 0x7fa92bf87700 (LWP 6733))]#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
(gdb) print td->error
No symbol "td" in current context.
(gdb) bt
#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
#1  0x0000000000457058 in fio_libaio_commit (td=0x7fa9a0dd1860) at engines/libaio.c:255
#2  0x000000000040b395 in td_io_commit (td=0x7fa9a0dd1860) at ioengines.c:396
#3  0x000000000040bea1 in td_io_queue (td=0x7fa9a0dd1860, io_u=0x7fa8e4015400) at ioengines.c:343
#4  0x000000000044a75d in do_io (td=0x7fa9a0dd1860) at backend.c:792
#5  0x000000000044c209 in thread_main (data=0x7fa9a0dd1860) at backend.c:1504
#6  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
#7  0x00000039748e8b7d in clone () from /lib64/libc.so.6
(gdb) select-frame 5
(gdb) print td->error
$40 = 0
(gdb) print td->terminate
$41 = 0
(gdb) print td->done
$42 = 0
(gdb) print td->o.time_based
$43 = 1
(gdb) print o->create_only
$47 = 0

I guess the threads are not leaving this loop in thread_main(): 
        while (keep_running(td)) {
        ...
        }

static int keep_running(struct thread_data *td)
{
        unsigned long long limit;

        if (td->done)
                return 0;
        if (td->o.time_based)
                return 1;
...

After the runtime has elapsed, should td->done or td->terminate
get set?


The jobfile is:
[global]
direct=1
ioengine=libaio
norandommap
randrepeat=0
bs=512
iodepth=96
numjobs=1
numjobs=14
runtime=216000
time_based=1
group_reporting
thread
gtod_reduce=1
iodepth_batch=14
iodepth_batch_complete=14
iodepth_low=14
userspace_reap=1
cpus_allowed=0-13
cpus_allowed_policy=split
rw=randread

[drive_b]
filename=/dev/sdb

[drive_c]
filename=/dev/sdc

[drive_d]
filename=/dev/sdd

[drive_e]
filename=/dev/sde

[drive_f]
filename=/dev/sdf

[drive_g]
filename=/dev/sdg

[drive_h]
filename=/dev/sdh

[drive_i]
filename=/dev/sdi

[drive_j]
filename=/dev/sdj

[drive_k]
filename=/dev/sdk

[drive_l]
filename=/dev/sdl

[drive_m]
filename=/dev/sdm

[drive_n]
filename=/dev/sdn

[drive_o]
filename=/dev/sdo

[drive_p]
filename=/dev/sdp

[drive_q]
filename=/dev/sdq





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fio main thread got stuck over the weekend
  2014-12-15 19:39                 ` Elliott, Robert (Server Storage)
@ 2014-12-15 20:12                   ` Jens Axboe
  2014-12-15 20:31                     ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2014-12-15 20:12 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), stephenmcameron; +Cc: fio

On 12/15/2014 12:39 PM, Elliott, Robert (Server Storage) wrote:
> 
> 
>> it's just 18.9 minutes. Total runtime is summed after the fact. The one
>> I think is interesting is the first thread, that is the one that should
>> be printing eta stats. And it seems to be stuck:
>>
>>  >    212 Thread 0x7fa9a086e700 (LWP 6509)  0x0000003974c0b98e in
>> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>
>> can you do a bt on that? This looks like it _might_ be an issue that was
>> fixed recently with stats. Are you running fio with any output or status
>> options?
>>
> 
> No output options.

OK

> (gdb) thread 212
> [Switching to thread 212 (Thread 0x7fa9a086e700 (LWP 6509))]#0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> (gdb) bt
> #0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> #1  0x0000000000449381 in helper_thread_main (data=<value optimized out>) at backend.c:2127
> #2  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
> #3  0x00000039748e8b7d in clone () from /lib64/libc.so.6
> (gdb) select-frame 1
> (gdb) info local
> ts = {tv_sec = 1418658662, tv_nsec = 980074000}
> tv = {tv_sec = 1418658662, tv_usec = 730074}
> ret = <value optimized out>
> (gdb) print helper_exit
> $36 = 0
> (gdb) print nr_thread
> $37 = 224
> (gdb) print nr_process
> $38 = 0

This is the helper thread, this is what will output the ETA etc stats.
There's a signal handler that runs and sets/wakes it to do so. So it
would be interesting if you could see what it is up to, set a break
point in there and check what the control flow ends up being.

It's backend.c:helper_thread_main()

backend.c:sig_show_status() should repeatedly fire and end up waking
this thread.

> 
> (gdb) thread 1
> [Switching to thread 1 (Thread 0x7fa9a99b1720 (LWP 6508))]#0  0x00000039748acced in nanosleep () from /lib64/libc.so.6
> (gdb) bt
> #0  0x00000039748acced in nanosleep () from /lib64/libc.so.6
> #1  0x00000039748e1e64 in usleep () from /lib64/libc.so.6
> #2  0x000000000044d019 in do_usleep () at backend.c:1841
> #3  run_threads () at backend.c:2083
> #4  0x000000000044d8ab in fio_backend () at backend.c:2199
> #5  0x000000397481ed1d in __libc_start_main () from /lib64/libc.so.6
> #6  0x000000000040a199 in _start ()
>  (gdb) select-frame 3
>  (gdb) info local
> td = <value optimized out>
> i = <value optimized out>
> todo = <value optimized out>
> nr_running = 210
> m_rate = 0
> t_rate = 0
> nr_started = <value optimized out>
> spent = <value optimized out>
> (gdb) print thread_number
> $39 = 224
> 
> so run_threads() is probably in this loop waiting for 210 threads to quit:
>         while (nr_running) {
>                 reap_threads(&nr_running, &t_rate, &m_rate);
>                 do_usleep(10000);
>         }

Yes, which is expected. Once all jobs have been started, it'll
continually sit in that loop and wait for them to finish.

> 
> Given:
> 
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7fa92bf87700 (LWP 6733))]#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
> (gdb) print td->error
> No symbol "td" in current context.
> (gdb) bt
> #0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
> #1  0x0000000000457058 in fio_libaio_commit (td=0x7fa9a0dd1860) at engines/libaio.c:255
> #2  0x000000000040b395 in td_io_commit (td=0x7fa9a0dd1860) at ioengines.c:396
> #3  0x000000000040bea1 in td_io_queue (td=0x7fa9a0dd1860, io_u=0x7fa8e4015400) at ioengines.c:343
> #4  0x000000000044a75d in do_io (td=0x7fa9a0dd1860) at backend.c:792
> #5  0x000000000044c209 in thread_main (data=0x7fa9a0dd1860) at backend.c:1504
> #6  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
> #7  0x00000039748e8b7d in clone () from /lib64/libc.so.6
> (gdb) select-frame 5
> (gdb) print td->error
> $40 = 0
> (gdb) print td->terminate
> $41 = 0
> (gdb) print td->done
> $42 = 0
> (gdb) print td->o.time_based
> $43 = 1
> (gdb) print o->create_only
> $47 = 0

It's still actively running, correct? Not stuck in io_submit()?

> I guess the threads are not leaving this loop in thread_main(): 
>         while (keep_running(td)) {
>         ...
>         }
> 
> static int keep_running(struct thread_data *td)
> {
>         unsigned long long limit;
> 
>         if (td->done)
>                 return 0;
>         if (td->o.time_based)
>                 return 1;
> ...
> 
> After the runtime has elapsed, should td->done or td->terminate
> get set?

Nope, when the runtime is exceeded, the thread will notice by calling
runtime_exceeded() in do_io(). ->terminate is set if you do ctrl-c for
instance.

Are you seeing evidence that it's running for longer than it should?
From your job file:

> runtime=216000
> time_based=1

it should run for 2.5 days, then exit.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fio main thread got stuck over the weekend
  2014-12-15 20:12                   ` Jens Axboe
@ 2014-12-15 20:31                     ` Jens Axboe
  2014-12-15 20:49                       ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2014-12-15 20:31 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), stephenmcameron; +Cc: fio

On 12/15/2014 01:12 PM, Jens Axboe wrote:
>> (gdb) thread 212
>> [Switching to thread 212 (Thread 0x7fa9a086e700 (LWP 6509))]#0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>> (gdb) bt
>> #0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>> #1  0x0000000000449381 in helper_thread_main (data=<value optimized out>) at backend.c:2127
>> #2  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
>> #3  0x00000039748e8b7d in clone () from /lib64/libc.so.6
>> (gdb) select-frame 1
>> (gdb) info local
>> ts = {tv_sec = 1418658662, tv_nsec = 980074000}
>> tv = {tv_sec = 1418658662, tv_usec = 730074}
>> ret = <value optimized out>
>> (gdb) print helper_exit
>> $36 = 0
>> (gdb) print nr_thread
>> $37 = 224
>> (gdb) print nr_process
>> $38 = 0
> 
> This is the helper thread, this is what will output the ETA etc stats.
> There's a signal handler that runs and sets/wakes it to do so. So it
> would be interesting if you could see what it is up to, set a break
> point in there and check what the control flow ends up being.
> 
> It's backend.c:helper_thread_main()
> 
> backend.c:sig_show_status() should repeatedly fire and end up waking
> this thread.

So that referred to the output status, your bug was on the ETA, correct?
For that, we never get woken up by someone signaling our condvar, we
just wake up every second to do the ETA display. The time conversion
looks wrong, though:

gettimeofday(&tv, NULL);
ts.tv_sec = tv.tv_sec + sec;
ts.tv_nsec = (tv.tv_usec * 1000) + nsec;
if (ts.tv_nsec > 1000000000ULL) {
	ts.tv_nsec -= 1000000000ULL;
	ts.tv_sec++;
}

I wonder what happens if ts.tv_nsec == 10^6, that check really should be
on >= instead. Not sure if this explains it though, I'll conjure up a
test case...

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fio main thread got stuck over the weekend
  2014-12-15 20:31                     ` Jens Axboe
@ 2014-12-15 20:49                       ` Jens Axboe
  2014-12-16  0:52                         ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2014-12-15 20:49 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), stephenmcameron; +Cc: fio

[-- Attachment #1: Type: text/plain, Size: 2222 bytes --]

On 12/15/2014 01:31 PM, Jens Axboe wrote:
> On 12/15/2014 01:12 PM, Jens Axboe wrote:
>>> (gdb) thread 212
>>> [Switching to thread 212 (Thread 0x7fa9a086e700 (LWP 6509))]#0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>> (gdb) bt
>>> #0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>> #1  0x0000000000449381 in helper_thread_main (data=<value optimized out>) at backend.c:2127
>>> #2  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
>>> #3  0x00000039748e8b7d in clone () from /lib64/libc.so.6
>>> (gdb) select-frame 1
>>> (gdb) info local
>>> ts = {tv_sec = 1418658662, tv_nsec = 980074000}
>>> tv = {tv_sec = 1418658662, tv_usec = 730074}
>>> ret = <value optimized out>
>>> (gdb) print helper_exit
>>> $36 = 0
>>> (gdb) print nr_thread
>>> $37 = 224
>>> (gdb) print nr_process
>>> $38 = 0
>>
>> This is the helper thread, this is what will output the ETA etc stats.
>> There's a signal handler that runs and sets/wakes it to do so. So it
>> would be interesting if you could see what it is up to, set a break
>> point in there and check what the control flow ends up being.
>>
>> It's backend.c:helper_thread_main()
>>
>> backend.c:sig_show_status() should repeatedly fire and end up waking
>> this thread.
> 
> So that referred to the output status, your bug was on the ETA, correct?
> For that, we never get woken up by someone signaling our condvar, we
> just wake up every second to do the ETA display. The time conversion
> looks wrong, though:
> 
> gettimeofday(&tv, NULL);
> ts.tv_sec = tv.tv_sec + sec;
> ts.tv_nsec = (tv.tv_usec * 1000) + nsec;
> if (ts.tv_nsec > 1000000000ULL) {
> 	ts.tv_nsec -= 1000000000ULL;
> 	ts.tv_sec++;
> }
> 
> I wonder what happens if ts.tv_nsec == 10^6, that check really should be
> on >= instead. Not sure if this explains it though, I'll conjure up a
> test case...

On my system, pthread_cond_timedwait() returns EINVAL if nsec is 10^9 or
higher, which is the sane thing to do. I wonder if it does that at your
end too, or whether it hangs? If it hangs indefinitely, that could
explain it. Test case attached, I checked in a fix for the off-by-one.

-- 
Jens Axboe


[-- Attachment #2: cvtest.c --]
[-- Type: text/x-csrc, Size: 1006 bytes --]

/*
 * gcc -Wall -O2 -o cvtest cvtest.c -lpthread
 */
#include <stdio.h>
#include <limits.h>
#include <inttypes.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/time.h>
#include <string.h>
#include <pthread.h>

#define DISK_UTIL_MSEC	250

struct d {
	pthread_mutex_t *lock;
	pthread_cond_t *cond;
};

static void *thread(void *data)
{
	struct d *d = data;
	struct timespec ts;
	struct timeval tv;
	int ret;

	gettimeofday(&tv, NULL);
	ts.tv_sec = tv.tv_sec + 1;
	ts.tv_nsec = 1000000000ULL;

	printf("will wait\n");
	pthread_mutex_lock(d->lock);
	ret = pthread_cond_timedwait(d->cond, d->lock, &ts);
	pthread_mutex_unlock(d->lock);
	printf("done wait %d\n", ret);
	return NULL;
}

int main(int argc, char *argv[])
{
	pthread_cond_t cond;
	pthread_mutex_t lock;
	pthread_t pthread;
	struct d d;
	void *ret;

	pthread_cond_init(&cond, NULL);
	pthread_mutex_init(&lock, NULL);

	d.lock = &lock;
	d.cond = &cond;

	pthread_create(&pthread, NULL, thread, &d);
	pthread_join(pthread, &ret);

	return 0;
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: fio main thread got stuck over the weekend
  2014-12-15 20:49                       ` Jens Axboe
@ 2014-12-16  0:52                         ` Elliott, Robert (Server Storage)
  2014-12-16  2:51                           ` Jens Axboe
  2014-12-16 22:43                           ` Jens Axboe
  0 siblings, 2 replies; 20+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-12-16  0:52 UTC (permalink / raw)
  To: Jens Axboe, stephenmcameron; +Cc: fio



> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Sent: Monday, 15 December, 2014 2:50 PM
> To: Elliott, Robert (Server Storage); stephenmcameron@gmail.com
> Cc: fio@vger.kernel.org
> Subject: Re: fio main thread got stuck over the weekend
> 
> On 12/15/2014 01:31 PM, Jens Axboe wrote:
> > On 12/15/2014 01:12 PM, Jens Axboe wrote:
> >>> (gdb) thread 212
> >>> [Switching to thread 212 (Thread 0x7fa9a086e700 (LWP 6509))]#0
> 0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> >>> (gdb) bt
> >>> #0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> >>> #1  0x0000000000449381 in helper_thread_main (data=<value optimized
> out>) at backend.c:2127
> >>> #2  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
> >>> #3  0x00000039748e8b7d in clone () from /lib64/libc.so.6
> >>> (gdb) select-frame 1
> >>> (gdb) info local
> >>> ts = {tv_sec = 1418658662, tv_nsec = 980074000}
> >>> tv = {tv_sec = 1418658662, tv_usec = 730074}
> >>> ret = <value optimized out>
> >>> (gdb) print helper_exit
> >>> $36 = 0
> >>> (gdb) print nr_thread
> >>> $37 = 224
> >>> (gdb) print nr_process
> >>> $38 = 0
> >>
> >> This is the helper thread, this is what will output the ETA etc stats.
> >> There's a signal handler that runs and sets/wakes it to do so. So it
> >> would be interesting if you could see what it is up to, set a break
> >> point in there and check what the control flow ends up being.
> >>
> >> It's backend.c:helper_thread_main()
> >>
> >> backend.c:sig_show_status() should repeatedly fire and end up waking
> >> this thread.
> >
> > So that referred to the output status, your bug was on the ETA, correct?

Yes, the first anomaly was no ETAs incrementing, while IO was still
occurring. 

The second anomaly was running forever.


> > For that, we never get woken up by someone signaling our condvar, we
> > just wake up every second to do the ETA display. The time conversion
> > looks wrong, though:
> >
> > gettimeofday(&tv, NULL);
> > ts.tv_sec = tv.tv_sec + sec;
> > ts.tv_nsec = (tv.tv_usec * 1000) + nsec;
> > if (ts.tv_nsec > 1000000000ULL) {
> > 	ts.tv_nsec -= 1000000000ULL;
> > 	ts.tv_sec++;
> > }
> >
> > I wonder what happens if ts.tv_nsec == 10^6, that check really should be
> > on >= instead. Not sure if this explains it though, I'll conjure up a
> > test case...
> 
> On my system, pthread_cond_timedwait() returns EINVAL if nsec is 10^9 or
> higher, which is the sane thing to do. I wonder if it does that at your
> end too, or whether it hangs? If it hangs indefinitely, that could
> explain it. Test case attached, I checked in a fix for the off-by-one.

That program results in EINVAL here:
will wait
done wait 22

In backend.c helper_thread_main():
* should the gettimeofday() return value be checked?
* should fio_gettime() be used instead of gettimeofday() to honor
the clocksource setting?  

In this pool of machines, I have occasionally seen time issues where
some timeserver communication is messed up and "make" complains that
files are newer than expected.  Maybe some time warp occurred during
the test?  




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fio main thread got stuck over the weekend
  2014-12-16  0:52                         ` Elliott, Robert (Server Storage)
@ 2014-12-16  2:51                           ` Jens Axboe
  2014-12-16 22:43                           ` Jens Axboe
  1 sibling, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2014-12-16  2:51 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), stephenmcameron; +Cc: fio

On 12/15/2014 05:52 PM, Elliott, Robert (Server Storage) wrote:
>
>
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@kernel.dk]
>> Sent: Monday, 15 December, 2014 2:50 PM
>> To: Elliott, Robert (Server Storage); stephenmcameron@gmail.com
>> Cc: fio@vger.kernel.org
>> Subject: Re: fio main thread got stuck over the weekend
>>
>> On 12/15/2014 01:31 PM, Jens Axboe wrote:
>>> On 12/15/2014 01:12 PM, Jens Axboe wrote:
>>>>> (gdb) thread 212
>>>>> [Switching to thread 212 (Thread 0x7fa9a086e700 (LWP 6509))]#0
>> 0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>>>>> (gdb) bt
>>>>> #0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>>>>> #1  0x0000000000449381 in helper_thread_main (data=<value optimized
>> out>) at backend.c:2127
>>>>> #2  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
>>>>> #3  0x00000039748e8b7d in clone () from /lib64/libc.so.6
>>>>> (gdb) select-frame 1
>>>>> (gdb) info local
>>>>> ts = {tv_sec = 1418658662, tv_nsec = 980074000}
>>>>> tv = {tv_sec = 1418658662, tv_usec = 730074}
>>>>> ret = <value optimized out>
>>>>> (gdb) print helper_exit
>>>>> $36 = 0
>>>>> (gdb) print nr_thread
>>>>> $37 = 224
>>>>> (gdb) print nr_process
>>>>> $38 = 0
>>>>
>>>> This is the helper thread, this is what will output the ETA etc stats.
>>>> There's a signal handler that runs and sets/wakes it to do so. So it
>>>> would be interesting if you could see what it is up to, set a break
>>>> point in there and check what the control flow ends up being.
>>>>
>>>> It's backend.c:helper_thread_main()
>>>>
>>>> backend.c:sig_show_status() should repeatedly fire and end up waking
>>>> this thread.
>>>
>>> So that referred to the output status, your bug was on the ETA, correct?
>
> Yes, the first anomaly was no ETAs incrementing, while IO was still
> occurring.
>
> The second anomaly was running forever.

Gotcha

>>> For that, we never get woken up by someone signaling our condvar, we
>>> just wake up every second to do the ETA display. The time conversion
>>> looks wrong, though:
>>>
>>> gettimeofday(&tv, NULL);
>>> ts.tv_sec = tv.tv_sec + sec;
>>> ts.tv_nsec = (tv.tv_usec * 1000) + nsec;
>>> if (ts.tv_nsec > 1000000000ULL) {
>>> 	ts.tv_nsec -= 1000000000ULL;
>>> 	ts.tv_sec++;
>>> }
>>>
>>> I wonder what happens if ts.tv_nsec == 10^6, that check really should be
>>> on >= instead. Not sure if this explains it though, I'll conjure up a
>>> test case...
>>
>> On my system, pthread_cond_timedwait() returns EINVAL if nsec is 10^9 or
>> higher, which is the sane thing to do. I wonder if it does that at your
>> end too, or whether it hangs? If it hangs indefinitely, that could
>> explain it. Test case attached, I checked in a fix for the off-by-one.
>
> That program results in EINVAL here:
> will wait
> done wait 22

OK, which is the sane way to handle it, but who knows what bugs there 
might have been there.

> In backend.c helper_thread_main():
> * should the gettimeofday() return value be checked?

Should not really fail, since we know 'tv' is valid.

> * should fio_gettime() be used instead of gettimeofday() to honor
> the clocksource setting?

That would be the case if we used it internally, but we do not. It's 
only used for passing to the condvar timedwait.

> In this pool of machines, I have occasionally seen time issues where
> some timeserver communication is messed up and "make" complains that
> files are newer than expected.  Maybe some time warp occurred during
> the test?

That could definitely explain it, not sure what else could. That could 
cause both the run to continue, and the timedwait to not return. I'll 
switch it to use clock_gettime, if we have it, otherwise we really have 
no other option than to use gettimeofday()...

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fio main thread got stuck over the weekend
  2014-12-16  0:52                         ` Elliott, Robert (Server Storage)
  2014-12-16  2:51                           ` Jens Axboe
@ 2014-12-16 22:43                           ` Jens Axboe
  2014-12-17  3:52                             ` Elliott, Robert (Server Storage)
  1 sibling, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2014-12-16 22:43 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), stephenmcameron; +Cc: fio

On 12/15/2014 05:52 PM, Elliott, Robert (Server Storage) wrote:
> 
> 
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@kernel.dk]
>> Sent: Monday, 15 December, 2014 2:50 PM
>> To: Elliott, Robert (Server Storage); stephenmcameron@gmail.com
>> Cc: fio@vger.kernel.org
>> Subject: Re: fio main thread got stuck over the weekend
>>
>> On 12/15/2014 01:31 PM, Jens Axboe wrote:
>>> On 12/15/2014 01:12 PM, Jens Axboe wrote:
>>>>> (gdb) thread 212
>>>>> [Switching to thread 212 (Thread 0x7fa9a086e700 (LWP 6509))]#0
>> 0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>>>>> (gdb) bt
>>>>> #0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>>>>> #1  0x0000000000449381 in helper_thread_main (data=<value optimized
>> out>) at backend.c:2127
>>>>> #2  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
>>>>> #3  0x00000039748e8b7d in clone () from /lib64/libc.so.6
>>>>> (gdb) select-frame 1
>>>>> (gdb) info local
>>>>> ts = {tv_sec = 1418658662, tv_nsec = 980074000}
>>>>> tv = {tv_sec = 1418658662, tv_usec = 730074}
>>>>> ret = <value optimized out>
>>>>> (gdb) print helper_exit
>>>>> $36 = 0
>>>>> (gdb) print nr_thread
>>>>> $37 = 224
>>>>> (gdb) print nr_process
>>>>> $38 = 0
>>>>
>>>> This is the helper thread, this is what will output the ETA etc stats.
>>>> There's a signal handler that runs and sets/wakes it to do so. So it
>>>> would be interesting if you could see what it is up to, set a break
>>>> point in there and check what the control flow ends up being.
>>>>
>>>> It's backend.c:helper_thread_main()
>>>>
>>>> backend.c:sig_show_status() should repeatedly fire and end up waking
>>>> this thread.
>>>
>>> So that referred to the output status, your bug was on the ETA, correct?
> 
> Yes, the first anomaly was no ETAs incrementing, while IO was still
> occurring. 
> 
> The second anomaly was running forever.
> 
> 
>>> For that, we never get woken up by someone signaling our condvar, we
>>> just wake up every second to do the ETA display. The time conversion
>>> looks wrong, though:
>>>
>>> gettimeofday(&tv, NULL);
>>> ts.tv_sec = tv.tv_sec + sec;
>>> ts.tv_nsec = (tv.tv_usec * 1000) + nsec;
>>> if (ts.tv_nsec > 1000000000ULL) {
>>> 	ts.tv_nsec -= 1000000000ULL;
>>> 	ts.tv_sec++;
>>> }
>>>
>>> I wonder what happens if ts.tv_nsec == 10^6, that check really should be
>>> on >= instead. Not sure if this explains it though, I'll conjure up a
>>> test case...
>>
>> On my system, pthread_cond_timedwait() returns EINVAL if nsec is 10^9 or
>> higher, which is the sane thing to do. I wonder if it does that at your
>> end too, or whether it hangs? If it hangs indefinitely, that could
>> explain it. Test case attached, I checked in a fix for the off-by-one.
> 
> That program results in EINVAL here:
> will wait
> done wait 22
> 
> In backend.c helper_thread_main():
> * should the gettimeofday() return value be checked?
> * should fio_gettime() be used instead of gettimeofday() to honor
> the clocksource setting?  
> 
> In this pool of machines, I have occasionally seen time issues where
> some timeserver communication is messed up and "make" complains that
> files are newer than expected.  Maybe some time warp occurred during
> the test?  

I think I figured out what was going in, another user reported a similar
issue. Since you have it running, can you try and gdb attach to one of
the fio jobs? Then add a breakpoint in backend.c:__update_tv_cache(),
and print td->tv_cache after each call. Just do that a handful of times,
enough so we can see if it's incrementing or if it's stuck at the same
value.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: fio main thread got stuck over the weekend
  2014-12-16 22:43                           ` Jens Axboe
@ 2014-12-17  3:52                             ` Elliott, Robert (Server Storage)
  2014-12-17  5:43                               ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-12-17  3:52 UTC (permalink / raw)
  To: Jens Axboe, stephenmcameron; +Cc: fio



> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Sent: Tuesday, 16 December, 2014 4:43 PM
...
> 
> I think I figured out what was going in, another user reported a similar
> issue. Since you have it running, can you try and gdb attach to one of
> the fio jobs? Then add a breakpoint in backend.c:__update_tv_cache(),
> and print td->tv_cache after each call. Just do that a handful of times,
> enough so we can see if it's incrementing or if it's stuck at the same
> value.
> 

That is a static inline function, and its caller do_verify
also got inlined.  I tried a breakpoint by line number in
thread_main; although IOs resumed, it doesn't trigger.


(gdb) b backend.c:1536
Breakpoint 2 at 0x44c67b: file backend.c, line 1536.
(gdb) cont


objdump -drS around that address is:


        check_update_rusage(td);

        if (td->error)
  44c66e:       83 bb 90 4e 00 00 00    cmpl   $0x0,0x4e90(%rbx)
  44c675:       0f 84 0d 01 00 00       je     44c788 <thread_main+0x1538>

                fio_gettime(&td->start, NULL);

                do_verify(td, verify_bytes);

                fio_mutex_down(stat_mutex);
  44c67b:       48 8b 3d 16 a4 2d 00    mov    0x2da416(%rip),%rdi        # 726a98 <stat_mutex>
  44c682:       e8 a9 40 fe ff          callq  430730 <fio_mutex_down>
                td->ts.runtime[DDIR_READ] += mtime_since_now(&td->start);
  44c687:       4c 8b bb 44 4c 00 00    mov    0x4c44(%rbx),%r15
  44c68e:       48 89 ef                mov    %rbp,%rdi
  44c691:       e8 9a e0 fb ff          callq  40a730 <mtime_since_now>
                fio_gettime(&td->start, NULL);
  44c696:       31 f6                   xor    %esi,%esi
  44c698:       48 89 ef                mov    %rbp,%rdi
                fio_gettime(&td->start, NULL);

                do_verify(td, verify_bytes);

                fio_mutex_down(stat_mutex);
                td->ts.runtime[DDIR_READ] += mtime_since_now(&td->start);
  44c69b:       4c 01 f8                add    %r15,%rax
  44c69e:       48 89 83 44 4c 00 00    mov    %rax,0x4c44(%rbx)
                fio_gettime(&td->start, NULL);
  44c6a5:       e8 56 de fb ff          callq  40a500 <fio_gettime>
                fio_mutex_up(stat_mutex);
  44c6aa:       48 8b 3d e7 a3 2d 00    mov    0x2da3e7(%rip),%rdi        # 726a98 <stat_mutex>
  44c6b1:       e8 0a 40 fe ff          callq  4306c0 <fio_mutex_up>


I printed that value for thread 2, let it run for
a few seconds, then printed again, in case that helps
(tv_usec increments, but tv_sec does not).


(gdb) thread 2
[Switching to thread 2 (Thread 0x7fa92bf87700 (LWP 6733))]#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
(gdb) bt
#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
#1  0x0000000000457058 in fio_libaio_commit (td=0x7fa9a0dd1860) at engines/libaio.c:255
#2  0x000000000040b395 in td_io_commit (td=0x7fa9a0dd1860) at ioengines.c:396
#3  0x000000000040bea1 in td_io_queue (td=0x7fa9a0dd1860, io_u=0x7fa8e401c980) at ioengines.c:343
#4  0x000000000044a75d in do_io (td=0x7fa9a0dd1860) at backend.c:792
#5  0x000000000044c209 in thread_main (data=0x7fa9a0dd1860) at backend.c:1504
#6  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
#7  0x00000039748e8b7d in clone () from /lib64/libc.so.6
(gdb) print td
No symbol "td" in current context.
(gdb) select-frame 5
(gdb) print td->tv_cache
$51 = {tv_sec = 1099511, tv_usec = 641885}
(gdb) cont
Continuing.
^C
Program received signal SIGINT, Interrupt.
[Switching to Thread 0x7fa9a99b1720 (LWP 6508)]
0x00000039748acced in nanosleep () from /lib64/libc.so.6
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fa92bf87700 (LWP 6733))]#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
(gdb) bt
#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
#1  0x0000000000457058 in fio_libaio_commit (td=0x7fa9a0dd1860) at engines/libaio.c:255
#2  0x000000000040b47a in td_io_getevents (td=0x7fa9a0dd1860, min=<value optimized out>, max=<value optimized out>, t=<value optimized out>) at ioengines.c:238
#3  0x000000000042b442 in io_u_queued_complete (td=0x7fa9a0dd1860, min_evts=<value optimized out>, bytes=0x7fa92bf86d00) at io_u.c:1793
#4  0x000000000044a45e in do_io (td=0x7fa9a0dd1860) at backend.c:889
#5  0x000000000044c209 in thread_main (data=0x7fa9a0dd1860) at backend.c:1504
#6  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
#7  0x00000039748e8b7d in clone () from /lib64/libc.so.6
(gdb) select-frame 5
(gdb) print td->tv_cache
$52 = {tv_sec = 1099511, tv_usec = 971364}


and again after about a minute:
(gdb) cont
Continuing.
^C
Program received signal SIGINT, Interrupt.
[Switching to Thread 0x7fa9a99b1720 (LWP 6508)]
0x00000039748acced in nanosleep () from /lib64/libc.so.6
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fa92bf87700 (LWP 6733))]#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
(gdb) bt
#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
#1  0x0000000000457058 in fio_libaio_commit (td=0x7fa9a0dd1860) at engines/libaio.c:255
#2  0x000000000040b395 in td_io_commit (td=0x7fa9a0dd1860) at ioengines.c:396
#3  0x000000000040bea1 in td_io_queue (td=0x7fa9a0dd1860, io_u=0x7fa8e401fe80) at ioengines.c:343
#4  0x000000000044a75d in do_io (td=0x7fa9a0dd1860) at backend.c:792
#5  0x000000000044c209 in thread_main (data=0x7fa9a0dd1860) at backend.c:1504
#6  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
#7  0x00000039748e8b7d in clone () from /lib64/libc.so.6
(gdb) select-frame 5
(gdb) print td->tv_cache
$53 = {tv_sec = 1099511, tv_usec = 578033}




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fio main thread got stuck over the weekend
  2014-12-17  3:52                             ` Elliott, Robert (Server Storage)
@ 2014-12-17  5:43                               ` Jens Axboe
  2014-12-17 16:48                                 ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2014-12-17  5:43 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), stephenmcameron; +Cc: fio

On 12/16/2014 08:52 PM, Elliott, Robert (Server Storage) wrote:
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7fa92bf87700 (LWP 6733))]#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
> (gdb) bt
> #0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
> #1  0x0000000000457058 in fio_libaio_commit (td=0x7fa9a0dd1860) at engines/libaio.c:255
> #2  0x000000000040b395 in td_io_commit (td=0x7fa9a0dd1860) at ioengines.c:396
> #3  0x000000000040bea1 in td_io_queue (td=0x7fa9a0dd1860, io_u=0x7fa8e401c980) at ioengines.c:343
> #4  0x000000000044a75d in do_io (td=0x7fa9a0dd1860) at backend.c:792
> #5  0x000000000044c209 in thread_main (data=0x7fa9a0dd1860) at backend.c:1504
> #6  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
> #7  0x00000039748e8b7d in clone () from /lib64/libc.so.6
> (gdb) print td
> No symbol "td" in current context.
> (gdb) select-frame 5
> (gdb) print td->tv_cache
> $51 = {tv_sec = 1099511, tv_usec = 641885}
                   ^^^^^^^

This is the key. If this multiplication overflows:

usecs = (t * inv_cycles_per_usec) / 16777216UL;

then usecs is 2^64/2^24, which is 1099511627776. Divide that by 10^6 to 
get seconds, and that is 1099511... I initially thought this was a buggy 
backwards timer, but it's just this overflow. Fix:

http://git.kernel.dk/?p=fio.git;a=commit;h=b3fa625b38a638cd1783e9fdcac1b958e37e48fa

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: fio main thread got stuck over the weekend
  2014-12-17  5:43                               ` Jens Axboe
@ 2014-12-17 16:48                                 ` Elliott, Robert (Server Storage)
  2014-12-17 17:27                                   ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-12-17 16:48 UTC (permalink / raw)
  To: Jens Axboe, stephenmcameron; +Cc: fio



> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Sent: Tuesday, 16 December, 2014 11:43 PM
...
> > (gdb) print td->tv_cache
> > $51 = {tv_sec = 1099511, tv_usec = 641885}
>                    ^^^^^^^
> 
> This is the key. If this multiplication overflows:
> 
> usecs = (t * inv_cycles_per_usec) / 16777216UL;
> 
> then usecs is 2^64/2^24, which is 1099511627776. Divide that by 10^6 to
> get seconds, and that is 1099511... I initially thought this was a buggy
> backwards timer, but it's just this overflow. Fix:
> 
> http://git.kernel.dk/?p=fio.git;a=commit;h=b3fa625b38a638cd1783e9fdcac1b95
> 8e37e48fa

Good find.  The 64-bit RDTSC won't wrap for over 10 years, but
that multiplication must be stealing too many bits. 
fio --debug=time shows this:
	time     28459 inv_cycles_per_usec=8397

Is anything in the linux kernel susceptible to a similar problem?

Anyway, I detached gdb and hit ^C to terminate fio, confirming that 
the 64-bit counters are working - it's reporting more than 4B IOs
for devices now:
* total IOs: 572,018,473,400
* 15 devices: 37,703,868,929 (example)
* (1 device (sdi) is lower, but fio gave up on it after IO errors)


drive_b: (groupid=0, jobs=224): err= 5 (file:io_u.c:1534, func=io_u error, error=Input/output error): pid=6510: Wed Dec 17 10:30:44 2014
  read : io=272760GB, bw=4102.9MB/s, iops=8402.7K, runt=68076097msec
  cpu          : usr=0.78%, sys=37.24%, ctx=401925558097, majf=0, minf=2927
  IO depths    : 1=0.0%, 2=0.1%, 4=0.1%, 8=0.3%, 16=0.4%, 32=0.6%, >=64=0.3%
     submit    : 0=0.0%, 4=5.8%, 8=0.6%, 16=3.4%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=5.4%, 32=0.1%, 64=0.1%, >=64=0.1%
     issued    : total=r=572018473400/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=96

Run status group 0 (all jobs):
   READ: io=272760GB, aggrb=4102.9MB/s, minb=4102.9MB/s, maxb=4102.9MB/s, mint=68076097msec, maxt=68076097msec

Disk stats (read/write):
  sdb: ios=37703868929/0, merge=0/0, ticks=68454382/0, in_queue=129910494, util=100.00%
  sdc: ios=37712378473/0, merge=0/0, ticks=77399794/0, in_queue=119899831, util=100.00%
  sdd: ios=37707967388/0, merge=0/0, ticks=75432313/0, in_queue=187703422, util=100.00%
  sde: ios=37698239552/0, merge=0/0, ticks=68544916/0, in_queue=179384002, util=100.00%
  sdf: ios=37716299315/0, merge=0/0, ticks=84593922/0, in_queue=199961411, util=100.00%
  sdg: ios=37699099568/0, merge=0/0, ticks=75257519/0, in_queue=175502802, util=100.00%
  sdh: ios=37701332642/0, merge=0/0, ticks=74572660/0, in_queue=192088623, util=100.00%
  sdi: ios= 2218175796/0, merge=0/0, ticks=2303073961/0, in_queue=2306683517, util=86.68%
  sdj: ios=37707221334/0, merge=0/0, ticks=78701197/0, in_queue=180717546, util=100.00%
  sdk: ios=37698155959/0, merge=0/0, ticks=77357378/0, in_queue=187489699, util=100.00%
  sdl: ios=37692589221/0, merge=0/0, ticks=75227670/0, in_queue=186780751, util=100.00%
  sdm: ios=37692481205/0, merge=0/0, ticks=73056936/0, in_queue=187474248, util=100.00%
  sdn: ios=37695237426/0, merge=0/0, ticks=77270328/0, in_queue=191649153, util=100.00%
  sdo: ios=37688493826/0, merge=0/0, ticks=72485901/0, in_queue=182735443, util=100.00%
  sdp: ios=39851839290/0, merge=0/0, ticks=48724465/0, in_queue=139505151, util=100.00%
  sdq: ios=39835069533/0, merge=0/0, ticks=44151762/0, in_queue=136927387, util=100.00%



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fio main thread got stuck over the weekend
  2014-12-17 16:48                                 ` Elliott, Robert (Server Storage)
@ 2014-12-17 17:27                                   ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2014-12-17 17:27 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), stephenmcameron; +Cc: fio

On 12/17/2014 09:48 AM, Elliott, Robert (Server Storage) wrote:
>
>
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@kernel.dk]
>> Sent: Tuesday, 16 December, 2014 11:43 PM
> ...
>>> (gdb) print td->tv_cache
>>> $51 = {tv_sec = 1099511, tv_usec = 641885}
>>                     ^^^^^^^
>>
>> This is the key. If this multiplication overflows:
>>
>> usecs = (t * inv_cycles_per_usec) / 16777216UL;
>>
>> then usecs is 2^64/2^24, which is 1099511627776. Divide that by 10^6 to
>> get seconds, and that is 1099511... I initially thought this was a buggy
>> backwards timer, but it's just this overflow. Fix:
>>
>> http://git.kernel.dk/?p=fio.git;a=commit;h=b3fa625b38a638cd1783e9fdcac1b95
>> 8e37e48fa
>
> Good find.  The 64-bit RDTSC won't wrap for over 10 years, but
> that multiplication must be stealing too many bits.
> fio --debug=time shows this:
> 	time     28459 inv_cycles_per_usec=8397

I added a second change that offsets the TSC by the initial value, so we 
should have the full 2^64 bit range available now. And yes, wrapping 
wont be a problem beyond that, it's a good chunk over 10 years and 
people _probably_ don't run jobs that long :-)

> Is anything in the linux kernel susceptible to a similar problem?

I haven't checked, I would assume the kernel would offset by the initial 
value as well.

> Anyway, I detached gdb and hit ^C to terminate fio, confirming that
> the 64-bit counters are working - it's reporting more than 4B IOs
> for devices now:
> * total IOs: 572,018,473,400
> * 15 devices: 37,703,868,929 (example)
> * (1 device (sdi) is lower, but fio gave up on it after IO errors)

Perfect! I'll cut 2.2.0 sometime this week, jfyi.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-12-17 17:27 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-11 15:44 fio main thread got stuck over the weekend scameron
2014-08-11 16:04 ` scameron
2014-08-22 19:04   ` Jens Axboe
2014-08-22 19:09     ` scameron
2014-08-22 19:11       ` Jens Axboe
2014-12-12 20:32         ` Elliott, Robert (Server Storage)
2014-12-13  4:49           ` Jens Axboe
2014-12-15 17:33             ` Elliott, Robert (Server Storage)
2014-12-15 17:37               ` Jens Axboe
2014-12-15 19:39                 ` Elliott, Robert (Server Storage)
2014-12-15 20:12                   ` Jens Axboe
2014-12-15 20:31                     ` Jens Axboe
2014-12-15 20:49                       ` Jens Axboe
2014-12-16  0:52                         ` Elliott, Robert (Server Storage)
2014-12-16  2:51                           ` Jens Axboe
2014-12-16 22:43                           ` Jens Axboe
2014-12-17  3:52                             ` Elliott, Robert (Server Storage)
2014-12-17  5:43                               ` Jens Axboe
2014-12-17 16:48                                 ` Elliott, Robert (Server Storage)
2014-12-17 17:27                                   ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.