All of lore.kernel.org
 help / color / mirror / Atom feed
* fio server errors
@ 2017-06-06 17:39 Jeff Furlong
  2017-06-07  6:18 ` Sitsofe Wheeler
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Furlong @ 2017-06-06 17:39 UTC (permalink / raw)
  To: fio

Hi All,
When using the client/server mode, I'm having two issues.  First, I seemingly randomly get an inflate error, which happens after several hours, and reproducing it is challenging:

fio: inflate error -5
fio: failed decompressing log
fio: failed converting IO log

I have --write_iops_log turned on so guessing the IO log transferred across the network is not working smoothly.  I don't have a reliable job file that reproduces the issue, unfortunately.

Second, I also get (less randomly) a timeout on SEND_ETA.  Here are the details:

# fio -version
fio-2.20-39-gc12d

# fio --client=server1 test.job --client=server2 test.job
client <server2>: timeout on SEND_ETA
client <server1>: timeout on SEND_ETA
fio: client: unable to find matching tag (76f680)

# cat test.job
[test_job]
ioengine=libaio
direct=1
rw=randread
norandommap
randrepeat=0
iodepth=256
size=100%
numjobs=1
bs=4k
filename=/dev/nvme0n1
group_reporting
write_lat_log=test_job 
runtime=3m

Any help would be appreciated.  Thanks.

Regards,
Jeff


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fio server errors
  2017-06-06 17:39 fio server errors Jeff Furlong
@ 2017-06-07  6:18 ` Sitsofe Wheeler
  2017-06-08 21:53   ` Jeff Furlong
  0 siblings, 1 reply; 8+ messages in thread
From: Sitsofe Wheeler @ 2017-06-07  6:18 UTC (permalink / raw)
  To: Jeff Furlong; +Cc: fio

On 6 June 2017 at 18:39, Jeff Furlong <jeff.furlong@wdc.com> wrote:
>
> When using the client/server mode, I'm having two issues.  First, I seemingly randomly get an inflate error, which happens after several hours, and reproducing it is challenging:
>
> fio: inflate error -5
> fio: failed decompressing log
> fio: failed converting IO log
>
> I have --write_iops_log turned on so guessing the IO log transferred across the network is not working smoothly.  I don't have a reliable job file that reproduces the issue, unfortunately.

Hmm -5 is going to be Z_BUF_ERROR so the suggestion is we didn't have
enough space to store the decompressed data. Do you get this problem
with older versions (e.g. 2.19) of fio too? You'll probably need to
add debugging to trace nr_samples and where possible total (e.g.
around https://github.com/axboe/fio/blob/master/client.c#L1467 and
https://github.com/axboe/fio/blob/9af5a2450a555a725dd18e6845967cd7cf3aad64/server.c#L1927
).

> Second, I also get (less randomly) a timeout on SEND_ETA.  Here are the details:
>
> # fio -version
> fio-2.20-39-gc12d
>
> # fio --client=server1 test.job --client=server2 test.job
> client <server2>: timeout on SEND_ETA
> client <server1>: timeout on SEND_ETA
> fio: client: unable to find matching tag (76f680)
>
> # cat test.job
> [test_job]
> ioengine=libaio
> direct=1
> rw=randread
> norandommap
> randrepeat=0
> iodepth=256
> size=100%
> numjobs=1
> bs=4k
> filename=/dev/nvme0n1
> group_reporting
> write_lat_log=test_job
> runtime=3m

Again can you reproduce this one with the previous fio release? Other
things that would help is if you cut the job down to the fewest lines
that still allow the problem to reproduce and reduce values (such as
iodepth, runtime and size) as far as possible that again still allow
the problem to occur (e.g. if size=4G still makes it happen that's
useful to know). If it happens relatively quickly you may be able to
use --debug=all to narrow down the last thing the client did but be
aware this generates a lot of output...

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: fio server errors
  2017-06-07  6:18 ` Sitsofe Wheeler
@ 2017-06-08 21:53   ` Jeff Furlong
  2017-06-14  6:13     ` Sitsofe Wheeler
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Furlong @ 2017-06-08 21:53 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: fio

Using fio 2.19 I can reproduce the SEND_ETA issue.  With debug=all, indeed there is a ton of output, but it finishes with roughly:

<server1> diskutil 30748 update io ticks
<server1> diskutil 30748 open stat file: /sys/block/nvme0n1/stat
<server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
<server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0
<server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 250, msec_to_next_event: 250
<server1> diskutil 30748 update io ticks
<server1> diskutil 30748 open stat file: /sys/block/nvme0n1/stat
<server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
<server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0
<server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 250, msec_to_next_event: 250
<server1> fio: pid=30750, got signal=11
<server1> process  30748 pid=30750: runstate RUNNING -> REAPED
<server1> process  30748 terminate group_id=-1
<server1> process  30748 setting terminate on test_job/30750
<server1> diskutil 30748 update io ticks
<server1> diskutil 30748 open stat file: /sys/block/nvme0n1/stat
<server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
<server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0
<server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 500, msec_to_next_event: 157
<server1> process  30748 pid=30750: runstate REAPED -> FINISHING
<server1> process  30748 pid=30750: runstate FINISHING -> REAPED
<server1> net      30748 server sending end stats
<server1> net      30748 ts->ss_state = 0
<server1> net      30748 server sending group run stats
<server1> net      30748 server: sending disk_util 1
<server1>
<server1> parse    30748 free options
<server1> parse    30748 free options
<server1> net      30707 pid 30748 exited, sig=0, exitval=0
<server1> net      30707 server: sending stop (0, 0)
<server1> net      30707 server: sending quit
<server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
<server1> net      30707 server sending status
<server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
<server1> net      30707 server sending status
<server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
<server1> net      30707 server sending status
<server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
<server1> net      30707 server sending status
<server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
<server1> net      30707 server sending status
<server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
<server1> net      30707 server sending status
<server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
<server1> net      30707 server sending status
parse    30706 free options

If I keep size=100% and runtime at 1m, I can reproduce the issue.  If I change size to 4G (and don’t use time_based) then it runs without any issue.  I suspect it's about the size of the saved log files, which are roughly 400MB each. 

Regards,
Jeff 


-----Original Message-----
From: Sitsofe Wheeler [mailto:sitsofe@gmail.com] 
Sent: Tuesday, June 6, 2017 11:19 PM
To: Jeff Furlong <jeff.furlong@wdc.com>
Cc: fio@vger.kernel.org
Subject: Re: fio server errors

On 6 June 2017 at 18:39, Jeff Furlong <jeff.furlong@wdc.com> wrote:
>
> When using the client/server mode, I'm having two issues.  First, I seemingly randomly get an inflate error, which happens after several hours, and reproducing it is challenging:
>
> fio: inflate error -5
> fio: failed decompressing log
> fio: failed converting IO log
>
> I have --write_iops_log turned on so guessing the IO log transferred across the network is not working smoothly.  I don't have a reliable job file that reproduces the issue, unfortunately.

Hmm -5 is going to be Z_BUF_ERROR so the suggestion is we didn't have enough space to store the decompressed data. Do you get this problem with older versions (e.g. 2.19) of fio too? You'll probably need to add debugging to trace nr_samples and where possible total (e.g.
around https://github.com/axboe/fio/blob/master/client.c#L1467 and
https://github.com/axboe/fio/blob/9af5a2450a555a725dd18e6845967cd7cf3aad64/server.c#L1927
).

> Second, I also get (less randomly) a timeout on SEND_ETA.  Here are the details:
>
> # fio -version
> fio-2.20-39-gc12d
>
> # fio --client=server1 test.job --client=server2 test.job client 
> <server2>: timeout on SEND_ETA client <server1>: timeout on SEND_ETA
> fio: client: unable to find matching tag (76f680)
>
> # cat test.job
> [test_job]
> ioengine=libaio
> direct=1
> rw=randread
> norandommap
> randrepeat=0
> iodepth=256
> size=100%
> numjobs=1
> bs=4k
> filename=/dev/nvme0n1
> group_reporting
> write_lat_log=test_job
> runtime=3m

Again can you reproduce this one with the previous fio release? Other things that would help is if you cut the job down to the fewest lines that still allow the problem to reproduce and reduce values (such as iodepth, runtime and size) as far as possible that again still allow the problem to occur (e.g. if size=4G still makes it happen that's useful to know). If it happens relatively quickly you may be able to use --debug=all to narrow down the last thing the client did but be aware this generates a lot of output...

--
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fio server errors
  2017-06-08 21:53   ` Jeff Furlong
@ 2017-06-14  6:13     ` Sitsofe Wheeler
  2017-06-14 22:20       ` Jeff Furlong
  0 siblings, 1 reply; 8+ messages in thread
From: Sitsofe Wheeler @ 2017-06-14  6:13 UTC (permalink / raw)
  To: Jeff Furlong; +Cc: fio

On 8 June 2017 at 22:53, Jeff Furlong <jeff.furlong@wdc.com> wrote:
> Using fio 2.19 I can reproduce the SEND_ETA issue.  With debug=all, indeed there is a ton of output, but it finishes with roughly:
>
> <server1> diskutil 30748 update io ticks
> <server1> diskutil 30748 open stat file: /sys/block/nvme0n1/stat
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0
> <server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 250, msec_to_next_event: 250
> <server1> diskutil 30748 update io ticks
> <server1> diskutil 30748 open stat file: /sys/block/nvme0n1/stat
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0
> <server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 250, msec_to_next_event: 250
> <server1> fio: pid=30750, got signal=11
> <server1> process  30748 pid=30750: runstate RUNNING -> REAPED
> <server1> process  30748 terminate group_id=-1
> <server1> process  30748 setting terminate on test_job/30750
> <server1> diskutil 30748 update io ticks
> <server1> diskutil 30748 open stat file: /sys/block/nvme0n1/stat
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0
> <server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 500, msec_to_next_event: 157
> <server1> process  30748 pid=30750: runstate REAPED -> FINISHING
> <server1> process  30748 pid=30750: runstate FINISHING -> REAPED
> <server1> net      30748 server sending end stats
> <server1> net      30748 ts->ss_state = 0
> <server1> net      30748 server sending group run stats
> <server1> net      30748 server: sending disk_util 1
> <server1>
> <server1> parse    30748 free options
> <server1> parse    30748 free options
> <server1> net      30707 pid 30748 exited, sig=0, exitval=0
> <server1> net      30707 server: sending stop (0, 0)
> <server1> net      30707 server: sending quit
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
> <server1> net      30707 server sending status
> parse    30706 free options

OK so it's not a newly introduced issue. This looks like the client is
continuing to send SEND_ETA messages even though the server has sent
back a "I've quit" message...

If you whittle the job file as much as possible (e.g. no
write_lat_log, smaller iodepth etc) do the symptoms change? What's the
smallest job file that still reproduces this problem?

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: fio server errors
  2017-06-14  6:13     ` Sitsofe Wheeler
@ 2017-06-14 22:20       ` Jeff Furlong
  2017-06-15  6:50         ` Sitsofe Wheeler
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Furlong @ 2017-06-14 22:20 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: fio

On issue one (inflate error), I've gone through some of the code and I think nr_samples should be declared as uint64_t, not uint32_t, at https://github.com/axboe/fio/blob/master/client.c#L1451  I've changed it on my machine and reran the prior test but after 25 hours the same issue occurred again.  It's still difficult to say what type of job is causing the issue.

On issue two (SEND_ETA), I can reproduce with a slightly smaller job file:

#fio --client=server1 test.job --client=server2 test.job

# cat test.job
[test_job]
ioengine=libaio
direct=1
rw=randread
iodepth=256
size=100%
numjobs=1
bs=4k
filename=/dev/nvme0n1
group_reporting
write_lat_log=test_job
runtime=1m

If I remove write_lat_log=test_job then there is no SEND_ETA issue.  Only when logging (the latency) does this issue occur.  And, if I enable write_lat_log=test_job but reduce runtime=30s, then the issue does not occur.  I must have write_lat_log=test_job and runtime=1m to produce the issue.  I imagine some count must be exceeded or the pause in moving the lat data from memory to file at the end of the job is causing a timing conflict (SEND_ETA msg vs "I've quit" msg).

Regards,
Jeff

-----Original Message-----
From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On Behalf Of Sitsofe Wheeler
Sent: Tuesday, June 13, 2017 11:13 PM
To: Jeff Furlong <jeff.furlong@wdc.com>
Cc: fio@vger.kernel.org
Subject: Re: fio server errors

On 8 June 2017 at 22:53, Jeff Furlong <jeff.furlong@wdc.com> wrote:
> Using fio 2.19 I can reproduce the SEND_ETA issue.  With debug=all, indeed there is a ton of output, but it finishes with roughly:
>
> <server1> diskutil 30748 update io ticks <server1> diskutil 30748 open 
> stat file: /sys/block/nvme0n1/stat
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0 
> <server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 
> 250, msec_to_next_event: 250 <server1> diskutil 30748 update io ticks 
> <server1> diskutil 30748 open stat file: /sys/block/nvme0n1/stat
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0 
> <server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 
> 250, msec_to_next_event: 250 <server1> fio: pid=30750, got signal=11 
> <server1> process  30748 pid=30750: runstate RUNNING -> REAPED 
> <server1> process  30748 terminate group_id=-1 <server1> process  
> 30748 setting terminate on test_job/30750 <server1> diskutil 30748 
> update io ticks <server1> diskutil 30748 open stat file: 
> /sys/block/nvme0n1/stat
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0 
> <server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 
> 500, msec_to_next_event: 157 <server1> process  30748 pid=30750: 
> runstate REAPED -> FINISHING <server1> process  30748 pid=30750: runstate FINISHING -> REAPED
> <server1> net      30748 server sending end stats
> <server1> net      30748 ts->ss_state = 0
> <server1> net      30748 server sending group run stats
> <server1> net      30748 server: sending disk_util 1
> <server1>
> <server1> parse    30748 free options
> <server1> parse    30748 free options
> <server1> net      30707 pid 30748 exited, sig=0, exitval=0
> <server1> net      30707 server: sending stop (0, 0)
> <server1> net      30707 server: sending quit
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
> <server1> net      30707 server sending status
> parse    30706 free options

OK so it's not a newly introduced issue. This looks like the client is continuing to send SEND_ETA messages even though the server has sent back a "I've quit" message...

If you whittle the job file as much as possible (e.g. no write_lat_log, smaller iodepth etc) do the symptoms change? What's the smallest job file that still reproduces this problem?

--
Sitsofe | http://sucs.org/~sits/
--
To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fio server errors
  2017-06-14 22:20       ` Jeff Furlong
@ 2017-06-15  6:50         ` Sitsofe Wheeler
  2017-06-15 22:59           ` Jeff Furlong
  0 siblings, 1 reply; 8+ messages in thread
From: Sitsofe Wheeler @ 2017-06-15  6:50 UTC (permalink / raw)
  To: Jeff Furlong; +Cc: fio

On 14 June 2017 at 23:20, Jeff Furlong <jeff.furlong@wdc.com> wrote:
> On issue one (inflate error), I've gone through some of the code and I think nr_samples should be declared as uint64_t, not uint32_t, at https://github.com/axboe/fio/blob/master/client.c#L1451  I've changed it on my machine and reran the prior test but after 25 hours the same issue occurred again.  It's still difficult to say what type of job is causing the issue.

I see there is another uint32_t nr_samples over in
https://github.com/axboe/fio/blob/master/iolog.c#L801 - do you think
that would need changing too? Is nr_samples slowly growing over time
and have you managed to print its value at the point of the inflate
error?

> On issue two (SEND_ETA), I can reproduce with a slightly smaller job file:
>
> #fio --client=server1 test.job --client=server2 test.job
>
> # cat test.job
> [test_job]
> ioengine=libaio
> direct=1
> rw=randread
> iodepth=256
> size=100%
> numjobs=1
> bs=4k
> filename=/dev/nvme0n1
> group_reporting
> write_lat_log=test_job
> runtime=1m
>
> If I remove write_lat_log=test_job then there is no SEND_ETA issue.  Only when logging (the latency) does this issue occur.  And, if I enable write_lat_log=test_job but reduce runtime=30s, then the issue does not occur.  I must have write_lat_log=test_job and runtime=1m to produce the issue.  I imagine some count must be exceeded or the pause in moving the lat data from memory to file at the end of the job is causing a timing conflict (SEND_ETA msg vs "I've quit" msg).

I can reproduce the problem here using this job:
[sendeta]
ioengine=libaio
direct=1
rw=read
bs=512
filename=/dev/nullb0
write_lat_log=test_job
disable_clat=1
disable_slat=1
iodepth=512
number_ios=16000000

and running
fio --debug=net --client=localhost sendeta.fio --client=127.0.0.1 sendeta.fio

Using number_ios seems to make the job run for as long as needed to
show the problem regardless of what the iodepth is in my case. From
what I can see the client is forced to spend time on the IOLOG op
which in turn makes it unresponsive to other actions:

net      1709  client: got cmd op ETA from 127.0.0.1 (pdu=161)
net      1709  client: got eta tag 0x2442d90, 2
net      1709  client: handle localhost
net      1709  client: got cmd op ETA from localhost (pdu=161)
net      1709  client: got eta tag 0x2442d90, 1
net      1709  client: request eta (2)k][r=423k,w=0 IOPS][eta 27m:09s]
net      1709  client: requested eta tag 0x2442d90
net      1709  client: handle localhost
net      1709  client: got cmd op ETA from localhost (pdu=161)
net      1709  client: got eta tag 0x2442d90, 2
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op IOLOG from 127.0.0.1 (pdu=290)
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op IOLOG from 127.0.0.1 (pdu=290)
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op ETA from 127.0.0.1 (pdu=161)
net      1709  client: got eta tag 0x2442d90, 1
net      1709  client: handle localhost0k][r=387k,w=0 IOPS][eta 00m:00s]
net      1709  client: got cmd op IOLOG from localhost (pdu=11297502)
net      1709  client: request eta (2)
net      1709  client: requested eta tag 0x2442d90
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op IOLOG from 127.0.0.1 (pdu=11297330)
net      1709  client: handle localhost
net      1709  client: got cmd op IOLOG from localhost (pdu=290)
net      1709  client: request eta (2)
net      1709  client: requested eta tag 0x2449df0
client <127.0.0.1>: timeout on SEND_ETA
client <localhost>: timeout on SEND_ETA
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op ETA from 127.0.0.1 (pdu=161)
fio: client: unable to find matching tag (2442c80)

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: fio server errors
  2017-06-15  6:50         ` Sitsofe Wheeler
@ 2017-06-15 22:59           ` Jeff Furlong
  2017-06-16 17:58             ` Jeff Furlong
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Furlong @ 2017-06-15 22:59 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: fio

Good catch.  I've fixed the second declaration of uint32_t nrsamples in iolog.c  It's running so far, but seems to take a while to fail even in the bad case.  I have printed the value of nr_samples in various tests, but unfortunately not exactly when the inflate error caused fio to exit.  I'm running again so should be a matter of time to determine if changing to unit64_t solves it.

Glad you can reproduce the SEND_ETA issue.  Not sure how to deal with its unresponsiveness during IOLOG.

Regards,
Jeff

-----Original Message-----
From: Sitsofe Wheeler [mailto:sitsofe@gmail.com] 
Sent: Wednesday, June 14, 2017 11:51 PM
To: Jeff Furlong <jeff.furlong@wdc.com>
Cc: fio@vger.kernel.org
Subject: Re: fio server errors

On 14 June 2017 at 23:20, Jeff Furlong <jeff.furlong@wdc.com> wrote:
> On issue one (inflate error), I've gone through some of the code and I think nr_samples should be declared as uint64_t, not uint32_t, at https://github.com/axboe/fio/blob/master/client.c#L1451  I've changed it on my machine and reran the prior test but after 25 hours the same issue occurred again.  It's still difficult to say what type of job is causing the issue.

I see there is another uint32_t nr_samples over in
https://github.com/axboe/fio/blob/master/iolog.c#L801 - do you think that would need changing too? Is nr_samples slowly growing over time and have you managed to print its value at the point of the inflate error?

> On issue two (SEND_ETA), I can reproduce with a slightly smaller job file:
>
> #fio --client=server1 test.job --client=server2 test.job
>
> # cat test.job
> [test_job]
> ioengine=libaio
> direct=1
> rw=randread
> iodepth=256
> size=100%
> numjobs=1
> bs=4k
> filename=/dev/nvme0n1
> group_reporting
> write_lat_log=test_job
> runtime=1m
>
> If I remove write_lat_log=test_job then there is no SEND_ETA issue.  Only when logging (the latency) does this issue occur.  And, if I enable write_lat_log=test_job but reduce runtime=30s, then the issue does not occur.  I must have write_lat_log=test_job and runtime=1m to produce the issue.  I imagine some count must be exceeded or the pause in moving the lat data from memory to file at the end of the job is causing a timing conflict (SEND_ETA msg vs "I've quit" msg).

I can reproduce the problem here using this job:
[sendeta]
ioengine=libaio
direct=1
rw=read
bs=512
filename=/dev/nullb0
write_lat_log=test_job
disable_clat=1
disable_slat=1
iodepth=512
number_ios=16000000

and running
fio --debug=net --client=localhost sendeta.fio --client=127.0.0.1 sendeta.fio

Using number_ios seems to make the job run for as long as needed to show the problem regardless of what the iodepth is in my case. From what I can see the client is forced to spend time on the IOLOG op which in turn makes it unresponsive to other actions:

net      1709  client: got cmd op ETA from 127.0.0.1 (pdu=161)
net      1709  client: got eta tag 0x2442d90, 2
net      1709  client: handle localhost
net      1709  client: got cmd op ETA from localhost (pdu=161)
net      1709  client: got eta tag 0x2442d90, 1
net      1709  client: request eta (2)k][r=423k,w=0 IOPS][eta 27m:09s]
net      1709  client: requested eta tag 0x2442d90
net      1709  client: handle localhost
net      1709  client: got cmd op ETA from localhost (pdu=161)
net      1709  client: got eta tag 0x2442d90, 2
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op IOLOG from 127.0.0.1 (pdu=290)
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op IOLOG from 127.0.0.1 (pdu=290)
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op ETA from 127.0.0.1 (pdu=161)
net      1709  client: got eta tag 0x2442d90, 1
net      1709  client: handle localhost0k][r=387k,w=0 IOPS][eta 00m:00s]
net      1709  client: got cmd op IOLOG from localhost (pdu=11297502)
net      1709  client: request eta (2)
net      1709  client: requested eta tag 0x2442d90
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op IOLOG from 127.0.0.1 (pdu=11297330)
net      1709  client: handle localhost
net      1709  client: got cmd op IOLOG from localhost (pdu=290)
net      1709  client: request eta (2)
net      1709  client: requested eta tag 0x2449df0
client <127.0.0.1>: timeout on SEND_ETA
client <localhost>: timeout on SEND_ETA
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op ETA from 127.0.0.1 (pdu=161)
fio: client: unable to find matching tag (2442c80)

--
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: fio server errors
  2017-06-15 22:59           ` Jeff Furlong
@ 2017-06-16 17:58             ` Jeff Furlong
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Furlong @ 2017-06-16 17:58 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: fio

Unfortunately changing to uint64_t did not solve the problem.  At this point I'm getting the bigger hammer and setting server.c's:
static unsigned int has_zlib = 1;
-->
static unsigned int has_zlib = 0;
to disable compression.

As for SEND_ETA, it seems related to http://git.kernel.dk/cgit/fio/commit/?id=308d69b5d340577b7886696f39753b7ba5ae9e11 but is not a catastrophic error.

Regards,
Jeff

-----Original Message-----
From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On Behalf Of Jeff Furlong
Sent: Thursday, June 15, 2017 3:59 PM
To: Sitsofe Wheeler <sitsofe@gmail.com>
Cc: fio@vger.kernel.org
Subject: RE: fio server errors

Good catch.  I've fixed the second declaration of uint32_t nrsamples in iolog.c  It's running so far, but seems to take a while to fail even in the bad case.  I have printed the value of nr_samples in various tests, but unfortunately not exactly when the inflate error caused fio to exit.  I'm running again so should be a matter of time to determine if changing to unit64_t solves it.

Glad you can reproduce the SEND_ETA issue.  Not sure how to deal with its unresponsiveness during IOLOG.

Regards,
Jeff

-----Original Message-----
From: Sitsofe Wheeler [mailto:sitsofe@gmail.com] 
Sent: Wednesday, June 14, 2017 11:51 PM
To: Jeff Furlong <jeff.furlong@wdc.com>
Cc: fio@vger.kernel.org
Subject: Re: fio server errors

On 14 June 2017 at 23:20, Jeff Furlong <jeff.furlong@wdc.com> wrote:
> On issue one (inflate error), I've gone through some of the code and I think nr_samples should be declared as uint64_t, not uint32_t, at https://github.com/axboe/fio/blob/master/client.c#L1451  I've changed it on my machine and reran the prior test but after 25 hours the same issue occurred again.  It's still difficult to say what type of job is causing the issue.

I see there is another uint32_t nr_samples over in
https://github.com/axboe/fio/blob/master/iolog.c#L801 - do you think that would need changing too? Is nr_samples slowly growing over time and have you managed to print its value at the point of the inflate error?

> On issue two (SEND_ETA), I can reproduce with a slightly smaller job file:
>
> #fio --client=server1 test.job --client=server2 test.job
>
> # cat test.job
> [test_job]
> ioengine=libaio
> direct=1
> rw=randread
> iodepth=256
> size=100%
> numjobs=1
> bs=4k
> filename=/dev/nvme0n1
> group_reporting
> write_lat_log=test_job
> runtime=1m
>
> If I remove write_lat_log=test_job then there is no SEND_ETA issue.  Only when logging (the latency) does this issue occur.  And, if I enable write_lat_log=test_job but reduce runtime=30s, then the issue does not occur.  I must have write_lat_log=test_job and runtime=1m to produce the issue.  I imagine some count must be exceeded or the pause in moving the lat data from memory to file at the end of the job is causing a timing conflict (SEND_ETA msg vs "I've quit" msg).

I can reproduce the problem here using this job:
[sendeta]
ioengine=libaio
direct=1
rw=read
bs=512
filename=/dev/nullb0
write_lat_log=test_job
disable_clat=1
disable_slat=1
iodepth=512
number_ios=16000000

and running
fio --debug=net --client=localhost sendeta.fio --client=127.0.0.1 sendeta.fio

Using number_ios seems to make the job run for as long as needed to show the problem regardless of what the iodepth is in my case. From what I can see the client is forced to spend time on the IOLOG op which in turn makes it unresponsive to other actions:

net      1709  client: got cmd op ETA from 127.0.0.1 (pdu=161)
net      1709  client: got eta tag 0x2442d90, 2
net      1709  client: handle localhost
net      1709  client: got cmd op ETA from localhost (pdu=161)
net      1709  client: got eta tag 0x2442d90, 1
net      1709  client: request eta (2)k][r=423k,w=0 IOPS][eta 27m:09s]
net      1709  client: requested eta tag 0x2442d90
net      1709  client: handle localhost
net      1709  client: got cmd op ETA from localhost (pdu=161)
net      1709  client: got eta tag 0x2442d90, 2
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op IOLOG from 127.0.0.1 (pdu=290)
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op IOLOG from 127.0.0.1 (pdu=290)
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op ETA from 127.0.0.1 (pdu=161)
net      1709  client: got eta tag 0x2442d90, 1
net      1709  client: handle localhost0k][r=387k,w=0 IOPS][eta 00m:00s]
net      1709  client: got cmd op IOLOG from localhost (pdu=11297502)
net      1709  client: request eta (2)
net      1709  client: requested eta tag 0x2442d90
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op IOLOG from 127.0.0.1 (pdu=11297330)
net      1709  client: handle localhost
net      1709  client: got cmd op IOLOG from localhost (pdu=290)
net      1709  client: request eta (2)
net      1709  client: requested eta tag 0x2449df0
client <127.0.0.1>: timeout on SEND_ETA
client <localhost>: timeout on SEND_ETA
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op ETA from 127.0.0.1 (pdu=161)
fio: client: unable to find matching tag (2442c80)

--
Sitsofe | http://sucs.org/~sits/
\x13��칻\x1c�&�~�&�\x18��+-��ݶ\x17��w��˛���m��)���w*jg���\x1e�����ݢj/���z�ޖ��2�ޙ���&�)ߡ�a��\x7f��\x1e�G���h�\x0f�j:+v���w�٥

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-06-16 17:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-06 17:39 fio server errors Jeff Furlong
2017-06-07  6:18 ` Sitsofe Wheeler
2017-06-08 21:53   ` Jeff Furlong
2017-06-14  6:13     ` Sitsofe Wheeler
2017-06-14 22:20       ` Jeff Furlong
2017-06-15  6:50         ` Sitsofe Wheeler
2017-06-15 22:59           ` Jeff Furlong
2017-06-16 17:58             ` Jeff Furlong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.