All of lore.kernel.org
 help / color / mirror / Atom feed
* fio server/client disconnect bug
@ 2018-02-16  0:32 Jeff Furlong
  2018-03-21  0:16 ` Jeff Furlong
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Furlong @ 2018-02-16  0:32 UTC (permalink / raw)
  To: fio

Hi All,
With the near latest fio version (fio-3.3-51-gf2cd) using server/client mode, I occasionally see one client get disconnected early.  Hence, the client's IO summary output does not get reported at the end of the job.  The issue seems to occur between 100 and 300 iterations of a job file.  Worse, the conditions for the early disconnect may be tied to the complexity of the job file (such as numjobs or IOPs of device).  Even worse, when using debug mode, the failure seems to reduce to around 1000 iterations of a job file.

When running with

fio --client=host1 test_job_a --client=host2 test_job_b --debug=process,net

...
net      5027  client: handle host2
net      5027  client: got cmd op IOLOG from host2 (pdu=446)
net      5027  client: handle host2
client: host=host2 disconnected
net      5027  client: removed <host2>
net      5027  client: request eta (1)
net      5027  client: requested eta tag 0x1b52c20
net      5027  client: handle host1
net      5027  client: got cmd op TEXT from host1 (pdu=85)
<host1> net      5028  server: got op [SEND_ETA], pdu=0, tag=1b4a970
net      5027  client: handle host1
net      5027  client: got cmd op TEXT from host1 (pdu=61)
<host1> net      5028  server sending status
...

Then the normal output summary only shows the statistics from host1.  The host2 throughput, latency, etc. are never displayed.  However, the iops, bw, and lat logs all seem to have been updated properly.  Sometimes host2 is disconnected early; sometimes host1 is disconnected early.

Why might host2 be disconnected?  I see disconnects when using a switch with one hop and also with directly connecting host1 to host2.  So dropping network packets seems unlikely.  Could the ETA update not be accurate?  Or is it possible host2 finishes the job faster than host1 and closes the connection too early?

In the event host2's job file finishes early, should we still summarize the IO traffic in the output?  How is that condition handled?

If you have suggestions on other debug options, I would appreciate.  Thanks.

Regards,
Jeff



^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: fio server/client disconnect bug
  2018-02-16  0:32 fio server/client disconnect bug Jeff Furlong
@ 2018-03-21  0:16 ` Jeff Furlong
  2018-03-21 15:19   ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Furlong @ 2018-03-21  0:16 UTC (permalink / raw)
  To: fio

Revisiting this issue.  It seems the call stack is:

fio_handle_clients()
    fio_handle_client()
        case FIO_NET_CMD_TS:
            ops->thread_status(client, cmd);
            .thread_status    = handle_ts
                static void handle_ts(struct fio_client *client, struct fio_net_cmd *cmd)
                {
                    struct cmd_ts_pdu *p = (struct cmd_ts_pdu *) cmd->payload;
                    struct flist_head *opt_list = NULL;
                    struct json_object *tsobj;

                    if (client->opt_lists && p->ts.thread_number <= client->jobs)
                        opt_list = &client->opt_lists[p->ts.thread_number - 1];

                    tsobj = show_thread_status(&p->ts, &p->rs, opt_list, NULL);
                    client->did_stat = true;
                    if (tsobj) {
                        json_object_add_client_info(tsobj, client);
                        json_array_add_value_object(clients_array, tsobj);
                    }

                    if (sum_stat_clients <= 1)
                        return;

                    sum_thread_stats(&client_ts, &p->ts, sum_stat_nr == 1);
                    sum_group_stats(&client_gs, &p->rs);

                    client_ts.members++;
                    client_ts.thread_number = p->ts.thread_number;
                    client_ts.groupid = p->ts.groupid;
                    client_ts.unified_rw_rep = p->ts.unified_rw_rep;
                    client_ts.sig_figs = p->ts.sig_figs;

                    if (++sum_stat_nr == sum_stat_clients) {
                        strcpy(client_ts.name, "All clients");
                        tsobj = show_thread_status(&client_ts, &client_gs, NULL, NULL);
                        if (tsobj) {
                            json_object_add_client_info(tsobj, client);
                            json_array_add_value_object(clients_array, tsobj);
                        }
                    }
                }

And when sum_stat_clients <= 1, we never print "All clients" summary.  Actually, we miss an entire client, so neither the individual client summary is output nor the "all clients" summary is output.  It seems one client finishes just slightly before the other but we remove from the list of clients too quickly.  I tried adjusting the timeout and such, but didn't completely remove the issue.  Any specific thoughts?

I cut down the job files to the smallest I could find to reliably reproduce the issue.  It seems we need to log a few items to reproduce, but the job runtime itself can be quite small.


# cat test_job_a
[test_job_a]
description=test_job_a
ioengine=libaio
direct=1
rw=randread
iodepth=8
bs=64k
filename=/dev/nvme0n1
group_reporting
write_bw_log=test_job_a
write_iops_log=test_job_a
write_lat_log=test_job_a
log_avg_msec=1000
unified_rw_reporting=1
disable_lat=0
disable_clat=0
disable_slat=0
runtime=5s
time_based


# cat test_job_b
[test_job_b]
description=test_job_b
ioengine=libaio
direct=1
rw=randread
iodepth=8
bs=64k
filename=/dev/nvme0n1
group_reporting
write_bw_log=test_job_b
write_iops_log=test_job_b
write_lat_log=test_job_b
log_avg_msec=1000
unified_rw_reporting=1
disable_lat=0
disable_clat=0
disable_slat=0
runtime=5s
time_based


# fio --client=host1 test_job_a --client=host2 test_job_b --output=test_job


Regards,
Jeff


-----Original Message-----
From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On Behalf Of Jeff Furlong
Sent: Thursday, February 15, 2018 4:33 PM
To: fio@vger.kernel.org
Subject: fio server/client disconnect bug

Hi All,
With the near latest fio version (fio-3.3-51-gf2cd) using server/client mode, I occasionally see one client get disconnected early.  Hence, the client's IO summary output does not get reported at the end of the job.  The issue seems to occur between 100 and 300 iterations of a job file.  Worse, the conditions for the early disconnect may be tied to the complexity of the job file (such as numjobs or IOPs of device).  Even worse, when using debug mode, the failure seems to reduce to around 1000 iterations of a job file.

When running with

fio --client=host1 test_job_a --client=host2 test_job_b --debug=process,net

...
net      5027  client: handle host2
net      5027  client: got cmd op IOLOG from host2 (pdu=446)
net      5027  client: handle host2
client: host=host2 disconnected
net      5027  client: removed <host2>
net      5027  client: request eta (1)
net      5027  client: requested eta tag 0x1b52c20
net      5027  client: handle host1
net      5027  client: got cmd op TEXT from host1 (pdu=85)
<host1> net      5028  server: got op [SEND_ETA], pdu=0, tag=1b4a970
net      5027  client: handle host1
net      5027  client: got cmd op TEXT from host1 (pdu=61)
<host1> net      5028  server sending status
...

Then the normal output summary only shows the statistics from host1.  The host2 throughput, latency, etc. are never displayed.  However, the iops, bw, and lat logs all seem to have been updated properly.  Sometimes host2 is disconnected early; sometimes host1 is disconnected early.

Why might host2 be disconnected?  I see disconnects when using a switch with one hop and also with directly connecting host1 to host2.  So dropping network packets seems unlikely.  Could the ETA update not be accurate?  Or is it possible host2 finishes the job faster than host1 and closes the connection too early?

In the event host2's job file finishes early, should we still summarize the IO traffic in the output?  How is that condition handled?

If you have suggestions on other debug options, I would appreciate.  Thanks.

Regards,
Jeff



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fio server/client disconnect bug
  2018-03-21  0:16 ` Jeff Furlong
@ 2018-03-21 15:19   ` Jens Axboe
  2018-03-21 16:19     ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2018-03-21 15:19 UTC (permalink / raw)
  To: Jeff Furlong, fio

On 3/20/18 6:16 PM, Jeff Furlong wrote:
> Revisiting this issue.  It seems the call stack is:
> 
> fio_handle_clients()
>     fio_handle_client()
>         case FIO_NET_CMD_TS:
>             ops->thread_status(client, cmd);
>             .thread_status    = handle_ts
>                 static void handle_ts(struct fio_client *client, struct fio_net_cmd *cmd)
>                 {
>                     struct cmd_ts_pdu *p = (struct cmd_ts_pdu *) cmd->payload;
>                     struct flist_head *opt_list = NULL;
>                     struct json_object *tsobj;
> 
>                     if (client->opt_lists && p->ts.thread_number <= client->jobs)
>                         opt_list = &client->opt_lists[p->ts.thread_number - 1];
> 
>                     tsobj = show_thread_status(&p->ts, &p->rs, opt_list, NULL);
>                     client->did_stat = true;
>                     if (tsobj) {
>                         json_object_add_client_info(tsobj, client);
>                         json_array_add_value_object(clients_array, tsobj);
>                     }
> 
>                     if (sum_stat_clients <= 1)
>                         return;
> 
>                     sum_thread_stats(&client_ts, &p->ts, sum_stat_nr == 1);
>                     sum_group_stats(&client_gs, &p->rs);
> 
>                     client_ts.members++;
>                     client_ts.thread_number = p->ts.thread_number;
>                     client_ts.groupid = p->ts.groupid;
>                     client_ts.unified_rw_rep = p->ts.unified_rw_rep;
>                     client_ts.sig_figs = p->ts.sig_figs;
> 
>                     if (++sum_stat_nr == sum_stat_clients) {
>                         strcpy(client_ts.name, "All clients");
>                         tsobj = show_thread_status(&client_ts, &client_gs, NULL, NULL);
>                         if (tsobj) {
>                             json_object_add_client_info(tsobj, client);
>                             json_array_add_value_object(clients_array, tsobj);
>                         }
>                     }
>                 }
> 
> And when sum_stat_clients <= 1, we never print "All clients" summary.
> Actually, we miss an entire client, so neither the individual client
> summary is output nor the "all clients" summary is output.  It seems
> one client finishes just slightly before the other but we remove from
> the list of clients too quickly.  I tried adjusting the timeout and
> such, but didn't completely remove the issue.  Any specific thoughts?

sum_stat_clients is set when we start everything up, so that should
always be '2' for your case. So I'm a little puzzled as to what is going
on here. Do any of the jobs ever end in error, and that's why we are
missing a report from one of the jobs? Or are you referring to timing on
receiving the stats output, somehow racing with each other and we're
missing one of them? The latter could result in displaying just one
output, and never getting ++sum_stat_nr == 2 and displaying the "All
clients" output.

> I cut down the job files to the smallest I could find to reliably
> reproduce the issue.  It seems we need to log a few items to
> reproduce, but the job runtime itself can be quite small.
> 
> 
> # cat test_job_a
> [test_job_a]
> description=test_job_a
> ioengine=libaio
> direct=1
> rw=randread
> iodepth=8
> bs=64k
> filename=/dev/nvme0n1
> group_reporting
> write_bw_log=test_job_a
> write_iops_log=test_job_a
> write_lat_log=test_job_a
> log_avg_msec=1000
> unified_rw_reporting=1
> disable_lat=0
> disable_clat=0
> disable_slat=0
> runtime=5s
> time_based
> 
> 
> # cat test_job_b
> [test_job_b]
> description=test_job_b
> ioengine=libaio
> direct=1
> rw=randread
> iodepth=8
> bs=64k
> filename=/dev/nvme0n1
> group_reporting
> write_bw_log=test_job_b
> write_iops_log=test_job_b
> write_lat_log=test_job_b
> log_avg_msec=1000
> unified_rw_reporting=1
> disable_lat=0
> disable_clat=0
> disable_slat=0
> runtime=5s
> time_based
> 
> 
> # fio --client=host1 test_job_a --client=host2 test_job_b --output=test_job

I've run 2x100 iterations of this, and I haven't been able to reproduce
the issue so far. I'll try and dig some more, but any extra info or
debug output you may have would be very useful.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fio server/client disconnect bug
  2018-03-21 15:19   ` Jens Axboe
@ 2018-03-21 16:19     ` Jens Axboe
  2018-03-21 16:27       ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2018-03-21 16:19 UTC (permalink / raw)
  To: Jeff Furlong, fio

On 3/21/18 9:19 AM, Jens Axboe wrote:
> On 3/20/18 6:16 PM, Jeff Furlong wrote:
>> Revisiting this issue.  It seems the call stack is:
>>
>> fio_handle_clients()
>>     fio_handle_client()
>>         case FIO_NET_CMD_TS:
>>             ops->thread_status(client, cmd);
>>             .thread_status    = handle_ts
>>                 static void handle_ts(struct fio_client *client, struct fio_net_cmd *cmd)
>>                 {
>>                     struct cmd_ts_pdu *p = (struct cmd_ts_pdu *) cmd->payload;
>>                     struct flist_head *opt_list = NULL;
>>                     struct json_object *tsobj;
>>
>>                     if (client->opt_lists && p->ts.thread_number <= client->jobs)
>>                         opt_list = &client->opt_lists[p->ts.thread_number - 1];
>>
>>                     tsobj = show_thread_status(&p->ts, &p->rs, opt_list, NULL);
>>                     client->did_stat = true;
>>                     if (tsobj) {
>>                         json_object_add_client_info(tsobj, client);
>>                         json_array_add_value_object(clients_array, tsobj);
>>                     }
>>
>>                     if (sum_stat_clients <= 1)
>>                         return;
>>
>>                     sum_thread_stats(&client_ts, &p->ts, sum_stat_nr == 1);
>>                     sum_group_stats(&client_gs, &p->rs);
>>
>>                     client_ts.members++;
>>                     client_ts.thread_number = p->ts.thread_number;
>>                     client_ts.groupid = p->ts.groupid;
>>                     client_ts.unified_rw_rep = p->ts.unified_rw_rep;
>>                     client_ts.sig_figs = p->ts.sig_figs;
>>
>>                     if (++sum_stat_nr == sum_stat_clients) {
>>                         strcpy(client_ts.name, "All clients");
>>                         tsobj = show_thread_status(&client_ts, &client_gs, NULL, NULL);
>>                         if (tsobj) {
>>                             json_object_add_client_info(tsobj, client);
>>                             json_array_add_value_object(clients_array, tsobj);
>>                         }
>>                     }
>>                 }
>>
>> And when sum_stat_clients <= 1, we never print "All clients" summary.
>> Actually, we miss an entire client, so neither the individual client
>> summary is output nor the "all clients" summary is output.  It seems
>> one client finishes just slightly before the other but we remove from
>> the list of clients too quickly.  I tried adjusting the timeout and
>> such, but didn't completely remove the issue.  Any specific thoughts?
> 
> sum_stat_clients is set when we start everything up, so that should
> always be '2' for your case. So I'm a little puzzled as to what is going
> on here. Do any of the jobs ever end in error, and that's why we are
> missing a report from one of the jobs? Or are you referring to timing on
> receiving the stats output, somehow racing with each other and we're
> missing one of them? The latter could result in displaying just one
> output, and never getting ++sum_stat_nr == 2 and displaying the "All
> clients" output.

Does the below patch change anything for you? I forgot that we get
multiple starts (one from each client, of course), which means that we
really should protect the inc from there.

diff --git a/client.c b/client.c
index bff0adc0d972..fb1d1eb233d8 100644
--- a/client.c
+++ b/client.c
@@ -198,7 +198,7 @@ void fio_put_client(struct fio_client *client)
 		free(client->opt_lists);
 
 	if (!client->did_stat)
-		sum_stat_clients--;
+		__sync_fetch_and_sub(&sum_stat_clients, 1);
 
 	if (client->error)
 		error_clients++;
@@ -1440,7 +1440,7 @@ static void handle_start(struct fio_client *client, struct fio_net_cmd *cmd)
 			INIT_FLIST_HEAD(&client->opt_lists[i]);
 	}
 
-	sum_stat_clients += client->nr_stat;
+	__sync_fetch_and_add(&sum_stat_clients, client->nr_stat);
 }
 
 static void handle_stop(struct fio_client *client, struct fio_net_cmd *cmd)

-- 
Jens Axboe



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: fio server/client disconnect bug
  2018-03-21 16:19     ` Jens Axboe
@ 2018-03-21 16:27       ` Jens Axboe
  2018-03-21 16:35         ` Jeff Furlong
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2018-03-21 16:27 UTC (permalink / raw)
  To: Jeff Furlong, fio

On 3/21/18 10:19 AM, Jens Axboe wrote:
> On 3/21/18 9:19 AM, Jens Axboe wrote:
>> On 3/20/18 6:16 PM, Jeff Furlong wrote:
>>> Revisiting this issue.  It seems the call stack is:
>>>
>>> fio_handle_clients()
>>>     fio_handle_client()
>>>         case FIO_NET_CMD_TS:
>>>             ops->thread_status(client, cmd);
>>>             .thread_status    = handle_ts
>>>                 static void handle_ts(struct fio_client *client, struct fio_net_cmd *cmd)
>>>                 {
>>>                     struct cmd_ts_pdu *p = (struct cmd_ts_pdu *) cmd->payload;
>>>                     struct flist_head *opt_list = NULL;
>>>                     struct json_object *tsobj;
>>>
>>>                     if (client->opt_lists && p->ts.thread_number <= client->jobs)
>>>                         opt_list = &client->opt_lists[p->ts.thread_number - 1];
>>>
>>>                     tsobj = show_thread_status(&p->ts, &p->rs, opt_list, NULL);
>>>                     client->did_stat = true;
>>>                     if (tsobj) {
>>>                         json_object_add_client_info(tsobj, client);
>>>                         json_array_add_value_object(clients_array, tsobj);
>>>                     }
>>>
>>>                     if (sum_stat_clients <= 1)
>>>                         return;
>>>
>>>                     sum_thread_stats(&client_ts, &p->ts, sum_stat_nr == 1);
>>>                     sum_group_stats(&client_gs, &p->rs);
>>>
>>>                     client_ts.members++;
>>>                     client_ts.thread_number = p->ts.thread_number;
>>>                     client_ts.groupid = p->ts.groupid;
>>>                     client_ts.unified_rw_rep = p->ts.unified_rw_rep;
>>>                     client_ts.sig_figs = p->ts.sig_figs;
>>>
>>>                     if (++sum_stat_nr == sum_stat_clients) {
>>>                         strcpy(client_ts.name, "All clients");
>>>                         tsobj = show_thread_status(&client_ts, &client_gs, NULL, NULL);
>>>                         if (tsobj) {
>>>                             json_object_add_client_info(tsobj, client);
>>>                             json_array_add_value_object(clients_array, tsobj);
>>>                         }
>>>                     }
>>>                 }
>>>
>>> And when sum_stat_clients <= 1, we never print "All clients" summary.
>>> Actually, we miss an entire client, so neither the individual client
>>> summary is output nor the "all clients" summary is output.  It seems
>>> one client finishes just slightly before the other but we remove from
>>> the list of clients too quickly.  I tried adjusting the timeout and
>>> such, but didn't completely remove the issue.  Any specific thoughts?
>>
>> sum_stat_clients is set when we start everything up, so that should
>> always be '2' for your case. So I'm a little puzzled as to what is going
>> on here. Do any of the jobs ever end in error, and that's why we are
>> missing a report from one of the jobs? Or are you referring to timing on
>> receiving the stats output, somehow racing with each other and we're
>> missing one of them? The latter could result in displaying just one
>> output, and never getting ++sum_stat_nr == 2 and displaying the "All
>> clients" output.
> 
> Does the below patch change anything for you? I forgot that we get
> multiple starts (one from each client, of course), which means that we
> really should protect the inc from there.

I don't think that's it, we serially handle the clients, so there should
be no room for a race there. Hmm, it's basically back to my theory where
we put a client that hasn't done stats yet. That way we can miss doing
the all clients display, since that condition will never be met. But I
don't see how that could happen, since I'm assuming that both of your
hosts always run to completion without error?

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: fio server/client disconnect bug
  2018-03-21 16:27       ` Jens Axboe
@ 2018-03-21 16:35         ` Jeff Furlong
  2018-03-21 16:38           ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Furlong @ 2018-03-21 16:35 UTC (permalink / raw)
  To: Jens Axboe, fio

No client ends in error.  In fact, I get a full set of iops/lat/bw logs from both clients.  And inspection of those logs looks good; I see the last 1000ms timestamp is valid for the duration of runtime.  But only one client prints the summary info to the output file.

Regards,
Jeff


-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Wednesday, March 21, 2018 9:28 AM
To: Jeff Furlong <jeff.furlong@wdc.com>; fio@vger.kernel.org
Subject: Re: fio server/client disconnect bug

On 3/21/18 10:19 AM, Jens Axboe wrote:
> On 3/21/18 9:19 AM, Jens Axboe wrote:
>> On 3/20/18 6:16 PM, Jeff Furlong wrote:
>>> Revisiting this issue.  It seems the call stack is:
>>>
>>> fio_handle_clients()
>>>     fio_handle_client()
>>>         case FIO_NET_CMD_TS:
>>>             ops->thread_status(client, cmd);
>>>             .thread_status    = handle_ts
>>>                 static void handle_ts(struct fio_client *client, struct fio_net_cmd *cmd)
>>>                 {
>>>                     struct cmd_ts_pdu *p = (struct cmd_ts_pdu *) cmd->payload;
>>>                     struct flist_head *opt_list = NULL;
>>>                     struct json_object *tsobj;
>>>
>>>                     if (client->opt_lists && p->ts.thread_number <= client->jobs)
>>>                         opt_list = 
>>> &client->opt_lists[p->ts.thread_number - 1];
>>>
>>>                     tsobj = show_thread_status(&p->ts, &p->rs, opt_list, NULL);
>>>                     client->did_stat = true;
>>>                     if (tsobj) {
>>>                         json_object_add_client_info(tsobj, client);
>>>                         json_array_add_value_object(clients_array, tsobj);
>>>                     }
>>>
>>>                     if (sum_stat_clients <= 1)
>>>                         return;
>>>
>>>                     sum_thread_stats(&client_ts, &p->ts, sum_stat_nr == 1);
>>>                     sum_group_stats(&client_gs, &p->rs);
>>>
>>>                     client_ts.members++;
>>>                     client_ts.thread_number = p->ts.thread_number;
>>>                     client_ts.groupid = p->ts.groupid;
>>>                     client_ts.unified_rw_rep = p->ts.unified_rw_rep;
>>>                     client_ts.sig_figs = p->ts.sig_figs;
>>>
>>>                     if (++sum_stat_nr == sum_stat_clients) {
>>>                         strcpy(client_ts.name, "All clients");
>>>                         tsobj = show_thread_status(&client_ts, &client_gs, NULL, NULL);
>>>                         if (tsobj) {
>>>                             json_object_add_client_info(tsobj, client);
>>>                             json_array_add_value_object(clients_array, tsobj);
>>>                         }
>>>                     }
>>>                 }
>>>
>>> And when sum_stat_clients <= 1, we never print "All clients" summary.
>>> Actually, we miss an entire client, so neither the individual client 
>>> summary is output nor the "all clients" summary is output.  It seems 
>>> one client finishes just slightly before the other but we remove 
>>> from the list of clients too quickly.  I tried adjusting the timeout 
>>> and such, but didn't completely remove the issue.  Any specific thoughts?
>>
>> sum_stat_clients is set when we start everything up, so that should 
>> always be '2' for your case. So I'm a little puzzled as to what is 
>> going on here. Do any of the jobs ever end in error, and that's why 
>> we are missing a report from one of the jobs? Or are you referring to 
>> timing on receiving the stats output, somehow racing with each other 
>> and we're missing one of them? The latter could result in displaying 
>> just one output, and never getting ++sum_stat_nr == 2 and displaying 
>> the "All clients" output.
> 
> Does the below patch change anything for you? I forgot that we get 
> multiple starts (one from each client, of course), which means that we 
> really should protect the inc from there.

I don't think that's it, we serially handle the clients, so there should be no room for a race there. Hmm, it's basically back to my theory where we put a client that hasn't done stats yet. That way we can miss doing the all clients display, since that condition will never be met. But I don't see how that could happen, since I'm assuming that both of your hosts always run to completion without error?

--
Jens Axboe



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fio server/client disconnect bug
  2018-03-21 16:35         ` Jeff Furlong
@ 2018-03-21 16:38           ` Jens Axboe
  2018-03-21 16:58             ` Jeff Furlong
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2018-03-21 16:38 UTC (permalink / raw)
  To: Jeff Furlong, fio

On 3/21/18 10:35 AM, Jeff Furlong wrote:
> No client ends in error.  In fact, I get a full set of iops/lat/bw logs from both clients.  And inspection of those logs looks good; I see the last 1000ms timestamp is valid for the duration of runtime.  But only one client prints the summary info to the output file.

Can you try with this debug patch and see if you can reproduce? Comparing
a good and bad run would likely be key in figuring out wtf is going on
here.

diff --git a/client.c b/client.c
index bff0adc0d972..76c89a496964 100644
--- a/client.c
+++ b/client.c
@@ -1664,7 +1664,7 @@ int fio_handle_client(struct fio_client *client)
 	if (!cmd)
 		return 0;
 
-	dprint(FD_NET, "client: got cmd op %s from %s (pdu=%u)\n",
+	printf("client: got cmd op %s from %s (pdu=%u)\n",
 		fio_server_op(cmd->opcode), client->hostname, cmd->pdu_len);
 
 	client->last_cmd = cmd->opcode;

-- 
Jens Axboe



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* RE: fio server/client disconnect bug
  2018-03-21 16:38           ` Jens Axboe
@ 2018-03-21 16:58             ` Jeff Furlong
  2018-03-21 17:12               ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Furlong @ 2018-03-21 16:58 UTC (permalink / raw)
  To: Jens Axboe, fio

Patch applied to latest git.  Failed after 26 iterations.  The host1 did not print output to the summary file.

client: got cmd op PROBE from host2 (pdu=112)
client: got cmd op PROBE from host1 (pdu=112)
client: got cmd op START from host2 (pdu=8)
client: got cmd op START from host1 (pdu=8)
client: got cmd op ADD_JOB from host2 (pdu=23204)
client: got cmd op TEXT from host2 (pdu=70)
client: got cmd op TEXT from host2 (pdu=90)
client: got cmd op TEXT from host2 (pdu=51)
client: got cmd op ADD_JOB from host1 (pdu=23204)
client: got cmd op START from host2 (pdu=8)
client: got cmd op TEXT from host1 (pdu=70)
client: got cmd op TEXT from host1 (pdu=90)
client: got cmd op TEXT from host1 (pdu=51)
client: got cmd op START from host1 (pdu=8)
client: got cmd op TEXT from host2 (pdu=33)
client: got cmd op TEXT from host2 (pdu=33)
client: got cmd op TEXT from host2 (pdu=25)
client: got cmd op TEXT from host1 (pdu=33)
client: got cmd op TEXT from host1 (pdu=33)
client: got cmd op TEXT from host1 (pdu=25)
client: got cmd op SERVER_START from host2 (pdu=0)
client: got cmd op SERVER_START from host1 (pdu=0)
client: got cmd op ETA from host2 (pdu=165)
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)=41.6k,w=0 IOPS][eta 00m:02s]
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)=41.6k,w=0 IOPS][eta 00m:01s]
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op IOLOG from host2 (pdu=338)41.6k,w=0 IOPS][eta 00m:00s]
client: got cmd op IOLOG from host2 (pdu=327)
client: got cmd op IOLOG from host2 (pdu=338)
client: got cmd op IOLOG from host1 (pdu=337)
client: got cmd op IOLOG from host1 (pdu=328)
client: got cmd op IOLOG from host1 (pdu=336)
client: got cmd op IOLOG from host2 (pdu=330)
client: got cmd op IOLOG from host2 (pdu=327)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op TS from host2 (pdu=94888)
client: got cmd op GS from host2 (pdu=164)
client: got cmd op IOLOG from host1 (pdu=324)
client: got cmd op DISK_UTIL from host2 (pdu=256)
client: got cmd op IOLOG from host1 (pdu=320)
client: got cmd op TEXT from host2 (pdu=25)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op STOP from host2 (pdu=8)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op QUIT from host2 (pdu=0)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op STOP from host1 (pdu=8)
client: got cmd op QUIT from host1 (pdu=0)

Here is a "good" run, but one thing I note is that the host2 prints to the summary file first, then host1 prints, then the "all clients" print.  But I list the clients in the order of host1 then host2 when running the fio cmd.

client: got cmd op PROBE from host2 (pdu=112)
client: got cmd op PROBE from host1 (pdu=112)
client: got cmd op START from host2 (pdu=8)
client: got cmd op START from host1 (pdu=8)
client: got cmd op ADD_JOB from host2 (pdu=23204)
client: got cmd op TEXT from host2 (pdu=70)
client: got cmd op TEXT from host2 (pdu=90)
client: got cmd op TEXT from host2 (pdu=51)
client: got cmd op START from host2 (pdu=8)
client: got cmd op ADD_JOB from host1 (pdu=23204)
client: got cmd op TEXT from host1 (pdu=70)
client: got cmd op TEXT from host1 (pdu=90)
client: got cmd op TEXT from host1 (pdu=51)
client: got cmd op START from host1 (pdu=8)
client: got cmd op TEXT from host2 (pdu=33)
client: got cmd op TEXT from host2 (pdu=33)
client: got cmd op TEXT from host2 (pdu=25)
client: got cmd op SERVER_START from host2 (pdu=0)
client: got cmd op TEXT from host1 (pdu=33)
client: got cmd op TEXT from host1 (pdu=33)
client: got cmd op TEXT from host1 (pdu=25)
client: got cmd op SERVER_START from host1 (pdu=0)
client: got cmd op ETA from host2 (pdu=165)
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)=41.6k,w=0 IOPS][eta 00m:02s]
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)=41.8k,w=0 IOPS][eta 00m:01s]
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op IOLOG from host2 (pdu=338)41.8k,w=0 IOPS][eta 00m:00s]
client: got cmd op IOLOG from host2 (pdu=328)
client: got cmd op IOLOG from host2 (pdu=338)
client: got cmd op IOLOG from host1 (pdu=337)
client: got cmd op IOLOG from host1 (pdu=328)
client: got cmd op IOLOG from host1 (pdu=337)
client: got cmd op IOLOG from host2 (pdu=323)
client: got cmd op IOLOG from host2 (pdu=321)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op TS from host2 (pdu=94888)
client: got cmd op GS from host2 (pdu=164)
client: got cmd op DISK_UTIL from host2 (pdu=256)
client: got cmd op TEXT from host2 (pdu=25)
client: got cmd op STOP from host2 (pdu=8)
client: got cmd op QUIT from host2 (pdu=0)
client: got cmd op IOLOG from host1 (pdu=323)
client: got cmd op IOLOG from host1 (pdu=321)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op TS from host1 (pdu=94888)
client: got cmd op GS from host1 (pdu=164)
client: got cmd op DISK_UTIL from host1 (pdu=256)
client: got cmd op TEXT from host1 (pdu=25)
client: got cmd op STOP from host1 (pdu=8)
client: got cmd op QUIT from host1 (pdu=0)

Regards,
Jeff

-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Wednesday, March 21, 2018 9:39 AM
To: Jeff Furlong <jeff.furlong@wdc.com>; fio@vger.kernel.org
Subject: Re: fio server/client disconnect bug

On 3/21/18 10:35 AM, Jeff Furlong wrote:
> No client ends in error.  In fact, I get a full set of iops/lat/bw logs from both clients.  And inspection of those logs looks good; I see the last 1000ms timestamp is valid for the duration of runtime.  But only one client prints the summary info to the output file.

Can you try with this debug patch and see if you can reproduce? Comparing a good and bad run would likely be key in figuring out wtf is going on here.

diff --git a/client.c b/client.c
index bff0adc0d972..76c89a496964 100644
--- a/client.c
+++ b/client.c
@@ -1664,7 +1664,7 @@ int fio_handle_client(struct fio_client *client)
 	if (!cmd)
 		return 0;
 
-	dprint(FD_NET, "client: got cmd op %s from %s (pdu=%u)\n",
+	printf("client: got cmd op %s from %s (pdu=%u)\n",
 		fio_server_op(cmd->opcode), client->hostname, cmd->pdu_len);
 
 	client->last_cmd = cmd->opcode;

--
Jens Axboe



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: fio server/client disconnect bug
  2018-03-21 16:58             ` Jeff Furlong
@ 2018-03-21 17:12               ` Jens Axboe
  2018-03-21 18:16                 ` Jeff Furlong
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2018-03-21 17:12 UTC (permalink / raw)
  To: Jeff Furlong, fio

On 3/21/18 10:58 AM, Jeff Furlong wrote:
> Patch applied to latest git.  Failed after 26 iterations.  The host1 did not print output to the summary file.
> 
> client: got cmd op PROBE from host2 (pdu=112)
> client: got cmd op PROBE from host1 (pdu=112)
> client: got cmd op START from host2 (pdu=8)
> client: got cmd op START from host1 (pdu=8)
> client: got cmd op ADD_JOB from host2 (pdu=23204)
> client: got cmd op TEXT from host2 (pdu=70)
> client: got cmd op TEXT from host2 (pdu=90)
> client: got cmd op TEXT from host2 (pdu=51)
> client: got cmd op ADD_JOB from host1 (pdu=23204)
> client: got cmd op START from host2 (pdu=8)
> client: got cmd op TEXT from host1 (pdu=70)
> client: got cmd op TEXT from host1 (pdu=90)
> client: got cmd op TEXT from host1 (pdu=51)
> client: got cmd op START from host1 (pdu=8)
> client: got cmd op TEXT from host2 (pdu=33)
> client: got cmd op TEXT from host2 (pdu=33)
> client: got cmd op TEXT from host2 (pdu=25)
> client: got cmd op TEXT from host1 (pdu=33)
> client: got cmd op TEXT from host1 (pdu=33)
> client: got cmd op TEXT from host1 (pdu=25)
> client: got cmd op SERVER_START from host2 (pdu=0)
> client: got cmd op SERVER_START from host1 (pdu=0)
> client: got cmd op ETA from host2 (pdu=165)
> client: got cmd op ETA from host1 (pdu=165)
> client: got cmd op ETA from host2 (pdu=165)
> client: got cmd op ETA from host1 (pdu=165)
> client: got cmd op ETA from host2 (pdu=165)
> client: got cmd op ETA from host1 (pdu=165)
> client: got cmd op ETA from host2 (pdu=165)=41.6k,w=0 IOPS][eta 00m:02s]
> client: got cmd op ETA from host1 (pdu=165)
> client: got cmd op ETA from host2 (pdu=165)=41.6k,w=0 IOPS][eta 00m:01s]
> client: got cmd op ETA from host1 (pdu=165)
> client: got cmd op IOLOG from host2 (pdu=338)41.6k,w=0 IOPS][eta 00m:00s]
> client: got cmd op IOLOG from host2 (pdu=327)
> client: got cmd op IOLOG from host2 (pdu=338)
> client: got cmd op IOLOG from host1 (pdu=337)
> client: got cmd op IOLOG from host1 (pdu=328)
> client: got cmd op IOLOG from host1 (pdu=336)
> client: got cmd op IOLOG from host2 (pdu=330)
> client: got cmd op IOLOG from host2 (pdu=327)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op TS from host2 (pdu=94888)
> client: got cmd op GS from host2 (pdu=164)
> client: got cmd op IOLOG from host1 (pdu=324)
> client: got cmd op DISK_UTIL from host2 (pdu=256)
> client: got cmd op IOLOG from host1 (pdu=320)
> client: got cmd op TEXT from host2 (pdu=25)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op STOP from host2 (pdu=8)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op QUIT from host2 (pdu=0)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op STOP from host1 (pdu=8)
> client: got cmd op QUIT from host1 (pdu=0)

So this is weird, after the JOB_OPT it should have send the gs/ts and so
on summary. Which is what triggers the stat display. Can you remove the
previous patch, and add this debug patch with some more printf goodness
in it?


diff --git a/client.c b/client.c
index bff0adc0d972..0f740b5f3057 100644
--- a/client.c
+++ b/client.c
@@ -179,6 +179,8 @@ static struct fio_client *find_client_by_fd(int fd)
 
 void fio_put_client(struct fio_client *client)
 {
+	printf("client %s removed %d\n", client->hostname, client->refs);
+
 	if (--client->refs)
 		return;
 
@@ -1664,7 +1666,7 @@ int fio_handle_client(struct fio_client *client)
 	if (!cmd)
 		return 0;
 
-	dprint(FD_NET, "client: got cmd op %s from %s (pdu=%u)\n",
+	printf("client: got cmd op %s from %s (pdu=%u)\n",
 		fio_server_op(cmd->opcode), client->hostname, cmd->pdu_len);
 
 	client->last_cmd = cmd->opcode;
diff --git a/stat.c b/stat.c
index 98ab63893900..c23a3c9ddabc 100644
--- a/stat.c
+++ b/stat.c
@@ -1882,6 +1882,8 @@ void __show_run_stats(void)
 		json_object_add_value_array(root, "jobs", array);
 	}
 
+	printf("back=%d, groupid=%d, nr_ts=%d\n", is_backend, groupid, nr_ts);
+
 	if (is_backend)
 		fio_server_send_job_options(&get_global_options()->opt_list, -1U);
 

-- 
Jens Axboe



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* RE: fio server/client disconnect bug
  2018-03-21 17:12               ` Jens Axboe
@ 2018-03-21 18:16                 ` Jeff Furlong
  2018-03-21 18:38                   ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Furlong @ 2018-03-21 18:16 UTC (permalink / raw)
  To: Jens Axboe, fio

Here's the failing case where host1 does not print:

back=1, groupid=0, nr_ts=1
client: got cmd op PROBE from host2 (pdu=112)
client host2 removed 2
client: got cmd op PROBE from host1 (pdu=112)
client host1 removed 2
client: got cmd op START from host2 (pdu=8)
client host2 removed 2
client: got cmd op START from host1 (pdu=8)
client host1 removed 2
client: got cmd op ADD_JOB from host2 (pdu=23204)
client host2 removed 2
client: got cmd op TEXT from host2 (pdu=70)
client host2 removed 2
client: got cmd op TEXT from host2 (pdu=90)
client host2 removed 2
client: got cmd op ADD_JOB from host1 (pdu=23204)
client host1 removed 2
client: got cmd op TEXT from host2 (pdu=51)
client host2 removed 2
client: got cmd op TEXT from host1 (pdu=70)
client host1 removed 2
client: got cmd op START from host2 (pdu=8)
client host2 removed 2
client: got cmd op TEXT from host1 (pdu=90)
client host1 removed 2
client: got cmd op TEXT from host1 (pdu=51)
client host1 removed 2
client: got cmd op START from host1 (pdu=8)
client host1 removed 2
client: got cmd op TEXT from host2 (pdu=33)
client host2 removed 2
client: got cmd op TEXT from host2 (pdu=33)
client host2 removed 2
client: got cmd op TEXT from host2 (pdu=25)
client host2 removed 2
client: got cmd op TEXT from host1 (pdu=33)
client host1 removed 2
client: got cmd op TEXT from host1 (pdu=33)
client host1 removed 2
client: got cmd op TEXT from host1 (pdu=25)
client host1 removed 2
client: got cmd op SERVER_START from host2 (pdu=0)
client host2 removed 2
client: got cmd op SERVER_START from host1 (pdu=0)
client host1 removed 2
client: got cmd op ETA from host2 (pdu=165)
client host2 removed 2
client: got cmd op ETA from host1 (pdu=165)
client host1 removed 2
client: got cmd op ETA from host2 (pdu=165)
client host2 removed 2
client: got cmd op ETA from host1 (pdu=165)
client host1 removed 2
client: got cmd op ETA from host2 (pdu=165)
client host2 removed 2
client: got cmd op ETA from host1 (pdu=165)
client host1 removed 25.0%][r=2667M,w=0k][r=41.7k,w=0 IOPS][eta 00m:02s]
client: got cmd op ETA from host2 (pdu=165)
client host2 removed 2
client: got cmd op ETA from host1 (pdu=165)
client host1 removed 28.9%][r=2661M,w=0k][r=41.6k,w=0 IOPS][eta 00m:01s]
client: got cmd op ETA from host2 (pdu=165)
client host2 removed 2
client: got cmd op ETA from host1 (pdu=165)
client host1 removed 200.0%][r=2669M,w=0k][r=41.7k,w=0 IOPS][eta 00m:00s]
client: got cmd op IOLOG from host2 (pdu=338)
client host2 removed 2
client: got cmd op IOLOG from host2 (pdu=327)
client host2 removed 2
client: got cmd op IOLOG from host2 (pdu=337)
client host2 removed 2
client: got cmd op IOLOG from host1 (pdu=338)
client host1 removed 2
client: got cmd op IOLOG from host1 (pdu=325)
client host1 removed 2
client: got cmd op IOLOG from host1 (pdu=337)
client host1 removed 2
client: got cmd op IOLOG from host1 (pdu=323)
client host1 removed 2
client: got cmd op IOLOG from host1 (pdu=320)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op STOP from host1 (pdu=8)
client host1 removed 2
client: got cmd op QUIT from host1 (pdu=0)
client host1 removed 2
client host1 removed 1
client: got cmd op IOLOG from host2 (pdu=331)
client host2 removed 2
client: got cmd op IOLOG from host2 (pdu=327)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op TS from host2 (pdu=94888)
client host2 removed 2
client: got cmd op GS from host2 (pdu=164)
client host2 removed 2
client: got cmd op DISK_UTIL from host2 (pdu=256)
client host2 removed 2
client: got cmd op TEXT from host2 (pdu=25)
client host2 removed 2
client: got cmd op STOP from host2 (pdu=8)
client host2 removed 2
client: got cmd op QUIT from host2 (pdu=0)
client host2 removed 2
client host2 removed 1

And here's a good case:

client: got cmd op PROBE from host2 (pdu=112)
client host2 removed 2
client: got cmd op PROBE from host1 (pdu=112)
client host1 removed 2
client: got cmd op START from host2 (pdu=8)
client host2 removed 2
client: got cmd op START from host1 (pdu=8)
client host1 removed 2
client: got cmd op ADD_JOB from host2 (pdu=23204)
client host2 removed 2
client: got cmd op TEXT from host2 (pdu=70)
client host2 removed 2
client: got cmd op TEXT from host2 (pdu=90)
client host2 removed 2
client: got cmd op TEXT from host2 (pdu=51)
client host2 removed 2
client: got cmd op START from host2 (pdu=8)
client host2 removed 2
client: got cmd op ADD_JOB from host1 (pdu=23204)
client host1 removed 2
client: got cmd op TEXT from host1 (pdu=70)
client host1 removed 2
client: got cmd op TEXT from host1 (pdu=90)
client host1 removed 2
client: got cmd op TEXT from host1 (pdu=51)
client host1 removed 2
client: got cmd op START from host1 (pdu=8)
client host1 removed 2
client: got cmd op TEXT from host2 (pdu=33)
client host2 removed 2
client: got cmd op TEXT from host2 (pdu=33)
client host2 removed 2
client: got cmd op TEXT from host2 (pdu=25)
client host2 removed 2
client: got cmd op SERVER_START from host2 (pdu=0)
client host2 removed 2
client: got cmd op TEXT from host1 (pdu=33)
client host1 removed 2
client: got cmd op TEXT from host1 (pdu=33)
client host1 removed 2
client: got cmd op TEXT from host1 (pdu=25)
client host1 removed 2
client: got cmd op SERVER_START from host1 (pdu=0)
client host1 removed 2
client: got cmd op ETA from host2 (pdu=165)
client host2 removed 2
client: got cmd op ETA from host1 (pdu=165)
client host1 removed 2
client: got cmd op ETA from host2 (pdu=165)
client host2 removed 2
client: got cmd op ETA from host1 (pdu=165)
client host1 removed 2
client: got cmd op ETA from host2 (pdu=165)
client host2 removed 2
client: got cmd op ETA from host1 (pdu=165)
client host1 removed 25.0%][r=2667M,w=0k][r=41.7k,w=0 IOPS][eta 00m:02s]
client: got cmd op ETA from host2 (pdu=165)
client host2 removed 2
client: got cmd op ETA from host1 (pdu=165)
client host1 removed 28.9%][r=2669M,w=0k][r=41.7k,w=0 IOPS][eta 00m:01s]
client: got cmd op ETA from host2 (pdu=165)
client host2 removed 2
client: got cmd op ETA from host1 (pdu=165)
client host1 removed 200.0%][r=2675M,w=0k][r=41.8k,w=0 IOPS][eta 00m:00s]
client: got cmd op IOLOG from host2 (pdu=336)
client host2 removed 2
client: got cmd op IOLOG from host2 (pdu=327)
client host2 removed 2
client: got cmd op IOLOG from host2 (pdu=337)
client host2 removed 2
client: got cmd op IOLOG from host1 (pdu=338)
client host1 removed 2
client: got cmd op IOLOG from host1 (pdu=327)
client host1 removed 2
client: got cmd op IOLOG from host1 (pdu=338)
client host1 removed 2
client: got cmd op IOLOG from host2 (pdu=330)
client host2 removed 2
client: got cmd op IOLOG from host2 (pdu=327)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op JOB_OPT from host2 (pdu=200)
client host2 removed 2
client: got cmd op TS from host2 (pdu=94888)
client host2 removed 2
client: got cmd op GS from host2 (pdu=164)
client host2 removed 2
client: got cmd op DISK_UTIL from host2 (pdu=256)
client host2 removed 2
client: got cmd op TEXT from host2 (pdu=25)
client host2 removed 2
client: got cmd op STOP from host2 (pdu=8)
client host2 removed 2
client: got cmd op QUIT from host2 (pdu=0)
client host2 removed 2
client host2 removed 1
client: got cmd op IOLOG from host1 (pdu=330)
client host1 removed 2
client: got cmd op IOLOG from host1 (pdu=327)
back=1, groupid=0, nr_ts=1
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op JOB_OPT from host1 (pdu=200)
client host1 removed 2
client: got cmd op TS from host1 (pdu=94888)
client host1 removed 2
client: got cmd op GS from host1 (pdu=164)
client host1 removed 2
client: got cmd op DISK_UTIL from host1 (pdu=256)
client host1 removed 2
client: got cmd op TEXT from host1 (pdu=25)
client host1 removed 2
client: got cmd op STOP from host1 (pdu=8)
client host1 removed 2
client: got cmd op QUIT from host1 (pdu=0)
client host1 removed 2
client host1 removed 1

Regards,
Jeff


-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Wednesday, March 21, 2018 10:12 AM
To: Jeff Furlong <jeff.furlong@wdc.com>; fio@vger.kernel.org
Subject: Re: fio server/client disconnect bug

On 3/21/18 10:58 AM, Jeff Furlong wrote:
> Patch applied to latest git.  Failed after 26 iterations.  The host1 did not print output to the summary file.
> 
> client: got cmd op PROBE from host2 (pdu=112)
> client: got cmd op PROBE from host1 (pdu=112)
> client: got cmd op START from host2 (pdu=8)
> client: got cmd op START from host1 (pdu=8)
> client: got cmd op ADD_JOB from host2 (pdu=23204)
> client: got cmd op TEXT from host2 (pdu=70)
> client: got cmd op TEXT from host2 (pdu=90)
> client: got cmd op TEXT from host2 (pdu=51)
> client: got cmd op ADD_JOB from host1 (pdu=23204)
> client: got cmd op START from host2 (pdu=8)
> client: got cmd op TEXT from host1 (pdu=70)
> client: got cmd op TEXT from host1 (pdu=90)
> client: got cmd op TEXT from host1 (pdu=51)
> client: got cmd op START from host1 (pdu=8)
> client: got cmd op TEXT from host2 (pdu=33)
> client: got cmd op TEXT from host2 (pdu=33)
> client: got cmd op TEXT from host2 (pdu=25)
> client: got cmd op TEXT from host1 (pdu=33)
> client: got cmd op TEXT from host1 (pdu=33)
> client: got cmd op TEXT from host1 (pdu=25)
> client: got cmd op SERVER_START from host2 (pdu=0)
> client: got cmd op SERVER_START from host1 (pdu=0)
> client: got cmd op ETA from host2 (pdu=165)
> client: got cmd op ETA from host1 (pdu=165)
> client: got cmd op ETA from host2 (pdu=165)
> client: got cmd op ETA from host1 (pdu=165)
> client: got cmd op ETA from host2 (pdu=165)
> client: got cmd op ETA from host1 (pdu=165)
> client: got cmd op ETA from host2 (pdu=165)=41.6k,w=0 IOPS][eta 
> 00m:02s]
> client: got cmd op ETA from host1 (pdu=165)
> client: got cmd op ETA from host2 (pdu=165)=41.6k,w=0 IOPS][eta 
> 00m:01s]
> client: got cmd op ETA from host1 (pdu=165)
> client: got cmd op IOLOG from host2 (pdu=338)41.6k,w=0 IOPS][eta 
> 00m:00s]
> client: got cmd op IOLOG from host2 (pdu=327)
> client: got cmd op IOLOG from host2 (pdu=338)
> client: got cmd op IOLOG from host1 (pdu=337)
> client: got cmd op IOLOG from host1 (pdu=328)
> client: got cmd op IOLOG from host1 (pdu=336)
> client: got cmd op IOLOG from host2 (pdu=330)
> client: got cmd op IOLOG from host2 (pdu=327)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client: got cmd op TS from host2 (pdu=94888)
> client: got cmd op GS from host2 (pdu=164)
> client: got cmd op IOLOG from host1 (pdu=324)
> client: got cmd op DISK_UTIL from host2 (pdu=256)
> client: got cmd op IOLOG from host1 (pdu=320)
> client: got cmd op TEXT from host2 (pdu=25)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op STOP from host2 (pdu=8)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op QUIT from host2 (pdu=0)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client: got cmd op STOP from host1 (pdu=8)
> client: got cmd op QUIT from host1 (pdu=0)

So this is weird, after the JOB_OPT it should have send the gs/ts and so on summary. Which is what triggers the stat display. Can you remove the previous patch, and add this debug patch with some more printf goodness in it?


diff --git a/client.c b/client.c
index bff0adc0d972..0f740b5f3057 100644
--- a/client.c
+++ b/client.c
@@ -179,6 +179,8 @@ static struct fio_client *find_client_by_fd(int fd)
 
 void fio_put_client(struct fio_client *client)  {
+	printf("client %s removed %d\n", client->hostname, client->refs);
+
 	if (--client->refs)
 		return;
 
@@ -1664,7 +1666,7 @@ int fio_handle_client(struct fio_client *client)
 	if (!cmd)
 		return 0;
 
-	dprint(FD_NET, "client: got cmd op %s from %s (pdu=%u)\n",
+	printf("client: got cmd op %s from %s (pdu=%u)\n",
 		fio_server_op(cmd->opcode), client->hostname, cmd->pdu_len);
 
 	client->last_cmd = cmd->opcode;
diff --git a/stat.c b/stat.c
index 98ab63893900..c23a3c9ddabc 100644
--- a/stat.c
+++ b/stat.c
@@ -1882,6 +1882,8 @@ void __show_run_stats(void)
 		json_object_add_value_array(root, "jobs", array);
 	}
 
+	printf("back=%d, groupid=%d, nr_ts=%d\n", is_backend, groupid, nr_ts);
+
 	if (is_backend)
 		fio_server_send_job_options(&get_global_options()->opt_list, -1U);
 

--
Jens Axboe



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: fio server/client disconnect bug
  2018-03-21 18:16                 ` Jeff Furlong
@ 2018-03-21 18:38                   ` Jens Axboe
  2018-03-21 19:03                     ` Jens Axboe
  2018-03-21 19:23                     ` Jeff Furlong
  0 siblings, 2 replies; 21+ messages in thread
From: Jens Axboe @ 2018-03-21 18:38 UTC (permalink / raw)
  To: Jeff Furlong, fio

On 3/21/18 12:16 PM, Jeff Furlong wrote:
> Here's the failing case where host1 does not print:
> 
> back=1, groupid=0, nr_ts=1
> client: got cmd op PROBE from host2 (pdu=112)
> client host2 removed 2
> client: got cmd op PROBE from host1 (pdu=112)
> client host1 removed 2
> client: got cmd op START from host2 (pdu=8)
> client host2 removed 2
> client: got cmd op START from host1 (pdu=8)
> client host1 removed 2
> client: got cmd op ADD_JOB from host2 (pdu=23204)
> client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=70)
> client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=90)
> client host2 removed 2
> client: got cmd op ADD_JOB from host1 (pdu=23204)
> client host1 removed 2
> client: got cmd op TEXT from host2 (pdu=51)
> client host2 removed 2
> client: got cmd op TEXT from host1 (pdu=70)
> client host1 removed 2
> client: got cmd op START from host2 (pdu=8)
> client host2 removed 2
> client: got cmd op TEXT from host1 (pdu=90)
> client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=51)
> client host1 removed 2
> client: got cmd op START from host1 (pdu=8)
> client host1 removed 2
> client: got cmd op TEXT from host2 (pdu=33)
> client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=33)
> client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=25)
> client host2 removed 2
> client: got cmd op TEXT from host1 (pdu=33)
> client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=33)
> client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=25)
> client host1 removed 2
> client: got cmd op SERVER_START from host2 (pdu=0)
> client host2 removed 2
> client: got cmd op SERVER_START from host1 (pdu=0)
> client host1 removed 2
> client: got cmd op ETA from host2 (pdu=165)
> client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165)
> client host1 removed 2
> client: got cmd op ETA from host2 (pdu=165)
> client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165)
> client host1 removed 2
> client: got cmd op ETA from host2 (pdu=165)
> client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165)
> client host1 removed 25.0%][r=2667M,w=0k][r=41.7k,w=0 IOPS][eta 00m:02s]
> client: got cmd op ETA from host2 (pdu=165)
> client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165)
> client host1 removed 28.9%][r=2661M,w=0k][r=41.6k,w=0 IOPS][eta 00m:01s]
> client: got cmd op ETA from host2 (pdu=165)
> client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165)
> client host1 removed 200.0%][r=2669M,w=0k][r=41.7k,w=0 IOPS][eta 00m:00s]
> client: got cmd op IOLOG from host2 (pdu=338)
> client host2 removed 2
> client: got cmd op IOLOG from host2 (pdu=327)
> client host2 removed 2
> client: got cmd op IOLOG from host2 (pdu=337)
> client host2 removed 2
> client: got cmd op IOLOG from host1 (pdu=338)
> client host1 removed 2
> client: got cmd op IOLOG from host1 (pdu=325)
> client host1 removed 2
> client: got cmd op IOLOG from host1 (pdu=337)
> client host1 removed 2
> client: got cmd op IOLOG from host1 (pdu=323)
> client host1 removed 2
> client: got cmd op IOLOG from host1 (pdu=320)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op STOP from host1 (pdu=8)
> client host1 removed 2
> client: got cmd op QUIT from host1 (pdu=0)
> client host1 removed 2
> client host1 removed 1
> client: got cmd op IOLOG from host2 (pdu=331)
> client host2 removed 2
> client: got cmd op IOLOG from host2 (pdu=327)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op TS from host2 (pdu=94888)
> client host2 removed 2
> client: got cmd op GS from host2 (pdu=164)
> client host2 removed 2
> client: got cmd op DISK_UTIL from host2 (pdu=256)
> client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=25)
> client host2 removed 2
> client: got cmd op STOP from host2 (pdu=8)
> client host2 removed 2
> client: got cmd op QUIT from host2 (pdu=0)
> client host2 removed 2
> client host2 removed 1
> 
> And here's a good case:
> 
> client: got cmd op PROBE from host2 (pdu=112)
> client host2 removed 2
> client: got cmd op PROBE from host1 (pdu=112)
> client host1 removed 2
> client: got cmd op START from host2 (pdu=8)
> client host2 removed 2
> client: got cmd op START from host1 (pdu=8)
> client host1 removed 2
> client: got cmd op ADD_JOB from host2 (pdu=23204)
> client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=70)
> client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=90)
> client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=51)
> client host2 removed 2
> client: got cmd op START from host2 (pdu=8)
> client host2 removed 2
> client: got cmd op ADD_JOB from host1 (pdu=23204)
> client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=70)
> client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=90)
> client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=51)
> client host1 removed 2
> client: got cmd op START from host1 (pdu=8)
> client host1 removed 2
> client: got cmd op TEXT from host2 (pdu=33)
> client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=33)
> client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=25)
> client host2 removed 2
> client: got cmd op SERVER_START from host2 (pdu=0)
> client host2 removed 2
> client: got cmd op TEXT from host1 (pdu=33)
> client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=33)
> client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=25)
> client host1 removed 2
> client: got cmd op SERVER_START from host1 (pdu=0)
> client host1 removed 2
> client: got cmd op ETA from host2 (pdu=165)
> client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165)
> client host1 removed 2
> client: got cmd op ETA from host2 (pdu=165)
> client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165)
> client host1 removed 2
> client: got cmd op ETA from host2 (pdu=165)
> client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165)
> client host1 removed 25.0%][r=2667M,w=0k][r=41.7k,w=0 IOPS][eta 00m:02s]
> client: got cmd op ETA from host2 (pdu=165)
> client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165)
> client host1 removed 28.9%][r=2669M,w=0k][r=41.7k,w=0 IOPS][eta 00m:01s]
> client: got cmd op ETA from host2 (pdu=165)
> client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165)
> client host1 removed 200.0%][r=2675M,w=0k][r=41.8k,w=0 IOPS][eta 00m:00s]
> client: got cmd op IOLOG from host2 (pdu=336)
> client host2 removed 2
> client: got cmd op IOLOG from host2 (pdu=327)
> client host2 removed 2
> client: got cmd op IOLOG from host2 (pdu=337)
> client host2 removed 2
> client: got cmd op IOLOG from host1 (pdu=338)
> client host1 removed 2
> client: got cmd op IOLOG from host1 (pdu=327)
> client host1 removed 2
> client: got cmd op IOLOG from host1 (pdu=338)
> client host1 removed 2
> client: got cmd op IOLOG from host2 (pdu=330)
> client host2 removed 2
> client: got cmd op IOLOG from host2 (pdu=327)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200)
> client host2 removed 2
> client: got cmd op TS from host2 (pdu=94888)
> client host2 removed 2
> client: got cmd op GS from host2 (pdu=164)
> client host2 removed 2
> client: got cmd op DISK_UTIL from host2 (pdu=256)
> client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=25)
> client host2 removed 2
> client: got cmd op STOP from host2 (pdu=8)
> client host2 removed 2
> client: got cmd op QUIT from host2 (pdu=0)
> client host2 removed 2
> client host2 removed 1
> client: got cmd op IOLOG from host1 (pdu=330)
> client host1 removed 2
> client: got cmd op IOLOG from host1 (pdu=327)
> back=1, groupid=0, nr_ts=1
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200)
> client host1 removed 2
> client: got cmd op TS from host1 (pdu=94888)
> client host1 removed 2
> client: got cmd op GS from host1 (pdu=164)
> client host1 removed 2
> client: got cmd op DISK_UTIL from host1 (pdu=256)
> client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=25)
> client host1 removed 2
> client: got cmd op STOP from host1 (pdu=8)
> client host1 removed 2
> client: got cmd op QUIT from host1 (pdu=0)
> client host1 removed 2
> client host1 removed 1

So puzzling... It prints all the right things, but for some reason we
are not doing the ts/gs/disk_util send for the host in some cases.
For the life of me, I can't see how that would happen! We just go
straight from the job option sends, and then don't do the rest in
__show_run_stats().

And thanks for the quick turn-around on the debugging, it's key
to getting this resolved since I can't reproduce this myself. Can you
run with this one? Backout the previous one first.


diff --git a/client.c b/client.c
index bff0adc0d972..76c89a496964 100644
--- a/client.c
+++ b/client.c
@@ -1664,7 +1664,7 @@ int fio_handle_client(struct fio_client *client)
 	if (!cmd)
 		return 0;
 
-	dprint(FD_NET, "client: got cmd op %s from %s (pdu=%u)\n",
+	printf("client: got cmd op %s from %s (pdu=%u)\n",
 		fio_server_op(cmd->opcode), client->hostname, cmd->pdu_len);
 
 	client->last_cmd = cmd->opcode;
diff --git a/server.c b/server.c
index 65d4484edae5..e78f6f64824d 100644
--- a/server.c
+++ b/server.c
@@ -551,6 +551,8 @@ static void fio_net_queue_entry(struct sk_entry *entry)
 {
 	struct sk_out *sk_out = pthread_getspecific(sk_out_key);
 
+	printf("queue cmd op=%x, sock=%d\n", entry->opcode, sk_out->sk);
+
 	if (entry->flags & SK_F_INLINE)
 		handle_sk_entry(sk_out, entry);
 	else {
@@ -1177,6 +1179,7 @@ static int handle_xmits(struct sk_out *sk_out)
 	while (!flist_empty(&list)) {
 		entry = flist_entry(list.next, struct sk_entry, list);
 		flist_del(&entry->list);
+		printf("xmit op %x\n", entry->opcode);
 		ret += handle_sk_entry(sk_out, entry);
 	}
 
diff --git a/stat.c b/stat.c
index 98ab63893900..558786f52751 100644
--- a/stat.c
+++ b/stat.c
@@ -1891,6 +1891,7 @@ void __show_run_stats(void)
 
 		if (is_backend) {
 			fio_server_send_job_options(opt_lists[i], i);
+			printf("should send TS\n");
 			fio_server_send_ts(ts, rs);
 		} else {
 			if (output_format & FIO_OUTPUT_TERSE)
@@ -1905,6 +1906,7 @@ void __show_run_stats(void)
 	}
 	if (!is_backend && (output_format & FIO_OUTPUT_JSON)) {
 		/* disk util stats, if any */
+		printf("should send DU\n");
 		show_disk_util(1, root, &output[__FIO_OUTPUT_JSON]);
 
 		show_idle_prof_stats(FIO_OUTPUT_JSON, root, &output[__FIO_OUTPUT_JSON]);
@@ -1917,6 +1919,8 @@ void __show_run_stats(void)
 	for (i = 0; i < groupid + 1; i++) {
 		rs = &runstats[i];
 
+		printf("should send GS\n");
+
 		rs->groupid = i;
 		if (is_backend)
 			fio_server_send_gs(rs);

-- 
Jens Axboe



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: fio server/client disconnect bug
  2018-03-21 18:38                   ` Jens Axboe
@ 2018-03-21 19:03                     ` Jens Axboe
  2018-03-21 19:10                       ` Jens Axboe
  2018-03-21 19:23                     ` Jeff Furlong
  1 sibling, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2018-03-21 19:03 UTC (permalink / raw)
  To: Jeff Furlong, fio

Shoot, I think I see what it is. Some commands we punt to out-of-line
sending, others we do inline. STOP/QUIT are inline, whereas the ts/gs/du
are not. So we could potentially have a situation where we queue up the
ts/gs/du, but then process the STOP/QUIT first. If we do that, then we
don't get any output...

Let me think about this for a bit, and see what I can come up with.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fio server/client disconnect bug
  2018-03-21 19:03                     ` Jens Axboe
@ 2018-03-21 19:10                       ` Jens Axboe
  2018-03-21 19:44                         ` Jens Axboe
  2018-03-21 19:45                         ` Jeff Furlong
  0 siblings, 2 replies; 21+ messages in thread
From: Jens Axboe @ 2018-03-21 19:10 UTC (permalink / raw)
  To: Jeff Furlong, fio

On 3/21/18 1:03 PM, Jens Axboe wrote:
> Shoot, I think I see what it is. Some commands we punt to out-of-line
> sending, others we do inline. STOP/QUIT are inline, whereas the ts/gs/du
> are not. So we could potentially have a situation where we queue up the
> ts/gs/du, but then process the STOP/QUIT first. If we do that, then we
> don't get any output...
> 
> Let me think about this for a bit, and see what I can come up with.

Can you try with this? Can be improved to have the thread signal us,
but it should be a rare occurrence and a 1ms stall from this context
isn't a concern.

diff --git a/server.c b/server.c
index 65d4484edae5..52bcf7bf44c6 100644
--- a/server.c
+++ b/server.c
@@ -551,9 +551,15 @@ static void fio_net_queue_entry(struct sk_entry *entry)
 {
 	struct sk_out *sk_out = pthread_getspecific(sk_out_key);
 
-	if (entry->flags & SK_F_INLINE)
+	if (entry->flags & SK_F_INLINE) {
+		/*
+		 * Honor ordering - if the out-of-line queue isn't empty,
+		 * wait for it to drain before proceeding.
+		 */
+		while (!flist_empty(&sk_out->list))
+			usleep(1000);
 		handle_sk_entry(sk_out, entry);
-	else {
+	} else {
 		sk_lock(sk_out);
 		flist_add_tail(&entry->list, &sk_out->list);
 		sk_unlock(sk_out);

-- 
Jens Axboe



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* RE: fio server/client disconnect bug
  2018-03-21 18:38                   ` Jens Axboe
  2018-03-21 19:03                     ` Jens Axboe
@ 2018-03-21 19:23                     ` Jeff Furlong
  1 sibling, 0 replies; 21+ messages in thread
From: Jeff Furlong @ 2018-03-21 19:23 UTC (permalink / raw)
  To: Jens Axboe, fio

Here's the failing case:

client: got cmd op PROBE from host2 (pdu=112)
client: got cmd op PROBE from host1 (pdu=112)
client: got cmd op START from host2 (pdu=8)
client: got cmd op START from host1 (pdu=8)
client: got cmd op ADD_JOB from host2 (pdu=23204)
client: got cmd op TEXT from host2 (pdu=70)
client: got cmd op ADD_JOB from host1 (pdu=23204)
client: got cmd op TEXT from host2 (pdu=90)
client: got cmd op TEXT from host1 (pdu=70)
client: got cmd op TEXT from host2 (pdu=51)
client: got cmd op TEXT from host1 (pdu=90)
client: got cmd op START from host2 (pdu=8)
client: got cmd op TEXT from host1 (pdu=51)
client: got cmd op START from host1 (pdu=8)
client: got cmd op TEXT from host2 (pdu=33)
client: got cmd op TEXT from host2 (pdu=33)
client: got cmd op TEXT from host2 (pdu=25)
client: got cmd op TEXT from host1 (pdu=33)
client: got cmd op TEXT from host1 (pdu=33)
client: got cmd op TEXT from host1 (pdu=25)
client: got cmd op SERVER_START from host2 (pdu=0)
client: got cmd op SERVER_START from host1 (pdu=0)
client: got cmd op ETA from host2 (pdu=165)
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)=41.7k,w=0 IOPS][eta 00m:02s]
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)=41.7k,w=0 IOPS][eta 00m:01s]
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op IOLOG from host2 (pdu=338)41.8k,w=0 IOPS][eta 00m:00s]
client: got cmd op IOLOG from host2 (pdu=328)
client: got cmd op IOLOG from host1 (pdu=338)
client: got cmd op IOLOG from host2 (pdu=338)
client: got cmd op IOLOG from host1 (pdu=329)
client: got cmd op IOLOG from host1 (pdu=338)
client: got cmd op IOLOG from host2 (pdu=331)
client: got cmd op IOLOG from host2 (pdu=327)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op TS from host2 (pdu=94888)
client: got cmd op STOP from host2 (pdu=8)
client: got cmd op IOLOG from host1 (pdu=329)
client: got cmd op QUIT from host2 (pdu=0)
client: got cmd op IOLOG from host1 (pdu=327)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op STOP from host1 (pdu=8)
client: got cmd op QUIT from host1 (pdu=0)

And a good case:

client: got cmd op PROBE from host2 (pdu=112)
client: got cmd op PROBE from host1 (pdu=112)
client: got cmd op START from host2 (pdu=8)
client: got cmd op START from host1 (pdu=8)
client: got cmd op ADD_JOB from host1 (pdu=23204)
client: got cmd op TEXT from host1 (pdu=70)
client: got cmd op TEXT from host1 (pdu=90)
client: got cmd op TEXT from host1 (pdu=51)
client: got cmd op START from host1 (pdu=8)
client: got cmd op ADD_JOB from host2 (pdu=23204)
client: got cmd op TEXT from host2 (pdu=70)
client: got cmd op TEXT from host2 (pdu=90)
client: got cmd op TEXT from host2 (pdu=51)
client: got cmd op START from host2 (pdu=8)
client: got cmd op TEXT from host1 (pdu=33)
client: got cmd op TEXT from host1 (pdu=33)
client: got cmd op TEXT from host1 (pdu=25)
client: got cmd op TEXT from host2 (pdu=33)
client: got cmd op TEXT from host2 (pdu=33)
client: got cmd op TEXT from host2 (pdu=25)
client: got cmd op SERVER_START from host1 (pdu=0)
client: got cmd op SERVER_START from host2 (pdu=0)
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)
client: got cmd op ETA from host1 (pdu=165)
client: got cmd op ETA from host2 (pdu=165)
client: got cmd op ETA from host1 (pdu=165)=41.5k,w=0 IOPS][eta 00m:02s]
client: got cmd op ETA from host2 (pdu=165)
client: got cmd op ETA from host1 (pdu=165)=41.5k,w=0 IOPS][eta 00m:01s]
client: got cmd op ETA from host2 (pdu=165)
client: got cmd op IOLOG from host1 (pdu=336)41.7k,w=0 IOPS][eta 00m:00s]
client: got cmd op IOLOG from host1 (pdu=328)
client: got cmd op IOLOG from host1 (pdu=338)
client: got cmd op IOLOG from host2 (pdu=335)
client: got cmd op IOLOG from host2 (pdu=327)
client: got cmd op IOLOG from host2 (pdu=337)
client: got cmd op IOLOG from host1 (pdu=329)
client: got cmd op IOLOG from host1 (pdu=328)
should send TS
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op JOB_OPT from host1 (pdu=200)
should send GS
client: got cmd op JOB_OPT from host1 (pdu=200)
client: got cmd op TS from host1 (pdu=94888)
client: got cmd op STOP from host1 (pdu=8)
client: got cmd op QUIT from host1 (pdu=0)
client: got cmd op IOLOG from host2 (pdu=322)
client: got cmd op IOLOG from host2 (pdu=321)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op JOB_OPT from host2 (pdu=200)
client: got cmd op TS from host2 (pdu=94888)
client: got cmd op STOP from host2 (pdu=8)
client: got cmd op QUIT from host2 (pdu=0)

Let me try your latest patch.

Regards,
Jeff


-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Wednesday, March 21, 2018 11:39 AM
To: Jeff Furlong <jeff.furlong@wdc.com>; fio@vger.kernel.org
Subject: Re: fio server/client disconnect bug

On 3/21/18 12:16 PM, Jeff Furlong wrote:
> Here's the failing case where host1 does not print:
> 
> back=1, groupid=0, nr_ts=1
> client: got cmd op PROBE from host2 (pdu=112) client host2 removed 2
> client: got cmd op PROBE from host1 (pdu=112) client host1 removed 2
> client: got cmd op START from host2 (pdu=8) client host2 removed 2
> client: got cmd op START from host1 (pdu=8) client host1 removed 2
> client: got cmd op ADD_JOB from host2 (pdu=23204) client host2 removed 
> 2
> client: got cmd op TEXT from host2 (pdu=70) client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=90) client host2 removed 2
> client: got cmd op ADD_JOB from host1 (pdu=23204) client host1 removed 
> 2
> client: got cmd op TEXT from host2 (pdu=51) client host2 removed 2
> client: got cmd op TEXT from host1 (pdu=70) client host1 removed 2
> client: got cmd op START from host2 (pdu=8) client host2 removed 2
> client: got cmd op TEXT from host1 (pdu=90) client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=51) client host1 removed 2
> client: got cmd op START from host1 (pdu=8) client host1 removed 2
> client: got cmd op TEXT from host2 (pdu=33) client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=33) client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=25) client host2 removed 2
> client: got cmd op TEXT from host1 (pdu=33) client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=33) client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=25) client host1 removed 2
> client: got cmd op SERVER_START from host2 (pdu=0) client host2 
> removed 2
> client: got cmd op SERVER_START from host1 (pdu=0) client host1 
> removed 2
> client: got cmd op ETA from host2 (pdu=165) client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165) client host1 removed 2
> client: got cmd op ETA from host2 (pdu=165) client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165) client host1 removed 2
> client: got cmd op ETA from host2 (pdu=165) client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165) client host1 removed 
> 25.0%][r=2667M,w=0k][r=41.7k,w=0 IOPS][eta 00m:02s]
> client: got cmd op ETA from host2 (pdu=165) client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165) client host1 removed 
> 28.9%][r=2661M,w=0k][r=41.6k,w=0 IOPS][eta 00m:01s]
> client: got cmd op ETA from host2 (pdu=165) client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165) client host1 removed 
> 200.0%][r=2669M,w=0k][r=41.7k,w=0 IOPS][eta 00m:00s]
> client: got cmd op IOLOG from host2 (pdu=338) client host2 removed 2
> client: got cmd op IOLOG from host2 (pdu=327) client host2 removed 2
> client: got cmd op IOLOG from host2 (pdu=337) client host2 removed 2
> client: got cmd op IOLOG from host1 (pdu=338) client host1 removed 2
> client: got cmd op IOLOG from host1 (pdu=325) client host1 removed 2
> client: got cmd op IOLOG from host1 (pdu=337) client host1 removed 2
> client: got cmd op IOLOG from host1 (pdu=323) client host1 removed 2
> client: got cmd op IOLOG from host1 (pdu=320) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op STOP from host1 (pdu=8) client host1 removed 2
> client: got cmd op QUIT from host1 (pdu=0) client host1 removed 2 
> client host1 removed 1
> client: got cmd op IOLOG from host2 (pdu=331) client host2 removed 2
> client: got cmd op IOLOG from host2 (pdu=327) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op TS from host2 (pdu=94888) client host2 removed 2
> client: got cmd op GS from host2 (pdu=164) client host2 removed 2
> client: got cmd op DISK_UTIL from host2 (pdu=256) client host2 removed 
> 2
> client: got cmd op TEXT from host2 (pdu=25) client host2 removed 2
> client: got cmd op STOP from host2 (pdu=8) client host2 removed 2
> client: got cmd op QUIT from host2 (pdu=0) client host2 removed 2 
> client host2 removed 1
> 
> And here's a good case:
> 
> client: got cmd op PROBE from host2 (pdu=112) client host2 removed 2
> client: got cmd op PROBE from host1 (pdu=112) client host1 removed 2
> client: got cmd op START from host2 (pdu=8) client host2 removed 2
> client: got cmd op START from host1 (pdu=8) client host1 removed 2
> client: got cmd op ADD_JOB from host2 (pdu=23204) client host2 removed 
> 2
> client: got cmd op TEXT from host2 (pdu=70) client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=90) client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=51) client host2 removed 2
> client: got cmd op START from host2 (pdu=8) client host2 removed 2
> client: got cmd op ADD_JOB from host1 (pdu=23204) client host1 removed 
> 2
> client: got cmd op TEXT from host1 (pdu=70) client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=90) client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=51) client host1 removed 2
> client: got cmd op START from host1 (pdu=8) client host1 removed 2
> client: got cmd op TEXT from host2 (pdu=33) client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=33) client host2 removed 2
> client: got cmd op TEXT from host2 (pdu=25) client host2 removed 2
> client: got cmd op SERVER_START from host2 (pdu=0) client host2 
> removed 2
> client: got cmd op TEXT from host1 (pdu=33) client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=33) client host1 removed 2
> client: got cmd op TEXT from host1 (pdu=25) client host1 removed 2
> client: got cmd op SERVER_START from host1 (pdu=0) client host1 
> removed 2
> client: got cmd op ETA from host2 (pdu=165) client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165) client host1 removed 2
> client: got cmd op ETA from host2 (pdu=165) client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165) client host1 removed 2
> client: got cmd op ETA from host2 (pdu=165) client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165) client host1 removed 
> 25.0%][r=2667M,w=0k][r=41.7k,w=0 IOPS][eta 00m:02s]
> client: got cmd op ETA from host2 (pdu=165) client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165) client host1 removed 
> 28.9%][r=2669M,w=0k][r=41.7k,w=0 IOPS][eta 00m:01s]
> client: got cmd op ETA from host2 (pdu=165) client host2 removed 2
> client: got cmd op ETA from host1 (pdu=165) client host1 removed 
> 200.0%][r=2675M,w=0k][r=41.8k,w=0 IOPS][eta 00m:00s]
> client: got cmd op IOLOG from host2 (pdu=336) client host2 removed 2
> client: got cmd op IOLOG from host2 (pdu=327) client host2 removed 2
> client: got cmd op IOLOG from host2 (pdu=337) client host2 removed 2
> client: got cmd op IOLOG from host1 (pdu=338) client host1 removed 2
> client: got cmd op IOLOG from host1 (pdu=327) client host1 removed 2
> client: got cmd op IOLOG from host1 (pdu=338) client host1 removed 2
> client: got cmd op IOLOG from host2 (pdu=330) client host2 removed 2
> client: got cmd op IOLOG from host2 (pdu=327) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op JOB_OPT from host2 (pdu=200) client host2 removed 2
> client: got cmd op TS from host2 (pdu=94888) client host2 removed 2
> client: got cmd op GS from host2 (pdu=164) client host2 removed 2
> client: got cmd op DISK_UTIL from host2 (pdu=256) client host2 removed 
> 2
> client: got cmd op TEXT from host2 (pdu=25) client host2 removed 2
> client: got cmd op STOP from host2 (pdu=8) client host2 removed 2
> client: got cmd op QUIT from host2 (pdu=0) client host2 removed 2 
> client host2 removed 1
> client: got cmd op IOLOG from host1 (pdu=330) client host1 removed 2
> client: got cmd op IOLOG from host1 (pdu=327) back=1, groupid=0, 
> nr_ts=1 client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op JOB_OPT from host1 (pdu=200) client host1 removed 2
> client: got cmd op TS from host1 (pdu=94888) client host1 removed 2
> client: got cmd op GS from host1 (pdu=164) client host1 removed 2
> client: got cmd op DISK_UTIL from host1 (pdu=256) client host1 removed 
> 2
> client: got cmd op TEXT from host1 (pdu=25) client host1 removed 2
> client: got cmd op STOP from host1 (pdu=8) client host1 removed 2
> client: got cmd op QUIT from host1 (pdu=0) client host1 removed 2 
> client host1 removed 1

So puzzling... It prints all the right things, but for some reason we are not doing the ts/gs/disk_util send for the host in some cases.
For the life of me, I can't see how that would happen! We just go straight from the job option sends, and then don't do the rest in __show_run_stats().

And thanks for the quick turn-around on the debugging, it's key to getting this resolved since I can't reproduce this myself. Can you run with this one? Backout the previous one first.


diff --git a/client.c b/client.c
index bff0adc0d972..76c89a496964 100644
--- a/client.c
+++ b/client.c
@@ -1664,7 +1664,7 @@ int fio_handle_client(struct fio_client *client)
 	if (!cmd)
 		return 0;
 
-	dprint(FD_NET, "client: got cmd op %s from %s (pdu=%u)\n",
+	printf("client: got cmd op %s from %s (pdu=%u)\n",
 		fio_server_op(cmd->opcode), client->hostname, cmd->pdu_len);
 
 	client->last_cmd = cmd->opcode;
diff --git a/server.c b/server.c
index 65d4484edae5..e78f6f64824d 100644
--- a/server.c
+++ b/server.c
@@ -551,6 +551,8 @@ static void fio_net_queue_entry(struct sk_entry *entry)  {
 	struct sk_out *sk_out = pthread_getspecific(sk_out_key);
 
+	printf("queue cmd op=%x, sock=%d\n", entry->opcode, sk_out->sk);
+
 	if (entry->flags & SK_F_INLINE)
 		handle_sk_entry(sk_out, entry);
 	else {
@@ -1177,6 +1179,7 @@ static int handle_xmits(struct sk_out *sk_out)
 	while (!flist_empty(&list)) {
 		entry = flist_entry(list.next, struct sk_entry, list);
 		flist_del(&entry->list);
+		printf("xmit op %x\n", entry->opcode);
 		ret += handle_sk_entry(sk_out, entry);
 	}
 
diff --git a/stat.c b/stat.c
index 98ab63893900..558786f52751 100644
--- a/stat.c
+++ b/stat.c
@@ -1891,6 +1891,7 @@ void __show_run_stats(void)
 
 		if (is_backend) {
 			fio_server_send_job_options(opt_lists[i], i);
+			printf("should send TS\n");
 			fio_server_send_ts(ts, rs);
 		} else {
 			if (output_format & FIO_OUTPUT_TERSE) @@ -1905,6 +1906,7 @@ void __show_run_stats(void)
 	}
 	if (!is_backend && (output_format & FIO_OUTPUT_JSON)) {
 		/* disk util stats, if any */
+		printf("should send DU\n");
 		show_disk_util(1, root, &output[__FIO_OUTPUT_JSON]);
 
 		show_idle_prof_stats(FIO_OUTPUT_JSON, root, &output[__FIO_OUTPUT_JSON]); @@ -1917,6 +1919,8 @@ void __show_run_stats(void)
 	for (i = 0; i < groupid + 1; i++) {
 		rs = &runstats[i];
 
+		printf("should send GS\n");
+
 		rs->groupid = i;
 		if (is_backend)
 			fio_server_send_gs(rs);

--
Jens Axboe



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: fio server/client disconnect bug
  2018-03-21 19:10                       ` Jens Axboe
@ 2018-03-21 19:44                         ` Jens Axboe
  2018-03-21 19:45                         ` Jeff Furlong
  1 sibling, 0 replies; 21+ messages in thread
From: Jens Axboe @ 2018-03-21 19:44 UTC (permalink / raw)
  To: Jeff Furlong, fio

On 3/21/18 1:10 PM, Jens Axboe wrote:
> On 3/21/18 1:03 PM, Jens Axboe wrote:
>> Shoot, I think I see what it is. Some commands we punt to out-of-line
>> sending, others we do inline. STOP/QUIT are inline, whereas the ts/gs/du
>> are not. So we could potentially have a situation where we queue up the
>> ts/gs/du, but then process the STOP/QUIT first. If we do that, then we
>> don't get any output...
>>
>> Let me think about this for a bit, and see what I can come up with.
> 
> Can you try with this? Can be improved to have the thread signal us,
> but it should be a rare occurrence and a 1ms stall from this context
> isn't a concern.
> 
> diff --git a/server.c b/server.c
> index 65d4484edae5..52bcf7bf44c6 100644
> --- a/server.c
> +++ b/server.c
> @@ -551,9 +551,15 @@ static void fio_net_queue_entry(struct sk_entry *entry)
>  {
>  	struct sk_out *sk_out = pthread_getspecific(sk_out_key);
>  
> -	if (entry->flags & SK_F_INLINE)
> +	if (entry->flags & SK_F_INLINE) {
> +		/*
> +		 * Honor ordering - if the out-of-line queue isn't empty,
> +		 * wait for it to drain before proceeding.
> +		 */
> +		while (!flist_empty(&sk_out->list))
> +			usleep(1000);
>  		handle_sk_entry(sk_out, entry);
> -	else {
> +	} else {
>  		sk_lock(sk_out);
>  		flist_add_tail(&entry->list, &sk_out->list);
>  		sk_unlock(sk_out);

Actually, this probably won't work since we also end up doing
inline commands from the loop that handles the out-of-line
commands. So the above can deadlock in that scenario.

Back to the drawing board...

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: fio server/client disconnect bug
  2018-03-21 19:10                       ` Jens Axboe
  2018-03-21 19:44                         ` Jens Axboe
@ 2018-03-21 19:45                         ` Jeff Furlong
  2018-03-21 19:49                           ` Jens Axboe
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff Furlong @ 2018-03-21 19:45 UTC (permalink / raw)
  To: Jens Axboe, fio

There is noticeable slowdown with this patch.  Seems some iterations (even successful) are taking several seconds of usleep().  Eventually hit a new issue:

fio: client host1, timeout on cmd SEND_ETA
fio: client host1 timed out

But host1 still logged the iops/bw/lat logs, but did not send summary to output.

Regards,
Jeff

-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Wednesday, March 21, 2018 12:10 PM
To: Jeff Furlong <jeff.furlong@wdc.com>; fio@vger.kernel.org
Subject: Re: fio server/client disconnect bug

On 3/21/18 1:03 PM, Jens Axboe wrote:
> Shoot, I think I see what it is. Some commands we punt to out-of-line 
> sending, others we do inline. STOP/QUIT are inline, whereas the 
> ts/gs/du are not. So we could potentially have a situation where we 
> queue up the ts/gs/du, but then process the STOP/QUIT first. If we do 
> that, then we don't get any output...
> 
> Let me think about this for a bit, and see what I can come up with.

Can you try with this? Can be improved to have the thread signal us, but it should be a rare occurrence and a 1ms stall from this context isn't a concern.

diff --git a/server.c b/server.c
index 65d4484edae5..52bcf7bf44c6 100644
--- a/server.c
+++ b/server.c
@@ -551,9 +551,15 @@ static void fio_net_queue_entry(struct sk_entry *entry)  {
 	struct sk_out *sk_out = pthread_getspecific(sk_out_key);
 
-	if (entry->flags & SK_F_INLINE)
+	if (entry->flags & SK_F_INLINE) {
+		/*
+		 * Honor ordering - if the out-of-line queue isn't empty,
+		 * wait for it to drain before proceeding.
+		 */
+		while (!flist_empty(&sk_out->list))
+			usleep(1000);
 		handle_sk_entry(sk_out, entry);
-	else {
+	} else {
 		sk_lock(sk_out);
 		flist_add_tail(&entry->list, &sk_out->list);
 		sk_unlock(sk_out);

--
Jens Axboe



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: fio server/client disconnect bug
  2018-03-21 19:45                         ` Jeff Furlong
@ 2018-03-21 19:49                           ` Jens Axboe
  2018-03-21 21:45                             ` Jeff Furlong
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2018-03-21 19:49 UTC (permalink / raw)
  To: Jeff Furlong, fio

On 3/21/18 1:45 PM, Jeff Furlong wrote:
> There is noticeable slowdown with this patch.  Seems some iterations (even successful) are taking several seconds of usleep().  Eventually hit a new issue:
> 
> fio: client host1, timeout on cmd SEND_ETA
> fio: client host1 timed out
> 
> But host1 still logged the iops/bw/lat logs, but did not send summary to output.

Can you try this one?


diff --git a/server.c b/server.c
index 65d4484edae5..6064d8172a3b 100644
--- a/server.c
+++ b/server.c
@@ -616,7 +616,7 @@ static int fio_net_queue_quit(void)
 {
 	dprint(FD_NET, "server: sending quit\n");
 
-	return fio_net_queue_cmd(FIO_NET_CMD_QUIT, NULL, 0, NULL, SK_F_SIMPLE | SK_F_INLINE);
+	return fio_net_queue_cmd(FIO_NET_CMD_QUIT, NULL, 0, NULL, SK_F_SIMPLE);
 }
 
 int fio_net_send_quit(int sk)
@@ -636,7 +636,7 @@ static int fio_net_send_ack(struct fio_net_cmd *cmd, int error, int signal)
 
 	epdu.error = __cpu_to_le32(error);
 	epdu.signal = __cpu_to_le32(signal);
-	return fio_net_queue_cmd(FIO_NET_CMD_STOP, &epdu, sizeof(epdu), &tag, SK_F_COPY | SK_F_INLINE);
+	return fio_net_queue_cmd(FIO_NET_CMD_STOP, &epdu, sizeof(epdu), &tag, SK_F_COPY);
 }
 
 static int fio_net_queue_stop(int error, int signal)

-- 
Jens Axboe



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* RE: fio server/client disconnect bug
  2018-03-21 19:49                           ` Jens Axboe
@ 2018-03-21 21:45                             ` Jeff Furlong
  2018-03-21 21:47                               ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Furlong @ 2018-03-21 21:45 UTC (permalink / raw)
  To: Jens Axboe, fio

Ran over 600 iterations so far without issue.  Will keep running to get a higher confidence.  That way this issue will be fresh in our minds, given the fun sequence we just traced through.

So the summary is to not add the QUIT command to the inline queue, and just process it like most other typical commands?  The only thing that gets processed separately are errors or triggers?

Regards,
Jeff


-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Wednesday, March 21, 2018 12:50 PM
To: Jeff Furlong <jeff.furlong@wdc.com>; fio@vger.kernel.org
Subject: Re: fio server/client disconnect bug

On 3/21/18 1:45 PM, Jeff Furlong wrote:
> There is noticeable slowdown with this patch.  Seems some iterations (even successful) are taking several seconds of usleep().  Eventually hit a new issue:
> 
> fio: client host1, timeout on cmd SEND_ETA
> fio: client host1 timed out
> 
> But host1 still logged the iops/bw/lat logs, but did not send summary to output.

Can you try this one?


diff --git a/server.c b/server.c
index 65d4484edae5..6064d8172a3b 100644
--- a/server.c
+++ b/server.c
@@ -616,7 +616,7 @@ static int fio_net_queue_quit(void)  {
 	dprint(FD_NET, "server: sending quit\n");
 
-	return fio_net_queue_cmd(FIO_NET_CMD_QUIT, NULL, 0, NULL, SK_F_SIMPLE | SK_F_INLINE);
+	return fio_net_queue_cmd(FIO_NET_CMD_QUIT, NULL, 0, NULL, 
+SK_F_SIMPLE);
 }
 
 int fio_net_send_quit(int sk)
@@ -636,7 +636,7 @@ static int fio_net_send_ack(struct fio_net_cmd *cmd, int error, int signal)
 
 	epdu.error = __cpu_to_le32(error);
 	epdu.signal = __cpu_to_le32(signal);
-	return fio_net_queue_cmd(FIO_NET_CMD_STOP, &epdu, sizeof(epdu), &tag, SK_F_COPY | SK_F_INLINE);
+	return fio_net_queue_cmd(FIO_NET_CMD_STOP, &epdu, sizeof(epdu), &tag, 
+SK_F_COPY);
 }
 
 static int fio_net_queue_stop(int error, int signal)

--
Jens Axboe



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: fio server/client disconnect bug
  2018-03-21 21:45                             ` Jeff Furlong
@ 2018-03-21 21:47                               ` Jens Axboe
  2018-03-21 21:51                                 ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2018-03-21 21:47 UTC (permalink / raw)
  To: Jeff Furlong, fio

OK good, I will commit the fix.

The summary is not to do those commands inline. Usually we don't have
issues with inline vs out-of-line ordering, since we don't queue
deeply or depend on ordering. But for this specific case, it's pretty
unfortunate if commands we already queued up are not processed, before
the client sends a stop/quit and finishes.


On 3/21/18 3:45 PM, Jeff Furlong wrote:
> Ran over 600 iterations so far without issue.  Will keep running to get a higher confidence.  That way this issue will be fresh in our minds, given the fun sequence we just traced through.
> 
> So the summary is to not add the QUIT command to the inline queue, and just process it like most other typical commands?  The only thing that gets processed separately are errors or triggers?
> 
> Regards,
> Jeff
> 
> 
> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk] 
> Sent: Wednesday, March 21, 2018 12:50 PM
> To: Jeff Furlong <jeff.furlong@wdc.com>; fio@vger.kernel.org
> Subject: Re: fio server/client disconnect bug
> 
> On 3/21/18 1:45 PM, Jeff Furlong wrote:
>> There is noticeable slowdown with this patch.  Seems some iterations (even successful) are taking several seconds of usleep().  Eventually hit a new issue:
>>
>> fio: client host1, timeout on cmd SEND_ETA
>> fio: client host1 timed out
>>
>> But host1 still logged the iops/bw/lat logs, but did not send summary to output.
> 
> Can you try this one?
> 
> 
> diff --git a/server.c b/server.c
> index 65d4484edae5..6064d8172a3b 100644
> --- a/server.c
> +++ b/server.c
> @@ -616,7 +616,7 @@ static int fio_net_queue_quit(void)  {
>  	dprint(FD_NET, "server: sending quit\n");
>  
> -	return fio_net_queue_cmd(FIO_NET_CMD_QUIT, NULL, 0, NULL, SK_F_SIMPLE | SK_F_INLINE);
> +	return fio_net_queue_cmd(FIO_NET_CMD_QUIT, NULL, 0, NULL, 
> +SK_F_SIMPLE);
>  }
>  
>  int fio_net_send_quit(int sk)
> @@ -636,7 +636,7 @@ static int fio_net_send_ack(struct fio_net_cmd *cmd, int error, int signal)
>  
>  	epdu.error = __cpu_to_le32(error);
>  	epdu.signal = __cpu_to_le32(signal);
> -	return fio_net_queue_cmd(FIO_NET_CMD_STOP, &epdu, sizeof(epdu), &tag, SK_F_COPY | SK_F_INLINE);
> +	return fio_net_queue_cmd(FIO_NET_CMD_STOP, &epdu, sizeof(epdu), &tag, 
> +SK_F_COPY);
>  }
>  
>  static int fio_net_queue_stop(int error, int signal)
> 
> --
> Jens Axboe
> 


-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fio server/client disconnect bug
  2018-03-21 21:47                               ` Jens Axboe
@ 2018-03-21 21:51                                 ` Jens Axboe
  2018-03-21 21:58                                   ` Jeff Furlong
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2018-03-21 21:51 UTC (permalink / raw)
  To: Jeff Furlong, fio

Done: http://git.kernel.dk/cgit/fio/commit/?id=b647a293d70949b8c401772936a8000d10eb1bb9


On 3/21/18 3:47 PM, Jens Axboe wrote:
> OK good, I will commit the fix.
> 
> The summary is not to do those commands inline. Usually we don't have
> issues with inline vs out-of-line ordering, since we don't queue
> deeply or depend on ordering. But for this specific case, it's pretty
> unfortunate if commands we already queued up are not processed, before
> the client sends a stop/quit and finishes.
> 
> 
> On 3/21/18 3:45 PM, Jeff Furlong wrote:
>> Ran over 600 iterations so far without issue.  Will keep running to get a higher confidence.  That way this issue will be fresh in our minds, given the fun sequence we just traced through.
>>
>> So the summary is to not add the QUIT command to the inline queue, and just process it like most other typical commands?  The only thing that gets processed separately are errors or triggers?
>>
>> Regards,
>> Jeff
>>
>>
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@kernel.dk] 
>> Sent: Wednesday, March 21, 2018 12:50 PM
>> To: Jeff Furlong <jeff.furlong@wdc.com>; fio@vger.kernel.org
>> Subject: Re: fio server/client disconnect bug
>>
>> On 3/21/18 1:45 PM, Jeff Furlong wrote:
>>> There is noticeable slowdown with this patch.  Seems some iterations (even successful) are taking several seconds of usleep().  Eventually hit a new issue:
>>>
>>> fio: client host1, timeout on cmd SEND_ETA
>>> fio: client host1 timed out
>>>
>>> But host1 still logged the iops/bw/lat logs, but did not send summary to output.
>>
>> Can you try this one?
>>
>>
>> diff --git a/server.c b/server.c
>> index 65d4484edae5..6064d8172a3b 100644
>> --- a/server.c
>> +++ b/server.c
>> @@ -616,7 +616,7 @@ static int fio_net_queue_quit(void)  {
>>  	dprint(FD_NET, "server: sending quit\n");
>>  
>> -	return fio_net_queue_cmd(FIO_NET_CMD_QUIT, NULL, 0, NULL, SK_F_SIMPLE | SK_F_INLINE);
>> +	return fio_net_queue_cmd(FIO_NET_CMD_QUIT, NULL, 0, NULL, 
>> +SK_F_SIMPLE);
>>  }
>>  
>>  int fio_net_send_quit(int sk)
>> @@ -636,7 +636,7 @@ static int fio_net_send_ack(struct fio_net_cmd *cmd, int error, int signal)
>>  
>>  	epdu.error = __cpu_to_le32(error);
>>  	epdu.signal = __cpu_to_le32(signal);
>> -	return fio_net_queue_cmd(FIO_NET_CMD_STOP, &epdu, sizeof(epdu), &tag, SK_F_COPY | SK_F_INLINE);
>> +	return fio_net_queue_cmd(FIO_NET_CMD_STOP, &epdu, sizeof(epdu), &tag, 
>> +SK_F_COPY);
>>  }
>>  
>>  static int fio_net_queue_stop(int error, int signal)
>>
>> --
>> Jens Axboe
>>
> 
> 


-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: fio server/client disconnect bug
  2018-03-21 21:51                                 ` Jens Axboe
@ 2018-03-21 21:58                                   ` Jeff Furlong
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff Furlong @ 2018-03-21 21:58 UTC (permalink / raw)
  To: Jens Axboe, fio

Great.  Makes sense.  Thanks for the help here.

Regards,
Jeff


-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Wednesday, March 21, 2018 2:51 PM
To: Jeff Furlong <jeff.furlong@wdc.com>; fio@vger.kernel.org
Subject: Re: fio server/client disconnect bug

Done: http://git.kernel.dk/cgit/fio/commit/?id=b647a293d70949b8c401772936a8000d10eb1bb9


On 3/21/18 3:47 PM, Jens Axboe wrote:
> OK good, I will commit the fix.
> 
> The summary is not to do those commands inline. Usually we don't have 
> issues with inline vs out-of-line ordering, since we don't queue 
> deeply or depend on ordering. But for this specific case, it's pretty 
> unfortunate if commands we already queued up are not processed, before 
> the client sends a stop/quit and finishes.
> 
> 
> On 3/21/18 3:45 PM, Jeff Furlong wrote:
>> Ran over 600 iterations so far without issue.  Will keep running to get a higher confidence.  That way this issue will be fresh in our minds, given the fun sequence we just traced through.
>>
>> So the summary is to not add the QUIT command to the inline queue, and just process it like most other typical commands?  The only thing that gets processed separately are errors or triggers?
>>
>> Regards,
>> Jeff
>>
>>
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@kernel.dk]
>> Sent: Wednesday, March 21, 2018 12:50 PM
>> To: Jeff Furlong <jeff.furlong@wdc.com>; fio@vger.kernel.org
>> Subject: Re: fio server/client disconnect bug
>>
>> On 3/21/18 1:45 PM, Jeff Furlong wrote:
>>> There is noticeable slowdown with this patch.  Seems some iterations (even successful) are taking several seconds of usleep().  Eventually hit a new issue:
>>>
>>> fio: client host1, timeout on cmd SEND_ETA
>>> fio: client host1 timed out
>>>
>>> But host1 still logged the iops/bw/lat logs, but did not send summary to output.
>>
>> Can you try this one?
>>
>>
>> diff --git a/server.c b/server.c
>> index 65d4484edae5..6064d8172a3b 100644
>> --- a/server.c
>> +++ b/server.c
>> @@ -616,7 +616,7 @@ static int fio_net_queue_quit(void)  {
>>  	dprint(FD_NET, "server: sending quit\n");
>>  
>> -	return fio_net_queue_cmd(FIO_NET_CMD_QUIT, NULL, 0, NULL, SK_F_SIMPLE | SK_F_INLINE);
>> +	return fio_net_queue_cmd(FIO_NET_CMD_QUIT, NULL, 0, NULL, 
>> +SK_F_SIMPLE);
>>  }
>>  
>>  int fio_net_send_quit(int sk)
>> @@ -636,7 +636,7 @@ static int fio_net_send_ack(struct fio_net_cmd 
>> *cmd, int error, int signal)
>>  
>>  	epdu.error = __cpu_to_le32(error);
>>  	epdu.signal = __cpu_to_le32(signal);
>> -	return fio_net_queue_cmd(FIO_NET_CMD_STOP, &epdu, sizeof(epdu), &tag, SK_F_COPY | SK_F_INLINE);
>> +	return fio_net_queue_cmd(FIO_NET_CMD_STOP, &epdu, sizeof(epdu), 
>> +&tag, SK_F_COPY);
>>  }
>>  
>>  static int fio_net_queue_stop(int error, int signal)
>>
>> --
>> Jens Axboe
>>
> 
> 


--
Jens Axboe



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-03-21 21:58 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-16  0:32 fio server/client disconnect bug Jeff Furlong
2018-03-21  0:16 ` Jeff Furlong
2018-03-21 15:19   ` Jens Axboe
2018-03-21 16:19     ` Jens Axboe
2018-03-21 16:27       ` Jens Axboe
2018-03-21 16:35         ` Jeff Furlong
2018-03-21 16:38           ` Jens Axboe
2018-03-21 16:58             ` Jeff Furlong
2018-03-21 17:12               ` Jens Axboe
2018-03-21 18:16                 ` Jeff Furlong
2018-03-21 18:38                   ` Jens Axboe
2018-03-21 19:03                     ` Jens Axboe
2018-03-21 19:10                       ` Jens Axboe
2018-03-21 19:44                         ` Jens Axboe
2018-03-21 19:45                         ` Jeff Furlong
2018-03-21 19:49                           ` Jens Axboe
2018-03-21 21:45                             ` Jeff Furlong
2018-03-21 21:47                               ` Jens Axboe
2018-03-21 21:51                                 ` Jens Axboe
2018-03-21 21:58                                   ` Jeff Furlong
2018-03-21 19:23                     ` Jeff Furlong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.