* Meta verification regression starting with fio 2.1.5
@ 2014-05-06 18:25 Stoo Davies
2014-05-06 21:59 ` Jens Axboe
0 siblings, 1 reply; 4+ messages in thread
From: Stoo Davies @ 2014-05-06 18:25 UTC (permalink / raw)
To: fio
I'm doing some powerfail recovery testing on a storage array over iSCSI.
Host is RHEL 6.4 kernel 2.6.32-358.el6.x86_64.
With fio 2.1.2 -> 2.1.4 the job file below rides through the disks going
away, and continues I/O after they come back, without reporting any errors.
With fio 2.1.5 -> 2.1.8 when the disks come back fio immediately reports
a meta verification error.
I captured a trace with an finisar analyzer, and can see that after the
disks come back and the host logs back in, a read is issued for an lba
which was never written to.
Since I don't see verification errors outside of the powerfail testing,
I suspect fio isn't correctly handling failed writes during the time the
disks are unavailable.
The trace file is rather large, but I can make it available if you need
to see it.
[whee]
bs=8k
thread=4
time_based=1
runtime=864000
readwrite=randrw
direct=1
iodepth=128
ioengine=libaio
size=100%
verify=meta
do_verify=1
verify_fatal=1
verify_dump=1
verify_backlog=8192
buffer_compress_percentage=95
ignore_error=ENODEV:EIO,ENODEV:EIO,ENODEV:EIO
filename=/dev/mapper/lun0
.
.
filename=/dev/mapper/lun9
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Meta verification regression starting with fio 2.1.5
2014-05-06 18:25 Meta verification regression starting with fio 2.1.5 Stoo Davies
@ 2014-05-06 21:59 ` Jens Axboe
2014-05-07 0:35 ` Stoo Davies
0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2014-05-06 21:59 UTC (permalink / raw)
To: Stoo Davies, fio
[-- Attachment #1: Type: text/plain, Size: 1557 bytes --]
On 05/06/2014 12:25 PM, Stoo Davies wrote:
> I'm doing some powerfail recovery testing on a storage array over iSCSI.
> Host is RHEL 6.4 kernel 2.6.32-358.el6.x86_64.
>
> With fio 2.1.2 -> 2.1.4 the job file below rides through the disks going
> away, and continues I/O after they come back, without reporting any errors.
> With fio 2.1.5 -> 2.1.8 when the disks come back fio immediately reports
> a meta verification error.
>
> I captured a trace with an finisar analyzer, and can see that after the
> disks come back and the host logs back in, a read is issued for an lba
> which was never written to.
> Since I don't see verification errors outside of the powerfail testing,
> I suspect fio isn't correctly handling failed writes during the time the
> disks are unavailable.
>
> The trace file is rather large, but I can make it available if you need
> to see it.
>
> [whee]
> bs=8k
> thread=4
> time_based=1
> runtime=864000
> readwrite=randrw
> direct=1
> iodepth=128
> ioengine=libaio
> size=100%
> verify=meta
> do_verify=1
> verify_fatal=1
> verify_dump=1
> verify_backlog=8192
> buffer_compress_percentage=95
> ignore_error=ENODEV:EIO,ENODEV:EIO,ENODEV:EIO
> filename=/dev/mapper/lun0
> .
> .
> filename=/dev/mapper/lun9
2.1.5 did indeed change when the IO was logged for verification, so that
does explain why it fails for you now. That's a problem.
Can you try with this patch? I'm not going to commit it yet, I want to
carefully audit all paths to ensure we also unlog or trim an io_piece,
if we don't fully complete it.
--
Jens Axboe
[-- Attachment #2: unlog.patch --]
[-- Type: text/x-patch, Size: 2301 bytes --]
diff --git a/backend.c b/backend.c
index 9deef284e36b..e6a47716094f 100644
--- a/backend.c
+++ b/backend.c
@@ -780,6 +780,7 @@ static uint64_t do_io(struct thread_data *td)
case FIO_Q_COMPLETED:
if (io_u->error) {
ret = -io_u->error;
+ unlog_io_piece(td, io_u);
clear_io_u(td, io_u);
} else if (io_u->resid) {
int bytes = io_u->xfer_buflen - io_u->resid;
@@ -830,6 +831,7 @@ sync_done:
bytes_issued += io_u->xfer_buflen;
break;
case FIO_Q_BUSY:
+ unlog_io_piece(td, io_u);
requeue_io_u(td, &io_u);
ret2 = td_io_commit(td);
if (ret2 < 0)
diff --git a/io_u.c b/io_u.c
index 4b0b5a7a11bd..e132fd9d2d98 100644
--- a/io_u.c
+++ b/io_u.c
@@ -1622,8 +1622,15 @@ static void io_completed(struct thread_data *td, struct io_u *io_u,
* Mark IO ok to verify
*/
if (io_u->ipo) {
- io_u->ipo->flags &= ~IP_F_IN_FLIGHT;
- write_barrier();
+ /*
+ * Remove errored entry from the verification list
+ */
+ if (io_u->error)
+ unlog_io_piece(td, io_u);
+ else {
+ io_u->ipo->flags &= ~IP_F_IN_FLIGHT;
+ write_barrier();
+ }
}
td_io_u_unlock(td);
diff --git a/iolog.c b/iolog.c
index f49895929c34..33ec07afed4e 100644
--- a/iolog.c
+++ b/iolog.c
@@ -268,6 +268,23 @@ restart:
td->io_hist_len++;
}
+void unlog_io_piece(struct thread_data *td, struct io_u *io_u)
+{
+ struct io_piece *ipo = io_u->ipo;
+
+ if (!ipo)
+ return;
+
+ if (ipo->flags & IP_F_ONRB)
+ rb_erase(&ipo->rb_node, &td->io_hist_tree);
+ else if (ipo->flags & IP_F_ONLIST)
+ flist_del(&ipo->list);
+
+ free(ipo);
+ io_u->ipo = NULL;
+ td->io_hist_len--;
+}
+
void write_iolog_close(struct thread_data *td)
{
fflush(td->iolog_f);
diff --git a/iolog.h b/iolog.h
index 50d09e26bfbe..8b1d5880da74 100644
--- a/iolog.h
+++ b/iolog.h
@@ -110,6 +110,7 @@ extern void log_io_u(struct thread_data *, struct io_u *);
extern void log_file(struct thread_data *, struct fio_file *, enum file_log_act);
extern int __must_check init_iolog(struct thread_data *td);
extern void log_io_piece(struct thread_data *, struct io_u *);
+extern void unlog_io_piece(struct thread_data *, struct io_u *);
extern void queue_io_piece(struct thread_data *, struct io_piece *);
extern void prune_io_piece_log(struct thread_data *);
extern void write_iolog_close(struct thread_data *);
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: Meta verification regression starting with fio 2.1.5
2014-05-06 21:59 ` Jens Axboe
@ 2014-05-07 0:35 ` Stoo Davies
2014-05-07 1:10 ` Jens Axboe
0 siblings, 1 reply; 4+ messages in thread
From: Stoo Davies @ 2014-05-07 0:35 UTC (permalink / raw)
To: Jens Axboe, fio
On 05/06/2014 02:59 PM, Jens Axboe wrote:
> 2.1.5 did indeed change when the IO was logged for verification, so that
> does explain why it fails for you now. That's a problem.
>
> Can you try with this patch? I'm not going to commit it yet, I want to
> carefully audit all paths to ensure we also unlog or trim an io_piece,
> if we don't fully complete it.
>
10 loops through the powerfail test and all the hosts are still happy.
Looks good to me, thanks for the fast response.
Stoo
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Meta verification regression starting with fio 2.1.5
2014-05-07 0:35 ` Stoo Davies
@ 2014-05-07 1:10 ` Jens Axboe
0 siblings, 0 replies; 4+ messages in thread
From: Jens Axboe @ 2014-05-07 1:10 UTC (permalink / raw)
To: Stoo Davies, fio
On 2014-05-06 18:35, Stoo Davies wrote:
> On 05/06/2014 02:59 PM, Jens Axboe wrote:
>> 2.1.5 did indeed change when the IO was logged for verification, so that
>> does explain why it fails for you now. That's a problem.
>>
>> Can you try with this patch? I'm not going to commit it yet, I want to
>> carefully audit all paths to ensure we also unlog or trim an io_piece,
>> if we don't fully complete it.
>>
> 10 loops through the powerfail test and all the hosts are still happy.
> Looks good to me, thanks for the fast response.
Thanks a lot for reporting it, I have now committed a patch that is
basically what I sent you, but catches the partial completion case as well.
--
Jens Axboe
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-05-07 1:10 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-06 18:25 Meta verification regression starting with fio 2.1.5 Stoo Davies
2014-05-06 21:59 ` Jens Axboe
2014-05-07 0:35 ` Stoo Davies
2014-05-07 1:10 ` Jens Axboe
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.