Fio Checksum tracking and enhanced trim workloads

* Fio Checksum tracking and enhanced trim workloads
@ 2017-05-08  3:54 paul houlihan
  2017-05-08 14:18 ` Fwd: " Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: paul houlihan @ 2017-05-08  3:54 UTC (permalink / raw)
  To: fio, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 26048 bytes --]

I have a submission for fio that enhances the data corruption detection and
diagnosis capabilities taking fio from pretty good corruption detection to
absolute guarantees. I would like these changes on the tracking branch????
to be reviewed and considered for inclusion to fio. A quick review would be
helpful as I am losing access to test systems shortly.

These changes were used by a Virtual Machine caching company to assure data
integrity. Most testing was on Linux 64 bits and windows 32/64 bits. The
windows build still had an issue with compile time asserts in libfio.c that
I worked around by commenting out the asserts as this looked like a
performance restriction. This should be researched more. The initial
development was on version fio 2.2.10 sources and I just ported the changes
to fio latest sources and tested on linux but haven’t yet test on windows.
No testing on all other fio supported OSes was done, although the changes
are almost exclusively to OS independent code.

The absolute guarantees are brought about by tracking checksums to prevent
a stale but intact prior version of a block being returned and by verifying
all reads. I was surprised to learn about the number of times fio performed
concurrent I/O to the same blocks which yields indeterminate results that
prevent data integrity verification. Thus a number of options are not
supported when tracking is enabled.

Finally I have enhanced the usage of trims and am able to verify data
integrity of these operations in an integrated fashion.

Here is a list of changes in this submission:

 * Bug where expected version of verify_interval is not generated
correctly, dummy io_u not setup correctly

 * Bug where unknown header_interval referenced in HOWTO, fixed a bunch of
typos.

 * Bug where windows hangs on nano sleep in windows 7.

 * Also stonewall= option does not seem to work on windows 7, seems fixed
in later releases so painfully worked around this by having separate init
and run fio scripts. No change was made here but just mentioning this in
passing.

 * Fixed bug where FD_IO logging was screwed up in io_c.h. Here is example
of logging problem:

io       2212  io complete: io_u 0x787280: off=1048576/len=2097152/ddir=0io
      2212  /b.datio       2212

io       2212  fill_io_u: io_u 0x787280: off=3145728/len=2097152/ddir=1io
    2212  /b.datio       2212

io       2212  prep: io_u 0x787280: off=3145728/len=2097152/ddir=1io
2212  /b.datio       2212

io       2212  ->prep(0x787280)=0

io       2212  queue: io_u 0x787280: off=3145728/len=2097152/ddir=1io
2212  /b.datio       2212

 * In order to make fio into an superb data integrity test tool, a number
of shortcomings were addressed. New verify_track switch enables in memory
tracking of checksums within each fio job, preventing a block from rolling
back to prior version. The in memory checksums can be written to a tracking
log file to provide an absolute checksum guarantees between fio jobs or
between fio runs. Verification of trim operations is supported in an
integrated fashion. See HOWTO description of verify_tracking.
verify_tacking_log, verify_tracking_required, verify_tracking_dir,
verify_trim_zero

 * Enhanced description surrounding corruption added to HOWTO as well as
providing some corruption analyze tools.

 * Bad header will dump received buffer into *.received before you gave you
an error message

 * If verify_interval is less than the block size, fio will now always dump
the complete buffer in an additional file called *.complete. Seeing whole
buffer can reveal more about the corruption pattern.

 * Changed the printing of the hex checksum to display in MSB to LSB order
to facilitate compares to memory dumps and debug logging

 * Added a dump of the complete return buffer on trim write verification
failure.

 * Debug logging was being truncated at the end of a job so you could not
see the full set of debug log messages, so added a log flush at the end of
each job if debug= switch is used.

 * rw=readwrite seems to have independent last_pos read/write pointers as
you sequentially access the file. If the mix is 50/50 then you could have
fio reading and writing the same block as the read and write pointer cross
each other which is not reliably verifiable. This pattern result is chaos
and contradicts all the other sequential patterns and even randrw.
Overlapping I/O makes little sense and is usually a sign of a broken
application. Moreover readwrite workload would not complete a sequential
pass over the entire file which everyone I spoke to assumed it was doing.
So a change was made to the existing read/write workload functionality. Now
the max of the file’s last_pos pointers for DDIR_READ and DDIR_WRITE are
used for selecting the next offset as we sequentially scan a file. If the
old behavior is somehow useful then an option can be added to preserve it.
If preserved, it should never be the default and should disable
verification.

My changes revolve around maintaining the last_pos array in a special way.
When multiple operations (read/write/trim) are requested by a workload then
as the last position is changed, the changes are reflected in all three
entries in the array. This way a randomly selected next operation always
use the right last_pos. However we retained the old behavior for single
operation workloads and for trimwrite which operates like a single
operation workload.

 * Synchronous Trim I/O completions were not updating bytes_issued in
backend.c and thus trimwrite was actually making 2 passes of the file.

 * I kept the new verify_tracking verification entirely separate from the
experimental_verify code. These new tracking changes provides fully
persistent verification of trims integrated into standard verify, so we
might want to consider deprecating support for experimental_verify. Note
that verify_track and experimental_verify cannot both be enabled.

 * With the wide adoption of thin LUN datastores and recent expanded OS
support for trim operations to reclaim unused space, testing trim
operations in a wide variety of contexts has been a necessity. Added some
new trim I/O workloads to the existing trim workloads, that require use of
verify_tracking option to verify:

trim Sequential trims

readtrim Sequential mixed reads and trims

writetrim Sequential mixed writes and trims.

Each block will be trimmed or written.

readwritetrim Sequential mixed reads/writes/trims

randtrim Random trims

randreadtrim Random mixed reads and trims

randwritetrim Random mixed writes and trims

randrwt Random mixed reads/writes/trims

 * A second change to existing fio functionality involves an inconsistency
of counting read verification bytes against the size= argument. Some rw=
workloads count read verification I/Os or bytes against size= values (like
readwrite and randrw) and some do not  like write, trim and trimwrite.
Counting read verifications bytes makes it hard to predict the number of
bytes or I/Os that will be performed in the readwrite workload and the new
rw= workloads increases the unpredictability with even more read
verifications in a readwritetrim workload. Normally I expect that fio
should process all the bytes in a file pass but when the bytes from read
verifies count towards the total bytes to process in size=, only part of
the file is processed. So I made it consistent for size and io_limit by not
counting read verify bytes. One could argue that number_os= could also be
similarly changed but I left this alone and it still uses raw I/O counts
which include read verification I/Os. Another justification is that
this_io_bytes never records verification reads for the dry_run and we need
dry_run and do_io to be in synch. Note this explains why I removed code to
add extra bytes to total_bytes in do_io for verify_backlog.

 * Seems like the processing of TD_F_VER_NONE is backwards from its name.
If verify != VERIFY_NONE then the bit is set but the name implies it should
be clear. So now it sets the bit only if verify == VERIFY_NONE to avoid
this very confusing state.

 * Added a sync and invalidate after the close in iolog.c ipo_special().
This is needed if you capture checksums in the tracking log and there is a
close followed immediately by an open. The close is not immediate if you
have iodepth set to a large number. The file is still marked “open” but
“closing” on return from the close  and will close only after the last I/O
completes. The sync avoids the assert on trying to open an already open
file which has a close pending.

 * —read_iolog does not support trims at this time.

 * io_u.c get_next_seq_offset() seems to suggest that ddir_seq_add can be
negative but there are a number of unhandled cases with such a setting. Add
TODOs to document issues. I have a number of reservations about the
correctness of get_next_seq_offset(). Note whenever I saw a possible
problem in the code but did not have time to research it, I added a TODO
comment.

 * io_u.c get_next_seq_offset() has a problem when it uses fixed value when
relative values are what is being manipulated, so this code:

if (pos >= f->real_file_size)

pos = f->file_offset;

should be:

if (pos >= f->io_size)

pos = 0;

 * Given there are a couple of changes to existing fio workload behavior,
you might want to consider going to a V3.0.

Here are two new sections on Verification Tracking and Data Corruption
Troubleshooting from HOWTO:

Verification Tracking

---------------------

Absolute data integrity guarantee is the primary mission of a storage

software/hardware subsystem. Fio is good at detecting data corruption but

there are gaps. Currently only when rw option is set to read only are

workload reads verified. It is desirable to validate all reads in addition

to writes to protect against data rolling back to earlier versions.

With the addition of the block's offset to the header in recent fio
releases,

block data returned for another block will be flagged as corrupt. However

a limitation of the fio header and data embedded checksums is that fio
cannot

detect if a prior intact version of a block was returned on a read. If the

header and data checksum match the block is declared valid.

These limitations can be addressed by setting the verify_track option which

allocates a memory array to track the header and data checksums to assure

data integrity is absolute. The array starts out empty at the beginning of

each fio job and is filled in as reads or writes occur, once defined the

checksums from succeeding I/Os must all match. This option extends checksum

verification to all reads in all workloads, not just the read-only
workloads.

However use of verify_track requires that fio avoid overlapping, concurrent

reads and writes to the same block. Reading and writing a block at the same

time yields indeterminate results and making guaranteeing data integrity

impossible. So some fio options where this is a risk are disabled when using

verify_track. See verify_track argument for list of restrictions.

Even better verification would validate data more persistently. You would

like to track checksums persistently between fio jobs or between runs of fio

which could be after a shutdown/restart of the system or on a different
system

that shares storage. Proving seamless data integrity from the application

perspective over complex failover and recovery situations like reverting a

virtual machine to a prior snapshot is quite valuable.

Also the popularity of thin LUNs in the storage world has caused problems

if the unused disk space is not reclaimed by use of trims. So we would like

to have the ability to mix and match trims with reads and writes. The rw
option

now supports a full set of combinations and the rwtmix=read%,write%,trim%
option

allows specifying the mix percentages of all three types of I/O in one
argument.

However trims do have special requirements as documented under the rw
option.

Finally we would like to verify trims operations. If you read a trimmed
block

before re-writing the block, it should return a block of zeroes.

The verify_track_log option permits persistent checksum tracking and

verification of trims by enabling the saving of the tracking array to a
tracking

log on the close of a data file at the end of a fio job and reading it back
in

at the next start. A clean shutdown of fio is needed for tracking log to be

persistent. When no errors occur checksum context is automatically preserved

between fio jobs and fio runs. On revert of a virtual machine snapshot if

the tracking log is restored from the time of the snapshot then checksum

context is again preserved. There is a tracking log for each data file.

Tracking log filename format is: [dir] / [filename].tracking.log

where:

   filename - is name of file system file or block device name like “sdb”

   dir - is log directory that defaults to directory of data file.

         For block devices, dir defaults to the process current default

         directory.

The tracking log is plain text and contains data from when it was first
created:

the data file name it is tracking, the size of the data file, the starting

file offset for I/Os, its verify_interval option setting. From the last

save of the log it has: timestamp of last save and a checksum of the

tracking log contents. For checksums, Bit 0 = 1 defines a valid checksum.

Bit 0 = 0 signifies special case entries (dddddddc indicates a trimmed block

and 0 indicates an undefined entry).

Tracking Log Example with "--" comments added:

$ cat xxx.tracking.log

Fio-tracking-log-version: 1

DataFileName: xxx

DataFileSize: 2048

DataFileOffset: 0

DataFileVerifyInterval: 512

TrackingLogSaveTimestamp: 2017-02-23T14:25:32.446981

TrackingLogChecksum: cae34cd8

VerifyIntervalChecksums:

4028ab33    -- Checksums from read or write of 3 blocks, Bit 0 = 1

a450bffb

81858a3

dddddddc    -- Means trimmed block, Bit 0 = 0

0           -- Means undefined entry never been accessed, Bit 0 = 0

$

Tracking arguments are:

verify_track=bool - enables checksum tracking in memory

verify_track_log=bool - enable savings and restoring of tracking log

verify_track_required=bool - By default fio will create a log on the fly.

    If a log is found at the start it is read and then the log file is
deleted.

    If any error occurs during the fio run then the tracking log is not

    written on close so compromised logs do not cause false failures.
However

    testing requiring absolute data integrity guarantees will want to use
this

    option to require that the tracking log always be present between fio
jobs

    or at the start of a new fio run.

verify_track_dir=str - Specifies dir to place all tracking logs. It is
advisable

    when evaluating the data integrity of device to place the tracking log
on a

    different, more trusted device.

verify_track_trim_zero=bool - When no tracking array entry exists, this
option

    allows a zeroed block from prior fio run to be treated as previously
trimmed

    instead of as data corruption. Once the array entry for a block is
defined,

    this option is no longer used as the array entry determines the required

    verification.

debug=chksum - a new debug option allows tracing of all checksum entry

    additions/changes to the tracking array or entry use in verification

There are a couple considerations to be aware of when using tracking log.

Tracking log is sticky. If you change the following options that make

the tracking log no longer match the data layout then you will receive

a persistent error until the tracking log is recreated: size= or offset=

or verify_interval= options. You do get a friendly error indicating

what tracking log file to delete to start with a fresh tracking log. Note

if a fio run an fails with other errors, the tracking log is discarded so
that

stale checksums do not cause false failures on subsequent runs.

The tracking log uses 4 bytes for tracking each verify_interval block

in the data file or block device as specified by 4*(size/verify_interval).

So there are scaling implications for memory usage and log file size.

However blocks are only tracked for the active I/O range from:

offset - (offset+size-1).

The performance impact of the few extra I/Os to read and write the tracking
log

between fio jobs and fio runs is negligible since one is not usually
verifying

data when doing performance studies. There is no overhead when verify
tracking

is disabled and no extra I/Os when verify_track_log is disabled.

Data Corruption Troubleshooting

-------------------------------

When a corruption occurs immediate analysis can reveal many clues as to the

source of the corruption. Is the corruption persistent? In memory and on
disk?

The exact pattern of the corruption is often revealing: At the beginning of

an I/O block? Sector aligned? All zeroes or garbage? What is the exact range

of the corruption? Is corruption a stale but intact prior version of the

block?

When a corruption is detected, three possible corrupt data files are
created:

*.received - the corrupt data which is possibly a verify_interval block
within

              the full block used in the I/O.

*. complete - the full block used in the I/O

*. expected - if the block's header is intact, the expected data pattern for

              the *.received block can be generated

Two scripts exist in the analyze directory to assist in analysis:

corruption_triage.sh - a bash script that contains a sequence of diagnostic

              steps

fio_header.py - a python script that displays the contents of the block
header

              in a corrupt data file.

Here are the related parameter descriptions from HOWTO:

option verify_track=bool

Fio normally verifies data within a verify_interval with checksums and file

offsets embedded in the data. However a prior version of a block could be

returned and verified successfully. When verify_track is enabled the
checksum

for every verify_interval in the file is stored in a table and all read data

must match the checksums in the table. The tracking table is sized as

(size / verify_interval) * 4 bytes. For very large size= option settings,

such a large memory allocation may impact testing. Reads assume that the
entire

file has been previously written with a verification format using the same

verify_interval. When verify_track is enabled, all reads are verified,
whether

writes are present in the workload or not. Sharing files by threads within
a job

is supported but not between jobs running concurrently so use the stonewall

option when more than one non-global job is present. Verify of trimmed
blocks

is described for the verify_track_trim_zero option. When disabled, fio falls

back on verification described under the verify option. The restrictions
when

enabling the verify_track option are:

- randommap is required

- softrandommap is not supported

- lfsr random generator not supported when using multiple block sizes

- stonewall option required when more than one job present

- file size must be an even multiple of the block size when iodepth > 1

- verify_backlog not supported when iodepth > 1

- verify_async is not supported

- file sharing between concurrent jobs not supported

- numjobs must be 1

- io_submit_mode must be set to "inline"

- verify=null or pattern are not supported

- verify_only is not supported

- io_submit_mode must be set to 'inline'

- supplying a sequence number with rw option is not supported

- experimental_verify is not supported

Defaults to off.

You can enable verify_track for individual jobs and each job will start with

a empty table which is filled in as each block is initially read or written
and

enforced on subsequent reads within the job. For persistent tracking of
checksums

between jobs or fio runs, see verify_track_log.

option verify_track_log=bool

If set when verify_track is set then on a clean shutdown, fio writes the
checksum

for each data block that has been read or written to a log named

(datafilename).tracking.log. If set when fio reopens this data file and a
tracking

log exists then the checksums are read into the tracking table and used to
validate

every subsequent read. This allows rigorous validation of data integrity as
data

files are passed between fio jobs or over the termination of fio and
restart on

the same system or on another system or after an OS reboot. Reverting a
virtual

machine to a snapshot can be tested by saving the tracking log after a
successful

fio run and later restoring the saved log after reverting the virtual
machine.

The log is deleted after being read in, so on abnormal termination no stale

checksums can be used. This option, the data file size and verify_interval

parameters should not change between jobs in the same run or on restart of
fio.

Defaults to off. verify_track_dir defines the tracking log's directory.

option verify_track_required=bool

If set when verify_track_log is set then the tracking log for each file
must exist

at the start of a fio job or an error is returned. Defaults to off which is

the case for the first job in a new fio run. Subsequent jobs in this run can

require use of the tracking log. If set to off then any tracking log found
will be

used otherwise an empty tracking table is used. If a prior fio run created a

tracking log for the data file then all jobs can require use of the
tracking log.

option verify_track_dir=str

If verify_track_log is set then this defines the single directory for all
tracking

logs. The default is to use the same directory where each data file resides.

When filename points to a block device or pipe then the directory defaults
to the

current process default directory. To assure data integrity of the tracking
log,

each tracking log also contains its own checksum. However when checking a
device

for data integrity it is advisable to place tracking logs containing
checksums on

a different, more trusted device.

option verify_track_trim_zero=bool

Typically a read of a trimmed block that has not been re-written will
return a block

of zeros. If set with verify_tracking enabled then all zeroed blocks with
no tracking

information are assumed to have resulted from a trim. If clear zeroed
blocks are

treated as corruption. If your device does not return zeroed blocks for
reads after

a trim then it cannot participate in tracking verification. Fio sets to 1
if trims

are present in the rw argument and defaults 0 otherwise. You would only use
this when

verify_tracking is enabled, trims are not specified in the rw argument and
a prior

fio job or run had performed trims.

option readwrite=str, rw=str

Type of I/O pattern. Accepted values are:

read

Sequential reads.

write

Sequential writes.

randwrite

Random writes.

randread

Random reads.

rw,readwrite

Sequential mixed reads or writes.

randrw

Random mixed reads or writes.

Trim I/O has several requirements:

- File system and OS support varies but Linux block devices

  accept trims. You need privilege to write to a Linux block device.

  See example fio: track-mem.fio

- Often minimal block size required. Linux on VMware requires

  at least 1 MB in size aligned on 1 MB boundary

- VMware requires minimum VM OS hardware level 11

- To verify the trim I/Os requires verify_track

Trim I/O patterns are:

trim

Sequential trims

readtrim

Sequential mixed reads or trims

trimwrite

Sequential mixed trim then write. Each block

will be trimmed first, then written to.

writetrim

Sequential mixed writes or trims.

Each block will be trimmed or written.

rwt,readwritetrim

Sequential mixed reads/writes/trims

randtrim

Random trims

randreadtrim

Random mixed reads or trims

randwritetrim

Random mixed writes or trims

randrwt

Random mixed reads/writes/trims

Fio defaults to read if the option is not specified.  For the mixed I/O

types, the default is to split them 50/50.  For certain types of I/O the

result may still be skewed a bit, since the speed may be different. It is

possible to specify a number of I/O's to do before getting a new offset,

this is done by appending a ``:[nr]`` to the end of the string given.  For a

random read, it would look like ``rw=randread:8`` for passing in an offset

modifier with a value of 8. If the suffix is used with a sequential I/O

pattern, then the value specified will be added to the generated offset for

each I/O.  For instance, using ``rw=write:4k`` will skip 4k for every

write. It turns sequential I/O into sequential I/O with holes.  See the

:option:`rw_sequencer` option. Storage array vendors often require

trims to use a minimum block size.

option rwtmix=int[,int][,int]

When trims along with reads and/or writes are specified in the rw option
then

this is the preferred argument for specifying mix percentages. The argument
is

of the form: read,write,trim and the percentages must total 100.  Note any

argument may be empty to leave that value at its default from rwmix*
arguments

of 50,50,0. If a trailing comma isn't given, the remainder will inherit

the last value set.

[-- Attachment #2: Type: text/html, Size: 57590 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread