All of lore.kernel.org
 help / color / mirror / Atom feed
* md road-map: 2011
@ 2011-02-16 10:27 NeilBrown
  2011-02-16 11:28 ` Giovanni Tessore
                   ` (7 more replies)
  0 siblings, 8 replies; 52+ messages in thread
From: NeilBrown @ 2011-02-16 10:27 UTC (permalink / raw)
  To: linux-raid


I all,
 I wrote this today and posted it at
http://neil.brown.name/blog/20110216044002

I thought it might be worth posting it here too...

NeilBrown


-------------------------


It is about 2 years since I last published a road-map[1] for md/raid
so I thought it was time for another one.  Unfortunately quite a few
things on the previous list remain undone, but there has been some
progress.

I think one of the problems with some to-do lists is that they aren't
detailed enough.  High-level design, low level design, implementation,
and testing are all very different sorts of tasks that seem to require
different styles of thinking and so are best done separately.  As
writing up a road-map is a high-level design task it makes sense to do
the full high-level design at that point so that the tasks are
detailed enough to be addressed individually with little reference to
the other tasks in the list (except what is explicit in the road map).

A particular need I am finding for this road map is to make explicit
the required ordering and interdependence of certain tasks.  Hopefully
that will make it easier to address them in an appropriate order, and
mean that I waste less time saying "this is too hard, I might go read
some email instead".

So the following is a detailed road-map for md raid for the coming
months.

[1] http://neil.brown.name/blog/20090129234603

Bad Block Log
-------------

As devices grow in capacity, the chance of finding a bad block
increases, and the time taken to recover to a spare also increases.
So the practice of ejecting a device from the array as soon as a
write-error is detected is getting more and more problematic.

For some time we have avoided ejecting devices for read errors, by
computing the expected data from elsewhere and writing back to the
device - hopefully fixing the read error.  However this cannot help
degraded arrays and they will still eject a device (and hence fail the
whole array) on a single read error.  This is not good.

A particular problem is that when a device does fail and we need to
recover the data, we typically read all of the other blocks on all
arrays.  If we are going to hit any read errors, this is the most
likely time, and also this is the worst possible time and it will mean
that the recovery doesn't complete and the array gets stuck in a
degraded state and is very susceptible to substantial loss if another
failure happens.

Part of the answer to this is to implement a "bad block log".  This is
a record of blocks that are known to be bad.  i.e. either a read or a
write has recently failed.  Doing this allows us to just eject that
block from the array rather than the whole devices.  Similarly instead
of failing the whole array, we can fail just one stripe.  Certainly
this can mean data loss, but the loss of a few K is much less
traumatic than the loss of a terabyte.

But using a bad block list isn't just about keeping the data loss
small, it can be about keeping it to zero.  If we get a write error on
a block in a non-degraded array, then recording the bad block means we
lose redundancy in just that stripe rather than losing it across the
whole array.  If we then lose a different block on a different drive,
the ability to record the bad block means that we can continue without
data loss.  Had we needed to eject both whole drives from the array we
would have lost access to all of our data.

The bad block list must be recorded to stable storage to be useful, so
it really needs to be on the same drives that store the data.  The
bad-block list for a particular device is only of any interest to that
device.  Keeping information about one device on another is pointless.
So we don't have a bad block list for the whole array, we keep
multiple lists, one for each device.

It would be best to keep at least two copies of the bad block list so
that if the place where the list goes bad we can keep working with the
device.  The same logic applies to other metadata which currently
cannot be duplicated.  So implementing this feature will not address
metadata redundancy.  A separate feature should address metadata
redundancy and it can duplicate the bad block list as well as other
metadata.

There are doubtlessly lots of ways that the bad block list could be
stored, but we need to settle on one.  For externally managed metadata
we need to make the list accessible via sysfs in a generic way so that
a user-space program can store is as appropriate.

So: for v0.90 we choose not to store a bad block list.  There isn't
anywhere convenient to store it and new installations of v0.90 are not
really encouraged.

For v1.x metadata we record in the metadata an offset (from the
superblock) and a size for a table, and a 'shift' value which can be
used to shift from sector addresses to block numbers.  Thus the unit
that is failed when an error is detected can be larger than one
sector.

Each entry in the table is 64bits in little-endian.   The most
significant 55 bits store a block number which allows for 16 exbibytes
with 512byte blocks, or more if a larger shift size is used.  The
remaining 9 bits store a length of the bad range which can range from
1 to 512.  As bad blocks can often be consecutive, this is expected to
allow the list to be quite efficient.  A value of all 1's cannot
correctly identify a bad range of blocks and so it is used to pad out
the tail of the list.

The bad block list is exposed through sysfs via a directory called
'badblocks' containing several attribute files.

"shift" stores the 'shift' number described above and can be set as
long as the bad block list is empty.

"all" and "unacknowledged" each contains a list of bad ranges, the
start (in blocks, not sectors) and the length (1-512).  Each can also
be written to with a string of the same format as is read out.  This
can be used to add bad blocks to the list or to acknowledge bad
blocks.  Writing effectively say "this bad range is securely recorded
on stable storage".

All bad blocks appear in the "badblocks/all" file.  Only "acknowledged"
bad blocks appear in "badblocks/unacknowledged".  These are ranges
which appear to be bad but are not known to be stored on stable
storage.

When md detects a write error or a read error which it cannot correct
it added the block and marks the range that it was part of as
'unacknowledged'.  Any write that depends on this block is then
blocked until the range is acknowledged.  This ensures that an
application isn't told that a write has succeeded until the data
really is safe.

If the bad block list is being managed by v1.x metadata internally,
then the bad block list will be written out and the ranges will be
acknowledged and writes unblocked automatically.

If the bad block list is being managed externally, then the bad ranges
will be reported in "unacknowledged_bad_blocks".  The metadata handler
should read this, update the on-disk metadata and write the range back
to "bad_blocks".  This completes the acknowledgment handshake and
writes can continue.

RAID1, RAID10 and RAID456 should all support bad blocks.  Every read
or write should perform a lookup of the bad block list.  If a read
finds a bad block, that device should be treated as failed for that
read.  This includes reads that are part of resync or recovery.

If a write finds a bad block there are two possible responses.  Either
the block can be ignored as with reads, or we can try to write the
data in the hope that it will fix the error.  Always taking the second
action would seem best as it allows blocks to be removed from the
bad-block list, but as a failing write can take a long time, there are
plenty of cases where it would not be good.

To choose between these we make the simple decision that once we see a
write error we never try to write to bad blocks on that device again.
This may not always be the perfect strategy, but it will effectively
address common scenarios.  So if a bad block is marked bad due to a
read error when the array was degraded, then a write (presumably from
the filesystem) will have the opportunity to correct the error.
However if it was marked bad due to a write error we don't risk paying
the penalty of more write errors.

This 'have seen a write error' status is not stored in the array
metadata.  So when restarting an array with some bad blocks, each
device will have one chance to prove that it can correctly handle
writes to a bad block.  If it can, the bad block will be removed from
the list and the data is that little bit safer.  If it cannot, no
further writes to bad blocks will be tried on the device until the
next array restart.


Hot Replace
-----------

"Hot replace" is my name for the process of replacing one device in an
array by another one without first failing the one device.  Thus there
can be two devices in an array filling the same 'role'.  One device
will contain all of the data, the other device will only contain some
of it and will be undergoing a 'recovery' process.  Once the second
device is fully recovered it is expected that the first device will be
removed from the array.

This can be useful whenever you want to replace a working device with
another device, without letting the array go degraded.  Two obvious
cases are:
 1/ when you want to replace a smaller device with a larger device
 2/ when you have a device with a number of bad blocks and want to
    replace it with a more reliable device.

For '2' to be realised, the bad block log described above must be
implemented, so it should be completed before this feature.

Hot replace is really only needed for RAID10 and RAID456.  For RAID1,
simply increasing the number of devices in the array while the new
device recovers, then failing the old device and decreasing the number
of devices in the array is sufficient.

For RAID0 or LINEAR it would be sufficient to:
 - stop the array
 - make a RAID1 without superblocks for the old and new device
 - re-assemble the array using the RAID1 in place of the old device.

This is certainly not as convenient but is sufficient for a case that
is not likely to be commonly needed.

So for both the RAID10 and RAID456 modules we need:
 - the ability to add a device as a hot-replace device for a specific
   slot
 - the ability to record hot-spare status in the metadata.
 - a 'recovery' process to rebuild a device, preferably only reading
   from the device to be replaced, though reading from elsewhere when
   needed
 - writes to go to both primary and secondary device.
 - Reads to come from either if the second has recovered far enough.
 - to promote a secondary device to primary when the primary device
   (that has a hot-replace device) fails.

It is not clear whether the primary should be automatically failed
when the rebuild of the secondary completes.  Commonly this would be
ideal, but if the secondary experienced any write errors (that were
recorded in the bad block log) then it would be best to leave both in
place until the sysadmin resolves the situation.   So in the first
implementation this failing should not be automatic.

The identification of a spare as a 'hot-replace' device is achieved
through the 'md/dev-XXXX/slot' sysfs attribute.  This is usually
'none' or a small integer identifying which slot in the array is
filled by this device.  A number followed by a plus (e.g. '1+') is
written, then the device takes the role of a hot-spare.  This syntax
requires there be at most one hot spare per slot.  This is a
deliberate decision to manage complexity in the code.  Allowing more
would be of minimal value but require substantial extra complexity.

v0.90 metadata is not supported.  v1.x sets a 'feature bit' on the
superblock of any 'hot-replace' device and naturally records in
'recover_offset' how far recovery has progressed.  Externally managed
metadata can support this, or not, as they choose.


Reversible Reshape
------------------

It is possible to start a reshape that cannot be reversed until the
reshape has completed.  This is occasionally problematic.  While we
might hope that users would never make errors, we should try to be as
forgiving as possible.

Reversing a reshape that changes the number of data-devices is
possible as we support both growing and shrinking and these happen in
opposite directions so one is the reverse of the other.  Thus at worst,
such a reshape can be reversed by:
 - stopping the array
 - re-writing the metadata so it looks like the change is going in the
   other direction
 - restarting the array.

However for a reshape that doesn't change the number of data devices,
such as a RAID5->RAID6 conversion or a change of chunk-size, reversal
is currently not possible as the change always goes in the same
direction.

This is currently only meaningful for RAID456, though at some later
date it might be relevant for RAID10.

A future change will make it possible to move the data_offset while
performing a reshape, and that will sometimes require the reshape to
progress in a certain direction.  It is only when the data_offset is
unchanged and the number of data disks is unchanged that there is any
doubt about direction.  In that case it needs to be explicitly stated.

We need:
 - some way to record in the metadata the direction of the reshape
 - some way to ask for a reshape to be started in the reverse
   direction
 - some way to reverse a reshape that is currently happening.

We have a new sysfs attribute "reshape_direction" which is
"low-to-high" or "high-to-low".  This defaults to "low-to-high" but
will be force to "high-to-low" if the particular reshape requires it,
or can be explicity set by a 'write' before the reshape commences.

Once the reshape has commenced, writing a new value to this field can
flip the reshape causing it to be reverted.

In both v0.90 and v1.x metadata we record a reversing reshape by
setting the most significant bit in reshape_position.  For v0.90 we
also increase the minor number to 91.  For v1.x we set a feature bit
as well.


Change data offset during reshape
---------------------------------

One of the biggest problems with reshape currently is the need for the
backup file.  This is a management problem as it cannot easily be
found at restart, and it is a performance problem as the extra writing
is expensive.

In some cases we can avoid the need for a backup file completely by
changing the data-offset.  i.e. the location on the devices where the
array data starts.

For reshapes that increase the number of devices, only a single backup
is required at the start.  If the data_offset is moved just one chunk
earlier we can do without a separate backup.  This obviously requires
that space was left when the array was first created.  Recent versions
of mdadm do leave some space with the default metadata, though more
would probably be good.

For reshapes that decrease the number of device, only a small backup
is required right at the end of the process (at the beginning of the
devices).  If we move the data_offset forward by one chunk that backup
too can be avoided.  As we are normally reducing the size of the array
in this process, we just need to reduce it a little bit more.

For reshapes that neither increase of decrease the number of devices a
somewhat larger change in data_offset is needed to get reasonable
performance.  A single chunk (of the larger chunk size) would work,
but would require updating the metadata after each chunk which would
be prohibitively slow unless chunks were very large.  A few megabytes
is probably sufficient for reasonable performance, though testing
would be helpful to be sure.  Current mdadm leaves no space at the
start of 1.0, and about 1Meg at the start of 1.1 and 1.2 arrays.

This will generally not be enough space.  In these cases it will
probably be best to perform the reshape in the reverse direction
(helped by the previous feature).  This will probably require
shrinking the filesystem and the array slightly first.  Future
versions of mdadm should aim to leave a few metabytes free at start
and end to make these reshapes work better.

Moving the data offset is not possible for 0.90 metadata as it does
not record a data offset.

For 1.x metadata it is possible to have a different data_offset on
each device.  However for simplicity we will only support changing the
data offset by the same amount on each device.  This amount will be
stored in currently-unused space in the 1.x metadata.  There will be a
sysfs attribute "delta_data_offset" which can be set to a number of
sectors - positive or negative - to request a change in the data
offset and thus avoid the need for a backup file.


Bitmap of non-sync regions.
---------------------------

There are a couple of reasons for having regions of an array that are known
not to contain important data and are known to not necessarily be
in-sync.

1/ When an array is first created it normally contains no valid data.
   The normal process of a 'resync' to make all parity/copies correct
   is largely a waste of time.
2/ When the filesystem uses a "discard" command to report that a
   region of the device is no-longer used it would be good to be able
   to pass this down to the underlying devices.  To do this safely we
   need to record at the md level that the region is unused so we
   don't complain about inconsistencies and don't try to re-sync the
   region after a crash.

If we record which regions are not in-sync in a bitmap then we can meet
both of these needs.

A read to a non-in-sync region would always return 0s.
A 'write' to a non-in-sync region should cause that region to be
resynced.  Writing zeros would in some sense be ideal, but to do that
we would have to block the write, which would be unfortunate.  As the
fs should not be reading from that area anyway, it shouldn't really
matter.

The granularity of the bit is probably quite hard to get right.
Having it match the block size would mean that no resync would be
needed and that every discard request could be handled exactly.
However it could result in a very large bitmap - 30 Megabytes for a 1
terabyte device with a 4K block size.  This would need to be kept in
memory and looked up for every access, which could be problematic.

Having a very coarse granularity would make storage and lookups more
efficient.  If we make sure the bitmap would fit in 4K, we would have
about 32 megabytes for bit.  This would mean that each time we
triggered a resync it would resync for a second or two which is
probably a reasonable time as it wouldn't happen very often.  But it
would also mean that we can only service a 'discard' request if it
covers whole blocks of 32 megabytes, and I really don't know how
likely that is.  Actually I'm not sure if anyone knows, the jury seems
to still be out on how 'discard' will work long-term.

So probably aiming for a few K to a few hundred K seems reasonable.
That means that the in-memory representation will have to be a
two-level array.  A page of pointers to other pages can cover (on a
64bit system) 512 pages or 2Meg of bitmap space which should be
enough.

As always we need a way to:
 - record the location and size of the bitmap in the metadata
 - allow the granularity to be set via sysfs
 - allow bits to be set via sysfs, and allow the current bitmap to
   be read via sysfs.

For v0.90 metadata we won't support this as there is no room.  We
could possibly store about 32 bytes directly in the superblock
allowing for 4Gig sections but this is unlikely to be really useful.

For v1.x metadata we use 8 bytes from the 'array state info'.  4 bytes
give an offset from the metadata of the start of the bitmap, 2 bytes
give the space reserved for the bitmap (max 32Meg) and 2 bytes give a
shift value from sectors to in-sync chunks.  The actual size of the
bitmap must be computed from the known size of the array and the size
of the chunks.

We present the bitmap in sysfs similar to the way we present the bad
block list.  A file 'non-sync/regions' contains start and size of regions
(measured in sectors) that are known to not be in-sync.  A file
'non-sync/now-in-sync' lists ranges that actually are in sync but have not been
recorded in non-in-sync yet.  User-space reads now-in-sync', updates
the metadata, and write to 'regions'.

Another file 'non-sync/to-discard' lists ranges for a which a discard
request has been made.  These need to be recorded in the metadata.
They are then written back to the file which allows the discard
request to complete.

The granularity can be set via sysfs by writing to
'non-sync/chunksize'.


Assume-clean when increasing array --size
-----------------------------------------

When a RAID1 is created, --assume-clean can be given so that the
largely-unnecessary initial resync can be avoided.  When extending the
size of an array with --grow --size=, there is no way to specify
--assume-clean.

If a non-sync bitmap (see above) is configured this doesn't matter
that the extra space will simply be marked as non-in-sync.
However if a non-sync bitmap is not supported by the metadata or is
not configured it would be good if md/raid1 can be told not to sync
the extra space - to assume that it is in-sync.

So when a non-sync bitmap is not configured (the chunk-size is zero),
writing to the non-sync/regions file tells md that we don't care about the
region being in-sync.  So the sequence:
 - freeze sync_action
 - update size
 - write range to non-sync/regions
 - unfreeze sync_action

will effect a "--grow --size=bigger --assume-clean" reshape.


Enable 'reshape' to also perform 'recovery'.
--------------------------------------------

As a 'reshape' re-writes all the data in the array it can quite easily
be used to recover to a spare device.  Normally these two operations
would happen separately.  However if a device fails during a reshape
and a spare is available it makes sense to combine them.

Currently if a device fails during a reshape (leaving the array
degraded but functional) the reshape will continue and complete.  Then
if a spare is available it will be recovered.  This means a longer
total time until the array is optimal.

When the device fails, the reshape actually aborts, and the restarts
from where it left off.  If instead we allow spares to be added
between the abort and the restart, and cause the 'reshape' to actually
do a recovery until it reaches the point where it was already up to,
then we minimise the time to getting an optimal array.


When reshaping an array to fewer devices, allow 'size' to be increased
--------------------------------------------------------------------

The 'size' of an array is the amount of space on each device which is
used by the array.  Normally the 'size' of an array cannot be set
beyond the amount of space available on the smallest device.

However when reshaping an array to have fewer devices it can be useful
to be able to set the 'size' to be the smallest of the remaining
devices - those that will still be in use after the reshape.

Normally reshaping an array to have fewer devices will make the array
size smaller.  However if we can simultaneously increase the size of
the remaining devices, the array size can stay unchanged or even grow.

This can be used after replacing (ideally using hot-replace) a few
devices in the array with larger devices.  The net result will be a
similar amount of storage using few drives, each larger than before.

This should simply be a case of allowing size to be set larger when
delta_disks is negative.  It also requires that when converting the
excess device to spares, we fail them if they are smaller than the new
size.

As a reshape can be reversed, we must make sure to revert the size
change when reversing a reshape.

Allow write-intent-bitmap to be added to an array during reshape/recovery.
--------------------------------------------------------------------------

Currently it is not possible to add a write-intent-bitmap to an array
that is being reshaped/resynced/recovered.  There is no real
justification for this, it was just easier at the time.

Implementing this requires a review of all code relating to the
bitmap, checking that a bitmap appearing - or disappearing - during
these processes will not be a problem.  As the array is quiescent when
the bitmap is added, no IO will actually be happening so it *should*
be safe.

This should also allow a reshape to be started while a bitmap is
present, as long as the reshape doesn't change the implied size of the
bitmap.

Support resizing of write-intent-bitmap prior to reshape
--------------------------------------------------------

When we increase the 'size' of an array (the amount of the device
used), that implies a change in size of the bitmap.  However the
kernel cannot unilaterally resize the bitmap as there may not be room.

Rather, mdadm needs to be able to resize the bitmap first.  This
requires the sysfs interface to expose the size of the bitmap - which
is currently implicit.

Whether the bitmap coverage is increased by increasing the number of
bits or increasing the chunk size, some updating of the bitmap storage
will be necessary (particularly in the second case).

So it makes sense to allow user-space to remove the bitmap then add a
new bitmap with a different configuration.  If there is concern about
a crash between these two, writes could be suspended for the (short)
duration.

Currently the 'sync_size' stored in the bitmap superblock is not used.
We could stop updating that, and could allow the bitmap to
automatically extend up to that boundary.

So: we have a well defined 'sync_size' which can be set via the
superblock or via sysfs.  A resize is permitted as long as there is no
bitmap, or the existing bitmap has a sufficiently large sync_size.

Support reshape of RAID10 arrays.
---------------------------------

RAID10 arrays currently cannot be reshaped at all.  It is possible to
convert a 'near' mode RAID10 to RAID0, but that is about all.   Some
real reshape is possible and should be implemented.

1/ A 'near' or 'offset' layout can have the device size changed quite
   easily.

2/ Device size of 'far' arrays cannot be changed easily.  Increasing device
   size of 'far' would require re-laying out a lot of data.  We would
   need to record the 'old' and 'new' sizes which metadata doesn't
   currently allow.  If we spent 8 bytes on this we could possibly
   manage a 'reverse reshape' style conversion here.

3/ Increasing the number of devices is much the same for all layouts.
   The data needs to be copied to the new location.  As we currently
   block IO while recovery is actually happening, we could just do
   that for reshape as well, and make sure reshape happens in whole
   chunks at a time (or whatever turns out to be the minimum
   recordable unit).  We switch to 'clean' before doing any reshape so
   a write will switch to 'dirty' and update the metadata.

4/ decreasing the number of devices is very much the reverse of
   increasing..
   Here is a weird thought:  We have introduced the idea that we can
   increase the size of remaining devices when we decrease the number
   of devices in the array.  For 'raid10-far', the re-layout for
   increasing the device size is very much like that for decreasing
   the number of devices - just that the number doesn't actually
   decrease.

5/ changing layouts between 'near' and 'offset' should be manageable
   providing enough 'backup' space is available.  We simply copy
   a few chunks worth of data and move reshape_position.

6/ changing layout to or from 'far' is nearly impossible...
   With a change in data_offset it might be possible to move one
   stripe at a time, always into the place just vacated.
   However keeping track of where we are and were it is safe to read
   from would be a major headache - unless it feel out with some
   really neat maths, which I don't think it does.
   So this option will be left out.


So the only 'instant' conversion possible is to increase the device
size for 'near' and 'offset' array.

'reshape' conversions can modify chunk size, increase/decrease number of
devices and swap between 'near' and 'offset' layout providing a
suitable number of chunks of backup space is available.

The device-size of a 'far' layout can also be changed by a reshape
providing the number of devices in not increased.


Better reporting of inconsistencies.
------------------------------------

When a 'check' finds a data inconsistency it would be useful if it
was reported.   That would allow a sysadmin to try to understand the
cause and possibly fix it.

One simple approach would be to simply log all inconsistencies through
the kernel logs.  This would have to be limited to 'check' and
possibly 'repair' passed as logging a 'sync' pass (which also find
inconsistencies) can be expected to be very noisy.

Another approach is to use a sysfs file to export a list of
addresses.  This would place some upper limit on the number of
addresses that could be listed, but if there are more inconsistencies
than that limit, then the details probably aren't all that important.

It makes sense to follow both of these paths.
 - some easy-to-parse logging of inconsistencies found.
 - a sysfs file that lists as many inconsistencies as possible.

Each inconsistency is listed as a simple sector offset.  For RAID1 and
RAID4/5/6, it is an offset from the start of data on the individual devices.  For
RAID1 and RAID10 it is an offset from the start of the array.  So this
can only be interpreted with a full understanding of the array layout.

The actual inconsistency may be in some sector immediately following
the given sector as md performs checks in blocks larger than one
sector and doesn't both refining.   So an process that uses this
information should read forward from the address to make sure it has
found all of the inconsistency.  For striped array, at most 1 chunk
need be examined.  For non-striped (i.e. RAID1) the window size is
currently 64K.  The actual size can be found by dividing
'mismatch_cnt' by the number of entries in the mismatch list.

This has no dependencies on other features.  It relates slightly to
the bad-block list as one way of dealing with an inconsistency is to
tell md that a selected block in the stripe is 'bad'.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 10:27 md road-map: 2011 NeilBrown
@ 2011-02-16 11:28 ` Giovanni Tessore
  2011-02-16 13:40   ` Roberto Spadim
  2011-02-16 14:13 ` Joe Landman
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 52+ messages in thread
From: Giovanni Tessore @ 2011-02-16 11:28 UTC (permalink / raw)
  To: linux-raid

Hi Neil,
I apreciate very much the Bad Block Log feature, as I had big troubles 
with read errors during recovery of a degraded RAID-5 array.
It seems to me a very good idea to just fail a stripe or even a single 
block (the smallest possible unit of information possibly) if the read 
error is unrecoverable, letting the remainig 99.99..% of the device 
still online and available (that is, return the unrecoverable read error 
to the 'caller' as would do a single disk).
Also having the list of bad block availabe into sysfs is a very useful 
feature.

Still regarding to correctable read errors, how are they currently 
managed with RAID-1? If a read error occurs on sector XZY of disk A, the 
same sector XYZ is get from another disk (ramdomly) in the same array 
and rewritten to disk A? (for RAID456 it's reconstructed from parity, 
and it's clearly much safer).

Regards.


On 02/16/2011 11:27 AM, NeilBrown wrote:
> I all,
>   I wrote this today and posted it at
> http://neil.brown.name/blog/20110216044002
>
> I thought it might be worth posting it here too...
>
> NeilBrown


-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 11:28 ` Giovanni Tessore
@ 2011-02-16 13:40   ` Roberto Spadim
  2011-02-16 14:00     ` Robin Hill
  0 siblings, 1 reply; 52+ messages in thread
From: Roberto Spadim @ 2011-02-16 13:40 UTC (permalink / raw)
  To: Giovanni Tessore; +Cc: linux-raid

i agree with giovanni, another question since we will make a lot of
change on mirrors based arrays (raid1, raid10) with the badblock list,
could we:

option1) remove raid1 code, change raid10 to work without raid0
'service', change raid10 to work with more than 1mirror (like raid1
do)?
option2) port raid10 layout to raid1?

raid10 can do the same job of raid1 if we don't use 0(stripe) feature.
raid1 have the write-behind, and 'many mirrors' feature, but don't
have layout/offset.

a raid1 with offset layout could improve (a lot!) read performace of
raid1 array. i'm not good in english i will explain with a example:

raid1 (/dev/md0) with 2 mirrors (/dev/sda,/dev/sdb)

/dev/md0 sector1 on /dev/sda = sector1
/dev/md0 sector1 on /dev/sdb = sector2 (or another offset)

reading sector 1 and 2 from /dev/md0:

considering current disks (/dev/sda,/dev/sdb) head positions=0
read sector1 from /dev/sda (distance from sda=0, distance from sdb=1)
read sector2 from /dev/sdb (distance from sda=1, distance from sdb=0)

see that i don't need more size(raid0) i just need a layout/offset to
make reads faster
others layouts could help too: odd sectors on start of disk1 even
sectors on end of disk1; even sectors on start of disk2 odd sectors of
end of disk2

i don't know what is more time consuming (for short and long time),
option1 or option2?



=================
ps1:
i made some benchmarks with ssd only array, a roundrobin read balance
is faster than near head, let's explain my conclusion:
    near head isn't good on devices where sequencial/non sequencial
read have the same speed, it don't know anything about device read
speed
    the near_head algorithm get one disk and use it for a big
sequencial read (some devices aren't good on sequencial read, or time
for sequencial/non sequencial can be the same for some ssd device)
    resume: a mixed speed ssd only is poor optimized with near head, because:
        1) near head don't know anything about read rate of devices,
with a round robin with per mirror max counter resolve this problem
        2) some ssd access time for sequencial/non sequencial read is
the same (near 0.1ms)

for a mixed array (andreas korn email) using time based with ssd
corsair (~100mb/s, <.1ms accesstime) and 2 barracuda 7200rpm harddisk
(~130mb/s, 0 accestime for sequencial, <=8ms accestime for non
sequencial) we don't have a lot of performace since the performace of
raid0 is just because layout/offset feature, just a read difference of
+1% speed improvement using time based read (maybe marginal error,
maybe not, iozone spend a lot of time to benchmark, we didn't have
more time to test).


==============
ps2/explanations:
time based isn't a today kernel read_balance algorithm, it's a patch
that i'm testing (www.spadim.com.br/raid1 for kernel 2.6.37)
time based use near_head idea with some more informations to select best disk:

time to move head ( (near_head distance * per mirror head move speed)
+ (fixed sequencial speed if sequencial) + (fixed non sequencial speed
if non sequencial)
+
time to read (sectors to read * read_rate, something like: 130mb/s =
3,7560096153e-6 seconds / sector)
+
time to end queue (sum of reads*read_rate + sum of writes*write_rate +
time to move head(first read/write sector - last read/write sector) ,
considering that disk queue (scheduler/elevators) make a good job
moving head just 1 time, in futures versions when elevator  could
inform time estimation we could just use it and remove this math from
md code), not yet implemented

time based work like near_head if:
read_rate=0, write_rate=0, head_move_speed=1,
fixed_sequencial_speed=0, fixed_nonsequencial_speed=0
on all mirror





2011/2/16 Giovanni Tessore <giotex@texsoft.it>:
> Hi Neil,
> I apreciate very much the Bad Block Log feature, as I had big troubles with
> read errors during recovery of a degraded RAID-5 array.
> It seems to me a very good idea to just fail a stripe or even a single block
> (the smallest possible unit of information possibly) if the read error is
> unrecoverable, letting the remainig 99.99..% of the device still online and
> available (that is, return the unrecoverable read error to the 'caller' as
> would do a single disk).
> Also having the list of bad block availabe into sysfs is a very useful
> feature.
>
> Still regarding to correctable read errors, how are they currently managed
> with RAID-1? If a read error occurs on sector XZY of disk A, the same sector
> XYZ is get from another disk (ramdomly) in the same array and rewritten to
> disk A? (for RAID456 it's reconstructed from parity, and it's clearly much
> safer).
>
> Regards.
>
>
> On 02/16/2011 11:27 AM, NeilBrown wrote:
>>
>> I all,
>>  I wrote this today and posted it at
>> http://neil.brown.name/blog/20110216044002
>>
>> I thought it might be worth posting it here too...
>>
>> NeilBrown
>
>
> --
> Cordiali saluti.
> Yours faithfully.
>
> Giovanni Tessore
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 13:40   ` Roberto Spadim
@ 2011-02-16 14:00     ` Robin Hill
  2011-02-16 14:09       ` Roberto Spadim
  0 siblings, 1 reply; 52+ messages in thread
From: Robin Hill @ 2011-02-16 14:00 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: Giovanni Tessore, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1014 bytes --]

On Wed Feb 16, 2011 at 10:40:40AM -0300, Roberto Spadim wrote:

> i agree with giovanni, another question since we will make a lot of
> change on mirrors based arrays (raid1, raid10) with the badblock list,
> could we:
> 
> option1) remove raid1 code, change raid10 to work without raid0
> 'service', change raid10 to work with more than 1mirror (like raid1
> do)?
> option2) port raid10 layout to raid1?
> 
You can already do option1 (if I'm understanding you correctly).  A
RAID10 array can use as many mirrors as you like (--layout n2 is the
default, meaning a near layout with 2 replicas, using n4 would give 4
replicas), and as long as the number of replicas is equal to the number
of devices, there should be no striping involved in the process.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 14:00     ` Robin Hill
@ 2011-02-16 14:09       ` Roberto Spadim
  2011-02-16 14:21         ` Roberto Spadim
  0 siblings, 1 reply; 52+ messages in thread
From: Roberto Spadim @ 2011-02-16 14:09 UTC (permalink / raw)
  To: Roberto Spadim, Giovanni Tessore, linux-raid; +Cc: Robin Hill

hummm nice =)
near layout is the key for many mirrors?
i will check more layouts

2011/2/16 Robin Hill <robin@robinhill.me.uk>:
> On Wed Feb 16, 2011 at 10:40:40AM -0300, Roberto Spadim wrote:
>
>> i agree with giovanni, another question since we will make a lot of
>> change on mirrors based arrays (raid1, raid10) with the badblock list,
>> could we:
>>
>> option1) remove raid1 code, change raid10 to work without raid0
>> 'service', change raid10 to work with more than 1mirror (like raid1
>> do)?
>> option2) port raid10 layout to raid1?
>>
> You can already do option1 (if I'm understanding you correctly).  A
> RAID10 array can use as many mirrors as you like (--layout n2 is the
> default, meaning a near layout with 2 replicas, using n4 would give 4
> replicas), and as long as the number of replicas is equal to the number
> of devices, there should be no striping involved in the process.
>
> Cheers,
>    Robin
> --
>     ___
>    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
>   / / )      | Little Jim says ....                            |
>  // !!       |      "He fallen in de water !!"                 |
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 10:27 md road-map: 2011 NeilBrown
  2011-02-16 11:28 ` Giovanni Tessore
@ 2011-02-16 14:13 ` Joe Landman
  2011-02-16 21:24   ` NeilBrown
  2011-02-16 15:42 ` David Brown
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 52+ messages in thread
From: Joe Landman @ 2011-02-16 14:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 02/16/2011 05:27 AM, NeilBrown wrote:
>
> I all,
>   I wrote this today and posted it at
> http://neil.brown.name/blog/20110216044002
>
> I thought it might be worth posting it here too...


Any possibility of getting a hook for a read/write/compare checksum in 
this?  I'd be happy to commit some time to this if I knew where to begin.

Also, very interested in hooks to do RAID and similar computations in 
user space (so we can play with functionality without causing problems 
with a kernel).  Does this external capability exist now?  Would it be 
hard to include?  Again, something we'd be interested in committing some 
time to if we could get a shove in the right direction.

Thanks!

Joe

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 14:09       ` Roberto Spadim
@ 2011-02-16 14:21         ` Roberto Spadim
  2011-02-16 21:55           ` NeilBrown
  0 siblings, 1 reply; 52+ messages in thread
From: Roberto Spadim @ 2011-02-16 14:21 UTC (permalink / raw)
  To: Roberto Spadim, Giovanni Tessore, linux-raid

since we have the option1 done, why continue with raid1 code? could we
port write-behind to raid10 code?
another thing, could raid10 work without replica? like a raid0?

why? just to remove many files with the same function (raid1and raid0,
if raid10 do the same work, many some mdadm changes allow us to
--level=1 to understand that's raid10 without stripe, --level=0 is
raid10 without mirrors)

2011/2/16 Roberto Spadim <roberto@spadim.com.br>:
> hummm nice =)
> near layout is the key for many mirrors?
> i will check more layouts
>
> 2011/2/16 Robin Hill <robin@robinhill.me.uk>:
>> On Wed Feb 16, 2011 at 10:40:40AM -0300, Roberto Spadim wrote:
>>
>>> i agree with giovanni, another question since we will make a lot of
>>> change on mirrors based arrays (raid1, raid10) with the badblock list,
>>> could we:
>>>
>>> option1) remove raid1 code, change raid10 to work without raid0
>>> 'service', change raid10 to work with more than 1mirror (like raid1
>>> do)?
>>> option2) port raid10 layout to raid1?
>>>
>> You can already do option1 (if I'm understanding you correctly).  A
>> RAID10 array can use as many mirrors as you like (--layout n2 is the
>> default, meaning a near layout with 2 replicas, using n4 would give 4
>> replicas), and as long as the number of replicas is equal to the number
>> of devices, there should be no striping involved in the process.
>>
>> Cheers,
>>    Robin
>> --
>>     ___
>>    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
>>   / / )      | Little Jim says ....                            |
>>  // !!       |      "He fallen in de water !!"                 |
>>
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 10:27 md road-map: 2011 NeilBrown
  2011-02-16 11:28 ` Giovanni Tessore
  2011-02-16 14:13 ` Joe Landman
@ 2011-02-16 15:42 ` David Brown
  2011-02-16 21:35   ` NeilBrown
  2011-02-16 17:20 ` Joe Landman
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 52+ messages in thread
From: David Brown @ 2011-02-16 15:42 UTC (permalink / raw)
  To: linux-raid

On 16/02/2011 11:27, NeilBrown wrote:
>
> I all,
>   I wrote this today and posted it at
> http://neil.brown.name/blog/20110216044002
>
> I thought it might be worth posting it here too...
>
> NeilBrown
>


The bad block log will be a huge step up for reliability by making 
failures fine-grained.  Occasional failures are a serious risk, 
especially with very large disks.  The bad block log, especially 
combined with the "hot replace" idea, will make md raid a lot safer 
because you avoid running the array in degraded mode (except for a few 
stripes).

When a block is marked as bad on a disk, is it possible to inform the 
file system that the whole stripe is considered bad?  Then the 
filesystem will (I hope) add that stripe to its own bad block list, move 
the data out to another stripe (or block, from the fs's viewpoint), thus 
restoring the raid redundancy for that data.

Can a "hot spare" automatically turn into a "hot replace" based on some 
criteria (such as a certain number of bad blocks)?  Can the replaced 
drive then become a "hot spare" again?  It may not be perfect, but it is 
still better than nothing, and useful if the admin can't replace the 
drive quickly.

It strikes me that "hot replace" is much like one of the original disks 
out of the array and replacing it with a RAID 1 pair using the original 
disk and a missing second.  The new disk is then added to the pair and 
they are sync'ed.  Finally, you remove the old disk from the RAID 1 
pair, then re-assign the drive from the RAID 1 "pair" to the original RAID.

I may be missing something, but if I think that using the bad-block list 
and the non-sync bitmaps, the only thing needed to support hot replace 
is a way to turn a member drive into a degraded RAID 1 set in an atomic 
action, and to reverse this action afterwards.  This may also give extra 
flexibility - it is conceivable that someone would want to keep the RAID 
1 set afterwards as a reshape (turning a RAID 5 into a RAID 1+0, for 
example).

For your non-sync bitmap, would it make sense to have a two-level 
bitmap?  Perhaps a coarse bitmap in blocks of 32 MB, with each entry 
showing a state of in sync, out of sync, partially synced, or never 
synced.  Partially synced coarse blocks would have their own fine bitmap 
at the 4K block size (or perhaps a bit bigger - maybe 32K or 64K would 
fit well with SSD block sizes).  Partially synced and out of sync blocks 
would be gradually brought into sync when the disks are otherwise free, 
while never synced blocks would not need to be synced at all.

This would let you efficiently store the state during initial builds 
(everything is marked "never synced" until it is used), and rebuilds are 
done by marking everything as "out of sync" on the new device.  The 
two-level structure would let you keep fine-grained sync information 
from file system discards without taking up unreasonable space.





^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 10:27 md road-map: 2011 NeilBrown
                   ` (2 preceding siblings ...)
  2011-02-16 15:42 ` David Brown
@ 2011-02-16 17:20 ` Joe Landman
  2011-02-16 21:36   ` NeilBrown
  2011-02-16 19:37 ` Phil Turmel
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 52+ messages in thread
From: Joe Landman @ 2011-02-16 17:20 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 02/16/2011 05:27 AM, NeilBrown wrote:
>
> I all,
>   I wrote this today and posted it at
> http://neil.brown.name/blog/20110216044002
>
> I thought it might be worth posting it here too...

Another request would be an incremental on-demand build of the RAID. 
That is, when we set up a RAID6, that it only computes the blocks as 
they are allocated and used.  This helps with things like thin 
provisioning on remote target devices (among other nice things).

>
> NeilBrown
>
>
> -------------------------
>
>
> It is about 2 years since I last published a road-map[1] for md/raid
> so I thought it was time for another one.  Unfortunately quite a few
> things on the previous list remain undone, but there has been some
> progress.
>
> I think one of the problems with some to-do lists is that they aren't
> detailed enough.  High-level design, low level design, implementation,
> and testing are all very different sorts of tasks that seem to require
> different styles of thinking and so are best done separately.  As
> writing up a road-map is a high-level design task it makes sense to do
> the full high-level design at that point so that the tasks are
> detailed enough to be addressed individually with little reference to
> the other tasks in the list (except what is explicit in the road map).
>
> A particular need I am finding for this road map is to make explicit
> the required ordering and interdependence of certain tasks.  Hopefully
> that will make it easier to address them in an appropriate order, and
> mean that I waste less time saying "this is too hard, I might go read
> some email instead".
>
> So the following is a detailed road-map for md raid for the coming
> months.
>
> [1] http://neil.brown.name/blog/20090129234603
>
> Bad Block Log
> -------------
>
> As devices grow in capacity, the chance of finding a bad block
> increases, and the time taken to recover to a spare also increases.
> So the practice of ejecting a device from the array as soon as a
> write-error is detected is getting more and more problematic.
>
> For some time we have avoided ejecting devices for read errors, by
> computing the expected data from elsewhere and writing back to the
> device - hopefully fixing the read error.  However this cannot help
> degraded arrays and they will still eject a device (and hence fail the
> whole array) on a single read error.  This is not good.
>
> A particular problem is that when a device does fail and we need to
> recover the data, we typically read all of the other blocks on all
> arrays.  If we are going to hit any read errors, this is the most
> likely time, and also this is the worst possible time and it will mean
> that the recovery doesn't complete and the array gets stuck in a
> degraded state and is very susceptible to substantial loss if another
> failure happens.
>
> Part of the answer to this is to implement a "bad block log".  This is
> a record of blocks that are known to be bad.  i.e. either a read or a
> write has recently failed.  Doing this allows us to just eject that
> block from the array rather than the whole devices.  Similarly instead
> of failing the whole array, we can fail just one stripe.  Certainly
> this can mean data loss, but the loss of a few K is much less
> traumatic than the loss of a terabyte.
>
> But using a bad block list isn't just about keeping the data loss
> small, it can be about keeping it to zero.  If we get a write error on
> a block in a non-degraded array, then recording the bad block means we
> lose redundancy in just that stripe rather than losing it across the
> whole array.  If we then lose a different block on a different drive,
> the ability to record the bad block means that we can continue without
> data loss.  Had we needed to eject both whole drives from the array we
> would have lost access to all of our data.
>
> The bad block list must be recorded to stable storage to be useful, so
> it really needs to be on the same drives that store the data.  The
> bad-block list for a particular device is only of any interest to that
> device.  Keeping information about one device on another is pointless.
> So we don't have a bad block list for the whole array, we keep
> multiple lists, one for each device.
>
> It would be best to keep at least two copies of the bad block list so
> that if the place where the list goes bad we can keep working with the
> device.  The same logic applies to other metadata which currently
> cannot be duplicated.  So implementing this feature will not address
> metadata redundancy.  A separate feature should address metadata
> redundancy and it can duplicate the bad block list as well as other
> metadata.
>
> There are doubtlessly lots of ways that the bad block list could be
> stored, but we need to settle on one.  For externally managed metadata
> we need to make the list accessible via sysfs in a generic way so that
> a user-space program can store is as appropriate.
>
> So: for v0.90 we choose not to store a bad block list.  There isn't
> anywhere convenient to store it and new installations of v0.90 are not
> really encouraged.
>
> For v1.x metadata we record in the metadata an offset (from the
> superblock) and a size for a table, and a 'shift' value which can be
> used to shift from sector addresses to block numbers.  Thus the unit
> that is failed when an error is detected can be larger than one
> sector.
>
> Each entry in the table is 64bits in little-endian.   The most
> significant 55 bits store a block number which allows for 16 exbibytes
> with 512byte blocks, or more if a larger shift size is used.  The
> remaining 9 bits store a length of the bad range which can range from
> 1 to 512.  As bad blocks can often be consecutive, this is expected to
> allow the list to be quite efficient.  A value of all 1's cannot
> correctly identify a bad range of blocks and so it is used to pad out
> the tail of the list.
>
> The bad block list is exposed through sysfs via a directory called
> 'badblocks' containing several attribute files.
>
> "shift" stores the 'shift' number described above and can be set as
> long as the bad block list is empty.
>
> "all" and "unacknowledged" each contains a list of bad ranges, the
> start (in blocks, not sectors) and the length (1-512).  Each can also
> be written to with a string of the same format as is read out.  This
> can be used to add bad blocks to the list or to acknowledge bad
> blocks.  Writing effectively say "this bad range is securely recorded
> on stable storage".
>
> All bad blocks appear in the "badblocks/all" file.  Only "acknowledged"
> bad blocks appear in "badblocks/unacknowledged".  These are ranges
> which appear to be bad but are not known to be stored on stable
> storage.
>
> When md detects a write error or a read error which it cannot correct
> it added the block and marks the range that it was part of as
> 'unacknowledged'.  Any write that depends on this block is then
> blocked until the range is acknowledged.  This ensures that an
> application isn't told that a write has succeeded until the data
> really is safe.
>
> If the bad block list is being managed by v1.x metadata internally,
> then the bad block list will be written out and the ranges will be
> acknowledged and writes unblocked automatically.
>
> If the bad block list is being managed externally, then the bad ranges
> will be reported in "unacknowledged_bad_blocks".  The metadata handler
> should read this, update the on-disk metadata and write the range back
> to "bad_blocks".  This completes the acknowledgment handshake and
> writes can continue.
>
> RAID1, RAID10 and RAID456 should all support bad blocks.  Every read
> or write should perform a lookup of the bad block list.  If a read
> finds a bad block, that device should be treated as failed for that
> read.  This includes reads that are part of resync or recovery.
>
> If a write finds a bad block there are two possible responses.  Either
> the block can be ignored as with reads, or we can try to write the
> data in the hope that it will fix the error.  Always taking the second
> action would seem best as it allows blocks to be removed from the
> bad-block list, but as a failing write can take a long time, there are
> plenty of cases where it would not be good.
>
> To choose between these we make the simple decision that once we see a
> write error we never try to write to bad blocks on that device again.
> This may not always be the perfect strategy, but it will effectively
> address common scenarios.  So if a bad block is marked bad due to a
> read error when the array was degraded, then a write (presumably from
> the filesystem) will have the opportunity to correct the error.
> However if it was marked bad due to a write error we don't risk paying
> the penalty of more write errors.
>
> This 'have seen a write error' status is not stored in the array
> metadata.  So when restarting an array with some bad blocks, each
> device will have one chance to prove that it can correctly handle
> writes to a bad block.  If it can, the bad block will be removed from
> the list and the data is that little bit safer.  If it cannot, no
> further writes to bad blocks will be tried on the device until the
> next array restart.
>
>
> Hot Replace
> -----------
>
> "Hot replace" is my name for the process of replacing one device in an
> array by another one without first failing the one device.  Thus there
> can be two devices in an array filling the same 'role'.  One device
> will contain all of the data, the other device will only contain some
> of it and will be undergoing a 'recovery' process.  Once the second
> device is fully recovered it is expected that the first device will be
> removed from the array.
>
> This can be useful whenever you want to replace a working device with
> another device, without letting the array go degraded.  Two obvious
> cases are:
>   1/ when you want to replace a smaller device with a larger device
>   2/ when you have a device with a number of bad blocks and want to
>      replace it with a more reliable device.
>
> For '2' to be realised, the bad block log described above must be
> implemented, so it should be completed before this feature.
>
> Hot replace is really only needed for RAID10 and RAID456.  For RAID1,
> simply increasing the number of devices in the array while the new
> device recovers, then failing the old device and decreasing the number
> of devices in the array is sufficient.
>
> For RAID0 or LINEAR it would be sufficient to:
>   - stop the array
>   - make a RAID1 without superblocks for the old and new device
>   - re-assemble the array using the RAID1 in place of the old device.
>
> This is certainly not as convenient but is sufficient for a case that
> is not likely to be commonly needed.
>
> So for both the RAID10 and RAID456 modules we need:
>   - the ability to add a device as a hot-replace device for a specific
>     slot
>   - the ability to record hot-spare status in the metadata.
>   - a 'recovery' process to rebuild a device, preferably only reading
>     from the device to be replaced, though reading from elsewhere when
>     needed
>   - writes to go to both primary and secondary device.
>   - Reads to come from either if the second has recovered far enough.
>   - to promote a secondary device to primary when the primary device
>     (that has a hot-replace device) fails.
>
> It is not clear whether the primary should be automatically failed
> when the rebuild of the secondary completes.  Commonly this would be
> ideal, but if the secondary experienced any write errors (that were
> recorded in the bad block log) then it would be best to leave both in
> place until the sysadmin resolves the situation.   So in the first
> implementation this failing should not be automatic.
>
> The identification of a spare as a 'hot-replace' device is achieved
> through the 'md/dev-XXXX/slot' sysfs attribute.  This is usually
> 'none' or a small integer identifying which slot in the array is
> filled by this device.  A number followed by a plus (e.g. '1+') is
> written, then the device takes the role of a hot-spare.  This syntax
> requires there be at most one hot spare per slot.  This is a
> deliberate decision to manage complexity in the code.  Allowing more
> would be of minimal value but require substantial extra complexity.
>
> v0.90 metadata is not supported.  v1.x sets a 'feature bit' on the
> superblock of any 'hot-replace' device and naturally records in
> 'recover_offset' how far recovery has progressed.  Externally managed
> metadata can support this, or not, as they choose.
>
>
> Reversible Reshape
> ------------------
>
> It is possible to start a reshape that cannot be reversed until the
> reshape has completed.  This is occasionally problematic.  While we
> might hope that users would never make errors, we should try to be as
> forgiving as possible.
>
> Reversing a reshape that changes the number of data-devices is
> possible as we support both growing and shrinking and these happen in
> opposite directions so one is the reverse of the other.  Thus at worst,
> such a reshape can be reversed by:
>   - stopping the array
>   - re-writing the metadata so it looks like the change is going in the
>     other direction
>   - restarting the array.
>
> However for a reshape that doesn't change the number of data devices,
> such as a RAID5->RAID6 conversion or a change of chunk-size, reversal
> is currently not possible as the change always goes in the same
> direction.
>
> This is currently only meaningful for RAID456, though at some later
> date it might be relevant for RAID10.
>
> A future change will make it possible to move the data_offset while
> performing a reshape, and that will sometimes require the reshape to
> progress in a certain direction.  It is only when the data_offset is
> unchanged and the number of data disks is unchanged that there is any
> doubt about direction.  In that case it needs to be explicitly stated.
>
> We need:
>   - some way to record in the metadata the direction of the reshape
>   - some way to ask for a reshape to be started in the reverse
>     direction
>   - some way to reverse a reshape that is currently happening.
>
> We have a new sysfs attribute "reshape_direction" which is
> "low-to-high" or "high-to-low".  This defaults to "low-to-high" but
> will be force to "high-to-low" if the particular reshape requires it,
> or can be explicity set by a 'write' before the reshape commences.
>
> Once the reshape has commenced, writing a new value to this field can
> flip the reshape causing it to be reverted.
>
> In both v0.90 and v1.x metadata we record a reversing reshape by
> setting the most significant bit in reshape_position.  For v0.90 we
> also increase the minor number to 91.  For v1.x we set a feature bit
> as well.
>
>
> Change data offset during reshape
> ---------------------------------
>
> One of the biggest problems with reshape currently is the need for the
> backup file.  This is a management problem as it cannot easily be
> found at restart, and it is a performance problem as the extra writing
> is expensive.
>
> In some cases we can avoid the need for a backup file completely by
> changing the data-offset.  i.e. the location on the devices where the
> array data starts.
>
> For reshapes that increase the number of devices, only a single backup
> is required at the start.  If the data_offset is moved just one chunk
> earlier we can do without a separate backup.  This obviously requires
> that space was left when the array was first created.  Recent versions
> of mdadm do leave some space with the default metadata, though more
> would probably be good.
>
> For reshapes that decrease the number of device, only a small backup
> is required right at the end of the process (at the beginning of the
> devices).  If we move the data_offset forward by one chunk that backup
> too can be avoided.  As we are normally reducing the size of the array
> in this process, we just need to reduce it a little bit more.
>
> For reshapes that neither increase of decrease the number of devices a
> somewhat larger change in data_offset is needed to get reasonable
> performance.  A single chunk (of the larger chunk size) would work,
> but would require updating the metadata after each chunk which would
> be prohibitively slow unless chunks were very large.  A few megabytes
> is probably sufficient for reasonable performance, though testing
> would be helpful to be sure.  Current mdadm leaves no space at the
> start of 1.0, and about 1Meg at the start of 1.1 and 1.2 arrays.
>
> This will generally not be enough space.  In these cases it will
> probably be best to perform the reshape in the reverse direction
> (helped by the previous feature).  This will probably require
> shrinking the filesystem and the array slightly first.  Future
> versions of mdadm should aim to leave a few metabytes free at start
> and end to make these reshapes work better.
>
> Moving the data offset is not possible for 0.90 metadata as it does
> not record a data offset.
>
> For 1.x metadata it is possible to have a different data_offset on
> each device.  However for simplicity we will only support changing the
> data offset by the same amount on each device.  This amount will be
> stored in currently-unused space in the 1.x metadata.  There will be a
> sysfs attribute "delta_data_offset" which can be set to a number of
> sectors - positive or negative - to request a change in the data
> offset and thus avoid the need for a backup file.
>
>
> Bitmap of non-sync regions.
> ---------------------------
>
> There are a couple of reasons for having regions of an array that are known
> not to contain important data and are known to not necessarily be
> in-sync.
>
> 1/ When an array is first created it normally contains no valid data.
>     The normal process of a 'resync' to make all parity/copies correct
>     is largely a waste of time.
> 2/ When the filesystem uses a "discard" command to report that a
>     region of the device is no-longer used it would be good to be able
>     to pass this down to the underlying devices.  To do this safely we
>     need to record at the md level that the region is unused so we
>     don't complain about inconsistencies and don't try to re-sync the
>     region after a crash.
>
> If we record which regions are not in-sync in a bitmap then we can meet
> both of these needs.
>
> A read to a non-in-sync region would always return 0s.
> A 'write' to a non-in-sync region should cause that region to be
> resynced.  Writing zeros would in some sense be ideal, but to do that
> we would have to block the write, which would be unfortunate.  As the
> fs should not be reading from that area anyway, it shouldn't really
> matter.
>
> The granularity of the bit is probably quite hard to get right.
> Having it match the block size would mean that no resync would be
> needed and that every discard request could be handled exactly.
> However it could result in a very large bitmap - 30 Megabytes for a 1
> terabyte device with a 4K block size.  This would need to be kept in
> memory and looked up for every access, which could be problematic.
>
> Having a very coarse granularity would make storage and lookups more
> efficient.  If we make sure the bitmap would fit in 4K, we would have
> about 32 megabytes for bit.  This would mean that each time we
> triggered a resync it would resync for a second or two which is
> probably a reasonable time as it wouldn't happen very often.  But it
> would also mean that we can only service a 'discard' request if it
> covers whole blocks of 32 megabytes, and I really don't know how
> likely that is.  Actually I'm not sure if anyone knows, the jury seems
> to still be out on how 'discard' will work long-term.
>
> So probably aiming for a few K to a few hundred K seems reasonable.
> That means that the in-memory representation will have to be a
> two-level array.  A page of pointers to other pages can cover (on a
> 64bit system) 512 pages or 2Meg of bitmap space which should be
> enough.
>
> As always we need a way to:
>   - record the location and size of the bitmap in the metadata
>   - allow the granularity to be set via sysfs
>   - allow bits to be set via sysfs, and allow the current bitmap to
>     be read via sysfs.
>
> For v0.90 metadata we won't support this as there is no room.  We
> could possibly store about 32 bytes directly in the superblock
> allowing for 4Gig sections but this is unlikely to be really useful.
>
> For v1.x metadata we use 8 bytes from the 'array state info'.  4 bytes
> give an offset from the metadata of the start of the bitmap, 2 bytes
> give the space reserved for the bitmap (max 32Meg) and 2 bytes give a
> shift value from sectors to in-sync chunks.  The actual size of the
> bitmap must be computed from the known size of the array and the size
> of the chunks.
>
> We present the bitmap in sysfs similar to the way we present the bad
> block list.  A file 'non-sync/regions' contains start and size of regions
> (measured in sectors) that are known to not be in-sync.  A file
> 'non-sync/now-in-sync' lists ranges that actually are in sync but have not been
> recorded in non-in-sync yet.  User-space reads now-in-sync', updates
> the metadata, and write to 'regions'.
>
> Another file 'non-sync/to-discard' lists ranges for a which a discard
> request has been made.  These need to be recorded in the metadata.
> They are then written back to the file which allows the discard
> request to complete.
>
> The granularity can be set via sysfs by writing to
> 'non-sync/chunksize'.
>
>
> Assume-clean when increasing array --size
> -----------------------------------------
>
> When a RAID1 is created, --assume-clean can be given so that the
> largely-unnecessary initial resync can be avoided.  When extending the
> size of an array with --grow --size=, there is no way to specify
> --assume-clean.
>
> If a non-sync bitmap (see above) is configured this doesn't matter
> that the extra space will simply be marked as non-in-sync.
> However if a non-sync bitmap is not supported by the metadata or is
> not configured it would be good if md/raid1 can be told not to sync
> the extra space - to assume that it is in-sync.
>
> So when a non-sync bitmap is not configured (the chunk-size is zero),
> writing to the non-sync/regions file tells md that we don't care about the
> region being in-sync.  So the sequence:
>   - freeze sync_action
>   - update size
>   - write range to non-sync/regions
>   - unfreeze sync_action
>
> will effect a "--grow --size=bigger --assume-clean" reshape.
>
>
> Enable 'reshape' to also perform 'recovery'.
> --------------------------------------------
>
> As a 'reshape' re-writes all the data in the array it can quite easily
> be used to recover to a spare device.  Normally these two operations
> would happen separately.  However if a device fails during a reshape
> and a spare is available it makes sense to combine them.
>
> Currently if a device fails during a reshape (leaving the array
> degraded but functional) the reshape will continue and complete.  Then
> if a spare is available it will be recovered.  This means a longer
> total time until the array is optimal.
>
> When the device fails, the reshape actually aborts, and the restarts
> from where it left off.  If instead we allow spares to be added
> between the abort and the restart, and cause the 'reshape' to actually
> do a recovery until it reaches the point where it was already up to,
> then we minimise the time to getting an optimal array.
>
>
> When reshaping an array to fewer devices, allow 'size' to be increased
> --------------------------------------------------------------------
>
> The 'size' of an array is the amount of space on each device which is
> used by the array.  Normally the 'size' of an array cannot be set
> beyond the amount of space available on the smallest device.
>
> However when reshaping an array to have fewer devices it can be useful
> to be able to set the 'size' to be the smallest of the remaining
> devices - those that will still be in use after the reshape.
>
> Normally reshaping an array to have fewer devices will make the array
> size smaller.  However if we can simultaneously increase the size of
> the remaining devices, the array size can stay unchanged or even grow.
>
> This can be used after replacing (ideally using hot-replace) a few
> devices in the array with larger devices.  The net result will be a
> similar amount of storage using few drives, each larger than before.
>
> This should simply be a case of allowing size to be set larger when
> delta_disks is negative.  It also requires that when converting the
> excess device to spares, we fail them if they are smaller than the new
> size.
>
> As a reshape can be reversed, we must make sure to revert the size
> change when reversing a reshape.
>
> Allow write-intent-bitmap to be added to an array during reshape/recovery.
> --------------------------------------------------------------------------
>
> Currently it is not possible to add a write-intent-bitmap to an array
> that is being reshaped/resynced/recovered.  There is no real
> justification for this, it was just easier at the time.
>
> Implementing this requires a review of all code relating to the
> bitmap, checking that a bitmap appearing - or disappearing - during
> these processes will not be a problem.  As the array is quiescent when
> the bitmap is added, no IO will actually be happening so it *should*
> be safe.
>
> This should also allow a reshape to be started while a bitmap is
> present, as long as the reshape doesn't change the implied size of the
> bitmap.
>
> Support resizing of write-intent-bitmap prior to reshape
> --------------------------------------------------------
>
> When we increase the 'size' of an array (the amount of the device
> used), that implies a change in size of the bitmap.  However the
> kernel cannot unilaterally resize the bitmap as there may not be room.
>
> Rather, mdadm needs to be able to resize the bitmap first.  This
> requires the sysfs interface to expose the size of the bitmap - which
> is currently implicit.
>
> Whether the bitmap coverage is increased by increasing the number of
> bits or increasing the chunk size, some updating of the bitmap storage
> will be necessary (particularly in the second case).
>
> So it makes sense to allow user-space to remove the bitmap then add a
> new bitmap with a different configuration.  If there is concern about
> a crash between these two, writes could be suspended for the (short)
> duration.
>
> Currently the 'sync_size' stored in the bitmap superblock is not used.
> We could stop updating that, and could allow the bitmap to
> automatically extend up to that boundary.
>
> So: we have a well defined 'sync_size' which can be set via the
> superblock or via sysfs.  A resize is permitted as long as there is no
> bitmap, or the existing bitmap has a sufficiently large sync_size.
>
> Support reshape of RAID10 arrays.
> ---------------------------------
>
> RAID10 arrays currently cannot be reshaped at all.  It is possible to
> convert a 'near' mode RAID10 to RAID0, but that is about all.   Some
> real reshape is possible and should be implemented.
>
> 1/ A 'near' or 'offset' layout can have the device size changed quite
>     easily.
>
> 2/ Device size of 'far' arrays cannot be changed easily.  Increasing device
>     size of 'far' would require re-laying out a lot of data.  We would
>     need to record the 'old' and 'new' sizes which metadata doesn't
>     currently allow.  If we spent 8 bytes on this we could possibly
>     manage a 'reverse reshape' style conversion here.
>
> 3/ Increasing the number of devices is much the same for all layouts.
>     The data needs to be copied to the new location.  As we currently
>     block IO while recovery is actually happening, we could just do
>     that for reshape as well, and make sure reshape happens in whole
>     chunks at a time (or whatever turns out to be the minimum
>     recordable unit).  We switch to 'clean' before doing any reshape so
>     a write will switch to 'dirty' and update the metadata.
>
> 4/ decreasing the number of devices is very much the reverse of
>     increasing..
>     Here is a weird thought:  We have introduced the idea that we can
>     increase the size of remaining devices when we decrease the number
>     of devices in the array.  For 'raid10-far', the re-layout for
>     increasing the device size is very much like that for decreasing
>     the number of devices - just that the number doesn't actually
>     decrease.
>
> 5/ changing layouts between 'near' and 'offset' should be manageable
>     providing enough 'backup' space is available.  We simply copy
>     a few chunks worth of data and move reshape_position.
>
> 6/ changing layout to or from 'far' is nearly impossible...
>     With a change in data_offset it might be possible to move one
>     stripe at a time, always into the place just vacated.
>     However keeping track of where we are and were it is safe to read
>     from would be a major headache - unless it feel out with some
>     really neat maths, which I don't think it does.
>     So this option will be left out.
>
>
> So the only 'instant' conversion possible is to increase the device
> size for 'near' and 'offset' array.
>
> 'reshape' conversions can modify chunk size, increase/decrease number of
> devices and swap between 'near' and 'offset' layout providing a
> suitable number of chunks of backup space is available.
>
> The device-size of a 'far' layout can also be changed by a reshape
> providing the number of devices in not increased.
>
>
> Better reporting of inconsistencies.
> ------------------------------------
>
> When a 'check' finds a data inconsistency it would be useful if it
> was reported.   That would allow a sysadmin to try to understand the
> cause and possibly fix it.
>
> One simple approach would be to simply log all inconsistencies through
> the kernel logs.  This would have to be limited to 'check' and
> possibly 'repair' passed as logging a 'sync' pass (which also find
> inconsistencies) can be expected to be very noisy.
>
> Another approach is to use a sysfs file to export a list of
> addresses.  This would place some upper limit on the number of
> addresses that could be listed, but if there are more inconsistencies
> than that limit, then the details probably aren't all that important.
>
> It makes sense to follow both of these paths.
>   - some easy-to-parse logging of inconsistencies found.
>   - a sysfs file that lists as many inconsistencies as possible.
>
> Each inconsistency is listed as a simple sector offset.  For RAID1 and
> RAID4/5/6, it is an offset from the start of data on the individual devices.  For
> RAID1 and RAID10 it is an offset from the start of the array.  So this
> can only be interpreted with a full understanding of the array layout.
>
> The actual inconsistency may be in some sector immediately following
> the given sector as md performs checks in blocks larger than one
> sector and doesn't both refining.   So an process that uses this
> information should read forward from the address to make sure it has
> found all of the inconsistency.  For striped array, at most 1 chunk
> need be examined.  For non-striped (i.e. RAID1) the window size is
> currently 64K.  The actual size can be found by dividing
> 'mismatch_cnt' by the number of entries in the mismatch list.
>
> This has no dependencies on other features.  It relates slightly to
> the bad-block list as one way of dealing with an inconsistency is to
> tell md that a selected block in the stripe is 'bad'.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 10:27 md road-map: 2011 NeilBrown
                   ` (3 preceding siblings ...)
  2011-02-16 17:20 ` Joe Landman
@ 2011-02-16 19:37 ` Phil Turmel
  2011-02-16 21:44   ` NeilBrown
  2011-02-16 20:29 ` Piergiorgio Sartor
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 52+ messages in thread
From: Phil Turmel @ 2011-02-16 19:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

On 02/16/2011 05:27 AM, NeilBrown wrote:
> 
> I all,
>  I wrote this today and posted it at
> http://neil.brown.name/blog/20110216044002
> 
> I thought it might be worth posting it here too...
> 
> NeilBrown
> 
> 
> -------------------------
> 
> 
> It is about 2 years since I last published a road-map[1] for md/raid
> so I thought it was time for another one.  Unfortunately quite a few
> things on the previous list remain undone, but there has been some
> progress.
> 
> I think one of the problems with some to-do lists is that they aren't
> detailed enough.  High-level design, low level design, implementation,
> and testing are all very different sorts of tasks that seem to require
> different styles of thinking and so are best done separately.  As
> writing up a road-map is a high-level design task it makes sense to do
> the full high-level design at that point so that the tasks are
> detailed enough to be addressed individually with little reference to
> the other tasks in the list (except what is explicit in the road map).
> 
> A particular need I am finding for this road map is to make explicit
> the required ordering and interdependence of certain tasks.  Hopefully
> that will make it easier to address them in an appropriate order, and
> mean that I waste less time saying "this is too hard, I might go read
> some email instead".
> 
> So the following is a detailed road-map for md raid for the coming
> months.
> 
> [1] http://neil.brown.name/blog/20090129234603
> 
> Bad Block Log
> -------------
[trim /]
> Bitmap of non-sync regions.
> ---------------------------
[trim /]

It occurred to me that if you go to the trouble (and space and performance)
to create and maintain metadata for lists of bad blocks, and separate
metadata for sync status aka "trim", or hot-replace status, or reshape-status,
or whatever features are dreamt up later, why not create an infrastructure to
carry all of it efficiently?

David Brown suggested a multi-level metadata structure.  I concur, but somewhat
more generic:
	Level 1:  Coarse bitmap, set bit indicates 'look at level 2'
	Level 2:  Fine bitmap, set bit indicates 'look at level 3'
	Level 3:  Extent list, with starting block, length, and feature payload

The bitmap levels are purely for hot-path performance.

As an option, it should be possible to spread the detailed metadata through the
data area, possibly in chunk-sized areas spread out at some user-defined
interval.  "meta-span", perhaps.  Then resizing partitions that compose an
array would be less likely to bump up against metadata size limits.  The coarse
bitmap should stay near the superblock, of course.

Personally, I'd like to see the bad-block feature actually perform block
remapping, much like hard drives themselves do, but with the option to unmap the
block if a later write succeeds.  Using one retry per array restart as you
described makes a lot of sense.  In any case, remapping would retain redundancy
where applicable short of full drive failure or remap overflow.

My $0.02, of course.

Phil

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 10:27 md road-map: 2011 NeilBrown
                   ` (4 preceding siblings ...)
  2011-02-16 19:37 ` Phil Turmel
@ 2011-02-16 20:29 ` Piergiorgio Sartor
  2011-02-16 21:48   ` NeilBrown
  2011-02-16 22:50 ` Keld Jørn Simonsen
  2011-02-23  5:06 ` Daniel Reurich
  7 siblings, 1 reply; 52+ messages in thread
From: Piergiorgio Sartor @ 2011-02-16 20:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

> I all,
>  I wrote this today and posted it at
> http://neil.brown.name/blog/20110216044002
> 
> I thought it might be worth posting it here too...
[...] 
> So the following is a detailed road-map for md raid for the coming
> months.

Question, is this for information purpose or are we
called to a "brainstorming"?

[...]
> Hot Replace
> -----------
> 
> "Hot replace" is my name for the process of replacing one device in an
> array by another one without first failing the one device.  Thus there

Didn't we named it also "proactive replacement"? :-)

> It is not clear whether the primary should be automatically failed
> when the rebuild of the secondary completes.  Commonly this would be
> ideal, but if the secondary experienced any write errors (that were
> recorded in the bad block log) then it would be best to leave both in
> place until the sysadmin resolves the situation.   So in the first
> implementation this failing should not be automatic.

Maybe putting the primary as "spare", i.e. not failed nor
working, unless the "migration" was not successful. In that
case the secondary device should be failed.

My use case here is disk "rotation" :-). That is, for example, a
RAID-5/6 with n disks + 1 spare. Each X months/weeks/days/hours
one disk is pulled out of the array and the spare one takes over.
The pulled out disk will be the new spare (and powered down, possibly).
The idea here is to have n disks which will have, after some time,
different (increasing) power on hours, so to minimize the possibility
of multiple failures.

> Better reporting of inconsistencies.
> ------------------------------------
> 
> When a 'check' finds a data inconsistency it would be useful if it
> was reported.   That would allow a sysadmin to try to understand the
> cause and possibly fix it.

Could you, please, consider to add, for RAID-6, the
capability to report also which device, potentially,
has the problem? Thanks!

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 14:13 ` Joe Landman
@ 2011-02-16 21:24   ` NeilBrown
  2011-02-16 21:44     ` Roman Mamedov
  0 siblings, 1 reply; 52+ messages in thread
From: NeilBrown @ 2011-02-16 21:24 UTC (permalink / raw)
  To: Joe Landman; +Cc: linux-raid

On Wed, 16 Feb 2011 09:13:24 -0500 Joe Landman <joe.landman@gmail.com> wrote:

> On 02/16/2011 05:27 AM, NeilBrown wrote:
> >
> > I all,
> >   I wrote this today and posted it at
> > http://neil.brown.name/blog/20110216044002
> >
> > I thought it might be worth posting it here too...
> 
> 
> Any possibility of getting a hook for a read/write/compare checksum in 
> this?  I'd be happy to commit some time to this if I knew where to begin.

"read/write/compare checksum" is not a lot of words so I may well not be
understanding exactly what you mean, but I guess you are suggesting that we
could store (say) a 64bit hash of each 4K block somewhere.
e.g. Use 513 4K blocks to store 512 4K blocks of data with checksums.
When reading a block, read the checksum too and report an error if they
don't match.  When writing the block, calculate and write the checksum too.

This is already done by the disk drive - I'm not sure what you hope to gain
by doing it in the RAID layer as well.

Doing it in the filesystem as well does make sense and then you get an
end-to-end checksum which can be useful.  But the RAID layer doesn't really
give you that.

And doing it in the RAID layer is problematic because you cannot commit both
checksum and data at the same time (like you can in the hardware).
In the filesystem you could use whatever journalling mechanism you use to
commimt both effectively at the same time.

Doing this in the RAID layer could possibly piggy-back off the write-intent
bitmap so that we re-generate all checksums for all blocks which have a bit
set - which means all blocks that were being written when the power went off.
But that is probably the most likely time for corruption to occur, and it is
the one time when we wouldn't detect it...

So I'm not sure I see real value in doing this in the RAID layer.  But maybe
I misunderstand you, or maybe you can see a better design than me, which
actually works well.

So if you have more words, I'd be keen to read them :-)



> 
> Also, very interested in hooks to do RAID and similar computations in 
> user space (so we can play with functionality without causing problems 
> with a kernel).  Does this external capability exist now?  Would it be 
> hard to include?  Again, something we'd be interested in committing some 
> time to if we could get a shove in the right direction.

Again, I'm not really sure what you are suggesting.  But if you want to
experiment with some other RAID computations, I suggest either experimenting
entirely in userspace, or entirely in the kernel.  Making interfaces between
the two tends to be quite challenging.


Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 15:42 ` David Brown
@ 2011-02-16 21:35   ` NeilBrown
  2011-02-16 22:34     ` David Brown
  0 siblings, 1 reply; 52+ messages in thread
From: NeilBrown @ 2011-02-16 21:35 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

On Wed, 16 Feb 2011 16:42:26 +0100 David Brown <david@westcontrol.com> wrote:

> On 16/02/2011 11:27, NeilBrown wrote:
> >
> > I all,
> >   I wrote this today and posted it at
> > http://neil.brown.name/blog/20110216044002
> >
> > I thought it might be worth posting it here too...
> >
> > NeilBrown
> >
> 
> 
> The bad block log will be a huge step up for reliability by making 
> failures fine-grained.  Occasional failures are a serious risk, 
> especially with very large disks.  The bad block log, especially 
> combined with the "hot replace" idea, will make md raid a lot safer 
> because you avoid running the array in degraded mode (except for a few 
> stripes).
> 
> When a block is marked as bad on a disk, is it possible to inform the 
> file system that the whole stripe is considered bad?  Then the 
> filesystem will (I hope) add that stripe to its own bad block list, move 
> the data out to another stripe (or block, from the fs's viewpoint), thus 
> restoring the raid redundancy for that data.

There is no in-kernel mechanism to do this.  You could possibly write a tool
which examined the bad-block-lists exported by md, and told a filesystem
about them.

It might be good to have a feature where by when the filesystem requests a
'read', it gets told 'here is the data, but I had trouble getting it so you
should try to save it elsewhere and never write here again'.   If you can
find a filesystem developer interested in using the information I'd be
interested in trying to provide it.


> 
> Can a "hot spare" automatically turn into a "hot replace" based on some 
> criteria (such as a certain number of bad blocks)?  Can the replaced 
> drive then become a "hot spare" again?  It may not be perfect, but it is 
> still better than nothing, and useful if the admin can't replace the 
> drive quickly.

Possibly.  This would be a job for user-space though.  May "mdadm --monitor"
could be given some policy such as you describe.  Then it could activate a
spare as appropriate.

> 
> It strikes me that "hot replace" is much like one of the original disks 
> out of the array and replacing it with a RAID 1 pair using the original 
> disk and a missing second.  The new disk is then added to the pair and 
> they are sync'ed.  Finally, you remove the old disk from the RAID 1 
> pair, then re-assign the drive from the RAID 1 "pair" to the original RAID.

Very much.  However if that process finds an unreadable block, there is
nothing it can do.  By integrating into the parent array, we can easily find
that data from elsewhere.

> 
> I may be missing something, but if I think that using the bad-block list 
> and the non-sync bitmaps, the only thing needed to support hot replace 
> is a way to turn a member drive into a degraded RAID 1 set in an atomic 
> action, and to reverse this action afterwards.  This may also give extra 
> flexibility - it is conceivable that someone would want to keep the RAID 
> 1 set afterwards as a reshape (turning a RAID 5 into a RAID 1+0, for 
> example).

You could do that .... the raid1 resync would need to record bad-blocks in
the new device where badblocks are found in the old device.  Then you need
the parent array to find and reconstruct all those bad blocks.  It would be
do-able.  I'm not sure the complexity of doing it that way is less than the
complexity of directly implementing hot-replace.  But I'll keep it in mind if
the code gets too hairy.

> 
> For your non-sync bitmap, would it make sense to have a two-level 
> bitmap?  Perhaps a coarse bitmap in blocks of 32 MB, with each entry 
> showing a state of in sync, out of sync, partially synced, or never 
> synced.  Partially synced coarse blocks would have their own fine bitmap 
> at the 4K block size (or perhaps a bit bigger - maybe 32K or 64K would 
> fit well with SSD block sizes).  Partially synced and out of sync blocks 
> would be gradually brought into sync when the disks are otherwise free, 
> while never synced blocks would not need to be synced at all.
> 
> This would let you efficiently store the state during initial builds 
> (everything is marked "never synced" until it is used), and rebuilds are 
> done by marking everything as "out of sync" on the new device.  The 
> two-level structure would let you keep fine-grained sync information 
> from file system discards without taking up unreasonable space.

I cannot see that this gains anything.
I need to allocate all the disk space that I might ever need for bitmaps at
the beginning.  There is no sense in which I can allocate some when needed
and free it up later (like there might be in a filesystem).
So whatever granularity I need - the space must be pre-allocated.

Certainly a two-level table might be appropriate for the in-memory copy of
the bitmap.  Maybe even 3 level.  But I think you are talking about storing
data on disk, and I think there - only one bitmap makes sense.

??

NeilBrown


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 17:20 ` Joe Landman
@ 2011-02-16 21:36   ` NeilBrown
  0 siblings, 0 replies; 52+ messages in thread
From: NeilBrown @ 2011-02-16 21:36 UTC (permalink / raw)
  To: Joe Landman; +Cc: linux-raid

On Wed, 16 Feb 2011 12:20:32 -0500 Joe Landman <joe.landman@gmail.com> wrote:

> On 02/16/2011 05:27 AM, NeilBrown wrote:
> >
> > I all,
> >   I wrote this today and posted it at
> > http://neil.brown.name/blog/20110216044002
> >
> > I thought it might be worth posting it here too...
> 
> Another request would be an incremental on-demand build of the RAID. 
> That is, when we set up a RAID6, that it only computes the blocks as 
> they are allocated and used.  This helps with things like thin 
> provisioning on remote target devices (among other nice things).
> 

That is exactly what the non-sync bitmap is supposed to do.  On the first
write to a region that is marked as not-in-sync, a resync for just that
region is triggered.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 21:24   ` NeilBrown
@ 2011-02-16 21:44     ` Roman Mamedov
  2011-02-16 21:59       ` NeilBrown
  2011-02-16 22:12       ` Joe Landman
  0 siblings, 2 replies; 52+ messages in thread
From: Roman Mamedov @ 2011-02-16 21:44 UTC (permalink / raw)
  To: NeilBrown; +Cc: Joe Landman, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1104 bytes --]

On Thu, 17 Feb 2011 08:24:12 +1100
NeilBrown <neilb@suse.de> wrote:

> "read/write/compare checksum" is not a lot of words so I may well not be
> understanding exactly what you mean, but I guess you are suggesting that we
> could store (say) a 64bit hash of each 4K block somewhere.
> e.g. Use 513 4K blocks to store 512 4K blocks of data with checksums.
> When reading a block, read the checksum too and report an error if they
> don't match.  When writing the block, calculate and write the checksum too.
> 
> This is already done by the disk drive - I'm not sure what you hope to gain
> by doing it in the RAID layer as well.

Consider RAID1/RAID10/RAID5/RAID6, where one or more members are returning bad
data for some reason (e.g. are failing or have written garbage to disk during
a sudden power loss). Having per-block checksums would allow to determine
which members have correct data and which do not, and would help the RAID
layer recover from that situation in the smartest way possible (with absolutely
no loss or corruption of the user data).

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 19:37 ` Phil Turmel
@ 2011-02-16 21:44   ` NeilBrown
  2011-02-17  0:11     ` Phil Turmel
  0 siblings, 1 reply; 52+ messages in thread
From: NeilBrown @ 2011-02-16 21:44 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

On Wed, 16 Feb 2011 14:37:26 -0500 Phil Turmel <philip@turmel.org> wrote:

> Hi Neil,
> 
> On 02/16/2011 05:27 AM, NeilBrown wrote:
> > 
> > I all,
> >  I wrote this today and posted it at
> > http://neil.brown.name/blog/20110216044002
> > 
> > I thought it might be worth posting it here too...
> > 
> > NeilBrown
> > 
> > 
> > -------------------------
> > 
> > 
> > It is about 2 years since I last published a road-map[1] for md/raid
> > so I thought it was time for another one.  Unfortunately quite a few
> > things on the previous list remain undone, but there has been some
> > progress.
> > 
> > I think one of the problems with some to-do lists is that they aren't
> > detailed enough.  High-level design, low level design, implementation,
> > and testing are all very different sorts of tasks that seem to require
> > different styles of thinking and so are best done separately.  As
> > writing up a road-map is a high-level design task it makes sense to do
> > the full high-level design at that point so that the tasks are
> > detailed enough to be addressed individually with little reference to
> > the other tasks in the list (except what is explicit in the road map).
> > 
> > A particular need I am finding for this road map is to make explicit
> > the required ordering and interdependence of certain tasks.  Hopefully
> > that will make it easier to address them in an appropriate order, and
> > mean that I waste less time saying "this is too hard, I might go read
> > some email instead".
> > 
> > So the following is a detailed road-map for md raid for the coming
> > months.
> > 
> > [1] http://neil.brown.name/blog/20090129234603
> > 
> > Bad Block Log
> > -------------
> [trim /]
> > Bitmap of non-sync regions.
> > ---------------------------
> [trim /]
> 
> It occurred to me that if you go to the trouble (and space and performance)
> to create and maintain metadata for lists of bad blocks, and separate
> metadata for sync status aka "trim", or hot-replace status, or reshape-status,
> or whatever features are dreamt up later, why not create an infrastructure to
> carry all of it efficiently?
> 
> David Brown suggested a multi-level metadata structure.  I concur, but somewhat
> more generic:
> 	Level 1:  Coarse bitmap, set bit indicates 'look at level 2'
> 	Level 2:  Fine bitmap, set bit indicates 'look at level 3'
> 	Level 3:  Extent list, with starting block, length, and feature payload
> 
> The bitmap levels are purely for hot-path performance.
> 
> As an option, it should be possible to spread the detailed metadata through the
> data area, possibly in chunk-sized areas spread out at some user-defined
> interval.  "meta-span", perhaps.  Then resizing partitions that compose an
> array would be less likely to bump up against metadata size limits.  The coarse
> bitmap should stay near the superblock, of course.

This is starting to sound a lot more like a filesystem than a RAID system.

I really don't want there to be so much metadata that I am tempted to spread
it out among the data.  I think that implies too much complexity.

Maybe that is a good place to draw the line:  If some metadata doesn't fit
easily at the start of end of the devices, it has no place in RAID - you
should add it to a filesystem instead.


> 
> Personally, I'd like to see the bad-block feature actually perform block
> remapping, much like hard drives themselves do, but with the option to unmap the
> block if a later write succeeds.  Using one retry per array restart as you
> described makes a lot of sense.  In any case, remapping would retain redundancy
> where applicable short of full drive failure or remap overflow.

If the hard drives already do this, why should md try to do it as well??
If a hard drive has had some many write errors that it has used up all of its
spare space, then it is long past time to replace it.


> 
> My $0.02, of course.

Here in .au, the smallest legal tender is $0.05 - but thanks anyway :-)

NeilBrown

> 
> Phil


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 20:29 ` Piergiorgio Sartor
@ 2011-02-16 21:48   ` NeilBrown
  2011-02-16 22:53     ` Piergiorgio Sartor
  2011-02-17  0:24     ` Phil Turmel
  0 siblings, 2 replies; 52+ messages in thread
From: NeilBrown @ 2011-02-16 21:48 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On Wed, 16 Feb 2011 21:29:39 +0100 Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:

> Hi Neil,
> 
> > I all,
> >  I wrote this today and posted it at
> > http://neil.brown.name/blog/20110216044002
> > 
> > I thought it might be worth posting it here too...
> [...] 
> > So the following is a detailed road-map for md raid for the coming
> > months.
> 
> Question, is this for information purpose or are we
> called to a "brainstorming"?

Primarily for information, but I'm always happy to hear other peoples ideas.
Some of them help...
Or maybe it was really a task list for all of you budding programmers out
there ...  I can always hope!.

> 
> [...]
> > Hot Replace
> > -----------
> > 
> > "Hot replace" is my name for the process of replacing one device in an
> > array by another one without first failing the one device.  Thus there
> 
> Didn't we named it also "proactive replacement"? :-)

Probably - but too many syllables, so I cannot remember that so well.

> 
> > It is not clear whether the primary should be automatically failed
> > when the rebuild of the secondary completes.  Commonly this would be
> > ideal, but if the secondary experienced any write errors (that were
> > recorded in the bad block log) then it would be best to leave both in
> > place until the sysadmin resolves the situation.   So in the first
> > implementation this failing should not be automatic.
> 
> Maybe putting the primary as "spare", i.e. not failed nor
> working, unless the "migration" was not successful. In that
> case the secondary device should be failed.

Maybe ... but what if both primary and secondary have bad blocks on them?
What do I do then?

> 
> My use case here is disk "rotation" :-). That is, for example, a
> RAID-5/6 with n disks + 1 spare. Each X months/weeks/days/hours
> one disk is pulled out of the array and the spare one takes over.
> The pulled out disk will be the new spare (and powered down, possibly).
> The idea here is to have n disks which will have, after some time,
> different (increasing) power on hours, so to minimize the possibility
> of multiple failures.

Interesting idea.  This could be managed with some user-space tool that
initiates the 'hot-replace' and 'fail' from time to time and keeps track of
ages.


> 
> > Better reporting of inconsistencies.
> > ------------------------------------
> > 
> > When a 'check' finds a data inconsistency it would be useful if it
> > was reported.   That would allow a sysadmin to try to understand the
> > cause and possibly fix it.
> 
> Could you, please, consider to add, for RAID-6, the
> capability to report also which device, potentially,
> has the problem? Thanks!

I would rather leave that to user-space.  If I report where the problem is, a
tool could directly read all the blocks in that stripe and perform any fancy
calculations you like.  I may even write that tool (but no promises).

> 
> bye,
> 

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 14:21         ` Roberto Spadim
@ 2011-02-16 21:55           ` NeilBrown
  2011-02-17  1:30             ` Roberto Spadim
  0 siblings, 1 reply; 52+ messages in thread
From: NeilBrown @ 2011-02-16 21:55 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: Giovanni Tessore, linux-raid

On Wed, 16 Feb 2011 11:21:50 -0300 Roberto Spadim <roberto@spadim.com.br>
wrote:

> since we have the option1 done, why continue with raid1 code? could we
> port write-behind to raid10 code?

No.  write-behind depends on write-mostly, and write-mostly only really makes
sense for RAID1.  I much prefer to keep these two code bases separate.

> another thing, could raid10 work without replica? like a raid0?

Why don't you try it?  Choose a layout that asks for only 1 copy of the data.
It should work.

> 
> why? just to remove many files with the same function (raid1and raid0,
> if raid10 do the same work, many some mdadm changes allow us to
> --level=1 to understand that's raid10 without stripe, --level=0 is
> raid10 without mirrors)

Again, RAID0 has some features that RAID10 doesn'tand cannot.  I suggest you
read man pages (e.g. 'man md') to find out the details.

Also the RAID0 code is much simpler and hence possibly faster.

NeilBrown


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 21:44     ` Roman Mamedov
@ 2011-02-16 21:59       ` NeilBrown
  2011-02-17  0:48         ` Phil Turmel
  2011-02-16 22:12       ` Joe Landman
  1 sibling, 1 reply; 52+ messages in thread
From: NeilBrown @ 2011-02-16 21:59 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Joe Landman, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1691 bytes --]

On Thu, 17 Feb 2011 02:44:02 +0500 Roman Mamedov <rm@romanrm.ru> wrote:

> On Thu, 17 Feb 2011 08:24:12 +1100
> NeilBrown <neilb@suse.de> wrote:
> 
> > "read/write/compare checksum" is not a lot of words so I may well not be
> > understanding exactly what you mean, but I guess you are suggesting that we
> > could store (say) a 64bit hash of each 4K block somewhere.
> > e.g. Use 513 4K blocks to store 512 4K blocks of data with checksums.
> > When reading a block, read the checksum too and report an error if they
> > don't match.  When writing the block, calculate and write the checksum too.
> > 
> > This is already done by the disk drive - I'm not sure what you hope to gain
> > by doing it in the RAID layer as well.
> 
> Consider RAID1/RAID10/RAID5/RAID6, where one or more members are returning bad
> data for some reason (e.g. are failing or have written garbage to disk during
> a sudden power loss). Having per-block checksums would allow to determine
> which members have correct data and which do not, and would help the RAID
> layer recover from that situation in the smartest way possible (with absolutely
> no loss or corruption of the user data).
> 

Why do you think that md would be able to reliably write consistent data and
checksum to a device in a circumstance (power failure) where the hard drive
is not able to do it itelf?

i.e. I would need to see a clear threat-model which can cause data corruption
that the hard drive itself would not be able to reliably report, but that
checksums provided by md would be able to reliably report.
Powerfail does not qualify (without sophisticated journalling on the part of
md).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 21:44     ` Roman Mamedov
  2011-02-16 21:59       ` NeilBrown
@ 2011-02-16 22:12       ` Joe Landman
  1 sibling, 0 replies; 52+ messages in thread
From: Joe Landman @ 2011-02-16 22:12 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: NeilBrown, linux-raid

On 02/16/2011 04:44 PM, Roman Mamedov wrote:
> On Thu, 17 Feb 2011 08:24:12 +1100
> NeilBrown<neilb@suse.de>  wrote:
>
>> "read/write/compare checksum" is not a lot of words so I may well not be
>> understanding exactly what you mean, but I guess you are suggesting that we
>> could store (say) a 64bit hash of each 4K block somewhere.
>> e.g. Use 513 4K blocks to store 512 4K blocks of data with checksums.
>> When reading a block, read the checksum too and report an error if they
>> don't match.  When writing the block, calculate and write the checksum too.
>>
>> This is already done by the disk drive - I'm not sure what you hope to gain
>> by doing it in the RAID layer as well.
>
> Consider RAID1/RAID10/RAID5/RAID6, where one or more members are returning bad
> data for some reason (e.g. are failing or have written garbage to disk during
> a sudden power loss). Having per-block checksums would allow to determine
> which members have correct data and which do not, and would help the RAID
> layer recover from that situation in the smartest way possible (with absolutely
> no loss or corruption of the user data).

I wasn't specifically thinking about bad data from a power loss, but the 
more general case of something in the pathway causing bad bits to have 
been committed or read back from the storage.  I am after being able to 
detect bad reads (silent corruption) and bad writes (by flushing then 
reading recently written blocks to compare).

Suppose, for example, we have a RAID1, and we read block N.  As a sanity 
check on the data, we can compare the data read from one device to 
another.  This doesn't tell us if the data is correct, just whether or 
not the same data was returned.  So the RAID layer (nor the disks 
themselves) would return an error in the event of this data not matching 
a checksum.  So if we computed a simple checksum compared it to a stored 
checksum, we could likely detect corruption on read.

Similar to this would be computing/comparing the RAIDn (n>1) checksum on 
every read.  It would cost somewhat more processing power, but I believe 
that in most cases, the disk performance would be the rate limiting process.

It might make more sense to push some of this up to the file system 
layers (ala btrfs), but I am thinking that it would be nice to have some 
elements of this functionality in the RAID layers, that the upper level 
file systems can use as a service.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 21:35   ` NeilBrown
@ 2011-02-16 22:34     ` David Brown
  2011-02-16 23:01       ` NeilBrown
  0 siblings, 1 reply; 52+ messages in thread
From: David Brown @ 2011-02-16 22:34 UTC (permalink / raw)
  To: linux-raid

On 16/02/11 22:35, NeilBrown wrote:
> On Wed, 16 Feb 2011 16:42:26 +0100 David Brown<david@westcontrol.com>  wrote:
>
>> On 16/02/2011 11:27, NeilBrown wrote:
>>>
>>> I all,
>>>    I wrote this today and posted it at
>>> http://neil.brown.name/blog/20110216044002
>>>
>>> I thought it might be worth posting it here too...
>>>
>>> NeilBrown
>>>
>>
>>
>> The bad block log will be a huge step up for reliability by making
>> failures fine-grained.  Occasional failures are a serious risk,
>> especially with very large disks.  The bad block log, especially
>> combined with the "hot replace" idea, will make md raid a lot safer
>> because you avoid running the array in degraded mode (except for a few
>> stripes).
>>
>> When a block is marked as bad on a disk, is it possible to inform the
>> file system that the whole stripe is considered bad?  Then the
>> filesystem will (I hope) add that stripe to its own bad block list, move
>> the data out to another stripe (or block, from the fs's viewpoint), thus
>> restoring the raid redundancy for that data.
>
> There is no in-kernel mechanism to do this.  You could possibly write a tool
> which examined the bad-block-lists exported by md, and told a filesystem
> about them.
>
> It might be good to have a feature where by when the filesystem requests a
> 'read', it gets told 'here is the data, but I had trouble getting it so you
> should try to save it elsewhere and never write here again'.   If you can
> find a filesystem developer interested in using the information I'd be
> interested in trying to provide it.
>

I thought there was some mechanism for block devices to report bad 
blocks back to the file system, and that file systems tracked bad block 
lists.  Modern drives automatically relocate bad blocks (at least, they 
do if they can), but there was a time when they did not and it was up to 
the file system to track these.  Whether that still applies to modern 
file systems, I do not know - they only file system I have studied in 
low-level detail is FAT16.

If we were talking about changes to the md layer only, then my idea 
could make sense.  But if every file system needs to be adapted, then it 
would be much less practical (sometimes having lots of choice is a 
disadvantage!).

>
>>
>> Can a "hot spare" automatically turn into a "hot replace" based on some
>> criteria (such as a certain number of bad blocks)?  Can the replaced
>> drive then become a "hot spare" again?  It may not be perfect, but it is
>> still better than nothing, and useful if the admin can't replace the
>> drive quickly.
>
> Possibly.  This would be a job for user-space though.  May "mdadm --monitor"
> could be given some policy such as you describe.  Then it could activate a
> spare as appropriate.
>

Yes, I can see this as a user-space feature.  It might be better 
implemented as a cron job (or an external program called by mdadm 
--monitor") for flexibility.

>>
>> It strikes me that "hot replace" is much like one of the original disks
>> out of the array and replacing it with a RAID 1 pair using the original
>> disk and a missing second.  The new disk is then added to the pair and
>> they are sync'ed.  Finally, you remove the old disk from the RAID 1
>> pair, then re-assign the drive from the RAID 1 "pair" to the original RAID.
>
> Very much.  However if that process finds an unreadable block, there is
> nothing it can do.  By integrating into the parent array, we can easily find
> that data from elsewhere.
>

There is nothing that can be done at the RAID 1 pair level.  At some 
point, the problem blocks need to be marked as not synced at the upper 
raid level - either while still doing the rebuild (which would perhaps 
be the safest) or when the RAID 1 was broken down again and the disk 
re-assigned to the original raid (which would perhaps be the easiest).

>>
>> I may be missing something, but if I think that using the bad-block list
>> and the non-sync bitmaps, the only thing needed to support hot replace
>> is a way to turn a member drive into a degraded RAID 1 set in an atomic
>> action, and to reverse this action afterwards.  This may also give extra
>> flexibility - it is conceivable that someone would want to keep the RAID
>> 1 set afterwards as a reshape (turning a RAID 5 into a RAID 1+0, for
>> example).
>
> You could do that .... the raid1 resync would need to record bad-blocks in
> the new device where badblocks are found in the old device.  Then you need
> the parent array to find and reconstruct all those bad blocks.  It would be
> do-able.  I'm not sure the complexity of doing it that way is less than the
> complexity of directly implementing hot-replace.  But I'll keep it in mind if
> the code gets too hairy.
>

It's just an alternative idea.  I haven't thought through the details 
enough - I just think that it might let you re-use existing (or planned) 
features in layers rather than implementing hot replace as a separate 
feature.  But I can see there could be challenges here - keeping track 
of the metadata for bad block lists and sync lists at both levels might 
make it more complex.

>>
>> For your non-sync bitmap, would it make sense to have a two-level
>> bitmap?  Perhaps a coarse bitmap in blocks of 32 MB, with each entry
>> showing a state of in sync, out of sync, partially synced, or never
>> synced.  Partially synced coarse blocks would have their own fine bitmap
>> at the 4K block size (or perhaps a bit bigger - maybe 32K or 64K would
>> fit well with SSD block sizes).  Partially synced and out of sync blocks
>> would be gradually brought into sync when the disks are otherwise free,
>> while never synced blocks would not need to be synced at all.
>>
>> This would let you efficiently store the state during initial builds
>> (everything is marked "never synced" until it is used), and rebuilds are
>> done by marking everything as "out of sync" on the new device.  The
>> two-level structure would let you keep fine-grained sync information
>> from file system discards without taking up unreasonable space.
>
> I cannot see that this gains anything.
> I need to allocate all the disk space that I might ever need for bitmaps at
> the beginning.  There is no sense in which I can allocate some when needed
> and free it up later (like there might be in a filesystem).
> So whatever granularity I need - the space must be pre-allocated.
>
> Certainly a two-level table might be appropriate for the in-memory copy of
> the bitmap.  Maybe even 3 level.  But I think you are talking about storing
> data on disk, and I think there - only one bitmap makes sense.
>

You mean you need to reserve enough disk space for a worst-case 
scenario, so you need the disk space for a full bitmap anyway?  I 
suppose that's true.

For the in-memory copy, such multi-level tables would be more 
appropriate.  32 MB might not sound much for a modern server, but since 
the non-sync information must be kept for each disk, it will quickly 
become significant for large arrays.

mvh.,

David Brown

> ??
>
> NeilBrown
>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 10:27 md road-map: 2011 NeilBrown
                   ` (5 preceding siblings ...)
  2011-02-16 20:29 ` Piergiorgio Sartor
@ 2011-02-16 22:50 ` Keld Jørn Simonsen
  2011-02-23  5:06 ` Daniel Reurich
  7 siblings, 0 replies; 52+ messages in thread
From: Keld Jørn Simonsen @ 2011-02-16 22:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Wed, Feb 16, 2011 at 09:27:51PM +1100, NeilBrown wrote:
> 
> RAID1, RAID10 and RAID456 should all support bad blocks.  Every read
> or write should perform a lookup of the bad block list.  If a read
> finds a bad block, that device should be treated as failed for that
> read.  This includes reads that are part of resync or recovery.
> 
> If a write finds a bad block there are two possible responses.  Either
> the block can be ignored as with reads, or we can try to write the
> data in the hope that it will fix the error.  Always taking the second
> action would seem best as it allows blocks to be removed from the
> bad-block list, but as a failing write can take a long time, there are
> plenty of cases where it would not be good.

I was thinking of a further refinement, namely that if there is a bad
block on one drive, then the corresponding good block of another drive
should be read, and written to a bad block recovery area on the
erroneous drive. In that way the erroneous dive would still hold
the complete data. The bad block list would then hold both  the bad
block and then the corresponding good block in the bad block recovery
area. Given that the number of bad blocks would be small,
this would not really hurt performance.

the bad block recovery area could be handled as other metadata on the
drive. I think this reflects much what is currently done in most disk
hardware, except that the corresponding good block is copied from another
drive.

> Support reshape of RAID10 arrays.
> ---------------------------------
> 
> 6/ changing layout to or from 'far' is nearly impossible...
>    With a change in data_offset it might be possible to move one
>    stripe at a time, always into the place just vacated.
>    However keeping track of where we are and were it is safe to read
>    from would be a major headache - unless it feel out with some
>    really neat maths, which I don't think it does.
>    So this option will be left out.

I think this can easily be done for some of the more common cases of
"far", eg a 2 or 4-drive raid10 - possibly all layouts involving an
even number of drives. You can just have say one set of complete data 
intact and then rewrite the whole other set of data in the new layout. 
Please note that there may be two versions of the layout of "near" and
"far", one looking like a raid 1+0 and one loking as a raid 0+1, giving
distinct different survival characteristics with failure of more than
one drive. In a 4-drive raid0, the one layout will have a 66 % chance of
surviving a 2 drive crash, while the other version will have a 33 %
chance of surviving 2 disks crashing.

I am not sure this can be generalized to all combinations of drives and
layouts. However, the simple cases are common enough and simple enough
to do to warrant the implementation, IMHO.

> So the only 'instant' conversion possible is to increase the device
> size for 'near' and 'offset' array.
> 
> 'reshape' conversions can modify chunk size, increase/decrease number of
> devices and swap between 'near' and 'offset' layout providing a
> suitable number of chunks of backup space is available.
> 
> The device-size of a 'far' layout can also be changed by a reshape
> providing the number of devices in not increased.

given that most configurations of "far" can be reshaped into "near" -
then the additin of drives should be possible by: reshape far to near,
extend near, reshape near to far.

Other improvements
------------------

I would like to hear if you are considering other improvements:

1.  a layout version of raid10,far and raid10,near thathas a better
survival ratio for failure fo 2 disks or more. The current layout only
have properties of raid 0+1.

2. better performance of resync etc, by using bigger buffers say 20 MB.

best regards
keld

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 21:48   ` NeilBrown
@ 2011-02-16 22:53     ` Piergiorgio Sartor
  2011-02-17  0:24     ` Phil Turmel
  1 sibling, 0 replies; 52+ messages in thread
From: Piergiorgio Sartor @ 2011-02-16 22:53 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid

> > > when the rebuild of the secondary completes.  Commonly this would be
> > > ideal, but if the secondary experienced any write errors (that were
> > > recorded in the bad block log) then it would be best to leave both in
> > > place until the sysadmin resolves the situation.   So in the first
> > > implementation this failing should not be automatic.
> > 
> > Maybe putting the primary as "spare", i.e. not failed nor
> > working, unless the "migration" was not successful. In that
> > case the secondary device should be failed.
> 
> Maybe ... but what if both primary and secondary have bad blocks on them?
> What do I do then?

IMHO this means migration was not sucessful, so
you return to the original state, with the
primary disk up and running.

Assuming you realize the secondary has bad blocks,
otherwise I do not think there are any possibilities.
 
> > My use case here is disk "rotation" :-). That is, for example, a
> > RAID-5/6 with n disks + 1 spare. Each X months/weeks/days/hours
> > one disk is pulled out of the array and the spare one takes over.
> > The pulled out disk will be the new spare (and powered down, possibly).
> > The idea here is to have n disks which will have, after some time,
> > different (increasing) power on hours, so to minimize the possibility
> > of multiple failures.
> 
> Interesting idea.  This could be managed with some user-space tool that
> initiates the 'hot-replace' and 'fail' from time to time and keeps track of
> ages.

Exactly, my idea was to have a daemon, which, time to time, maybe
reading the power up hours from the SMART information, will remove
the oldest disk replacing it with the youngest.
There could be other policies, of course.
 
> > > Better reporting of inconsistencies.
> > > ------------------------------------
> > > 
> > > When a 'check' finds a data inconsistency it would be useful if it
> > > was reported.   That would allow a sysadmin to try to understand the
> > > cause and possibly fix it.
> > 
> > Could you, please, consider to add, for RAID-6, the
> > capability to report also which device, potentially,
> > has the problem? Thanks!
> 
> I would rather leave that to user-space.  If I report where the problem is, a
> tool could directly read all the blocks in that stripe and perform any fancy
> calculations you like.  I may even write that tool (but no promises).

I guess you have already the tool, don't you remember? :-)

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 22:34     ` David Brown
@ 2011-02-16 23:01       ` NeilBrown
  2011-02-17  0:30         ` David Brown
  0 siblings, 1 reply; 52+ messages in thread
From: NeilBrown @ 2011-02-16 23:01 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

On Wed, 16 Feb 2011 23:34:43 +0100 David Brown <david.brown@hesbynett.no>
wrote:

> I thought there was some mechanism for block devices to report bad 
> blocks back to the file system, and that file systems tracked bad block 
> lists.  Modern drives automatically relocate bad blocks (at least, they 
> do if they can), but there was a time when they did not and it was up to 
> the file system to track these.  Whether that still applies to modern 
> file systems, I do not know - they only file system I have studied in 
> low-level detail is FAT16.

When the block device reports an error the filesystem can certainly record
that information in a bad-block list, and possibly does.

However I thought you were suggesting a situation where the block device
could succeed with the request, but knew that area of the device was of low
quality.
e.g. IO to a block on a stripe which had one 'bad block'.  The IO should
succeed, but the data isn't as safe as elsewhere.  It would be nice if we
could tell the filesystem that fact, and if it could make use of it. But we
currently cannot.   We can say "success" or "failure", but we cannot say
"success, but you might not be so lucky next time".



NeilBrown

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 21:44   ` NeilBrown
@ 2011-02-17  0:11     ` Phil Turmel
  0 siblings, 0 replies; 52+ messages in thread
From: Phil Turmel @ 2011-02-17  0:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 02/16/2011 04:44 PM, NeilBrown wrote:
[trim /]
> On Wed, 16 Feb 2011 14:37:26 -0500 Phil Turmel <philip@turmel.org> wrote:
>> It occurred to me that if you go to the trouble (and space and performance)
>> to create and maintain metadata for lists of bad blocks, and separate
>> metadata for sync status aka "trim", or hot-replace status, or reshape-status,
>> or whatever features are dreamt up later, why not create an infrastructure to
>> carry all of it efficiently?
>>
>> David Brown suggested a multi-level metadata structure.  I concur, but somewhat
>> more generic:
>> 	Level 1:  Coarse bitmap, set bit indicates 'look at level 2'
>> 	Level 2:  Fine bitmap, set bit indicates 'look at level 3'
>> 	Level 3:  Extent list, with starting block, length, and feature payload
>>
>> The bitmap levels are purely for hot-path performance.
>>
>> As an option, it should be possible to spread the detailed metadata through the
>> data area, possibly in chunk-sized areas spread out at some user-defined
>> interval.  "meta-span", perhaps.  Then resizing partitions that compose an
>> array would be less likely to bump up against metadata size limits.  The coarse
>> bitmap should stay near the superblock, of course.
> 
> This is starting to sound a lot more like a filesystem than a RAID system.

Heh.  But if you are going to start adding block and/or block-extent metadata for
a variety of features, common code and storage for it should be an all-around win.

> I really don't want there to be so much metadata that I am tempted to spread
> it out among the data.  I think that implies too much complexity.

It would be complex, yes.  Same math as computing block locations within raid 5
stripes, though.

> Maybe that is a good place to draw the line:  If some metadata doesn't fit
> easily at the start of end of the devices, it has no place in RAID - you
> should add it to a filesystem instead.

I think that's arbitrary, but its moot until someone tries to implement it.

>> Personally, I'd like to see the bad-block feature actually perform block
>> remapping, much like hard drives themselves do, but with the option to unmap the
>> block if a later write succeeds.  Using one retry per array restart as you
>> described makes a lot of sense.  In any case, remapping would retain redundancy
>> where applicable short of full drive failure or remap overflow.
> 
> If the hard drives already do this, why should md try to do it as well??
> If a hard drive has had some many write errors that it has used up all of its
> spare space, then it is long past time to replace it.

True enough.

>> My $0.02, of course.
> 
> Here in .au, the smallest legal tender is $0.05 - but thanks anyway :-)

I guess the offer of "a penny for your thoughts" doesn't work down under ;)

Phil

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 21:48   ` NeilBrown
  2011-02-16 22:53     ` Piergiorgio Sartor
@ 2011-02-17  0:24     ` Phil Turmel
  2011-02-17  0:52       ` NeilBrown
  1 sibling, 1 reply; 52+ messages in thread
From: Phil Turmel @ 2011-02-17  0:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid

On 02/16/2011 04:48 PM, NeilBrown wrote:
> On Wed, 16 Feb 2011 21:29:39 +0100 Piergiorgio Sartor
>>
>>> Better reporting of inconsistencies.
>>> ------------------------------------
>>>
>>> When a 'check' finds a data inconsistency it would be useful if it
>>> was reported.   That would allow a sysadmin to try to understand the
>>> cause and possibly fix it.
>>
>> Could you, please, consider to add, for RAID-6, the
>> capability to report also which device, potentially,
>> has the problem? Thanks!
> 
> I would rather leave that to user-space.  If I report where the problem is, a
> tool could directly read all the blocks in that stripe and perform any fancy
> calculations you like.  I may even write that tool (but no promises).

Hmmm.  The existing "check" code, if it encounters a read error, will use
available redundancy to recover that data and rewrite it on the spot.

Without a read error, or with multiple redundancy, the calculations to
check consistency are performed and reported.  With all the data "hot", and half
the calculation to pinpoint an inconsistency done, it seems a shame to have
userspace redo it.

Are you adamantly opposed to the kernel doing this?  (For Raid6)  Code talks,
of course, but I'd rather not start if I'm only going to be shot down.

Phil

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 23:01       ` NeilBrown
@ 2011-02-17  0:30         ` David Brown
  2011-02-17  0:55           ` NeilBrown
  2011-02-17  1:04           ` Keld Jørn Simonsen
  0 siblings, 2 replies; 52+ messages in thread
From: David Brown @ 2011-02-17  0:30 UTC (permalink / raw)
  To: linux-raid

On 17/02/11 00:01, NeilBrown wrote:
> On Wed, 16 Feb 2011 23:34:43 +0100 David Brown<david.brown@hesbynett.no>
> wrote:
>
>> I thought there was some mechanism for block devices to report bad
>> blocks back to the file system, and that file systems tracked bad block
>> lists.  Modern drives automatically relocate bad blocks (at least, they
>> do if they can), but there was a time when they did not and it was up to
>> the file system to track these.  Whether that still applies to modern
>> file systems, I do not know - they only file system I have studied in
>> low-level detail is FAT16.
>
> When the block device reports an error the filesystem can certainly record
> that information in a bad-block list, and possibly does.
>
> However I thought you were suggesting a situation where the block device
> could succeed with the request, but knew that area of the device was of low
> quality.

I guess that is what I was trying to suggest, though not very clearly.

> e.g. IO to a block on a stripe which had one 'bad block'.  The IO should
> succeed, but the data isn't as safe as elsewhere.  It would be nice if we
> could tell the filesystem that fact, and if it could make use of it. But we
> currently cannot.   We can say "success" or "failure", but we cannot say
> "success, but you might not be so lucky next time".
>

Do filesystems re-try reads when there is a failure?  Could you return 
fail on one read, then success on a re-read, which could be interpreted 
as "dying, but not yet dead" by the file system?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 21:59       ` NeilBrown
@ 2011-02-17  0:48         ` Phil Turmel
  0 siblings, 0 replies; 52+ messages in thread
From: Phil Turmel @ 2011-02-17  0:48 UTC (permalink / raw)
  To: NeilBrown; +Cc: Roman Mamedov, Joe Landman, linux-raid

On 02/16/2011 04:59 PM, NeilBrown wrote:
> On Thu, 17 Feb 2011 02:44:02 +0500 Roman Mamedov <rm@romanrm.ru> wrote:
> 
>> On Thu, 17 Feb 2011 08:24:12 +1100
>> NeilBrown <neilb@suse.de> wrote:
>>
>>> "read/write/compare checksum" is not a lot of words so I may well not be
>>> understanding exactly what you mean, but I guess you are suggesting that we
>>> could store (say) a 64bit hash of each 4K block somewhere.
>>> e.g. Use 513 4K blocks to store 512 4K blocks of data with checksums.
>>> When reading a block, read the checksum too and report an error if they
>>> don't match.  When writing the block, calculate and write the checksum too.
>>>
>>> This is already done by the disk drive - I'm not sure what you hope to gain
>>> by doing it in the RAID layer as well.
>>
>> Consider RAID1/RAID10/RAID5/RAID6, where one or more members are returning bad
>> data for some reason (e.g. are failing or have written garbage to disk during
>> a sudden power loss). Having per-block checksums would allow to determine
>> which members have correct data and which do not, and would help the RAID
>> layer recover from that situation in the smartest way possible (with absolutely
>> no loss or corruption of the user data).
>>
> 
> Why do you think that md would be able to reliably write consistent data and
> checksum to a device in a circumstance (power failure) where the hard drive
> is not able to do it itelf?

It wouldn't have to be a power failure.  A kernel panic wouldn't be recoverable,
either.

> i.e. I would need to see a clear threat-model which can cause data corruption
> that the hard drive itself would not be able to reliably report, but that
> checksums provided by md would be able to reliably report.
> Powerfail does not qualify (without sophisticated journalling on the part of
> md).

I agree that the hash itself is insufficient, but I don't think a full journal
is needed either.  If each hash had a timestamp and short sequence number, and
was stored with copies of its siblings' sequence numbers, which data was out of
sync could be worked out.  I admit that quantity of meta-data would be
exhorbitant for 512B sectors, but might be acceptable for 4K blocks.  It does
vary with number of raid devices, though.  I'll have think about ways to
minimize that.

It would work for any situation where data in an MD member device's queue didn't
make it to the platter, and the platter retained the old data.  Of course, if the
number of devices with stale data in one stripe exceeds the failure tolerance
of the array, it still can't be fixed.  The algorithm could *revert* to old data
if the number of devices with new data was within the failure tolerance.  That
might be valuable.

Phil

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17  0:24     ` Phil Turmel
@ 2011-02-17  0:52       ` NeilBrown
  2011-02-17  1:14         ` Phil Turmel
  0 siblings, 1 reply; 52+ messages in thread
From: NeilBrown @ 2011-02-17  0:52 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Piergiorgio Sartor, linux-raid

On Wed, 16 Feb 2011 19:24:15 -0500 Phil Turmel <philip@turmel.org> wrote:

> On 02/16/2011 04:48 PM, NeilBrown wrote:
> > On Wed, 16 Feb 2011 21:29:39 +0100 Piergiorgio Sartor
> >>
> >>> Better reporting of inconsistencies.
> >>> ------------------------------------
> >>>
> >>> When a 'check' finds a data inconsistency it would be useful if it
> >>> was reported.   That would allow a sysadmin to try to understand the
> >>> cause and possibly fix it.
> >>
> >> Could you, please, consider to add, for RAID-6, the
> >> capability to report also which device, potentially,
> >> has the problem? Thanks!
> > 
> > I would rather leave that to user-space.  If I report where the problem is, a
> > tool could directly read all the blocks in that stripe and perform any fancy
> > calculations you like.  I may even write that tool (but no promises).
> 
> Hmmm.  The existing "check" code, if it encounters a read error, will use
> available redundancy to recover that data and rewrite it on the spot.
> 
> Without a read error, or with multiple redundancy, the calculations to
> check consistency are performed and reported.  With all the data "hot", and half
> the calculation to pinpoint an inconsistency done, it seems a shame to have
> userspace redo it.
> 
> Are you adamantly opposed to the kernel doing this?  (For Raid6)  Code talks,
> of course, but I'd rather not start if I'm only going to be shot down.
> 

I like to think I remain open-minded to any compelling arguments.

However putting code into the kernel which *only* tells user-space something
that it could figure out for itself doesn't sound sensible - though it
depends a bit on how much code.

Also - as I understand it - the RAID6 code works on a byte-by-byte basis.
This the P and Q bytes are computed from the N data bytes, and collections of
these bytes form blocks.

The "which block is bad calculation" take the  data bytes and the P and Q
bytes and produces a new byte.  If that byte is < N, it means that just
changing data byte N can make P and Q consistent.  (if it is N, the the P
bytes is bad, if it is N+1 then the Q byte is bad).  If it is >N+1, then
... possibly multiple bytes are bad .. my knowledge gets hazy here.

So when you do the computation on all of the bytes in all of the blocks you
get a block full of answers.
If the answers are all the same - that tells you something fairly strong.
If they are a "all different" then that is also a fairly strong statement.
But what if most are the same, but a few are different?  How do you interpret
that?

The point I'm trying to get to is that the result of this RAID6 calculation
isn't a simple "that device is bad".  It is a block of data that needs to be
interpreted.

I'd rather have user-space do that interpretation, so it may as well do the
calculation too.

If you wanted to do it in the kernel, you would need to be very clear about
what information you provide, what it means exactly, and why it is sufficient.

NeilBrown

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17  0:30         ` David Brown
@ 2011-02-17  0:55           ` NeilBrown
  2011-02-17  1:04           ` Keld Jørn Simonsen
  1 sibling, 0 replies; 52+ messages in thread
From: NeilBrown @ 2011-02-17  0:55 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

On Thu, 17 Feb 2011 01:30:49 +0100 David Brown <david.brown@hesbynett.no>
wrote:

> Do filesystems re-try reads when there is a failure?  Could you return 
> fail on one read, then success on a re-read, which could be interpreted 
> as "dying, but not yet dead" by the file system?
> 

Not normally.  The underlying device is assumed to perform all retrys that
are reasonable.  Retrying again at the FS level would be pointless.

It certainly would be possibly to return some sort of "data not very safe"
indicator, which a disk drive could set if it needed to retry the read.
However you need to get buy-in from some FS developer before it is worth the
effort.

NeilBrown


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17  0:30         ` David Brown
  2011-02-17  0:55           ` NeilBrown
@ 2011-02-17  1:04           ` Keld Jørn Simonsen
  2011-02-17 10:45             ` David Brown
  1 sibling, 1 reply; 52+ messages in thread
From: Keld Jørn Simonsen @ 2011-02-17  1:04 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

On Thu, Feb 17, 2011 at 01:30:49AM +0100, David Brown wrote:
> On 17/02/11 00:01, NeilBrown wrote:
> >On Wed, 16 Feb 2011 23:34:43 +0100 David Brown<david.brown@hesbynett.no>
> >wrote:
> >
> >>I thought there was some mechanism for block devices to report bad
> >>blocks back to the file system, and that file systems tracked bad block
> >>lists.  Modern drives automatically relocate bad blocks (at least, they
> >>do if they can), but there was a time when they did not and it was up to
> >>the file system to track these.  Whether that still applies to modern
> >>file systems, I do not know - they only file system I have studied in
> >>low-level detail is FAT16.
> >
> >When the block device reports an error the filesystem can certainly record
> >that information in a bad-block list, and possibly does.
> >
> >However I thought you were suggesting a situation where the block device
> >could succeed with the request, but knew that area of the device was of low
> >quality.
> 
> I guess that is what I was trying to suggest, though not very clearly.
> 
> >e.g. IO to a block on a stripe which had one 'bad block'.  The IO should
> >succeed, but the data isn't as safe as elsewhere.  It would be nice if we
> >could tell the filesystem that fact, and if it could make use of it. But we
> >currently cannot.   We can say "success" or "failure", but we cannot say
> >"success, but you might not be so lucky next time".
> >
> 
> Do filesystems re-try reads when there is a failure?  Could you return 
> fail on one read, then success on a re-read, which could be interpreted 
> as "dying, but not yet dead" by the file system?

This should not be a file system feature. The file system is built upon
the raid, and in mirrorred rait types like raid1 and raid10, and also
other raid types, you cannot be sure which specific drive and sector the
data was read from - it could be one out of many (typically two) places.
So the bad blocks of a raid is a feature of the raid and its individual
drives, not the file system. If it was a property of the file system,
then the fs should be aware of the underlying raid topology, and know if
this was a parity block or data block of raid5 or raid6, or which
mirror instance of a raid1/10 type which  was involved. 

Best regards
keld

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17  0:52       ` NeilBrown
@ 2011-02-17  1:14         ` Phil Turmel
  2011-02-17  3:10           ` NeilBrown
  2011-02-17 19:56           ` Piergiorgio Sartor
  0 siblings, 2 replies; 52+ messages in thread
From: Phil Turmel @ 2011-02-17  1:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid

On 02/16/2011 07:52 PM, NeilBrown wrote:
> On Wed, 16 Feb 2011 19:24:15 -0500 Phil Turmel <philip@turmel.org> wrote:
> 
>> On 02/16/2011 04:48 PM, NeilBrown wrote:
>>> On Wed, 16 Feb 2011 21:29:39 +0100 Piergiorgio Sartor
>>>>
>>>>> Better reporting of inconsistencies.
>>>>> ------------------------------------
>>>>>
>>>>> When a 'check' finds a data inconsistency it would be useful if it
>>>>> was reported.   That would allow a sysadmin to try to understand the
>>>>> cause and possibly fix it.
>>>>
>>>> Could you, please, consider to add, for RAID-6, the
>>>> capability to report also which device, potentially,
>>>> has the problem? Thanks!
>>>
>>> I would rather leave that to user-space.  If I report where the problem is, a
>>> tool could directly read all the blocks in that stripe and perform any fancy
>>> calculations you like.  I may even write that tool (but no promises).
>>
>> Hmmm.  The existing "check" code, if it encounters a read error, will use
>> available redundancy to recover that data and rewrite it on the spot.
>>
>> Without a read error, or with multiple redundancy, the calculations to
>> check consistency are performed and reported.  With all the data "hot", and half
>> the calculation to pinpoint an inconsistency done, it seems a shame to have
>> userspace redo it.
>>
>> Are you adamantly opposed to the kernel doing this?  (For Raid6)  Code talks,
>> of course, but I'd rather not start if I'm only going to be shot down.
>>
> 
> I like to think I remain open-minded to any compelling arguments.
> 
> However putting code into the kernel which *only* tells user-space something
> that it could figure out for itself doesn't sound sensible - though it
> depends a bit on how much code.
> 
> Also - as I understand it - the RAID6 code works on a byte-by-byte basis.
> This the P and Q bytes are computed from the N data bytes, and collections of
> these bytes form blocks.
> 
> The "which block is bad calculation" take the  data bytes and the P and Q
> bytes and produces a new byte.  If that byte is < N, it means that just
> changing data byte N can make P and Q consistent.  (if it is N, the the P
> bytes is bad, if it is N+1 then the Q byte is bad).  If it is >N+1, then
> ... possibly multiple bytes are bad .. my knowledge gets hazy here.
> 
> So when you do the computation on all of the bytes in all of the blocks you
> get a block full of answers.
> If the answers are all the same - that tells you something fairly strong.
> If they are a "all different" then that is also a fairly strong statement.
> But what if most are the same, but a few are different?  How do you interpret
> that?

Actually, I was thinking about that.  (You suckered me into reading that PDF
some weeks ago.)  I would be inclined to allow the kernel to make corrections
where "all the same" covers individual sectors, per the sector size reported
by the underlying device.

Also, the comparison would have to ignore "neutral bytes", where P & Q
happened to be correct for that byte position.

> The point I'm trying to get to is that the result of this RAID6 calculation
> isn't a simple "that device is bad".  It is a block of data that needs to be
> interpreted.
> 
> I'd rather have user-space do that interpretation, so it may as well do the
> calculation too.
> 
> If you wanted to do it in the kernel, you would need to be very clear about
> what information you provide, what it means exactly, and why it is sufficient.

Given that the hardware is going to do error correction and checking at a
sector size granularity, and the kernel would in fact rewrite that sector using
this calculation if the hardware made a "fairly strong" statement that it can't
be trusted, I'd argue that rewriting the sector is appropriate.

Any corrective action that isn't consistent at the sector level should be punted.
I'm very curious what percentage that would be in production environments.

Phil

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 21:55           ` NeilBrown
@ 2011-02-17  1:30             ` Roberto Spadim
  0 siblings, 0 replies; 52+ messages in thread
From: Roberto Spadim @ 2011-02-17  1:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: Giovanni Tessore, linux-raid

=] i agree with you on all cases :) the idea of a generic raid10 for
raid1,0,10 isn't good for low cpu/ram (in other words increase
performace)
know i understand why raid10 raid1 raid0 are diferent files and raid456 only one

let´s try the option2:
could we implement layout/offset to raid1?
it´s a read performace improvement (maybe write problem)
for example, odd sectors on start of disk1 and end of disk2, even
sectors on end of disk1 and start of disk2 or another layouts like
raid10

2011/2/16 NeilBrown <neilb@suse.de>:
> On Wed, 16 Feb 2011 11:21:50 -0300 Roberto Spadim <roberto@spadim.com.br>
> wrote:
>
>> since we have the option1 done, why continue with raid1 code? could we
>> port write-behind to raid10 code?
>
> No.  write-behind depends on write-mostly, and write-mostly only really makes
> sense for RAID1.  I much prefer to keep these two code bases separate.
>
>> another thing, could raid10 work without replica? like a raid0?
>
> Why don't you try it?  Choose a layout that asks for only 1 copy of the data.
> It should work.
>
>>
>> why? just to remove many files with the same function (raid1and raid0,
>> if raid10 do the same work, many some mdadm changes allow us to
>> --level=1 to understand that's raid10 without stripe, --level=0 is
>> raid10 without mirrors)
>
> Again, RAID0 has some features that RAID10 doesn'tand cannot.  I suggest you
> read man pages (e.g. 'man md') to find out the details.
>
> Also the RAID0 code is much simpler and hence possibly faster.
>
> NeilBrown
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17  1:14         ` Phil Turmel
@ 2011-02-17  3:10           ` NeilBrown
  2011-02-17 18:46             ` Phil Turmel
  2011-02-17 21:04             ` Mr. James W. Laferriere
  2011-02-17 19:56           ` Piergiorgio Sartor
  1 sibling, 2 replies; 52+ messages in thread
From: NeilBrown @ 2011-02-17  3:10 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Piergiorgio Sartor, linux-raid

On Wed, 16 Feb 2011 20:14:50 -0500 Phil Turmel <philip@turmel.org> wrote:

> On 02/16/2011 07:52 PM, NeilBrown wrote:

> > So when you do the computation on all of the bytes in all of the blocks you
> > get a block full of answers.
> > If the answers are all the same - that tells you something fairly strong.
> > If they are a "all different" then that is also a fairly strong statement.
> > But what if most are the same, but a few are different?  How do you interpret
> > that?
> 
> Actually, I was thinking about that.  (You suckered me into reading that PDF
> some weeks ago.)  I would be inclined to allow the kernel to make corrections
> where "all the same" covers individual sectors, per the sector size reported
> by the underlying device.

To see what I am strongly against having the kernel make automatic
corrections like this, see

    http://neil.brown.name/blog/20100211050355


> 
> Also, the comparison would have to ignore "neutral bytes", where P & Q
> happened to be correct for that byte position.
> 
> > The point I'm trying to get to is that the result of this RAID6 calculation
> > isn't a simple "that device is bad".  It is a block of data that needs to be
> > interpreted.
> > 
> > I'd rather have user-space do that interpretation, so it may as well do the
> > calculation too.
> > 
> > If you wanted to do it in the kernel, you would need to be very clear about
> > what information you provide, what it means exactly, and why it is sufficient.
> 
> Given that the hardware is going to do error correction and checking at a
> sector size granularity, and the kernel would in fact rewrite that sector using
> this calculation if the hardware made a "fairly strong" statement that it can't
> be trusted, I'd argue that rewriting the sector is appropriate.

You the RAID6 calculation tells you is that something cannot be trusted.  It
doesn't tell you what.  It could be the controller, the cable, the drive
logic, or the rust on the media.  Without the knowledge, correction can be
dangerous.

NeilBrown



> 
> Any corrective action that isn't consistent at the sector level should be punted.
> I'm very curious what percentage that would be in production environments.
> 
> Phil


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17  1:04           ` Keld Jørn Simonsen
@ 2011-02-17 10:45             ` David Brown
  2011-02-17 10:58               ` Keld Jørn Simonsen
  0 siblings, 1 reply; 52+ messages in thread
From: David Brown @ 2011-02-17 10:45 UTC (permalink / raw)
  To: linux-raid

On 17/02/2011 02:04, Keld Jørn Simonsen wrote:
> On Thu, Feb 17, 2011 at 01:30:49AM +0100, David Brown wrote:
>> On 17/02/11 00:01, NeilBrown wrote:
>>> On Wed, 16 Feb 2011 23:34:43 +0100 David Brown<david.brown@hesbynett.no>
>>> wrote:
>>>
>>>> I thought there was some mechanism for block devices to report bad
>>>> blocks back to the file system, and that file systems tracked bad block
>>>> lists.  Modern drives automatically relocate bad blocks (at least, they
>>>> do if they can), but there was a time when they did not and it was up to
>>>> the file system to track these.  Whether that still applies to modern
>>>> file systems, I do not know - they only file system I have studied in
>>>> low-level detail is FAT16.
>>>
>>> When the block device reports an error the filesystem can certainly record
>>> that information in a bad-block list, and possibly does.
>>>
>>> However I thought you were suggesting a situation where the block device
>>> could succeed with the request, but knew that area of the device was of low
>>> quality.
>>
>> I guess that is what I was trying to suggest, though not very clearly.
>>
>>> e.g. IO to a block on a stripe which had one 'bad block'.  The IO should
>>> succeed, but the data isn't as safe as elsewhere.  It would be nice if we
>>> could tell the filesystem that fact, and if it could make use of it. But we
>>> currently cannot.   We can say "success" or "failure", but we cannot say
>>> "success, but you might not be so lucky next time".
>>>
>>
>> Do filesystems re-try reads when there is a failure?  Could you return
>> fail on one read, then success on a re-read, which could be interpreted
>> as "dying, but not yet dead" by the file system?
>
> This should not be a file system feature. The file system is built upon
> the raid, and in mirrorred rait types like raid1 and raid10, and also
> other raid types, you cannot be sure which specific drive and sector the
> data was read from - it could be one out of many (typically two) places.
> So the bad blocks of a raid is a feature of the raid and its individual
> drives, not the file system. If it was a property of the file system,
> then the fs should be aware of the underlying raid topology, and know if
> this was a parity block or data block of raid5 or raid6, or which
> mirror instance of a raid1/10 type which  was involved.
>

Thanks for the explanation.

I guess my worry is that if md layer has tracked a bad block on a disk, 
then that stripe will be in a degraded mode.  It's great that it will 
still work, and it's great that the bad block list means that it is 
/only/ that stripe that is degraded - not the whole raid.

But I'm hoping there can be some sort of relocation somewhere 
(ultimately it doesn't matter if it is handled by the file system, or by 
md for the whole stripe, or by md for just that disk block, or by the 
disk itself), so that you can get raid protection again for that stripe.


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17 10:45             ` David Brown
@ 2011-02-17 10:58               ` Keld Jørn Simonsen
  2011-02-17 11:45                 ` Giovanni Tessore
  0 siblings, 1 reply; 52+ messages in thread
From: Keld Jørn Simonsen @ 2011-02-17 10:58 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

On Thu, Feb 17, 2011 at 11:45:35AM +0100, David Brown wrote:
> On 17/02/2011 02:04, Keld Jørn Simonsen wrote:
> >On Thu, Feb 17, 2011 at 01:30:49AM +0100, David Brown wrote:
> >>On 17/02/11 00:01, NeilBrown wrote:
> >>>On Wed, 16 Feb 2011 23:34:43 +0100 David Brown<david.brown@hesbynett.no>
> >>>wrote:
> >>>
> >>>>I thought there was some mechanism for block devices to report bad
> >>>>blocks back to the file system, and that file systems tracked bad block
> >>>>lists.  Modern drives automatically relocate bad blocks (at least, they
> >>>>do if they can), but there was a time when they did not and it was up to
> >>>>the file system to track these.  Whether that still applies to modern
> >>>>file systems, I do not know - they only file system I have studied in
> >>>>low-level detail is FAT16.
> >>>
> >>>When the block device reports an error the filesystem can certainly 
> >>>record
> >>>that information in a bad-block list, and possibly does.
> >>>
> >>>However I thought you were suggesting a situation where the block device
> >>>could succeed with the request, but knew that area of the device was of 
> >>>low
> >>>quality.
> >>
> >>I guess that is what I was trying to suggest, though not very clearly.
> >>
> >>>e.g. IO to a block on a stripe which had one 'bad block'.  The IO should
> >>>succeed, but the data isn't as safe as elsewhere.  It would be nice if we
> >>>could tell the filesystem that fact, and if it could make use of it. But 
> >>>we
> >>>currently cannot.   We can say "success" or "failure", but we cannot say
> >>>"success, but you might not be so lucky next time".
> >>>
> >>
> >>Do filesystems re-try reads when there is a failure?  Could you return
> >>fail on one read, then success on a re-read, which could be interpreted
> >>as "dying, but not yet dead" by the file system?
> >
> >This should not be a file system feature. The file system is built upon
> >the raid, and in mirrorred raid types like raid1 and raid10, and also
> >other raid types, you cannot be sure which specific drive and sector the
> >data was read from - it could be one out of many (typically two) places.
> >So the bad blocks of a raid is a feature of the raid and its individual
> >drives, not the file system. If it was a property of the file system,
> >then the fs should be aware of the underlying raid topology, and know if
> >this was a parity block or data block of raid5 or raid6, or which
> >mirror instance of a raid1/10 type which  was involved.
> >
> 
> Thanks for the explanation.
> 
> I guess my worry is that if md layer has tracked a bad block on a disk, 
> then that stripe will be in a degraded mode.  It's great that it will 
> still work, and it's great that the bad block list means that it is 
> /only/ that stripe that is degraded - not the whole raid.

I am proposing that the stripe not be degraded, using a recovery area for bad
blocks on the disk, that goes together with the metadata area.

> But I'm hoping there can be some sort of relocation somewhere 
> (ultimately it doesn't matter if it is handled by the file system, or by 
> md for the whole stripe, or by md for just that disk block, or by the 
> disk itself), so that you can get raid protection again for that stripe.

I think we agree in hoping:-)

best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17 10:58               ` Keld Jørn Simonsen
@ 2011-02-17 11:45                 ` Giovanni Tessore
  2011-02-17 15:44                   ` Keld Jørn Simonsen
  0 siblings, 1 reply; 52+ messages in thread
From: Giovanni Tessore @ 2011-02-17 11:45 UTC (permalink / raw)
  To: linux-raid

On 02/17/2011 11:58 AM, Keld Jørn Simonsen wrote:
> On Thu, Feb 17, 2011 at 11:45:35AM +0100, David Brown wrote:
>> On 17/02/2011 02:04, Keld Jørn Simonsen wrote:
>>> On Thu, Feb 17, 2011 at 01:30:49AM +0100, David Brown wrote:
>>>> On 17/02/11 00:01, NeilBrown wrote:
>>>>> On Wed, 16 Feb 2011 23:34:43 +0100 David Brown<david.brown@hesbynett.no>
>>>>> wrote:
>>>>>
>>>>>> I thought there was some mechanism for block devices to report bad
>>>>>> blocks back to the file system, and that file systems tracked bad block
>>>>>> lists.  Modern drives automatically relocate bad blocks (at least, they
>>>>>> do if they can), but there was a time when they did not and it was up to
>>>>>> the file system to track these.  Whether that still applies to modern
>>>>>> file systems, I do not know - they only file system I have studied in
>>>>>> low-level detail is FAT16.
>>>>> When the block device reports an error the filesystem can certainly
>>>>> record
>>>>> that information in a bad-block list, and possibly does.
>>>>>
>>>>> However I thought you were suggesting a situation where the block device
>>>>> could succeed with the request, but knew that area of the device was of
>>>>> low
>>>>> quality.
>>>> I guess that is what I was trying to suggest, though not very clearly.
>>>>
>>>>> e.g. IO to a block on a stripe which had one 'bad block'.  The IO should
>>>>> succeed, but the data isn't as safe as elsewhere.  It would be nice if we
>>>>> could tell the filesystem that fact, and if it could make use of it. But
>>>>> we
>>>>> currently cannot.   We can say "success" or "failure", but we cannot say
>>>>> "success, but you might not be so lucky next time".
>>>>>
>>>> Do filesystems re-try reads when there is a failure?  Could you return
>>>> fail on one read, then success on a re-read, which could be interpreted
>>>> as "dying, but not yet dead" by the file system?
>>> This should not be a file system feature. The file system is built upon
>>> the raid, and in mirrorred raid types like raid1 and raid10, and also
>>> other raid types, you cannot be sure which specific drive and sector the
>>> data was read from - it could be one out of many (typically two) places.
>>> So the bad blocks of a raid is a feature of the raid and its individual
>>> drives, not the file system. If it was a property of the file system,
>>> then the fs should be aware of the underlying raid topology, and know if
>>> this was a parity block or data block of raid5 or raid6, or which
>>> mirror instance of a raid1/10 type which  was involved.
>>>
>> Thanks for the explanation.
>>
>> I guess my worry is that if md layer has tracked a bad block on a disk,
>> then that stripe will be in a degraded mode.  It's great that it will
>> still work, and it's great that the bad block list means that it is
>> /only/ that stripe that is degraded - not the whole raid.
> I am proposing that the stripe not be degraded, using a recovery area for bad
> blocks on the disk, that goes together with the metadata area.
>
>> But I'm hoping there can be some sort of relocation somewhere
>> (ultimately it doesn't matter if it is handled by the file system, or by
>> md for the whole stripe, or by md for just that disk block, or by the
>> disk itself), so that you can get raid protection again for that stripe.
> I think we agree in hoping:-)

IMHO the point is that this feature (Bad Block Log) is a GREAT feature 
as it just helps in keeping track of the health status of the underlying 
disks, and helps A LOT in recovering data from the array when a 
unrecoverable read error occurs (now the full array goes offline). Then 
something must be done proactively to repair the situation, as it means 
that a disk of the array has problems and should be replaced. So, first 
it's worth to make a backup of the still alive array (getting some read 
error when the bad blocks/stripes are encountered [maybe using ddrescue 
or similar]), then replace the disk, and reconstruct the array; after 
that a fsck on the filesystem may repair the situation.

You may argue that the unrecoverable read error come from just very few 
sector of the disk, and it's not worth to replace it (personally I would 
replace also on very few ones), as there are still many reserverd 
sectors for relocation on the disk. Then a simple solution would just be 
to zero-write the bad blocks in the Bad Block Log (the data is gone 
already): if the write succedes (disk uses reserved sectors for 
relocation), the blocks are removed from the log (now they are ok); then 
fsck (hopefully) may repair the filesystem. At this point there are no 
more md read erros, maybe just filesystem errors (the array is clean, 
the filesystem may be not, but notice that nothing can be done to avoid 
filesystem problems, as there has been a data loss; only fsck may help).

Regards

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17 11:45                 ` Giovanni Tessore
@ 2011-02-17 15:44                   ` Keld Jørn Simonsen
  2011-02-17 16:22                     ` Roberto Spadim
  2011-02-18  0:13                     ` Giovanni Tessore
  0 siblings, 2 replies; 52+ messages in thread
From: Keld Jørn Simonsen @ 2011-02-17 15:44 UTC (permalink / raw)
  To: Giovanni Tessore; +Cc: linux-raid

On Thu, Feb 17, 2011 at 12:45:42PM +0100, Giovanni Tessore wrote:
> On 02/17/2011 11:58 AM, Keld Jørn Simonsen wrote:
> >On Thu, Feb 17, 2011 at 11:45:35AM +0100, David Brown wrote:
> >>On 17/02/2011 02:04, Keld Jørn Simonsen wrote:
> >>>On Thu, Feb 17, 2011 at 01:30:49AM +0100, David Brown wrote:
> >>>>On 17/02/11 00:01, NeilBrown wrote:
> >>>>>On Wed, 16 Feb 2011 23:34:43 +0100 David 
> >>>>>Brown<david.brown@hesbynett.no>
> >>>>>wrote:
> >>>>>
> >>>>>>I thought there was some mechanism for block devices to report bad
> >>>>>>blocks back to the file system, and that file systems tracked bad 
> >>>>>>block
> >>>>>>lists.  Modern drives automatically relocate bad blocks (at least, 
> >>>>>>they
> >>>>>>do if they can), but there was a time when they did not and it was up 
> >>>>>>to
> >>>>>>the file system to track these.  Whether that still applies to modern
> >>>>>>file systems, I do not know - they only file system I have studied in
> >>>>>>low-level detail is FAT16.
> >>>>>When the block device reports an error the filesystem can certainly
> >>>>>record
> >>>>>that information in a bad-block list, and possibly does.
> >>>>>
> >>>>>However I thought you were suggesting a situation where the block 
> >>>>>device
> >>>>>could succeed with the request, but knew that area of the device was of
> >>>>>low
> >>>>>quality.
> >>>>I guess that is what I was trying to suggest, though not very clearly.
> >>>>
> >>>>>e.g. IO to a block on a stripe which had one 'bad block'.  The IO 
> >>>>>should
> >>>>>succeed, but the data isn't as safe as elsewhere.  It would be nice if 
> >>>>>we
> >>>>>could tell the filesystem that fact, and if it could make use of it. 
> >>>>>But
> >>>>>we
> >>>>>currently cannot.   We can say "success" or "failure", but we cannot 
> >>>>>say
> >>>>>"success, but you might not be so lucky next time".
> >>>>>
> >>>>Do filesystems re-try reads when there is a failure?  Could you return
> >>>>fail on one read, then success on a re-read, which could be interpreted
> >>>>as "dying, but not yet dead" by the file system?
> >>>This should not be a file system feature. The file system is built upon
> >>>the raid, and in mirrorred raid types like raid1 and raid10, and also
> >>>other raid types, you cannot be sure which specific drive and sector the
> >>>data was read from - it could be one out of many (typically two) places.
> >>>So the bad blocks of a raid is a feature of the raid and its individual
> >>>drives, not the file system. If it was a property of the file system,
> >>>then the fs should be aware of the underlying raid topology, and know if
> >>>this was a parity block or data block of raid5 or raid6, or which
> >>>mirror instance of a raid1/10 type which  was involved.
> >>>
> >>Thanks for the explanation.
> >>
> >>I guess my worry is that if md layer has tracked a bad block on a disk,
> >>then that stripe will be in a degraded mode.  It's great that it will
> >>still work, and it's great that the bad block list means that it is
> >>/only/ that stripe that is degraded - not the whole raid.
> >I am proposing that the stripe not be degraded, using a recovery area for 
> >bad
> >blocks on the disk, that goes together with the metadata area.
> >
> >>But I'm hoping there can be some sort of relocation somewhere
> >>(ultimately it doesn't matter if it is handled by the file system, or by
> >>md for the whole stripe, or by md for just that disk block, or by the
> >>disk itself), so that you can get raid protection again for that stripe.
> >I think we agree in hoping:-)
> 
> IMHO the point is that this feature (Bad Block Log) is a GREAT feature 
> as it just helps in keeping track of the health status of the underlying 
> disks, and helps A LOT in recovering data from the array when a 
> unrecoverable read error occurs (now the full array goes offline). Then 
> something must be done proactively to repair the situation, as it means 
> that a disk of the array has problems and should be replaced. So, first 
> it's worth to make a backup of the still alive array (getting some read 
> error when the bad blocks/stripes are encountered [maybe using ddrescue 
> or similar]), then replace the disk, and reconstruct the array; after 
> that a fsck on the filesystem may repair the situation.
> 
> You may argue that the unrecoverable read error come from just very few 
> sector of the disk, and it's not worth to replace it (personally I would 
> replace also on very few ones), as there are still many reserverd 
> sectors for relocation on the disk. Then a simple solution would just be 
> to zero-write the bad blocks in the Bad Block Log (the data is gone 
> already): if the write succedes (disk uses reserved sectors for 
> relocation), the blocks are removed from the log (now they are ok); then 
> fsck (hopefully) may repair the filesystem. At this point there are no 
> more md read erros, maybe just filesystem errors (the array is clean, 
> the filesystem may be not, but notice that nothing can be done to avoid 
> filesystem problems, as there has been a data loss; only fsck may help).

another way around, if the badblocks recovery area does not fly with
Neil or other implementors.

It should be possible to run a periodic check of if any bad sectors have
occurred in an array. Then the half-damaged file should be moved away from
this area with the bad block by copying it and relinking it, and before
relinking it to the proper place the good block corresponding to the bad 
block should be marked as a corresponding good block on the healthy disk
drive, so that it not be allocated again. This action could even be
triggered by the event of the detection of the bad block. This would
probably meean that ther need to be a system call to mark a
corresponding good block. The whole thing should be able to run in
userland and somewhat independent of the file system type, except for
the lookup of the corresponding file fram a damaged block.

best regards
Keld

best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17 15:44                   ` Keld Jørn Simonsen
@ 2011-02-17 16:22                     ` Roberto Spadim
  2011-02-18  0:13                     ` Giovanni Tessore
  1 sibling, 0 replies; 52+ messages in thread
From: Roberto Spadim @ 2011-02-17 16:22 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: Giovanni Tessore, linux-raid

another question, out of bad block, but related to write-behind
in mysql ndb, i can run many machines in the cluster, in a write, if 2
machines return commit OK, nbd put all others machines to async write,
it´s nice, because speed is improved and i have 1 computer redudancy,
could we implement a diferent write-behind method? i was talking about
it in another email thread
something like:
select what disks must be write-mostly (only read if all mirrors are failed)
select what disks MUST be commited (sync)
select what disks MUST be write-behind (async)
select what disks can be automatic (sync/async, if X disks are
commited theses disks are automatic write-behind, after write they get
backto non write-behind, idon´t see a solution at userspace, just at
kernel space)
itwil help a mix of raid1 with slow and fast disks, maybe the problem
of accesstime can be reduced for harddisks, the raid1 isn´t more
slowest disk speed for write
check that write-mostly is for read_balance
write-behind for write command

2011/2/17 Keld Jørn Simonsen <keld@keldix.com>:
> On Thu, Feb 17, 2011 at 12:45:42PM +0100, Giovanni Tessore wrote:
>> On 02/17/2011 11:58 AM, Keld Jørn Simonsen wrote:
>> >On Thu, Feb 17, 2011 at 11:45:35AM +0100, David Brown wrote:
>> >>On 17/02/2011 02:04, Keld Jørn Simonsen wrote:
>> >>>On Thu, Feb 17, 2011 at 01:30:49AM +0100, David Brown wrote:
>> >>>>On 17/02/11 00:01, NeilBrown wrote:
>> >>>>>On Wed, 16 Feb 2011 23:34:43 +0100 David
>> >>>>>Brown<david.brown@hesbynett.no>
>> >>>>>wrote:
>> >>>>>
>> >>>>>>I thought there was some mechanism for block devices to report bad
>> >>>>>>blocks back to the file system, and that file systems tracked bad
>> >>>>>>block
>> >>>>>>lists.  Modern drives automatically relocate bad blocks (at least,
>> >>>>>>they
>> >>>>>>do if they can), but there was a time when they did not and it was up
>> >>>>>>to
>> >>>>>>the file system to track these.  Whether that still applies to modern
>> >>>>>>file systems, I do not know - they only file system I have studied in
>> >>>>>>low-level detail is FAT16.
>> >>>>>When the block device reports an error the filesystem can certainly
>> >>>>>record
>> >>>>>that information in a bad-block list, and possibly does.
>> >>>>>
>> >>>>>However I thought you were suggesting a situation where the block
>> >>>>>device
>> >>>>>could succeed with the request, but knew that area of the device was of
>> >>>>>low
>> >>>>>quality.
>> >>>>I guess that is what I was trying to suggest, though not very clearly.
>> >>>>
>> >>>>>e.g. IO to a block on a stripe which had one 'bad block'.  The IO
>> >>>>>should
>> >>>>>succeed, but the data isn't as safe as elsewhere.  It would be nice if
>> >>>>>we
>> >>>>>could tell the filesystem that fact, and if it could make use of it.
>> >>>>>But
>> >>>>>we
>> >>>>>currently cannot.   We can say "success" or "failure", but we cannot
>> >>>>>say
>> >>>>>"success, but you might not be so lucky next time".
>> >>>>>
>> >>>>Do filesystems re-try reads when there is a failure?  Could you return
>> >>>>fail on one read, then success on a re-read, which could be interpreted
>> >>>>as "dying, but not yet dead" by the file system?
>> >>>This should not be a file system feature. The file system is built upon
>> >>>the raid, and in mirrorred raid types like raid1 and raid10, and also
>> >>>other raid types, you cannot be sure which specific drive and sector the
>> >>>data was read from - it could be one out of many (typically two) places.
>> >>>So the bad blocks of a raid is a feature of the raid and its individual
>> >>>drives, not the file system. If it was a property of the file system,
>> >>>then the fs should be aware of the underlying raid topology, and know if
>> >>>this was a parity block or data block of raid5 or raid6, or which
>> >>>mirror instance of a raid1/10 type which  was involved.
>> >>>
>> >>Thanks for the explanation.
>> >>
>> >>I guess my worry is that if md layer has tracked a bad block on a disk,
>> >>then that stripe will be in a degraded mode.  It's great that it will
>> >>still work, and it's great that the bad block list means that it is
>> >>/only/ that stripe that is degraded - not the whole raid.
>> >I am proposing that the stripe not be degraded, using a recovery area for
>> >bad
>> >blocks on the disk, that goes together with the metadata area.
>> >
>> >>But I'm hoping there can be some sort of relocation somewhere
>> >>(ultimately it doesn't matter if it is handled by the file system, or by
>> >>md for the whole stripe, or by md for just that disk block, or by the
>> >>disk itself), so that you can get raid protection again for that stripe.
>> >I think we agree in hoping:-)
>>
>> IMHO the point is that this feature (Bad Block Log) is a GREAT feature
>> as it just helps in keeping track of the health status of the underlying
>> disks, and helps A LOT in recovering data from the array when a
>> unrecoverable read error occurs (now the full array goes offline). Then
>> something must be done proactively to repair the situation, as it means
>> that a disk of the array has problems and should be replaced. So, first
>> it's worth to make a backup of the still alive array (getting some read
>> error when the bad blocks/stripes are encountered [maybe using ddrescue
>> or similar]), then replace the disk, and reconstruct the array; after
>> that a fsck on the filesystem may repair the situation.
>>
>> You may argue that the unrecoverable read error come from just very few
>> sector of the disk, and it's not worth to replace it (personally I would
>> replace also on very few ones), as there are still many reserverd
>> sectors for relocation on the disk. Then a simple solution would just be
>> to zero-write the bad blocks in the Bad Block Log (the data is gone
>> already): if the write succedes (disk uses reserved sectors for
>> relocation), the blocks are removed from the log (now they are ok); then
>> fsck (hopefully) may repair the filesystem. At this point there are no
>> more md read erros, maybe just filesystem errors (the array is clean,
>> the filesystem may be not, but notice that nothing can be done to avoid
>> filesystem problems, as there has been a data loss; only fsck may help).
>
> another way around, if the badblocks recovery area does not fly with
> Neil or other implementors.
>
> It should be possible to run a periodic check of if any bad sectors have
> occurred in an array. Then the half-damaged file should be moved away from
> this area with the bad block by copying it and relinking it, and before
> relinking it to the proper place the good block corresponding to the bad
> block should be marked as a corresponding good block on the healthy disk
> drive, so that it not be allocated again. This action could even be
> triggered by the event of the detection of the bad block. This would
> probably meean that ther need to be a system call to mark a
> corresponding good block. The whole thing should be able to run in
> userland and somewhat independent of the file system type, except for
> the lookup of the corresponding file fram a damaged block.
>
> best regards
> Keld
>
> best regards
> keld
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17  3:10           ` NeilBrown
@ 2011-02-17 18:46             ` Phil Turmel
  2011-02-17 21:04             ` Mr. James W. Laferriere
  1 sibling, 0 replies; 52+ messages in thread
From: Phil Turmel @ 2011-02-17 18:46 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid

On 02/16/2011 10:10 PM, NeilBrown wrote:
> On Wed, 16 Feb 2011 20:14:50 -0500 Phil Turmel <philip@turmel.org> wrote:
> 
>> On 02/16/2011 07:52 PM, NeilBrown wrote:
> 
>>> So when you do the computation on all of the bytes in all of the blocks you
>>> get a block full of answers.
>>> If the answers are all the same - that tells you something fairly strong.
>>> If they are a "all different" then that is also a fairly strong statement.
>>> But what if most are the same, but a few are different?  How do you interpret
>>> that?
>>
>> Actually, I was thinking about that.  (You suckered me into reading that PDF
>> some weeks ago.)  I would be inclined to allow the kernel to make corrections
>> where "all the same" covers individual sectors, per the sector size reported
>> by the underlying device.
> 
> To see what I am strongly against having the kernel make automatic
> corrections like this, see
> 
>     http://neil.brown.name/blog/20100211050355

I read it, and slept on it, and my gut wants to argue.  But I have no data to
back me up.  I think I'll take a stab at reporting inconsistencies via simple
printk with a sysfs on/off switch.

>> Also, the comparison would have to ignore "neutral bytes", where P & Q
>> happened to be correct for that byte position.
>>
>>> The point I'm trying to get to is that the result of this RAID6 calculation
>>> isn't a simple "that device is bad".  It is a block of data that needs to be
>>> interpreted.
>>>
>>> I'd rather have user-space do that interpretation, so it may as well do the
>>> calculation too.
>>>
>>> If you wanted to do it in the kernel, you would need to be very clear about
>>> what information you provide, what it means exactly, and why it is sufficient.
>>
>> Given that the hardware is going to do error correction and checking at a
>> sector size granularity, and the kernel would in fact rewrite that sector using
>> this calculation if the hardware made a "fairly strong" statement that it can't
>> be trusted, I'd argue that rewriting the sector is appropriate.
> 
> You the RAID6 calculation tells you is that something cannot be trusted.  It
> doesn't tell you what.  It could be the controller, the cable, the drive
> logic, or the rust on the media.  Without the knowledge, correction can be
> dangerous.

True, but inconsistent data is also dangerous, as traffic on this list shows.  The
question is, "When is it safer to correct than to leave alone?"  I don't think
there's enough data to answer that, unless you have some pointers to studies that
address it.

Either way, a reporting method is needed, and might give us some numbers to work
with.

Phil

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17  1:14         ` Phil Turmel
  2011-02-17  3:10           ` NeilBrown
@ 2011-02-17 19:56           ` Piergiorgio Sartor
  1 sibling, 0 replies; 52+ messages in thread
From: Piergiorgio Sartor @ 2011-02-17 19:56 UTC (permalink / raw)
  To: Phil Turmel; +Cc: NeilBrown, Piergiorgio Sartor, linux-raid

> > So when you do the computation on all of the bytes in all of the blocks you
> > get a block full of answers.
> > If the answers are all the same - that tells you something fairly strong.
> > If they are a "all different" then that is also a fairly strong statement.
> > But what if most are the same, but a few are different?  How do you interpret
> > that?
> 
> Actually, I was thinking about that.  (You suckered me into reading that PDF
> some weeks ago.)  I would be inclined to allow the kernel to make corrections
> where "all the same" covers individual sectors, per the sector size reported
> by the underlying device.

I do agree with Neil on this.
User space should collect the data, perform statistics
and give suggestions.
After that there should be a mechanism, at this point in
kernel space, I guess, capable of correcting one single
chunk of one device.

> Also, the comparison would have to ignore "neutral bytes", where P & Q
> happened to be correct for that byte position.

<shameless advertisement>
Have a look at the patch I submitted to "restripe.c", it
should cover the interesting cases.
Even if more statistics could be applied.
</shameless advertisement>
 
> Given that the hardware is going to do error correction and checking at a
> sector size granularity, and the kernel would in fact rewrite that sector using
> this calculation if the hardware made a "fairly strong" statement that it can't
> be trusted, I'd argue that rewriting the sector is appropriate.

The problem could be in the interface (it happened to me)
and not in the disk. So, there will be no error correction,
at this point, from the device.
 
> Any corrective action that isn't consistent at the sector level should be punted.
> I'm very curious what percentage that would be in production environments.

Yeah, me too.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17  3:10           ` NeilBrown
  2011-02-17 18:46             ` Phil Turmel
@ 2011-02-17 21:04             ` Mr. James W. Laferriere
  2011-02-18  1:48               ` NeilBrown
  1 sibling, 1 reply; 52+ messages in thread
From: Mr. James W. Laferriere @ 2011-02-17 21:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid maillist

 	Hello Neil ,

On Thu, 17 Feb 2011, NeilBrown wrote:
> On Wed, 16 Feb 2011 20:14:50 -0500 Phil Turmel <philip@turmel.org> wrote:
>> On 02/16/2011 07:52 PM, NeilBrown wrote:
>>> So when you do the computation on all of the bytes in all of the blocks you
>>> get a block full of answers.
>>> If the answers are all the same - that tells you something fairly strong.
>>> If they are a "all different" then that is also a fairly strong statement.
>>> But what if most are the same, but a few are different?  How do you interpret
>>> that?
>>
>> Actually, I was thinking about that.  (You suckered me into reading that PDF
>> some weeks ago.)  I would be inclined to allow the kernel to make corrections
>> where "all the same" covers individual sectors, per the sector size reported
>> by the underlying device.
>
> To see what I am strongly against having the kernel make automatic
> corrections like this, see
>
>    http://neil.brown.name/blog/20100211050355
 	Paraphrasing from the above ,  Mind all I did was skim the article . 
But this statement from your conclusions leaves me a tad lost .

"... It could even be done entirely in user-space by suspending IO to the 
affected stripe (md already supports that), making the required update, then 
resuming IO. ..."

 	Hopefully the quantity of 'required update'ng would be extremely small .
 	Tho if not ,  Then we'll start seeing locked at user level reports .  Of 
course the process should be completely kill'able & nice'able in userspace as 
long as there is documantation available to our 'informed admin' on howto limit 
the processes possible abuses to his primary runtime service(s) .  Which will 
probably be file sharing to other servers or workstations .
 	I am quite sure you have thought thru that particular (and many other) 
scenarios before making that proposal .  Would you please either here OR at the 
original document above inform us a little more on how you feel you'd like to 
approach the problem of a 'large update' ?  if a 'large update' should even be 
possible not even sure of can or not .


 	I also find myself agreeing with your 'to conclude the conclusion'-) 
that automaticity should in 'general' be avoided .  An Aware & informed admin 
is the best method to be used .


 	And finally BUT not the least ,  Thank you .  Even tho we have our 
differences of opinions on other aspects of the MD structure .

 		Tia ,  JimL
-- 
+------------------------------------------------------------------+
| James   W.   Laferriere | System    Techniques | Give me VMS     |
| Network&System Engineer | 3237     Holden Road |  Give me Linux  |
| babydr@baby-dragons.com | Fairbanks, AK. 99709 |   only  on  AXP |
+------------------------------------------------------------------+

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17 15:44                   ` Keld Jørn Simonsen
  2011-02-17 16:22                     ` Roberto Spadim
@ 2011-02-18  0:13                     ` Giovanni Tessore
  2011-02-18  2:56                       ` Keld Jørn Simonsen
  1 sibling, 1 reply; 52+ messages in thread
From: Giovanni Tessore @ 2011-02-18  0:13 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: linux-raid

On 02/17/2011 04:44 PM, Keld Jørn Simonsen wrote:
> On Thu, Feb 17, 2011 at 12:45:42PM +0100, Giovanni Tessore wrote:
>> On 02/17/2011 11:58 AM, Keld Jørn Simonsen wrote:
>>> On Thu, Feb 17, 2011 at 11:45:35AM +0100, David Brown wrote:
>>>> On 17/02/2011 02:04, Keld Jørn Simonsen wrote:
>>>>> On Thu, Feb 17, 2011 at 01:30:49AM +0100, David Brown wrote:
>>>>>> On 17/02/11 00:01, NeilBrown wrote:
>>>>>>> On Wed, 16 Feb 2011 23:34:43 +0100 David
>>>>>>> Brown<david.brown@hesbynett.no>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I thought there was some mechanism for block devices to report bad
>>>>>>>> blocks back to the file system, and that file systems tracked bad
>>>>>>>> block
>>>>>>>> lists.  Modern drives automatically relocate bad blocks (at least,
>>>>>>>> they
>>>>>>>> do if they can), but there was a time when they did not and it was up
>>>>>>>> to
>>>>>>>> the file system to track these.  Whether that still applies to modern
>>>>>>>> file systems, I do not know - they only file system I have studied in
>>>>>>>> low-level detail is FAT16.
>>>>>>> When the block device reports an error the filesystem can certainly
>>>>>>> record
>>>>>>> that information in a bad-block list, and possibly does.
>>>>>>>
>>>>>>> However I thought you were suggesting a situation where the block
>>>>>>> device
>>>>>>> could succeed with the request, but knew that area of the device was of
>>>>>>> low
>>>>>>> quality.
>>>>>> I guess that is what I was trying to suggest, though not very clearly.
>>>>>>
>>>>>>> e.g. IO to a block on a stripe which had one 'bad block'.  The IO
>>>>>>> should
>>>>>>> succeed, but the data isn't as safe as elsewhere.  It would be nice if
>>>>>>> we
>>>>>>> could tell the filesystem that fact, and if it could make use of it.
>>>>>>> But
>>>>>>> we
>>>>>>> currently cannot.   We can say "success" or "failure", but we cannot
>>>>>>> say
>>>>>>> "success, but you might not be so lucky next time".
>>>>>>>
>>>>>> Do filesystems re-try reads when there is a failure?  Could you return
>>>>>> fail on one read, then success on a re-read, which could be interpreted
>>>>>> as "dying, but not yet dead" by the file system?
>>>>> This should not be a file system feature. The file system is built upon
>>>>> the raid, and in mirrorred raid types like raid1 and raid10, and also
>>>>> other raid types, you cannot be sure which specific drive and sector the
>>>>> data was read from - it could be one out of many (typically two) places.
>>>>> So the bad blocks of a raid is a feature of the raid and its individual
>>>>> drives, not the file system. If it was a property of the file system,
>>>>> then the fs should be aware of the underlying raid topology, and know if
>>>>> this was a parity block or data block of raid5 or raid6, or which
>>>>> mirror instance of a raid1/10 type which  was involved.
>>>>>
>>>> Thanks for the explanation.
>>>>
>>>> I guess my worry is that if md layer has tracked a bad block on a disk,
>>>> then that stripe will be in a degraded mode.  It's great that it will
>>>> still work, and it's great that the bad block list means that it is
>>>> /only/ that stripe that is degraded - not the whole raid.
>>> I am proposing that the stripe not be degraded, using a recovery area for
>>> bad
>>> blocks on the disk, that goes together with the metadata area.
>>>
>>>> But I'm hoping there can be some sort of relocation somewhere
>>>> (ultimately it doesn't matter if it is handled by the file system, or by
>>>> md for the whole stripe, or by md for just that disk block, or by the
>>>> disk itself), so that you can get raid protection again for that stripe.
>>> I think we agree in hoping:-)
>> IMHO the point is that this feature (Bad Block Log) is a GREAT feature
>> as it just helps in keeping track of the health status of the underlying
>> disks, and helps A LOT in recovering data from the array when a
>> unrecoverable read error occurs (now the full array goes offline). Then
>> something must be done proactively to repair the situation, as it means
>> that a disk of the array has problems and should be replaced. So, first
>> it's worth to make a backup of the still alive array (getting some read
>> error when the bad blocks/stripes are encountered [maybe using ddrescue
>> or similar]), then replace the disk, and reconstruct the array; after
>> that a fsck on the filesystem may repair the situation.
>>
>> You may argue that the unrecoverable read error come from just very few
>> sector of the disk, and it's not worth to replace it (personally I would
>> replace also on very few ones), as there are still many reserverd
>> sectors for relocation on the disk. Then a simple solution would just be
>> to zero-write the bad blocks in the Bad Block Log (the data is gone
>> already): if the write succedes (disk uses reserved sectors for
>> relocation), the blocks are removed from the log (now they are ok); then
>> fsck (hopefully) may repair the filesystem. At this point there are no
>> more md read erros, maybe just filesystem errors (the array is clean,
>> the filesystem may be not, but notice that nothing can be done to avoid
>> filesystem problems, as there has been a data loss; only fsck may help).
> another way around, if the badblocks recovery area does not fly with
> Neil or other implementors.
>
> It should be possible to run a periodic check of if any bad sectors have
> occurred in an array. Then the half-damaged file should be moved away from
> this area with the bad block by copying it and relinking it, and before
> relinking it to the proper place the good block corresponding to the bad
> block should be marked as a corresponding good block on the healthy disk
> drive, so that it not be allocated again. This action could even be
> triggered by the event of the detection of the bad block. This would
> probably meean that ther need to be a system call to mark a
> corresponding good block. The whole thing should be able to run in
> userland and somewhat independent of the file system type, except for
> the lookup of the corresponding file fram a damaged block.

I don't follow this.. if a file has some damaged blocks, they are gone, 
moving it elsewhere does not help.

And however, this is a task of the filesystem.

md is just a block device (more reliable than a single disk due to some 
level of redundancy), and it should be indipendent from the kind of file 
system on it (as the file system should be indipendent from the kind of 
block device it resides on [md, hd, flash, iscsi, ...]).

Then what you suggest should be done for every block device that can 
have bad blocks (that is, every block device). Again, this is a 
filesystem issue. And of which file system type, as there are many?

The Bad Block Log allows md to behave 'like' a read hard disk would do 
with smart data:
- unreadable blocks/stripes are recorded into the log, as unreadable 
sectors are recorder into smart data
- unrecoverable read errors are reported to the caller for both
- the device still works if it has unrecoverable read errors for both 
(now the whole md device fails, this is the problem)
- if a block/stripe if rewritten with success  the block/stripe is 
removed from Bad Block Log (and the counter of relocated blocks/stripes 
is incremented); as if a sector is rewritten with succes on a disk the 
sector is removed from list of unreadable sector, and the counter of 
relocated sector is incremented (smart data)

A filesystem on a disk does not know what the firmware of the disk does 
about sectors relocation.
The same applies for a hardware (not fake) raid controller firmware.
The same should apply for md. It is transparent to the filesystem.

IMHO a more interesting issue whould be: a write error occurs on a disk 
participating to an already degraded array; failing the disk would fail 
the whole array. What to do? Put the array into read only mode, still 
allowing read access to data on it for easy backup? In such situation, 
what would do a hardware raid controller?

Hm, yes.... how do behave hardware raid controllers with uncorrectable 
read errors?
And how they behave with write error on a disk of an already degraded array?
I guess md should replicate these behaviours.

... Neil?

Regards.

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-17 21:04             ` Mr. James W. Laferriere
@ 2011-02-18  1:48               ` NeilBrown
  0 siblings, 0 replies; 52+ messages in thread
From: NeilBrown @ 2011-02-18  1:48 UTC (permalink / raw)
  To: Mr. James W. Laferriere; +Cc: linux-raid maillist

On Thu, 17 Feb 2011 12:04:54 -0900 (AKST) "Mr. James W. Laferriere"
<babydr@baby-dragons.com> wrote:

>  	Hello Neil ,
> 
> On Thu, 17 Feb 2011, NeilBrown wrote:
> > On Wed, 16 Feb 2011 20:14:50 -0500 Phil Turmel <philip@turmel.org> wrote:
> >> On 02/16/2011 07:52 PM, NeilBrown wrote:
> >>> So when you do the computation on all of the bytes in all of the blocks you
> >>> get a block full of answers.
> >>> If the answers are all the same - that tells you something fairly strong.
> >>> If they are a "all different" then that is also a fairly strong statement.
> >>> But what if most are the same, but a few are different?  How do you interpret
> >>> that?
> >>
> >> Actually, I was thinking about that.  (You suckered me into reading that PDF
> >> some weeks ago.)  I would be inclined to allow the kernel to make corrections
> >> where "all the same" covers individual sectors, per the sector size reported
> >> by the underlying device.
> >
> > To see what I am strongly against having the kernel make automatic
> > corrections like this, see
> >
> >    http://neil.brown.name/blog/20100211050355
>  	Paraphrasing from the above ,  Mind all I did was skim the article . 
> But this statement from your conclusions leaves me a tad lost .
> 
> "... It could even be done entirely in user-space by suspending IO to the 
> affected stripe (md already supports that), making the required update, then 
> resuming IO. ..."
> 
>  	Hopefully the quantity of 'required update'ng would be extremely small .
>  	Tho if not ,  Then we'll start seeing locked at user level reports .  Of 
> course the process should be completely kill'able & nice'able in userspace as 
> long as there is documantation available to our 'informed admin' on howto limit 
> the processes possible abuses to his primary runtime service(s) .  Which will 
> probably be file sharing to other servers or workstations .
>  	I am quite sure you have thought thru that particular (and many other) 
> scenarios before making that proposal .  Would you please either here OR at the 
> original document above inform us a little more on how you feel you'd like to 
> approach the problem of a 'large update' ?  if a 'large update' should even be 
> possible not even sure of can or not .

If a 'large update' were needed then there is something seriously wrong and
you probably want to take your array offline before all the corruption in it
causes other problems.

It really think this is largely a theoretical issue with little practical
significance so I'm not interested in putting a lot of though/planning/effort
into it.

I think logging inconsistencies is important so people can find out what it
going on.
I think having a tool that can help interpret those inconsistencies would
certainly be valuable.
I don't think there is any point going anywhere beyond that until there is
some sort of information available about what sort of inconsistencies
actually happen.


> 
> 
>  	I also find myself agreeing with your 'to conclude the conclusion'-) 
> that automaticity should in 'general' be avoided .  An Aware & informed admin 
> is the best method to be used .
> 
> 
>  	And finally BUT not the least ,  Thank you .  Even tho we have our 
> differences of opinions on other aspects of the MD structure .

And thank you too!

NeilBrown


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-18  0:13                     ` Giovanni Tessore
@ 2011-02-18  2:56                       ` Keld Jørn Simonsen
  2011-02-18  4:27                         ` Roberto Spadim
  2011-02-18  9:47                         ` Giovanni Tessore
  0 siblings, 2 replies; 52+ messages in thread
From: Keld Jørn Simonsen @ 2011-02-18  2:56 UTC (permalink / raw)
  To: Giovanni Tessore; +Cc: Keld Jørn Simonsen, linux-raid

On Fri, Feb 18, 2011 at 01:13:32AM +0100, Giovanni Tessore wrote:
> On 02/17/2011 04:44 PM, Keld Jørn Simonsen wrote:
> >It should be possible to run a periodic check of if any bad sectors have
> >occurred in an array. Then the half-damaged file should be moved away from
> >this area with the bad block by copying it and relinking it, and before
> >relinking it to the proper place the good block corresponding to the bad
> >block should be marked as a corresponding good block on the healthy disk
> >drive, so that it not be allocated again. This action could even be
> >triggered by the event of the detection of the bad block. This would
> >probably meean that ther need to be a system call to mark a
> >corresponding good block. The whole thing should be able to run in
> >userland and somewhat independent of the file system type, except for
> >the lookup of the corresponding file fram a damaged block.
> 
> I don't follow this.. if a file has some damaged blocks, they are gone, 
> moving it elsewhere does not help.

Remember the file is in a RAID. So you can lose one disk drive and your
data is still intact.

> And however, this is a task of the filesystem.

No, it is the task of the raid, as it is the raid that gives the
functionality that you can lose a drive and still have your data intact.
the raid level knows what is lost, and  what is still good, and where
this stuff is.

If we are then operating on the file level, then doing something clever could
be a cooperation between the raid leven ald the filesystem level, as
described above.


> md is just a block device (more reliable than a single disk due to some 
> level of redundancy), and it should be indipendent from the kind of file 
> system on it (as the file system should be indipendent from the kind of 
> block device it resides on [md, hd, flash, iscsi, ...]).

true

> Then what you suggest should be done for every block device that can 
> have bad blocks (that is, every block device). Again, this is a 
> filesystem issue. And of which file system type, as there are many?

yes, it is a cooperation between the file system layer, and the raid
layer, I propose this be done in userland.

> The Bad Block Log allows md to behave 'like' a read hard disk would do 
> with smart data:
> - unreadable blocks/stripes are recorded into the log, as unreadable 
> sectors are recorder into smart data
> - unrecoverable read errors are reported to the caller for both
> - the device still works if it has unrecoverable read errors for both 
> (now the whole md device fails, this is the problem)
> - if a block/stripe if rewritten with success  the block/stripe is 
> removed from Bad Block Log (and the counter of relocated blocks/stripes 
> is incremented); as if a sector is rewritten with succes on a disk the 
> sector is removed from list of unreadable sector, and the counter of 
> relocated sector is incremented (smart data)

Smart drives also reallocate bad blocks, hiding the errors from the SW
level.

> A filesystem on a disk does not know what the firmware of the disk does 
> about sectors relocation.
> The same applies for a hardware (not fake) raid controller firmware.
> The same should apply for md. It is transparent to the filesystem.

Yes, normally the raid layer and the fs layer are independent.

But you can add better recovery with what I suggest.

> IMHO a more interesting issue whould be: a write error occurs on a disk 
> participating to an already degraded array; failing the disk would fail 
> the whole array. What to do? Put the array into read only mode, still 
> allowing read access to data on it for easy backup? In such situation, 
> what would do a hardware raid controller?
> 
> Hm, yes.... how do behave hardware raid controllers with uncorrectable 
> read errors?
> And how they behave with write error on a disk of an already degraded array?
> I guess md should replicate these behaviours.

I think we should be more intelligent than ordinary HW RAID:-)

Best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-18  2:56                       ` Keld Jørn Simonsen
@ 2011-02-18  4:27                         ` Roberto Spadim
  2011-02-18  9:47                         ` Giovanni Tessore
  1 sibling, 0 replies; 52+ messages in thread
From: Roberto Spadim @ 2011-02-18  4:27 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: Giovanni Tessore, linux-raid

I think we should be more intelligent than ordinary HW RAID:-)
that´s why SW RAID is better =)
==========
some IDEAS for badblock:
==========
the point here is: MD is a virtual harddisk, and must operate like one
harddisk (or ssd disk, or mixed ssd+hd)
raid0 = many disks (inside a hard disk we have many disks and many
heads, right? it´s something like it without the control of head
positioning and SATA interface)
raid1 = something that don´t exist in harddisks: mirrors (maybe some
disks use it as a badblock solution and we don´t know, but mirrors are
used today as a device redundancy)
raid456 = a ecc or checksum of the 'disk'?! maybe something like it...

the badblock problem/solution:
many disks have it internally, some disks have online reallocate, some
report the block is failed and filesystem must workaround it (disks
report that the block is failed because they couldn´t reallocate or
don´t have this feature, filesystem must stop or write to another
device, maybe a end of space problem for app level... in mosts cases
user must tell what to do, or report to kernel log or user space log).

my opnion... since badblock is a device block problem, md must handle
this problem (today with mirror marked as failed, in near future with
badblocks)
filesystem must know that if a badblock exist the device will get
smaller (with less space)
some filesystem know about the badblock problem and try to store the
information on another sector without data
(filesystem should report the nonused sector to device with a TRIM
command, but they don´t (today just ext4 with discard and swap have
it))

for sw raid badblock...
maybe we need not only badblock list, but online realloc (for the
first step a badblock list is good, for second step a online realloc)

types of realloc (in all cases md array must will get smaller, if not,
mark the smaller mirror as 'badblocked')
1)device realloc: reallocate device block just on the bad mirror
2)md mirror realloc: if one mirror have a badblock, md will use
another mirror to read that sector (maybe the first option is better,
but when the first fail we use this option and mark md array as
'insync with badblocks' maybe a per mirror flag about bad block is a
nice feature, and maybe a per mirror export badblock list is nice too,
for md level a badblock list of virtual badblock (all mirror with the
same sector badblock, like a single harddisk without mirror))

check that we must implement layout in all raid levels (badblock
realocation is a dynamic layout)

check that for realocation we must use a TRIM like command (here our
friend told that badblock is a filesystem problem, i don´t see as a
filesystem problem, but device and filesystem problem, since SSD can
use non allocated sectors and optimize the speed and life time of NAND
cells)

the trim command tell us if a sector with 00000000000...000 value is
in use or not
when in use? when we write to sector
when not in use? at array startup, when filesystem send a trim command
to a MD device and MD mark the block as not in use. check that MD must
not send TRIM command to data blocks, it can send TRIM to parity
devices (raid456), not in use = 0 to TRIM bit + 0000 to sector bytes

howto make trim command?
internally we need a 0/1 bit value that tell us the block is in use or
not. the problem: for file system that need a block size of 4096bytes
= md will use 4096+1bit
the first solution (a filesystem problem): use 4095block size for
filesystem and use 1 byte for trim information
second solution (a md problem): group many bits in one block.
4096bytes=32768 bits in a block of 4096bytes, after 32768blocks we
have a TRIM block

check that TRIM block(bit) can be in a badblock hehehe =P check this
for more ideas:
http://en.wikipedia.org/wiki/TRIM_%28SSD_command%29
http://t13.org/Documents/UploadedDocuments/docs2008/e07154r6-Data_Set_Management_Proposal_for_ATA-ACS2.doc



when filesystem should know if a block have problem (a badblock)? when
all disks have the badblock and can´t be reallocated, in other words,
when device get smaller (with less space)



2011/2/18 Keld Jørn Simonsen <keld@keldix.com>:
> On Fri, Feb 18, 2011 at 01:13:32AM +0100, Giovanni Tessore wrote:
>> On 02/17/2011 04:44 PM, Keld Jørn Simonsen wrote:
>> >It should be possible to run a periodic check of if any bad sectors have
>> >occurred in an array. Then the half-damaged file should be moved away from
>> >this area with the bad block by copying it and relinking it, and before
>> >relinking it to the proper place the good block corresponding to the bad
>> >block should be marked as a corresponding good block on the healthy disk
>> >drive, so that it not be allocated again. This action could even be
>> >triggered by the event of the detection of the bad block. This would
>> >probably meean that ther need to be a system call to mark a
>> >corresponding good block. The whole thing should be able to run in
>> >userland and somewhat independent of the file system type, except for
>> >the lookup of the corresponding file fram a damaged block.
>>
>> I don't follow this.. if a file has some damaged blocks, they are gone,
>> moving it elsewhere does not help.
>
> Remember the file is in a RAID. So you can lose one disk drive and your
> data is still intact.
>
>> And however, this is a task of the filesystem.
>
> No, it is the task of the raid, as it is the raid that gives the
> functionality that you can lose a drive and still have your data intact.
> the raid level knows what is lost, and  what is still good, and where
> this stuff is.
>
> If we are then operating on the file level, then doing something clever could
> be a cooperation between the raid leven ald the filesystem level, as
> described above.
>
>
>> md is just a block device (more reliable than a single disk due to some
>> level of redundancy), and it should be indipendent from the kind of file
>> system on it (as the file system should be indipendent from the kind of
>> block device it resides on [md, hd, flash, iscsi, ...]).
>
> true
>
>> Then what you suggest should be done for every block device that can
>> have bad blocks (that is, every block device). Again, this is a
>> filesystem issue. And of which file system type, as there are many?
>
> yes, it is a cooperation between the file system layer, and the raid
> layer, I propose this be done in userland.
>
>> The Bad Block Log allows md to behave 'like' a read hard disk would do
>> with smart data:
>> - unreadable blocks/stripes are recorded into the log, as unreadable
>> sectors are recorder into smart data
>> - unrecoverable read errors are reported to the caller for both
>> - the device still works if it has unrecoverable read errors for both
>> (now the whole md device fails, this is the problem)
>> - if a block/stripe if rewritten with success  the block/stripe is
>> removed from Bad Block Log (and the counter of relocated blocks/stripes
>> is incremented); as if a sector is rewritten with succes on a disk the
>> sector is removed from list of unreadable sector, and the counter of
>> relocated sector is incremented (smart data)
>
> Smart drives also reallocate bad blocks, hiding the errors from the SW
> level.
>
>> A filesystem on a disk does not know what the firmware of the disk does
>> about sectors relocation.
>> The same applies for a hardware (not fake) raid controller firmware.
>> The same should apply for md. It is transparent to the filesystem.
>
> Yes, normally the raid layer and the fs layer are independent.
>
> But you can add better recovery with what I suggest.
>
>> IMHO a more interesting issue whould be: a write error occurs on a disk
>> participating to an already degraded array; failing the disk would fail
>> the whole array. What to do? Put the array into read only mode, still
>> allowing read access to data on it for easy backup? In such situation,
>> what would do a hardware raid controller?
>>
>> Hm, yes.... how do behave hardware raid controllers with uncorrectable
>> read errors?
>> And how they behave with write error on a disk of an already degraded array?
>> I guess md should replicate these behaviours.
>
> I think we should be more intelligent than ordinary HW RAID:-)
>
> Best regards
> keld
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-18  2:56                       ` Keld Jørn Simonsen
  2011-02-18  4:27                         ` Roberto Spadim
@ 2011-02-18  9:47                         ` Giovanni Tessore
  2011-02-18 18:43                           ` Keld Jørn Simonsen
  1 sibling, 1 reply; 52+ messages in thread
From: Giovanni Tessore @ 2011-02-18  9:47 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: linux-raid

On 02/18/2011 03:56 AM, Keld Jørn Simonsen wrote:
> On Fri, Feb 18, 2011 at 01:13:32AM +0100, Giovanni Tessore wrote:
>> On 02/17/2011 04:44 PM, Keld Jørn Simonsen wrote:
>>> It should be possible to run a periodic check of if any bad sectors have
>>> occurred in an array. Then the half-damaged file should be moved away from
>>> this area with the bad block by copying it and relinking it, and before
>>> relinking it to the proper place the good block corresponding to the bad
>>> block should be marked as a corresponding good block on the healthy disk
>>> drive, so that it not be allocated again. This action could even be
>>> triggered by the event of the detection of the bad block. This would
>>> probably meean that ther need to be a system call to mark a
>>> corresponding good block. The whole thing should be able to run in
>>> userland and somewhat independent of the file system type, except for
>>> the lookup of the corresponding file fram a damaged block.
>> I don't follow this.. if a file has some damaged blocks, they are gone,
>> moving it elsewhere does not help.
> Remember the file is in a RAID. So you can lose one disk drive and your
> data is still intact.
>
>> And however, this is a task of the filesystem.
> No, it is the task of the raid, as it is the raid that gives the
> functionality that you can lose a drive and still have your data intact.
> the raid level knows what is lost, and  what is still good, and where
> this stuff is.
>
> If we are then operating on the file level, then doing something clever could
> be a cooperation between the raid leven ald the filesystem level, as
> described above.

Raid of course has this functionality, but at block level; it's agnostic 
of the filesystem on it (there may be no filesystem at all actually, as 
for raid over raid); it does not know the word 'file'.

Raid adds SOME level of redundancy, not infinite. If the underlying 
hardware has damaged sectors over the redundancy level of the raid 
configuration, data in the stripe is lost; and the hardware probably 
should be replaced.

Unrecoverable read errors FROM MD (those addressed by Bad Block Log 
feature) only appear when this redudancy level is not enough; for example:
- raid 1 in degraded mode with only 1 disk active, read error on the 
remaning disk
- raid 5 in degraded mode, read error on one of the active disks
- raid 6 in degraded mode missing 2 disks, read error on one of the 
active disks
- raid 5, read error on the same sector on more than 1 disk
- raid 6, read error on the same sector on more than 2 disks
- etc ...

in this situation nothing can be done neither at md level, nor at 
filesytem level: data on the block/stripe is lost.

Remeber that the Bad Block Log keeps track of the block/stripes who gave 
this unrecoverable read error at md level. It has nothing to do with the 
unreadable sector list of the underlying disks: if raid gets a read 
error from a disk, it tries to reconstruct data from the other disks, 
and to rewrite the sector; if it succedes, all is ok for md (it just 
increments the counter of corrected read errors, which is persistent for 
superblock > 1.x); otherwise there is a write error, and the disk is 
marked as failed.


>
>> md is just a block device (more reliable than a single disk due to some
>> level of redundancy), and it should be indipendent from the kind of file
>> system on it (as the file system should be indipendent from the kind of
>> block device it resides on [md, hd, flash, iscsi, ...]).
> true
>
>> Then what you suggest should be done for every block device that can
>> have bad blocks (that is, every block device). Again, this is a
>> filesystem issue. And of which file system type, as there are many?
> yes, it is a cooperation between the file system layer, and the raid
> layer, I propose this be done in userland.
>
>> The Bad Block Log allows md to behave 'like' a read hard disk would do
>> with smart data:
>> - unreadable blocks/stripes are recorded into the log, as unreadable
>> sectors are recorder into smart data
>> - unrecoverable read errors are reported to the caller for both
>> - the device still works if it has unrecoverable read errors for both
>> (now the whole md device fails, this is the problem)
>> - if a block/stripe if rewritten with success  the block/stripe is
>> removed from Bad Block Log (and the counter of relocated blocks/stripes
>> is incremented); as if a sector is rewritten with succes on a disk the
>> sector is removed from list of unreadable sector, and the counter of
>> relocated sector is incremented (smart data)
> Smart drives also reallocate bad blocks, hiding the errors from the SW
> level.

And that is the only natural place where this operation should be done. 
Suppose you got a unrecoverable read error from md on a block. It means 
that some sector on one (or more) of the underlying disks gave a read 
error. If you try to rewrite the md block, the sectors are rewritten to 
the underlying disk, so either:
- all disks write correctly because they could solve the prolem (its a 
matter of their firmware, maybe relocating the sector on reserved area): 
block relocated, all OK.
- some disks give an error on write (no more space for relocatable 
errors, or other hw problems): then the disk(s) is(are) marked failed, 
and must be replaced.
There is no need for reserved blocks anywhere else than those of the 
underlying disks.

Having reserved relocable blocks at raid level would be usefull to 
address another situation: uncorrectable errors on write. But this is 
another story.

>> A filesystem on a disk does not know what the firmware of the disk does
>> about sectors relocation.
>> The same applies for a hardware (not fake) raid controller firmware.
>> The same should apply for md. It is transparent to the filesystem.
> Yes, normally the raid layer and the fs layer are independent.
>
> But you can add better recovery with what I suggest.
>
>> IMHO a more interesting issue whould be: a write error occurs on a disk
>> participating to an already degraded array; failing the disk would fail
>> the whole array. What to do? Put the array into read only mode, still
>> allowing read access to data on it for easy backup? In such situation,
>> what would do a hardware raid controller?
>>
>> Hm, yes.... how do behave hardware raid controllers with uncorrectable
>> read errors?
>> And how they behave with write error on a disk of an already degraded array?
>> I guess md should replicate these behaviours.
> I think we should be more intelligent than ordinary HW RAID:-)

I think it is a good point if the software raid had the same features 
and reliability of those mission critical hw controllers ;-)

Regards

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-18  9:47                         ` Giovanni Tessore
@ 2011-02-18 18:43                           ` Keld Jørn Simonsen
  2011-02-18 19:00                             ` Roberto Spadim
  0 siblings, 1 reply; 52+ messages in thread
From: Keld Jørn Simonsen @ 2011-02-18 18:43 UTC (permalink / raw)
  To: Giovanni Tessore; +Cc: Keld Jørn Simonsen, linux-raid

On Fri, Feb 18, 2011 at 10:47:28AM +0100, Giovanni Tessore wrote:
> On 02/18/2011 03:56 AM, Keld Jørn Simonsen wrote:
> >On Fri, Feb 18, 2011 at 01:13:32AM +0100, Giovanni Tessore wrote:
> >>On 02/17/2011 04:44 PM, Keld Jørn Simonsen wrote:
> >>>It should be possible to run a periodic check of if any bad sectors have
> >>>occurred in an array. Then the half-damaged file should be moved away 
> >>>from
> >>>this area with the bad block by copying it and relinking it, and before
> >>>relinking it to the proper place the good block corresponding to the bad
> >>>block should be marked as a corresponding good block on the healthy disk
> >>>drive, so that it not be allocated again. This action could even be
> >>>triggered by the event of the detection of the bad block. This would
> >>>probably meean that ther need to be a system call to mark a
> >>>corresponding good block. The whole thing should be able to run in
> >>>userland and somewhat independent of the file system type, except for
> >>>the lookup of the corresponding file fram a damaged block.
> >>I don't follow this.. if a file has some damaged blocks, they are gone,
> >>moving it elsewhere does not help.
> >Remember the file is in a RAID. So you can lose one disk drive and your
> >data is still intact.
> >
> >>And however, this is a task of the filesystem.
> >No, it is the task of the raid, as it is the raid that gives the
> >functionality that you can lose a drive and still have your data intact.
> >the raid level knows what is lost, and  what is still good, and where
> >this stuff is.
> >
> >If we are then operating on the file level, then doing something clever 
> >could
> >be a cooperation between the raid leven ald the filesystem level, as
> >described above.
> 
> Raid of course has this functionality, but at block level; it's agnostic 
> of the filesystem on it (there may be no filesystem at all actually, as 
> for raid over raid); it does not know the word 'file'.

true

> Raid adds SOME level of redundancy, not infinite. If the underlying 
> hardware has damaged sectors over the redundancy level of the raid 
> configuration, data in the stripe is lost; and the hardware probably 
> should be replaced.
> 
> Unrecoverable read errors FROM MD (those addressed by Bad Block Log 
> feature) only appear when this redudancy level is not enough; for example:
> - raid 1 in degraded mode with only 1 disk active, read error on the 
> remaning disk
> - raid 5 in degraded mode, read error on one of the active disks
> - raid 6 in degraded mode missing 2 disks, read error on one of the 
> active disks
> - raid 5, read error on the same sector on more than 1 disk
> - raid 6, read error on the same sector on more than 2 disks
> - etc ...
> 
> in this situation nothing can be done neither at md level, nor at 
> filesytem level: data on the block/stripe is lost.

true too.

My idea was to do something when the MD RAID shifts into the degraded
states listed above. Not when the MD RAID is in the stats listed above,
and getting yet another error.

> 
> Remeber that the Bad Block Log keeps track of the block/stripes who gave 
> this unrecoverable read error at md level. It has nothing to do with the 
> unreadable sector list of the underlying disks: if raid gets a read 
> error from a disk, it tries to reconstruct data from the other disks, 
> and to rewrite the sector; if it succedes, all is ok for md (it just 
> increments the counter of corrected read errors, which is persistent for 
> superblock > 1.x); otherwise there is a write error, and the disk is 
> marked as failed.

Yes, this is current behaviour. 

I propose that this be changed, in conjunctio with a badblock raid
feature. Supposedly the write (or read) error wil become registered with
a new badblock log. And there will be generated a report email to the
administrator or some such with notification of the event, repoting the
errpr on the disk as a read or write error, at a specific disk drive and
a specific block.

I would then like a program in userland that from the specified
information looks up the semi-damaged file in the file system,
tries to copy the file, and then sets a flag on other healthy blocks
related the the newly identified badblock for the related badblogs logs
for the healthy drives, so that it would generate an error if the block 
is attempetd to be used again.

Or alternatively, I would like reallloc of the badblock in the damaged
drive, given that there be set aside an area of the RAID metadata
foor badblock realloc (in a manner similar to what is done for many disk
drive HW. I think I prefer the latter solution.



> 
> >
> >>md is just a block device (more reliable than a single disk due to some
> >>level of redundancy), and it should be indipendent from the kind of file
> >>system on it (as the file system should be indipendent from the kind of
> >>block device it resides on [md, hd, flash, iscsi, ...]).
> >true
> >
> >>Then what you suggest should be done for every block device that can
> >>have bad blocks (that is, every block device). Again, this is a
> >>filesystem issue. And of which file system type, as there are many?
> >yes, it is a cooperation between the file system layer, and the raid
> >layer, I propose this be done in userland.
> >
> >>The Bad Block Log allows md to behave 'like' a read hard disk would do
> >>with smart data:
> >>- unreadable blocks/stripes are recorded into the log, as unreadable
> >>sectors are recorder into smart data
> >>- unrecoverable read errors are reported to the caller for both
> >>- the device still works if it has unrecoverable read errors for both
> >>(now the whole md device fails, this is the problem)
> >>- if a block/stripe if rewritten with success  the block/stripe is
> >>removed from Bad Block Log (and the counter of relocated blocks/stripes
> >>is incremented); as if a sector is rewritten with succes on a disk the
> >>sector is removed from list of unreadable sector, and the counter of
> >>relocated sector is incremented (smart data)
> >Smart drives also reallocate bad blocks, hiding the errors from the SW
> >level.
> 
> And that is the only natural place where this operation should be done. 
> Suppose you got a unrecoverable read error from md on a block. It means 
> that some sector on one (or more) of the underlying disks gave a read 
> error. If you try to rewrite the md block, the sectors are rewritten to 
> the underlying disk, so either:
> - all disks write correctly because they could solve the prolem (its a 
> matter of their firmware, maybe relocating the sector on reserved area): 
> block relocated, all OK.
> - some disks give an error on write (no more space for relocatable 
> errors, or other hw problems): then the disk(s) is(are) marked failed, 
> and must be replaced.
> There is no need for reserved blocks anywhere else than those of the 
> underlying disks.
> 
> Having reserved relocable blocks at raid level would be usefull to 
> address another situation: uncorrectable errors on write. But this is 
> another story.

I agree.

> >>A filesystem on a disk does not know what the firmware of the disk does
> >>about sectors relocation.
> >>The same applies for a hardware (not fake) raid controller firmware.
> >>The same should apply for md. It is transparent to the filesystem.
> >Yes, normally the raid layer and the fs layer are independent.
> >
> >But you can add better recovery with what I suggest.
> >
> >>IMHO a more interesting issue whould be: a write error occurs on a disk
> >>participating to an already degraded array; failing the disk would fail
> >>the whole array. What to do? Put the array into read only mode, still
> >>allowing read access to data on it for easy backup? In such situation,
> >>what would do a hardware raid controller?
> >>
> >>Hm, yes.... how do behave hardware raid controllers with uncorrectable
> >>read errors?
> >>And how they behave with write error on a disk of an already degraded 
> >>array?
> >>I guess md should replicate these behaviours.
> >I think we should be more intelligent than ordinary HW RAID:-)
> 
> I think it is a good point if the software raid had the same features 
> and reliability of those mission critical hw controllers ;-)

yes we can hope for such implementation.

Best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-18 18:43                           ` Keld Jørn Simonsen
@ 2011-02-18 19:00                             ` Roberto Spadim
  2011-02-18 19:18                               ` Keld Jørn Simonsen
  0 siblings, 1 reply; 52+ messages in thread
From: Roberto Spadim @ 2011-02-18 19:00 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: Giovanni Tessore, linux-raid

again... for realloc we need TRIM command or reserved sectors just for
bad block realloc, TRIM command tell MD what sector isn´t in use, at
WRITE command MD set the sector as inuse, at array creation md set
sector as inuse too. this will only work with ext4 and swap, others
filesystem don´t have TRIM. the solution of others filesystem are
based on not used block, but it´s a internal logic of each filesystem.
i don´t know what is best, TRIM command is nice (we can send TRIM to
disks, this help to make their life bigger) a bad block is a disk
getting smaller and smaller, the disk can realloc badblock. if it
cant, filesystem should realloc it (it have more information about
logic device, it shouldn´t, TRIM command is the information that disk
should have to discart blocks, not a filesystem logic, but... it´s a
option, filesystem can realloc)

2011/2/18 Keld Jørn Simonsen <keld@keldix.com>:
> On Fri, Feb 18, 2011 at 10:47:28AM +0100, Giovanni Tessore wrote:
>> On 02/18/2011 03:56 AM, Keld Jørn Simonsen wrote:
>> >On Fri, Feb 18, 2011 at 01:13:32AM +0100, Giovanni Tessore wrote:
>> >>On 02/17/2011 04:44 PM, Keld Jørn Simonsen wrote:
>> >>>It should be possible to run a periodic check of if any bad sectors have
>> >>>occurred in an array. Then the half-damaged file should be moved away
>> >>>from
>> >>>this area with the bad block by copying it and relinking it, and before
>> >>>relinking it to the proper place the good block corresponding to the bad
>> >>>block should be marked as a corresponding good block on the healthy disk
>> >>>drive, so that it not be allocated again. This action could even be
>> >>>triggered by the event of the detection of the bad block. This would
>> >>>probably meean that ther need to be a system call to mark a
>> >>>corresponding good block. The whole thing should be able to run in
>> >>>userland and somewhat independent of the file system type, except for
>> >>>the lookup of the corresponding file fram a damaged block.
>> >>I don't follow this.. if a file has some damaged blocks, they are gone,
>> >>moving it elsewhere does not help.
>> >Remember the file is in a RAID. So you can lose one disk drive and your
>> >data is still intact.
>> >
>> >>And however, this is a task of the filesystem.
>> >No, it is the task of the raid, as it is the raid that gives the
>> >functionality that you can lose a drive and still have your data intact.
>> >the raid level knows what is lost, and  what is still good, and where
>> >this stuff is.
>> >
>> >If we are then operating on the file level, then doing something clever
>> >could
>> >be a cooperation between the raid leven ald the filesystem level, as
>> >described above.
>>
>> Raid of course has this functionality, but at block level; it's agnostic
>> of the filesystem on it (there may be no filesystem at all actually, as
>> for raid over raid); it does not know the word 'file'.
>
> true
>
>> Raid adds SOME level of redundancy, not infinite. If the underlying
>> hardware has damaged sectors over the redundancy level of the raid
>> configuration, data in the stripe is lost; and the hardware probably
>> should be replaced.
>>
>> Unrecoverable read errors FROM MD (those addressed by Bad Block Log
>> feature) only appear when this redudancy level is not enough; for example:
>> - raid 1 in degraded mode with only 1 disk active, read error on the
>> remaning disk
>> - raid 5 in degraded mode, read error on one of the active disks
>> - raid 6 in degraded mode missing 2 disks, read error on one of the
>> active disks
>> - raid 5, read error on the same sector on more than 1 disk
>> - raid 6, read error on the same sector on more than 2 disks
>> - etc ...
>>
>> in this situation nothing can be done neither at md level, nor at
>> filesytem level: data on the block/stripe is lost.
>
> true too.
>
> My idea was to do something when the MD RAID shifts into the degraded
> states listed above. Not when the MD RAID is in the stats listed above,
> and getting yet another error.
>
>>
>> Remeber that the Bad Block Log keeps track of the block/stripes who gave
>> this unrecoverable read error at md level. It has nothing to do with the
>> unreadable sector list of the underlying disks: if raid gets a read
>> error from a disk, it tries to reconstruct data from the other disks,
>> and to rewrite the sector; if it succedes, all is ok for md (it just
>> increments the counter of corrected read errors, which is persistent for
>> superblock > 1.x); otherwise there is a write error, and the disk is
>> marked as failed.
>
> Yes, this is current behaviour.
>
> I propose that this be changed, in conjunctio with a badblock raid
> feature. Supposedly the write (or read) error wil become registered with
> a new badblock log. And there will be generated a report email to the
> administrator or some such with notification of the event, repoting the
> errpr on the disk as a read or write error, at a specific disk drive and
> a specific block.
>
> I would then like a program in userland that from the specified
> information looks up the semi-damaged file in the file system,
> tries to copy the file, and then sets a flag on other healthy blocks
> related the the newly identified badblock for the related badblogs logs
> for the healthy drives, so that it would generate an error if the block
> is attempetd to be used again.
>
> Or alternatively, I would like reallloc of the badblock in the damaged
> drive, given that there be set aside an area of the RAID metadata
> foor badblock realloc (in a manner similar to what is done for many disk
> drive HW. I think I prefer the latter solution.
>
>
>
>>
>> >
>> >>md is just a block device (more reliable than a single disk due to some
>> >>level of redundancy), and it should be indipendent from the kind of file
>> >>system on it (as the file system should be indipendent from the kind of
>> >>block device it resides on [md, hd, flash, iscsi, ...]).
>> >true
>> >
>> >>Then what you suggest should be done for every block device that can
>> >>have bad blocks (that is, every block device). Again, this is a
>> >>filesystem issue. And of which file system type, as there are many?
>> >yes, it is a cooperation between the file system layer, and the raid
>> >layer, I propose this be done in userland.
>> >
>> >>The Bad Block Log allows md to behave 'like' a read hard disk would do
>> >>with smart data:
>> >>- unreadable blocks/stripes are recorded into the log, as unreadable
>> >>sectors are recorder into smart data
>> >>- unrecoverable read errors are reported to the caller for both
>> >>- the device still works if it has unrecoverable read errors for both
>> >>(now the whole md device fails, this is the problem)
>> >>- if a block/stripe if rewritten with success  the block/stripe is
>> >>removed from Bad Block Log (and the counter of relocated blocks/stripes
>> >>is incremented); as if a sector is rewritten with succes on a disk the
>> >>sector is removed from list of unreadable sector, and the counter of
>> >>relocated sector is incremented (smart data)
>> >Smart drives also reallocate bad blocks, hiding the errors from the SW
>> >level.
>>
>> And that is the only natural place where this operation should be done.
>> Suppose you got a unrecoverable read error from md on a block. It means
>> that some sector on one (or more) of the underlying disks gave a read
>> error. If you try to rewrite the md block, the sectors are rewritten to
>> the underlying disk, so either:
>> - all disks write correctly because they could solve the prolem (its a
>> matter of their firmware, maybe relocating the sector on reserved area):
>> block relocated, all OK.
>> - some disks give an error on write (no more space for relocatable
>> errors, or other hw problems): then the disk(s) is(are) marked failed,
>> and must be replaced.
>> There is no need for reserved blocks anywhere else than those of the
>> underlying disks.
>>
>> Having reserved relocable blocks at raid level would be usefull to
>> address another situation: uncorrectable errors on write. But this is
>> another story.
>
> I agree.
>
>> >>A filesystem on a disk does not know what the firmware of the disk does
>> >>about sectors relocation.
>> >>The same applies for a hardware (not fake) raid controller firmware.
>> >>The same should apply for md. It is transparent to the filesystem.
>> >Yes, normally the raid layer and the fs layer are independent.
>> >
>> >But you can add better recovery with what I suggest.
>> >
>> >>IMHO a more interesting issue whould be: a write error occurs on a disk
>> >>participating to an already degraded array; failing the disk would fail
>> >>the whole array. What to do? Put the array into read only mode, still
>> >>allowing read access to data on it for easy backup? In such situation,
>> >>what would do a hardware raid controller?
>> >>
>> >>Hm, yes.... how do behave hardware raid controllers with uncorrectable
>> >>read errors?
>> >>And how they behave with write error on a disk of an already degraded
>> >>array?
>> >>I guess md should replicate these behaviours.
>> >I think we should be more intelligent than ordinary HW RAID:-)
>>
>> I think it is a good point if the software raid had the same features
>> and reliability of those mission critical hw controllers ;-)
>
> yes we can hope for such implementation.
>
> Best regards
> keld
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-18 19:00                             ` Roberto Spadim
@ 2011-02-18 19:18                               ` Keld Jørn Simonsen
  2011-02-18 19:22                                 ` Roberto Spadim
  0 siblings, 1 reply; 52+ messages in thread
From: Keld Jørn Simonsen @ 2011-02-18 19:18 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: Keld Jørn Simonsen, Giovanni Tessore, linux-raid

On Fri, Feb 18, 2011 at 05:00:27PM -0200, Roberto Spadim wrote:
> again... for realloc we need TRIM command or reserved sectors just for
> bad block realloc, TRIM command tell MD what sector isn?t in use, at
> WRITE command MD set the sector as inuse, at array creation md set
> sector as inuse too. this will only work with ext4 and swap, others
> filesystem don?t have TRIM. the solution of others filesystem are
> based on not used block, but it?s a internal logic of each filesystem.
> i don?t know what is best, TRIM command is nice (we can send TRIM to
> disks, this help to make their life bigger) a bad block is a disk
> getting smaller and smaller, the disk can realloc badblock. if it
> cant, filesystem should realloc it (it have more information about
> logic device, it shouldn?t, TRIM command is the information that disk
> should have to discart blocks, not a filesystem logic, but... it?s a
> option, filesystem can realloc)

I think I prefer a realloc area in the raid metadata area. And the
metadata area could be contaning a not-too-small realloc area, with an
option of enlarging the realloc area at a later time. This could be done
by shrinking the related file system, and then adding the freed space to
the realloc area in the raid metadata. 

Some MBs set aside for this would not be noticeable in todays TB disks
regime. I think current disk hardware allows relocation of under 1000
blocks a 512 byte = under 512 kB. So no problem sizewise.
Performance may be a bigger problem.  Maybe some binary searcheable list
built at MD RAID assembly time.

best regards
keld

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-18 19:18                               ` Keld Jørn Simonsen
@ 2011-02-18 19:22                                 ` Roberto Spadim
  0 siblings, 0 replies; 52+ messages in thread
From: Roberto Spadim @ 2011-02-18 19:22 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: Giovanni Tessore, linux-raid

yeah, disk with badblock = disk with dynamic layout
i think with badblock we could port layout to all raid systems
(including raid1 hehe, i like raid1)

the area for badblock, we could clena then at startup with: write
000000, send TRIM command to sectors used by badblock area, this will
help disks with internal realloc functions (faster than md software)

2011/2/18 Keld Jørn Simonsen <keld@keldix.com>:
> On Fri, Feb 18, 2011 at 05:00:27PM -0200, Roberto Spadim wrote:
>> again... for realloc we need TRIM command or reserved sectors just for
>> bad block realloc, TRIM command tell MD what sector isn?t in use, at
>> WRITE command MD set the sector as inuse, at array creation md set
>> sector as inuse too. this will only work with ext4 and swap, others
>> filesystem don?t have TRIM. the solution of others filesystem are
>> based on not used block, but it?s a internal logic of each filesystem.
>> i don?t know what is best, TRIM command is nice (we can send TRIM to
>> disks, this help to make their life bigger) a bad block is a disk
>> getting smaller and smaller, the disk can realloc badblock. if it
>> cant, filesystem should realloc it (it have more information about
>> logic device, it shouldn?t, TRIM command is the information that disk
>> should have to discart blocks, not a filesystem logic, but... it?s a
>> option, filesystem can realloc)
>
> I think I prefer a realloc area in the raid metadata area. And the
> metadata area could be contaning a not-too-small realloc area, with an
> option of enlarging the realloc area at a later time. This could be done
> by shrinking the related file system, and then adding the freed space to
> the realloc area in the raid metadata.
>
> Some MBs set aside for this would not be noticeable in todays TB disks
> regime. I think current disk hardware allows relocation of under 1000
> blocks a 512 byte = under 512 kB. So no problem sizewise.
> Performance may be a bigger problem.  Maybe some binary searcheable list
> built at MD RAID assembly time.
>
> best regards
> keld
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: md road-map: 2011
  2011-02-16 10:27 md road-map: 2011 NeilBrown
                   ` (6 preceding siblings ...)
  2011-02-16 22:50 ` Keld Jørn Simonsen
@ 2011-02-23  5:06 ` Daniel Reurich
  7 siblings, 0 replies; 52+ messages in thread
From: Daniel Reurich @ 2011-02-23  5:06 UTC (permalink / raw)
  To: linux-raid, Neil Brown

On Wed, 2011-02-16 at 21:27 +1100, NeilBrown wrote:

> Bitmap of non-sync regions.
> ---------------------------

> The granularity of the bit is probably quite hard to get right.
> Having it match the block size would mean that no resync would be
> needed and that every discard request could be handled exactly.
> However it could result in a very large bitmap - 30 Megabytes for a 1
> terabyte device with a 4K block size.  This would need to be kept in
> memory and looked up for every access, which could be problematic.
> 
Why not store the map as a list of regions defined by:
<start address><finish address>.  This may provide a better performance
vs (storage+memory) cost implementation when compared with a bitmap
which has a granularity vs storage problem.

It may well be more efficient to store a range list then a bitmap and
makes granularity a non-issue as granularity will be at blocksize.   The
limitation with this scheme is in choosing the size of the map, and the
larger the map the more regions that can be stored before no longer
being able to add new discards or splits (due to a write somewhere in
the middle of a non-sync region).  However this could be handled to
retain the best performance by ensuring that the largest non-sync
regions are always in the list

If we used full LBA48 addressing we could count on for each entry
in the map to 12bytes (2x48bit).  (Perhaps this could be reduced for
smaller devices that need less address bits.)  This would mean 85.3
entries per kB, or 87381.33 per Mb of map size on disk (excluding
possible headers).  In the case of a 1Tb raid volume a 1Mb map provide
roughly 1 entry for every 13Mb of disk space.  This sounds coarse but
when you consider you are setting regions based in units of the media's
block size it's not.  Furthermore once the filesystem is that fragmented
that you've exhausted the map space, the unhandled non-sync|discarded
regions would be so small that you'd gain little benefit from it.  A bit
of logic could ensure that large regions take precedence over smaller
regions, as this will provide the best performance for resync/check
passes.

Another benefit is that it makes it easy for md to be-able to pass
TRIM instructions down to media that support this feature whenever a
region/stripe is marked as non-sync.  In the case of raid levels
0 and linear there would be no need for a map and TRIM could be passed
through to the media.  For Raid1,10 a TRIM would be issued to the media
whenever a chunk is contained entirely within a non-sync region. With
raid456, a TRIM would only be issued when a whole stripe is contained
within a non-sync region.

The real beauty of this region map is that creation of a new raid volume
could (unless --assume-clean is set) mark the entire volume as non-sync
with a single entry in the list.

Of course this suggestion is only theoretical, and I might be way off on
the implementation cost vs benefits and feasability.

Regards,
-- 
Daniel Reurich.

Centurion Computer Technology (2005) Ltd
Mobile 021 797 722





^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2011-02-23  5:06 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-16 10:27 md road-map: 2011 NeilBrown
2011-02-16 11:28 ` Giovanni Tessore
2011-02-16 13:40   ` Roberto Spadim
2011-02-16 14:00     ` Robin Hill
2011-02-16 14:09       ` Roberto Spadim
2011-02-16 14:21         ` Roberto Spadim
2011-02-16 21:55           ` NeilBrown
2011-02-17  1:30             ` Roberto Spadim
2011-02-16 14:13 ` Joe Landman
2011-02-16 21:24   ` NeilBrown
2011-02-16 21:44     ` Roman Mamedov
2011-02-16 21:59       ` NeilBrown
2011-02-17  0:48         ` Phil Turmel
2011-02-16 22:12       ` Joe Landman
2011-02-16 15:42 ` David Brown
2011-02-16 21:35   ` NeilBrown
2011-02-16 22:34     ` David Brown
2011-02-16 23:01       ` NeilBrown
2011-02-17  0:30         ` David Brown
2011-02-17  0:55           ` NeilBrown
2011-02-17  1:04           ` Keld Jørn Simonsen
2011-02-17 10:45             ` David Brown
2011-02-17 10:58               ` Keld Jørn Simonsen
2011-02-17 11:45                 ` Giovanni Tessore
2011-02-17 15:44                   ` Keld Jørn Simonsen
2011-02-17 16:22                     ` Roberto Spadim
2011-02-18  0:13                     ` Giovanni Tessore
2011-02-18  2:56                       ` Keld Jørn Simonsen
2011-02-18  4:27                         ` Roberto Spadim
2011-02-18  9:47                         ` Giovanni Tessore
2011-02-18 18:43                           ` Keld Jørn Simonsen
2011-02-18 19:00                             ` Roberto Spadim
2011-02-18 19:18                               ` Keld Jørn Simonsen
2011-02-18 19:22                                 ` Roberto Spadim
2011-02-16 17:20 ` Joe Landman
2011-02-16 21:36   ` NeilBrown
2011-02-16 19:37 ` Phil Turmel
2011-02-16 21:44   ` NeilBrown
2011-02-17  0:11     ` Phil Turmel
2011-02-16 20:29 ` Piergiorgio Sartor
2011-02-16 21:48   ` NeilBrown
2011-02-16 22:53     ` Piergiorgio Sartor
2011-02-17  0:24     ` Phil Turmel
2011-02-17  0:52       ` NeilBrown
2011-02-17  1:14         ` Phil Turmel
2011-02-17  3:10           ` NeilBrown
2011-02-17 18:46             ` Phil Turmel
2011-02-17 21:04             ` Mr. James W. Laferriere
2011-02-18  1:48               ` NeilBrown
2011-02-17 19:56           ` Piergiorgio Sartor
2011-02-16 22:50 ` Keld Jørn Simonsen
2011-02-23  5:06 ` Daniel Reurich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.