All of lore.kernel.org
 help / color / mirror / Atom feed
* Extra write mode to close RAID5 write hole (kind of)
@ 2016-10-26 15:20 James Pharaoh
  2016-10-26 22:31 ` Vojtech Pavlik
  2016-10-28 11:59 ` Kent Overstreet
  0 siblings, 2 replies; 13+ messages in thread
From: James Pharaoh @ 2016-10-26 15:20 UTC (permalink / raw)
  To: linux-bcache

Hi all,

I'm creating an elaborate storage system and using bcache, with great 
success, to combine SSDs with smallish (500GB) network mounted block 
devices, with RAID5 in between.

I believe this should allow me to use RAID5 at large scale without high 
risk of data loss, because I can very quickly rebuild the small number 
of devices efficiently, across a distributed system.

I am using separate filesystems on each and abstracting their 
combination at a higher level, and I have redundant copies of their data 
in different locations (different countries in fact), so even if I lose 
one it can be recreated efficiently.

I believe this addresses the issue of two devices failing 
simultaneously, because it would affect an even smaller proportion of 
the total data than a single failure, which would simply trigger a 
number of RAID5 rebuilds.

I have high faith in SSD storage, especially given drives' SMART 
capabilities to report failure well in advance of it happening, so it 
occurs to me that bcache is going to close the RAID5 write hole for me, 
assuming certain things.

I am making assumptions about the ordering of writes that RAID5 makes, 
and will post to the appropriate list about that, with the possibility 
of another option. However, I also note that bcache "optimises" 
sequential writes directly to the underlying device:

 > Since random IO is what SSDs excel at, there generally won't be much
 > benefit to caching large sequential IO. Bcache detects sequential IO
 > and skips it; it also keeps a rolling average of the IO sizes per
 > task, and as long as the average is above the cutoff it will skip all
 > IO from that task - instead of caching the first 512k after every
 > seek. Backups and large file copies should thus entirely bypass the
 > cache.

Since I want my bcache device to essentially be a "journal", and to 
close the RAID5 write hole, I would prefer to disable this behaviour.

I propose, therefore, a further write mode, in which data is always 
written to the cache first, and synced, before it is written to the 
underlying device. This could be called "journal" perhaps, or something 
similar.

I am optimistic that this would be a relatively small change to the 
code, since it only requires to always choose the cache to write data to 
first. Perhaps the sync behaviour is also more complex, I am not 
familiar with the internals.

So, does anyone have any idea if this is practical, if it would 
genuinely close the write hole, or any other thoughts?

I am prepared to write up what I am designing in detail and open source 
it, I believe it would be a useful method of managing this kind of high 
scale storage in general.

James

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Extra write mode to close RAID5 write hole (kind of)
  2016-10-26 15:20 Extra write mode to close RAID5 write hole (kind of) James Pharaoh
@ 2016-10-26 22:31 ` Vojtech Pavlik
  2016-10-27 21:46   ` James Pharaoh
  2016-10-28 11:52   ` Kent Overstreet
  2016-10-28 11:59 ` Kent Overstreet
  1 sibling, 2 replies; 13+ messages in thread
From: Vojtech Pavlik @ 2016-10-26 22:31 UTC (permalink / raw)
  To: James Pharaoh; +Cc: linux-bcache

On Wed, Oct 26, 2016 at 04:20:38PM +0100, James Pharaoh wrote:
> Hi all,
> 
> I'm creating an elaborate storage system and using bcache, with
> great success, to combine SSDs with smallish (500GB) network mounted
> block devices, with RAID5 in between.
> 
> I believe this should allow me to use RAID5 at large scale without
> high risk of data loss, because I can very quickly rebuild the small
> number of devices efficiently, across a distributed system.
> 
> I am using separate filesystems on each and abstracting their
> combination at a higher level, and I have redundant copies of their
> data in different locations (different countries in fact), so even
> if I lose one it can be recreated efficiently.
> 
> I believe this addresses the issue of two devices failing
> simultaneously, because it would affect an even smaller proportion
> of the total data than a single failure, which would simply trigger
> a number of RAID5 rebuilds.
> 
> I have high faith in SSD storage, especially given drives' SMART
> capabilities to report failure well in advance of it happening, so
> it occurs to me that bcache is going to close the RAID5 write hole
> for me, assuming certain things.

I believe your faith in SSDs is somewhat misplaced, they do not so
infrequently die ahead of their SMART announcement and if they do, they
don't just get bad sectors, the whole device is gone.

In case you want to protect your data, either use a RAID for your cache
devices, too, use it in write through mode, or in write-back mode with
zero dirty data target.

> I am making assumptions about the ordering of writes that RAID5
> makes, and will post to the appropriate list about that, with the
> possibility of another option. However, I also note that bcache
> "optimises" sequential writes directly to the underlying device:

In case you're using mdraid for the RAID part on a reasonably recent
Linux kernel, there is no write hole. Linux mdraid implements barriers
properly even on RAID5, at the cost of performance - mdraid waits for a
barrier to complete on all drives before submitting more i/o.

Any journalling, log or cow filesystem that relies on i/o barriers for
consistency will be consistent in Linux even on mdraid RAID5.

> > Since random IO is what SSDs excel at, there generally won't be much
> > benefit to caching large sequential IO. Bcache detects sequential IO
> > and skips it; it also keeps a rolling average of the IO sizes per
> > task, and as long as the average is above the cutoff it will skip all
> > IO from that task - instead of caching the first 512k after every
> > seek. Backups and large file copies should thus entirely bypass the
> > cache.

> Since I want my bcache device to essentially be a "journal", and to
> close the RAID5 write hole, I would prefer to disable this
> behaviour.
> 
> I propose, therefore, a further write mode, in which data is always
> written to the cache first, and synced, before it is written to the
> underlying device. This could be called "journal" perhaps, or
> something similar.

Using bcache to accelerate a RAID using a SSD is a fairly common use
case. What you're asking for can likely be achieved by:

	echo writeback	> cache_mode
	echo 0     	> writeback_percent
	echo 10240  	> writeback_rate
	echo 5		> writeback_delay
	echo 0     	> readahead
	echo 0      	> sequential_cutoff
	echo 0      	> cache/congested_read_threshold_us
	echo 0      	> cache/congested_write_threshold_us

This is what I use personally on my system with success.

It enables writeback to optimize writing whole RAID stripes and sets a
writeback delay to make sure whole stripes are collected before writing them
out.

It sets a fixed writeback rate such that reads aren't significantly
delayed even during heavy writes - the dirty data will grow instead. 

It disables readahead, disallows skipping the cache for sequential writes
and disables cache device congestion control to make sure that writes always
go through the cache device.

As a result, if the cached device is busy with writes, only full stripes get
ever written to the raid. When the device is idle, even the remaining dirty
data gets written to the raid.

> I am optimistic that this would be a relatively small change to the
> code, since it only requires to always choose the cache to write
> data to first. Perhaps the sync behaviour is also more complex, I am
> not familiar with the internals.
> 
> So, does anyone have any idea if this is practical, if it would
> genuinely close the write hole, or any other thoughts?

It works without code changes, properly implements barriers throughout the
whole stack, doesn't get corrupted on pulling the cord if using a modern fs,
is fast and doesn't leave dirty data on the SSD unless the cord is pulled in
a busy period.

> I am prepared to write up what I am designing in detail and open
> source it, I believe it would be a useful method of managing this
> kind of high scale storage in general.

-- 
Vojtech Pavlik
Director SuSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Extra write mode to close RAID5 write hole (kind of)
  2016-10-26 22:31 ` Vojtech Pavlik
@ 2016-10-27 21:46   ` James Pharaoh
  2016-10-28 11:52   ` Kent Overstreet
  1 sibling, 0 replies; 13+ messages in thread
From: James Pharaoh @ 2016-10-27 21:46 UTC (permalink / raw)
  To: Vojtech Pavlik; +Cc: linux-bcache

On 26/10/16 23:31, Vojtech Pavlik wrote:

> I believe your faith in SSDs is somewhat misplaced, they do not so
> infrequently die ahead of their SMART announcement and if they do, they
> don't just get bad sectors, the whole device is gone.

In my experience they are extremely reliable, compared to traditional 
drives, and much faster, which increases their reliability because a 
rebuild/restore is much faster.

And of course I have redundant backups, stored in systems with 
significantly distinct designs, locations, access controls, etc.

> In case you want to protect your data, either use a RAID for your cache
> devices, too, use it in write through mode, or in write-back mode with
> zero dirty data target.

Ok, I'm not sure what this means, but it sounds like it is something I 
might want to use. Have you got a link for this?

>> I am making assumptions about the ordering of writes that RAID5
>> makes, and will post to the appropriate list about that, with the
>> possibility of another option. However, I also note that bcache
>> "optimises" sequential writes directly to the underlying device:
>
> In case you're using mdraid for the RAID part on a reasonably recent
> Linux kernel, there is no write hole. Linux mdraid implements barriers
> properly even on RAID5, at the cost of performance - mdraid waits for a
> barrier to complete on all drives before submitting more i/o.
>
> Any journalling, log or cow filesystem that relies on i/o barriers for
> consistency will be consistent in Linux even on mdraid RAID5.

Ok, wow, I did not know this. Again, have you got a link to any 
documentation about this. Unfortunately these kind of low-level systems 
tend to be quite hard to find information about...

> Using bcache to accelerate a RAID using a SSD is a fairly common use
> case. What you're asking for can likely be achieved by:
>
> 	echo writeback	> cache_mode
> 	echo 0     	> writeback_percent
> 	echo 10240  	> writeback_rate
> 	echo 5		> writeback_delay
> 	echo 0     	> readahead
> 	echo 0      	> sequential_cutoff
> 	echo 0      	> cache/congested_read_threshold_us
> 	echo 0      	> cache/congested_write_threshold_us
>
> This is what I use personally on my system with success.

Thanks, I'll look at this. Genuinelly a much more helpful response than 
I could ever have hoped for ;-)

James

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Extra write mode to close RAID5 write hole (kind of)
  2016-10-26 22:31 ` Vojtech Pavlik
  2016-10-27 21:46   ` James Pharaoh
@ 2016-10-28 11:52   ` Kent Overstreet
  2016-10-28 13:07     ` Vojtech Pavlik
  2016-10-28 17:07     ` James Pharaoh
  1 sibling, 2 replies; 13+ messages in thread
From: Kent Overstreet @ 2016-10-28 11:52 UTC (permalink / raw)
  To: Vojtech Pavlik; +Cc: James Pharaoh, linux-bcache

On Thu, Oct 27, 2016 at 12:31:58AM +0200, Vojtech Pavlik wrote:
> In case you're using mdraid for the RAID part on a reasonably recent
> Linux kernel, there is no write hole. Linux mdraid implements barriers
> properly even on RAID5, at the cost of performance - mdraid waits for a
> barrier to complete on all drives before submitting more i/o.

That's not what the raid 5 hole is. The raid 5 hole comes from the fact that
it's not possible to update the p/q blocks atomically with the data blocks, thus
there is a point in time when they are _inconsistent_ with the rest of the
stripe, and if used will lead to reconstructing incorrect data. There's no way
to fix this with just flushes.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Extra write mode to close RAID5 write hole (kind of)
  2016-10-26 15:20 Extra write mode to close RAID5 write hole (kind of) James Pharaoh
  2016-10-26 22:31 ` Vojtech Pavlik
@ 2016-10-28 11:59 ` Kent Overstreet
  2016-10-28 17:02   ` James Pharaoh
  1 sibling, 1 reply; 13+ messages in thread
From: Kent Overstreet @ 2016-10-28 11:59 UTC (permalink / raw)
  To: James Pharaoh; +Cc: linux-bcache

On Wed, Oct 26, 2016 at 04:20:38PM +0100, James Pharaoh wrote:
> Since I want my bcache device to essentially be a "journal", and to close
> the RAID5 write hole, I would prefer to disable this behaviour.
> 
> I propose, therefore, a further write mode, in which data is always written
> to the cache first, and synced, before it is written to the underlying
> device. This could be called "journal" perhaps, or something similar.
> 
> I am optimistic that this would be a relatively small change to the code,
> since it only requires to always choose the cache to write data to first.
> Perhaps the sync behaviour is also more complex, I am not familiar with the
> internals.
> 
> So, does anyone have any idea if this is practical, if it would genuinely
> close the write hole, or any other thoughts?

It's not a crazy idea - bcache already has some stripe awareness code that could
be used as a starting point.

The main thing you'd need to do is ensure that
 - all writes are writeback, not writethrough (as you noted)
 - when the writeback thread is flushing dirty data, only flush entire stripes -
   reading more data into the cache if necessary and marking it dirty, then
   ensure that the entire stripe is marked dirty until the entire stripe is
   flushed.

This would basically be using bcache to do full data journalling.

I'm not going to do the work myself - I'd rather spend my time working on adding
erasure coding to bcachefs - but I could help out if you or someone else wanted
to work on adding this to bcache.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Extra write mode to close RAID5 write hole (kind of)
  2016-10-28 11:52   ` Kent Overstreet
@ 2016-10-28 13:07     ` Vojtech Pavlik
  2016-10-28 13:13       ` Kent Overstreet
  2016-10-28 16:58       ` James Pharaoh
  2016-10-28 17:07     ` James Pharaoh
  1 sibling, 2 replies; 13+ messages in thread
From: Vojtech Pavlik @ 2016-10-28 13:07 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: James Pharaoh, linux-bcache

On Fri, Oct 28, 2016 at 03:52:49AM -0800, Kent Overstreet wrote:
> On Thu, Oct 27, 2016 at 12:31:58AM +0200, Vojtech Pavlik wrote:
> > In case you're using mdraid for the RAID part on a reasonably recent
> > Linux kernel, there is no write hole. Linux mdraid implements barriers
> > properly even on RAID5, at the cost of performance - mdraid waits for a
> > barrier to complete on all drives before submitting more i/o.
> 
> That's not what the raid 5 hole is. The raid 5 hole comes from the fact that
> it's not possible to update the p/q blocks atomically with the data blocks, thus
> there is a point in time when they are _inconsistent_ with the rest of the
> stripe, and if used will lead to reconstructing incorrect data. There's no way
> to fix this with just flushes.

Indeed. However, together with the write intent bitmap, and filesystems
ensuring consistency through barriers, it's still greatly mitigated. 

Mdraid will mark areas of disk dirty in the write intent bitmap before
writing to them. When the system comes up after a power outage, all
areas marked dirty are scanned and the xor block written where it
doesn't match the rest.

Thanks to the strict ordering using barriers, the damage to the
consistency of the RAID can only be in request since the last
successfully written barrier.

As such, the filesystem will always see a consistent state, and the raid
will also always recover to a consistent state.

The only situation where data damage can happen is a power outage that
comes together with a loss of one of the drives. In such a case, the
content of any blocks written past the last barrier is undefined. It
then depends on the filesystem whether it can revert to the last sane
state. Not sure about others, but btrfs will do so.

-- 
Vojtech Pavlik
Director SuSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Extra write mode to close RAID5 write hole (kind of)
  2016-10-28 13:07     ` Vojtech Pavlik
@ 2016-10-28 13:13       ` Kent Overstreet
  2016-10-28 16:55         ` Vojtech Pavlik
  2016-10-28 16:58       ` James Pharaoh
  1 sibling, 1 reply; 13+ messages in thread
From: Kent Overstreet @ 2016-10-28 13:13 UTC (permalink / raw)
  To: Vojtech Pavlik; +Cc: James Pharaoh, linux-bcache

t
On Fri, Oct 28, 2016 at 03:07:20PM +0200, Vojtech Pavlik wrote:
> The only situation where data damage can happen is a power outage that
> comes together with a loss of one of the drives. In such a case, the
> content of any blocks written past the last barrier is undefined. It
> then depends on the filesystem whether it can revert to the last sane
> state. Not sure about others, but btrfs will do so.

It's not any data written since the last barrier - in a non COW filesystem,
potentially the entire stripe is toast, which means existing unrelated data gets
corrupted. There's nothing really a non COW filesystem can do about it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Extra write mode to close RAID5 write hole (kind of)
  2016-10-28 13:13       ` Kent Overstreet
@ 2016-10-28 16:55         ` Vojtech Pavlik
  0 siblings, 0 replies; 13+ messages in thread
From: Vojtech Pavlik @ 2016-10-28 16:55 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: James Pharaoh, linux-bcache

On Fri, Oct 28, 2016 at 05:13:10AM -0800, Kent Overstreet wrote:
> t
> On Fri, Oct 28, 2016 at 03:07:20PM +0200, Vojtech Pavlik wrote:
> > The only situation where data damage can happen is a power outage that
> > comes together with a loss of one of the drives. In such a case, the
> > content of any blocks written past the last barrier is undefined. It
> > then depends on the filesystem whether it can revert to the last sane
> > state. Not sure about others, but btrfs will do so.
> 
> It's not any data written since the last barrier - in a non COW filesystem,
> potentially the entire stripe is toast, which means existing unrelated data gets
> corrupted. There's nothing really a non COW filesystem can do about it.

Again, you're right, if a drive is lost during a power outage, there can
be damage even outside of the blocks that were written if plain data was
written and xor wasn't. I don't think there is a filesystem that can
handle damage to untouched data cleanly.

An additional journal that works closely with the RAID device and tracks
what has been written to all devices is required to close this remaining gap.

But then, if the journal device is lost during a power outage ... ;)

-- 
Vojtech Pavlik
Director SuSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Extra write mode to close RAID5 write hole (kind of)
  2016-10-28 13:07     ` Vojtech Pavlik
  2016-10-28 13:13       ` Kent Overstreet
@ 2016-10-28 16:58       ` James Pharaoh
  1 sibling, 0 replies; 13+ messages in thread
From: James Pharaoh @ 2016-10-28 16:58 UTC (permalink / raw)
  To: Vojtech Pavlik, Kent Overstreet; +Cc: linux-bcache

On 28/10/16 14:07, Vojtech Pavlik wrote:
> On Fri, Oct 28, 2016 at 03:52:49AM -0800, Kent Overstreet wrote:

> Indeed. However, together with the write intent bitmap, and filesystems
> ensuring consistency through barriers, it's still greatly mitigated.
 >
> Mdraid will mark areas of disk dirty in the write intent bitmap before
> writing to them. When the system comes up after a power outage, all
> areas marked dirty are scanned and the xor block written where it
> doesn't match the rest.
>
> Thanks to the strict ordering using barriers, the damage to the
> consistency of the RAID can only be in request since the last
> successfully written barrier.

Ok so, without posting to mdraid, you are confident that, assuming the 
disk (etc) is correctly ordering writes, that the RAID5 write hole, as 
implemented by a modern Linux kernel, does not suffer from a write hole, 
then this is great news.

I understand that there is a clear issue in the case of a drive failure, 
but that's specifically why I think that bcache can be of use, because 
it should be able to mitigate some of this.

I have a feeling I would need to bcache the backing devices, rather than 
the array itself, to make this work, since, in the case of a drive 
failure, specifically the loss of a data-stripe as opposed to a parity 
one, is not possible to be ordered to avoid corruption. But I think that 
a bcache layer on the backing device, assuming of course that the bcache 
cache device is consistent, would provide this level of assurance.

> The only situation where data damage can happen is a power outage that
> comes together with a loss of one of the drives. In such a case, the
> content of any blocks written past the last barrier is undefined. It
> then depends on the filesystem whether it can revert to the last sane
> state. Not sure about others, but btrfs will do so.

Yes, and of course I've mentioned this above. But... I feel that this is 
something that bcache could help with, and I also have several redundant 
backups so that, in the unlikely event of a drive failure which causes 
corruption, I can easily restore the files in question.

I do feel like I would like to understand a little more about how Linux 
mdraid behaves in this respect, but it sounds like it does a pretty good 
job, and that my bcache layer, and redundant backups, provide a good 
layer of data security.

I am mostly using this to store zbackup respositories, which store the 
majority of data in 256 directories, which I currently map to 16 backing 
devices, and could, of course, easily map to as many as 256. In this use 
case, with the redundant backups, and of course some automatic testing 
and verification of the data, I am fairly confident that I won't be 
losing any backups.

James

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Extra write mode to close RAID5 write hole (kind of)
  2016-10-28 11:59 ` Kent Overstreet
@ 2016-10-28 17:02   ` James Pharaoh
  0 siblings, 0 replies; 13+ messages in thread
From: James Pharaoh @ 2016-10-28 17:02 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcache

On 28/10/16 12:59, Kent Overstreet wrote:
> On Wed, Oct 26, 2016 at 04:20:38PM +0100, James Pharaoh wrote:
>> Since I want my bcache device to essentially be a "journal", and to close
>> the RAID5 write hole, I would prefer to disable this behaviour.
>>
>> I propose, therefore, a further write mode, in which data is always written
>> to the cache first, and synced, before it is written to the underlying
>> device. This could be called "journal" perhaps, or something similar.
>>
>> I am optimistic that this would be a relatively small change to the code,
>> since it only requires to always choose the cache to write data to first.
>> Perhaps the sync behaviour is also more complex, I am not familiar with the
>> internals.
>>
>> So, does anyone have any idea if this is practical, if it would genuinely
>> close the write hole, or any other thoughts?
>
> It's not a crazy idea - bcache already has some stripe awareness code that could
> be used as a starting point.
>
> The main thing you'd need to do is ensure that
>  - all writes are writeback, not writethrough (as you noted)
>  - when the writeback thread is flushing dirty data, only flush entire stripes -
>    reading more data into the cache if necessary and marking it dirty, then
>    ensure that the entire stripe is marked dirty until the entire stripe is
>    flushed.
>
> This would basically be using bcache to do full data journalling.
>
> I'm not going to do the work myself - I'd rather spend my time working on adding
> erasure coding to bcachefs - but I could help out if you or someone else wanted
> to work on adding this to bcache.

I don't expect anyone to do the work, or to do this mysekf, although if 
I have the funds, and I may do soon, I would be prepared to pay someone 
to do it.

At the moment, I'm trying to check my facts/assumptions while designing 
a complex system which won't be fully operational for a while. I'd like 
to be sure that it is genuinely scalable, as in the design is valid, 
before I continue working in this way.

For what it's worth, I have recently set up a lot of this, taking 
advantage of extremely cheap servers set up in a "novel" way, and the 
performance is pretty good. As I've mentioned, I would like to write up 
what I've done, why, and perhaps create an open source management suite 
for people to repeat it.

James

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Extra write mode to close RAID5 write hole (kind of)
  2016-10-28 11:52   ` Kent Overstreet
  2016-10-28 13:07     ` Vojtech Pavlik
@ 2016-10-28 17:07     ` James Pharaoh
  2016-10-29  0:58       ` Kent Overstreet
  1 sibling, 1 reply; 13+ messages in thread
From: James Pharaoh @ 2016-10-28 17:07 UTC (permalink / raw)
  To: Kent Overstreet, Vojtech Pavlik; +Cc: linux-bcache

On 28/10/16 12:52, Kent Overstreet wrote:

> That's not what the raid 5 hole is. The raid 5 hole comes from the fact that
> it's not possible to update the p/q blocks atomically with the data blocks, thus
> there is a point in time when they are _inconsistent_ with the rest of the
> stripe, and if used will lead to reconstructing incorrect data. There's no way
> to fix this with just flushes.

Yes, I understand this, but if the kernel strictly orders writing mdraud 
data blocks before parity ones, then it closes part of the hole, 
especially if I have a "journal" in a higher layer, and of course ensure 
that this journal is reliable.

I think that, in the case of a drive failure, which contains data blocks 
which have been written, but which the parity blocks have not been, then 
this will fail.

I also think, however, that by putting bcache /under/ mdraid, and 
(again) ensuring that the bcache layer is reliable, along with the 
requirement for bcache to "journal" all writes, would provide an 
extremely reliable storage layer, even at a very large scale.

James

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Extra write mode to close RAID5 write hole (kind of)
  2016-10-28 17:07     ` James Pharaoh
@ 2016-10-29  0:58       ` Kent Overstreet
  2016-10-29 19:58         ` James Pharaoh
  0 siblings, 1 reply; 13+ messages in thread
From: Kent Overstreet @ 2016-10-29  0:58 UTC (permalink / raw)
  To: James Pharaoh; +Cc: Vojtech Pavlik, linux-bcache

On Fri, Oct 28, 2016 at 06:07:21PM +0100, James Pharaoh wrote:
> On 28/10/16 12:52, Kent Overstreet wrote:
> 
> > That's not what the raid 5 hole is. The raid 5 hole comes from the fact that
> > it's not possible to update the p/q blocks atomically with the data blocks, thus
> > there is a point in time when they are _inconsistent_ with the rest of the
> > stripe, and if used will lead to reconstructing incorrect data. There's no way
> > to fix this with just flushes.
> 
> Yes, I understand this, but if the kernel strictly orders writing mdraud
> data blocks before parity ones, then it closes part of the hole, especially
> if I have a "journal" in a higher layer, and of course ensure that this
> journal is reliable.

Ordering cannot help you here. Whichever order you do the writes in, there is a
point in time where the p/q blocks are inconsistent with the data blocks, thus
if you do a reconstruct you will reconstruct incorrect data. Unless you were
writing to the entire stripe, this affects data you were _not_ writing to.

> 
> I also think, however, that by putting bcache /under/ mdraid, and (again)
> ensuring that the bcache layer is reliable, along with the requirement for
> bcache to "journal" all writes, would provide an extremely reliable storage
> layer, even at a very large scale.

What? No, putting bcache under md wouldn't do anything, it couldn't do anything
about the atomicity issue there.

Also - Vojtech - btrfs _is_ subject to the raid5 hole, it would have to be doing
copygc to not be affceted.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Extra write mode to close RAID5 write hole (kind of)
  2016-10-29  0:58       ` Kent Overstreet
@ 2016-10-29 19:58         ` James Pharaoh
  0 siblings, 0 replies; 13+ messages in thread
From: James Pharaoh @ 2016-10-29 19:58 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Vojtech Pavlik, linux-bcache

Okay... So I think the situation is that:

- Currently there is no facility to atomically write out more than one 
block at a time.

- Mdraid orders writes to ensure that data blocks are updating 
atomically, and these are used for reads.

- If a data block is updated, but the parity is not, and there is a 
failure to any of the devices containing a data block with inconsistent 
parity, then the other blocks which share the parity block, effectively 
"random" blocks from the point of view of the filesystem, will be corrupted.

- Some kind of journal, and of course I'm proposing that bcache could 
serve this purpose, could potentially be able to close the write hole.

The main missing functionality is the first point above, namely that if 
the block layer could communicate that multiple block writes need to be 
made or not made, ie that multiple blocks could be written atomically, 
assuming there is a journal present, would fix this.

Has this been discussed before? As always, I find it hard to find good 
information about this kind of low-level stuff, and think that asking 
the people who have written it is the only way to get anywhere.

Obviously a change to the device mapper API is not something that would 
be done without significant consideration, although a POC would of 
course be welcomed, I think.

I think the gains to be made here are substantial, and that bcache is a 
very good candidate for the journal implementation. I also think that 
this implementation is relatively simple, compared to other options. I 
also have read many opinions on the problems of scaling up RAID5 and 
RAID6 as drives become larger, so I think there's definitely an urgent 
interest in finding a solution to this.

So, I would propose to add this kind of atomic write in the kernel's 
device mapper API, presumably with some way to detect if it is going to 
be honoured or not. I'm not familiar enough with it to know if this is 
more complicated than I make it sound...

The mdraid layer would need to use this API, perhaps as an option, but 
arguably if it can detect the presence of this facility, that it would 
be easy to recommend as the default, presumably after a period of testing.

Bcache would need to implement this API, and ensure that the "journal" 
atomically contains, or not, all of the atomically updated blocks.

I'm also assuming that the cache device is reliable, of course, and I've 
said I'm simply trusting a single SSD (or potentially a RAID0 array of 
backing devices with LVM), but I think that simply using RAID1 for the 
cache device would give a reasonable level of reliability for the bcache 
cache/journal.

I assume it uses some kind of COW tree with an atomic update at the 
root, and ordering, so that updates to the data can be ordered behind a 
single update which "commits" the changes, and that when this is read 
back, it is able to confirm if the critical commit has been made or not. 
Perhaps another API extension to the block layer, to perform a read 
which can check with a lower layer (RAID1 in this case) that the block 
is genuinely consistent.

In my main use case, where I am storing backups which are redundantly 
stored elsewhere, and my belief that an SSD array, even a RAID0 one, is 
quite reliable, I still think this is good enough. That said, SSDs are 
cheap enough for me to use RAID1 even in this case.

I also have other use cases, for example where I would RAID0 several 
bcache+RAID5 devices into a single LVM volume group. In this case, I'd 
definitely want the extra protection on the cache device, because an 
error would potentially affect a large filesystem built on top of it.

I think that there is a further opportunity for optimisation as well. 
If, as I am lead to believe, that mdraid is strictly ordering writes to 
data blocks then parity ones, to "partially" close the write hole, then 
being able to atomically write out all the blocks that change, ie two at 
minimum, could replace the strict ordering, and this would improve 
performance, because it takes a round trip of verifying the first write 
out then peforming the second, out of the consideration.

Does this all make sense? Is this interesting for anyone else? Is there 
any other work that attempts to solve this problem?

James

On 29/10/16 02:58, Kent Overstreet wrote:
> On Fri, Oct 28, 2016 at 06:07:21PM +0100, James Pharaoh wrote:
>> On 28/10/16 12:52, Kent Overstreet wrote:
>>
>>> That's not what the raid 5 hole is. The raid 5 hole comes from the fact that
>>> it's not possible to update the p/q blocks atomically with the data blocks, thus
>>> there is a point in time when they are _inconsistent_ with the rest of the
>>> stripe, and if used will lead to reconstructing incorrect data. There's no way
>>> to fix this with just flushes.
>>
>> Yes, I understand this, but if the kernel strictly orders writing mdraud
>> data blocks before parity ones, then it closes part of the hole, especially
>> if I have a "journal" in a higher layer, and of course ensure that this
>> journal is reliable.
>
> Ordering cannot help you here. Whichever order you do the writes in, there is a
> point in time where the p/q blocks are inconsistent with the data blocks, thus
> if you do a reconstruct you will reconstruct incorrect data. Unless you were
> writing to the entire stripe, this affects data you were _not_ writing to.
>
>>
>> I also think, however, that by putting bcache /under/ mdraid, and (again)
>> ensuring that the bcache layer is reliable, along with the requirement for
>> bcache to "journal" all writes, would provide an extremely reliable storage
>> layer, even at a very large scale.
>
> What? No, putting bcache under md wouldn't do anything, it couldn't do anything
> about the atomicity issue there.
>
> Also - Vojtech - btrfs _is_ subject to the raid5 hole, it would have to be doing
> copygc to not be affceted.
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-10-29 19:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-26 15:20 Extra write mode to close RAID5 write hole (kind of) James Pharaoh
2016-10-26 22:31 ` Vojtech Pavlik
2016-10-27 21:46   ` James Pharaoh
2016-10-28 11:52   ` Kent Overstreet
2016-10-28 13:07     ` Vojtech Pavlik
2016-10-28 13:13       ` Kent Overstreet
2016-10-28 16:55         ` Vojtech Pavlik
2016-10-28 16:58       ` James Pharaoh
2016-10-28 17:07     ` James Pharaoh
2016-10-29  0:58       ` Kent Overstreet
2016-10-29 19:58         ` James Pharaoh
2016-10-28 11:59 ` Kent Overstreet
2016-10-28 17:02   ` James Pharaoh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.