Split RAID: Proposal for archival RAID using incremental batch checksum

All of lore.kernel.org
 help / color / mirror / Atom feed

* Split RAID: Proposal for archival RAID using incremental batch checksum
@ 2014-11-21 10:15 Anshuman Aggarwal
  2014-11-21 11:41 ` Greg Freemyer
  0 siblings, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-11-21 10:15 UTC (permalink / raw)
  To: kernelnewbies

I'd a appreciate any help/pointers in implementing the proposal below
including the right path to get this into the kernel itself.
----------------------------------
I'm outlining below a proposal for a RAID device mapper virtual block
device for the kernel which adds "split raid" functionality on an
incremental batch basis for a home media server/archived content which
is rarely accessed.

Given a set of N+X block devices (of the same size but smallest common
size wins)

the SplitRAID device mapper device generates virtual devices which are
passthrough for N devices and write a Batched/Delayed checksum into
the X devices so as to allow offline recovery of block on the N
devices in case of a single disk failure.

Advantages over conventional RAID:

- Disks can be spun down reducing wear and tear over MD RAID Levels
(such as 1, 10, 5,6) in the case of rarely accessed archival content

- Prevent catastrophic data loss for multiple device failure since
each block device is independent and hence unlike MD RAID will only
lose data incrementally.

- Performance degradation for writes can be achieved by keeping the
checksum update asynchronous and delaying the fsync to the checksum
block device.

In the event of improper shutdown the checksum may not have all the
updated data but will be mostly up to date which is often acceptable
for home media server requirements. A flag can be set in case the
checksum block device was shutdown properly indicating that  a full
checksum rebuild is not required.

Existing solutions considered:

- SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
based scheme (Its advantages are that its in user space and has cross
platform support but has the huge disadvantage of every checksum being
done from scratch slowing the system, causing immense wear and tear on
every snapshot and also losing any information updates upto the
snapshot point etc)

I'd like to get opinions on the pros and cons of this proposal from
more experienced people on the list to redirect suitably on the
following questions:

- Maybe this can already be done using the block devices available in
the kernel?

- If not, Device mapper the right API to use? (I think so)

- What would be the best block devices code to look at to implement?


Regards,

Anshuman Aggarwal

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-21 10:15 Split RAID: Proposal for archival RAID using incremental batch checksum Anshuman Aggarwal
@ 2014-11-21 11:41 ` Greg Freemyer
  2014-11-21 18:48   ` Anshuman Aggarwal
  0 siblings, 1 reply; 44+ messages in thread
From: Greg Freemyer @ 2014-11-21 11:41 UTC (permalink / raw)
  To: kernelnewbies



On November 21, 2014 5:15:43 AM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>I'd a appreciate any help/pointers in implementing the proposal below
>including the right path to get this into the kernel itself.
>----------------------------------
>I'm outlining below a proposal for a RAID device mapper virtual block
>device for the kernel which adds "split raid" functionality on an
>incremental batch basis for a home media server/archived content which
>is rarely accessed.
>
>Given a set of N+X block devices (of the same size but smallest common
>size wins)
>
>the SplitRAID device mapper device generates virtual devices which are
>passthrough for N devices and write a Batched/Delayed checksum into
>the X devices so as to allow offline recovery of block on the N
>devices in case of a single disk failure.
>
>Advantages over conventional RAID:
>
>- Disks can be spun down reducing wear and tear over MD RAID Levels
>(such as 1, 10, 5,6) in the case of rarely accessed archival content
>
>- Prevent catastrophic data loss for multiple device failure since
>each block device is independent and hence unlike MD RAID will only
>lose data incrementally.
>
>- Performance degradation for writes can be achieved by keeping the
>checksum update asynchronous and delaying the fsync to the checksum
>block device.
>
>In the event of improper shutdown the checksum may not have all the
>updated data but will be mostly up to date which is often acceptable
>for home media server requirements. A flag can be set in case the
>checksum block device was shutdown properly indicating that  a full
>checksum rebuild is not required.
>
>Existing solutions considered:
>
>- SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
>based scheme (Its advantages are that its in user space and has cross
>platform support but has the huge disadvantage of every checksum being
>done from scratch slowing the system, causing immense wear and tear on
>every snapshot and also losing any information updates upto the
>snapshot point etc)
>
>I'd like to get opinions on the pros and cons of this proposal from
>more experienced people on the list to redirect suitably on the
>following questions:
>
>- Maybe this can already be done using the block devices available in
>the kernel?
>
>- If not, Device mapper the right API to use? (I think so)
>
>- What would be the best block devices code to look at to implement?
>
>
>Regards,
>
>Anshuman Aggarwal
>
>_______________________________________________
>Kernelnewbies mailing list
>Kernelnewbies at kernelnewbies.org
>http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

I think I understand the proposal.

You say N pass-through drives.  I assume concatenated?

If the N drives were instead in a Raid-0 stripe set and your X drives was just a single parity drive, then you would have described Raid-4.

There are use cases for raid 4 and you have described a good one (rarely used data where random w/o performance is not key).

I don't know if mdraid supports raid-4 or not.  If not I suggest adding raid-4 support is something else you might want to look at.

Anyway, at a minimum add raid-4 to the existing solutions considered section.

Greg


-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-21 11:41 ` Greg Freemyer
@ 2014-11-21 18:48   ` Anshuman Aggarwal
  2014-11-22 13:17     ` Greg Freemyer
  0 siblings, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-11-21 18:48 UTC (permalink / raw)
  To: kernelnewbies

N pass through but with their own filesystems. Concatenation is via
some kind of union fs solution not at the block level. Data is not
supposed to be striped (this is critical so as to prevent all drives
to be required to be accessed for consecutive data)

Idea is that each drive can work independently and the last drive
stores parity to save data in case of failure of any one drive.

Any suggestions from anyone on where to start with such a driver..it
seems like a block driver for the parity drive but which depends on
intercepting the writes to other drives.

On 21 November 2014 17:11, Greg Freemyer <greg.freemyer@gmail.com> wrote:
>
>
> On November 21, 2014 5:15:43 AM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>>I'd a appreciate any help/pointers in implementing the proposal below
>>including the right path to get this into the kernel itself.
>>----------------------------------
>>I'm outlining below a proposal for a RAID device mapper virtual block
>>device for the kernel which adds "split raid" functionality on an
>>incremental batch basis for a home media server/archived content which
>>is rarely accessed.
>>
>>Given a set of N+X block devices (of the same size but smallest common
>>size wins)
>>
>>the SplitRAID device mapper device generates virtual devices which are
>>passthrough for N devices and write a Batched/Delayed checksum into
>>the X devices so as to allow offline recovery of block on the N
>>devices in case of a single disk failure.
>>
>>Advantages over conventional RAID:
>>
>>- Disks can be spun down reducing wear and tear over MD RAID Levels
>>(such as 1, 10, 5,6) in the case of rarely accessed archival content
>>
>>- Prevent catastrophic data loss for multiple device failure since
>>each block device is independent and hence unlike MD RAID will only
>>lose data incrementally.
>>
>>- Performance degradation for writes can be achieved by keeping the
>>checksum update asynchronous and delaying the fsync to the checksum
>>block device.
>>
>>In the event of improper shutdown the checksum may not have all the
>>updated data but will be mostly up to date which is often acceptable
>>for home media server requirements. A flag can be set in case the
>>checksum block device was shutdown properly indicating that  a full
>>checksum rebuild is not required.
>>
>>Existing solutions considered:
>>
>>- SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
>>based scheme (Its advantages are that its in user space and has cross
>>platform support but has the huge disadvantage of every checksum being
>>done from scratch slowing the system, causing immense wear and tear on
>>every snapshot and also losing any information updates upto the
>>snapshot point etc)
>>
>>I'd like to get opinions on the pros and cons of this proposal from
>>more experienced people on the list to redirect suitably on the
>>following questions:
>>
>>- Maybe this can already be done using the block devices available in
>>the kernel?
>>
>>- If not, Device mapper the right API to use? (I think so)
>>
>>- What would be the best block devices code to look at to implement?
>>
>>
>>Regards,
>>
>>Anshuman Aggarwal
>>
>>_______________________________________________
>>Kernelnewbies mailing list
>>Kernelnewbies at kernelnewbies.org
>>http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
>
> I think I understand the proposal.
>
> You say N pass-through drives.  I assume concatenated?
>
> If the N drives were instead in a Raid-0 stripe set and your X drives was just a single parity drive, then you would have described Raid-4.
>
> There are use cases for raid 4 and you have described a good one (rarely used data where random w/o performance is not key).
>
> I don't know if mdraid supports raid-4 or not.  If not I suggest adding raid-4 support is something else you might want to look at.
>
> Anyway, at a minimum add raid-4 to the existing solutions considered section.
>
> Greg
>
>
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-21 18:48   ` Anshuman Aggarwal
@ 2014-11-22 13:17     ` Greg Freemyer
  2014-11-22 13:22       ` Anshuman Aggarwal
  0 siblings, 1 reply; 44+ messages in thread
From: Greg Freemyer @ 2014-11-22 13:17 UTC (permalink / raw)
  To: kernelnewbies

Top posting is strongly discouraged on all kernel related mailing lists including this one.  I've moved your reply to the bottom and then replied after that.  In future I will ignore replies that are top posted.

>On 21 November 2014 17:11, Greg Freemyer <greg.freemyer@gmail.com>
>wrote:
>>
>>
>> On November 21, 2014 5:15:43 AM EST, Anshuman Aggarwal
><anshuman.aggarwal@gmail.com> wrote:
>>>I'd a appreciate any help/pointers in implementing the proposal below
>>>including the right path to get this into the kernel itself.
>>>----------------------------------
>>>I'm outlining below a proposal for a RAID device mapper virtual block
>>>device for the kernel which adds "split raid" functionality on an
>>>incremental batch basis for a home media server/archived content
>which
>>>is rarely accessed.
>>>
>>>Given a set of N+X block devices (of the same size but smallest
>common
>>>size wins)
>>>
>>>the SplitRAID device mapper device generates virtual devices which
>are
>>>passthrough for N devices and write a Batched/Delayed checksum into
>>>the X devices so as to allow offline recovery of block on the N
>>>devices in case of a single disk failure.
>>>
>>>Advantages over conventional RAID:
>>>
>>>- Disks can be spun down reducing wear and tear over MD RAID Levels
>>>(such as 1, 10, 5,6) in the case of rarely accessed archival content
>>>
>>>- Prevent catastrophic data loss for multiple device failure since
>>>each block device is independent and hence unlike MD RAID will only
>>>lose data incrementally.
>>>
>>>- Performance degradation for writes can be achieved by keeping the
>>>checksum update asynchronous and delaying the fsync to the checksum
>>>block device.
>>>
>>>In the event of improper shutdown the checksum may not have all the
>>>updated data but will be mostly up to date which is often acceptable
>>>for home media server requirements. A flag can be set in case the
>>>checksum block device was shutdown properly indicating that  a full
>>>checksum rebuild is not required.
>>>
>>>Existing solutions considered:
>>>
>>>- SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
>>>based scheme (Its advantages are that its in user space and has cross
>>>platform support but has the huge disadvantage of every checksum
>being
>>>done from scratch slowing the system, causing immense wear and tear
>on
>>>every snapshot and also losing any information updates upto the
>>>snapshot point etc)
>>>
>>>I'd like to get opinions on the pros and cons of this proposal from
>>>more experienced people on the list to redirect suitably on the
>>>following questions:
>>>
>>>- Maybe this can already be done using the block devices available in
>>>the kernel?
>>>
>>>- If not, Device mapper the right API to use? (I think so)
>>>
>>>- What would be the best block devices code to look at to implement?
>>>
>>>
>>>Regards,
>>>
>>>Anshuman Aggarwal
>>>
>>>_______________________________________________
>>>Kernelnewbies mailing list
>>>Kernelnewbies at kernelnewbies.org
>>>http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
>>
>> I think I understand the proposal.
>>
>> You say N pass-through drives.  I assume concatenated?
>>
>> If the N drives were instead in a Raid-0 stripe set and your X drives
>was just a single parity drive, then you would have described Raid-4.
>>
>> There are use cases for raid 4 and you have described a good one
>(rarely used data where random w/o performance is not key).
>>
>> I don't know if mdraid supports raid-4 or not.  If not I suggest
>adding raid-4 support is something else you might want to look at.
>>
>> Anyway, at a minimum add raid-4 to the existing solutions considered
>section.
>>
>> Greg
On November 21, 2014 1:48:57 PM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>N pass through but with their own filesystems. Concatenation is via
>some kind of union fs solution not at the block level. Data is not
>supposed to be striped (this is critical so as to prevent all drives
>to be required to be accessed for consecutive data)

I'm ignorant of how unionfs works, so I can offer no feedback about it.

I see no real issue doing it with a block level solution with device mapper (dm) as the implementation.  I'm going to ignore implementation for the rest of this email and discuss the goal.

Can you detail what you see a single page write to D1 doing?

You talked about batching / delaying the checksum writes, but I didn't understand how that made things more efficient, nor the reason for the delay.

I assume you know raid 4 and 5 work like this:

Read D1old
Read Pold
Pnew=(Pold^D1old)^D1new
Write Pnew
Write D1new

Ie. 2 page reads and 2 page writes to update a single page.

The 2 reads and the 2 writes take place in parallel, so if the disks are otherwise idle, then the time involved is one disk seek and 2 disk rotations.  Let's say 25 msecs for the seek and 12 msecs per rotation.  That is 49 msecs total.  I think that is about right for a low performance rotating drive, but I didn't pull out any specs to double check my memory.

While that is a lot of i/o overhead (4x), it is how raid 4 and 5 work and I assume your split raid would have to do something similar.  With a normal non raided disk a single block write requires a seek and a rotation, so 37 msecs, thus very little clock time overhead for raid 4 or 5 for small random i/o block writes.

Is that also true of your split raid?  The delayed checksum writes confuse me.
---

Where I'm concerned about your solution for performance is with a full stride write.  Let's look at how a 4 disk raid 4 would write a full stride:

Pnew = D1new ^ D2new ^ D3new
Write D1
Write D2
Write D3
Write P

So only 4 writes to write 3 data blocks.  Even better all take place in parallel so you can accomplish 3x the data writes to disk that a single non-raided disk can.

Thus for streaming writes, raid 4 or 5 see a performance boost over a single drive.

I see nothing similar in your split raid.

The same is true of streaming reads, raid 4 and 5 get performance gains from reading from the drives in parallel.  I don't see any ability for that same gain in your split raid.

In the real world raid 4 is rarely used because having a static parity drive offers no advantage I know of over having the parity handled as raid 5 does it.

===
Thus if your split raid was in kernel and I was setting up a streaming media server the choice would be between raid 5 and your split raid.  Raid 5 I believe would have superior performance, but split raid would have a less catastrophic failure mode if 2 drives failed at once.

Do I have right?

Greg

>Idea is that each drive can work independently and the last drive
>stores parity to save data in case of failure of any one drive.
>
>Any suggestions from anyone on where to start with such a driver..it
>seems like a block driver for the parity drive but which depends on
>intercepting the writes to other drives.

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-22 13:17     ` Greg Freemyer
@ 2014-11-22 13:22       ` Anshuman Aggarwal
  2014-11-22 14:03         ` Greg Freemyer
  0 siblings, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-11-22 13:22 UTC (permalink / raw)
  To: kernelnewbies

On 22 November 2014 at 18:47, Greg Freemyer <greg.freemyer@gmail.com> wrote:
> Top posting is strongly discouraged on all kernel related mailing lists including this one.  I've moved your reply to the bottom and then replied after that.  In future I will ignore replies that are top posted.
>
>
>>On 21 November 2014 17:11, Greg Freemyer <greg.freemyer@gmail.com>
>>wrote:
>>>
>>>
>>> On November 21, 2014 5:15:43 AM EST, Anshuman Aggarwal
>><anshuman.aggarwal@gmail.com> wrote:
>>>>I'd a appreciate any help/pointers in implementing the proposal below
>>>>including the right path to get this into the kernel itself.
>>>>----------------------------------
>>>>I'm outlining below a proposal for a RAID device mapper virtual block
>>>>device for the kernel which adds "split raid" functionality on an
>>>>incremental batch basis for a home media server/archived content
>>which
>>>>is rarely accessed.
>>>>
>>>>Given a set of N+X block devices (of the same size but smallest
>>common
>>>>size wins)
>>>>
>>>>the SplitRAID device mapper device generates virtual devices which
>>are
>>>>passthrough for N devices and write a Batched/Delayed checksum into
>>>>the X devices so as to allow offline recovery of block on the N
>>>>devices in case of a single disk failure.
>>>>
>>>>Advantages over conventional RAID:
>>>>
>>>>- Disks can be spun down reducing wear and tear over MD RAID Levels
>>>>(such as 1, 10, 5,6) in the case of rarely accessed archival content
>>>>
>>>>- Prevent catastrophic data loss for multiple device failure since
>>>>each block device is independent and hence unlike MD RAID will only
>>>>lose data incrementally.
>>>>
>>>>- Performance degradation for writes can be achieved by keeping the
>>>>checksum update asynchronous and delaying the fsync to the checksum
>>>>block device.
>>>>
>>>>In the event of improper shutdown the checksum may not have all the
>>>>updated data but will be mostly up to date which is often acceptable
>>>>for home media server requirements. A flag can be set in case the
>>>>checksum block device was shutdown properly indicating that  a full
>>>>checksum rebuild is not required.
>>>>
>>>>Existing solutions considered:
>>>>
>>>>- SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
>>>>based scheme (Its advantages are that its in user space and has cross
>>>>platform support but has the huge disadvantage of every checksum
>>being
>>>>done from scratch slowing the system, causing immense wear and tear
>>on
>>>>every snapshot and also losing any information updates upto the
>>>>snapshot point etc)
>>>>
>>>>I'd like to get opinions on the pros and cons of this proposal from
>>>>more experienced people on the list to redirect suitably on the
>>>>following questions:
>>>>
>>>>- Maybe this can already be done using the block devices available in
>>>>the kernel?
>>>>
>>>>- If not, Device mapper the right API to use? (I think so)
>>>>
>>>>- What would be the best block devices code to look at to implement?
>>>>
>>>>
>>>>Regards,
>>>>
>>>>Anshuman Aggarwal
>>>>
>>>>_______________________________________________
>>>>Kernelnewbies mailing list
>>>>Kernelnewbies at kernelnewbies.org
>>>>http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
>>>
>>> I think I understand the proposal.
>>>
>>> You say N pass-through drives.  I assume concatenated?
>>>
>>> If the N drives were instead in a Raid-0 stripe set and your X drives
>>was just a single parity drive, then you would have described Raid-4.
>>>
>>> There are use cases for raid 4 and you have described a good one
>>(rarely used data where random w/o performance is not key).
>>>
>>> I don't know if mdraid supports raid-4 or not.  If not I suggest
>>adding raid-4 support is something else you might want to look at.
>>>
>>> Anyway, at a minimum add raid-4 to the existing solutions considered
>>section.
>>>
>>> Greg
> On November 21, 2014 1:48:57 PM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>>N pass through but with their own filesystems. Concatenation is via
>>some kind of union fs solution not at the block level. Data is not
>>supposed to be striped (this is critical so as to prevent all drives
>>to be required to be accessed for consecutive data)
>
> I'm ignorant of how unionfs works, so I can offer no feedback about it.
>
> I see no real issue doing it with a block level solution with device mapper (dm) as the implementation.  I'm going to ignore implementation for the rest of this email and discuss the goal.
>
> Can you detail what you see a single page write to D1 doing?
>
> You talked about batching / delaying the checksum writes, but I didn't understand how that made things more efficient, nor the reason for the delay.
>
> I assume you know raid 4 and 5 work like this:
>
> Read D1old
> Read Pold
> Pnew=(Pold^D1old)^D1new
> Write Pnew
> Write D1new
>
> Ie. 2 page reads and 2 page writes to update a single page.
>
> The 2 reads and the 2 writes take place in parallel, so if the disks are otherwise idle, then the time involved is one disk seek and 2 disk rotations.  Let's say 25 msecs for the seek and 12 msecs per rotation.  That is 49 msecs total.  I think that is about right for a low performance rotating drive, but I didn't pull out any specs to double check my memory.
>
> While that is a lot of i/o overhead (4x), it is how raid 4 and 5 work and I assume your split raid would have to do something similar.  With a normal non raided disk a single block write requires a seek and a rotation, so 37 msecs, thus very little clock time overhead for raid 4 or 5 for small random i/o block writes.
>
> Is that also true of your split raid?  The delayed checksum writes confuse me.
> ---
>
> Where I'm concerned about your solution for performance is with a full stride write.  Let's look at how a 4 disk raid 4 would write a full stride:
>
> Pnew = D1new ^ D2new ^ D3new
> Write D1
> Write D2
> Write D3
> Write P
>
> So only 4 writes to write 3 data blocks.  Even better all take place in parallel so you can accomplish 3x the data writes to disk that a single non-raided disk can.
>
> Thus for streaming writes, raid 4 or 5 see a performance boost over a single drive.
>
> I see nothing similar in your split raid.
>
> The same is true of streaming reads, raid 4 and 5 get performance gains from reading from the drives in parallel.  I don't see any ability for that same gain in your split raid.
>
> In the real world raid 4 is rarely used because having a static parity drive offers no advantage I know of over having the parity handled as raid 5 does it.
>
> ===
> Thus if your split raid was in kernel and I was setting up a streaming media server the choice would be between raid 5 and your split raid.  Raid 5 I believe would have superior performance, but split raid would have a less catastrophic failure mode if 2 drives failed at once.
>
> Do I have right?
>
> Greg
>
>
>
>
>
>
>>Idea is that each drive can work independently and the last drive
>>stores parity to save data in case of failure of any one drive.
>>
>>Any suggestions from anyone on where to start with such a driver..it
>>seems like a block driver for the parity drive but which depends on
>>intercepting the writes to other drives.
>
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.


You have the motivation and goal quite opposite from what is intended.

In a home media server, the RAID6 mdadm setup that I currently have
keeps all the disks spinning and running for writes which could be
done just to the last disk while the others are in sleep mode (head
parked etc)

Its not about performance at all. Its about longevity of the HDDs. The
entire proposal is focused entirely on extending the life of the
drives.

By not using stripes, we restrict writes to happen to just 1 drive and
the XOR output to the parity drive which then explains the delayed and
batched checksum (resulting in fewer writes to the parity drive). The
intention is that if a drive fails then maybe we lose 1 or 2 movies
but the rest is restorable from parity.

Also another advantage over RAID5 or RAID6 is that in the event of
multiple drive failure we only lose the content on the failed drive
not the whole cluster/RAID.

Did I clarify better this time around?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-22 13:22       ` Anshuman Aggarwal
@ 2014-11-22 14:03         ` Greg Freemyer
  2014-11-22 14:43           ` Anshuman Aggarwal
  0 siblings, 1 reply; 44+ messages in thread
From: Greg Freemyer @ 2014-11-22 14:03 UTC (permalink / raw)
  To: kernelnewbies

On Sat, Nov 22, 2014 at 8:22 AM, Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:
> By not using stripes, we restrict writes to happen to just 1 drive and
> the XOR output to the parity drive which then explains the delayed and
> batched checksum (resulting in fewer writes to the parity drive). The
> intention is that if a drive fails then maybe we lose 1 or 2 movies
> but the rest is restorable from parity.
>
> Also another advantage over RAID5 or RAID6 is that in the event of
> multiple drive failure we only lose the content on the failed drive
> not the whole cluster/RAID.
>
> Did I clarify better this time around?

I still don't understand the delayed checksum/parity.

With classic raid 4, writing 1 GB of data to just D1 would require 1
GB of data first be read from D1 and 1 GB read from P then 1 GB
written to both D1 and P.  4 GB worth of I/O total.

With your proposal, if you stream 1 GB of data to a file on D1:

- Does the old/previous data on D1 have to be read?

-  How much data goes to the parity drive?

- Does the old data on the parity drive have to be read?

-  Why does delaying it reduce that volume compared to Raid 4?

-  In the event drive 1 fails, can its content be re-created from the
other drives?

Greg
--
Greg Freemyer

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-22 14:03         ` Greg Freemyer
@ 2014-11-22 14:43           ` Anshuman Aggarwal
  2014-11-22 14:54             ` Greg Freemyer
  0 siblings, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-11-22 14:43 UTC (permalink / raw)
  To: kernelnewbies

On 22 November 2014 at 19:33, Greg Freemyer <greg.freemyer@gmail.com> wrote:
> On Sat, Nov 22, 2014 at 8:22 AM, Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
>> By not using stripes, we restrict writes to happen to just 1 drive and
>> the XOR output to the parity drive which then explains the delayed and
>> batched checksum (resulting in fewer writes to the parity drive). The
>> intention is that if a drive fails then maybe we lose 1 or 2 movies
>> but the rest is restorable from parity.
>>
>> Also another advantage over RAID5 or RAID6 is that in the event of
>> multiple drive failure we only lose the content on the failed drive
>> not the whole cluster/RAID.
>>
>> Did I clarify better this time around?
>
> I still don't understand the delayed checksum/parity.
>
> With classic raid 4, writing 1 GB of data to just D1 would require 1
> GB of data first be read from D1 and 1 GB read from P then 1 GB
> written to both D1 and P.  4 GB worth of I/O total.
>
> With your proposal, if you stream 1 GB of data to a file on D1:
>
> - Does the old/previous data on D1 have to be read?
>
> -  How much data goes to the parity drive?
>
> - Does the old data on the parity drive have to be read?
>
> -  Why does delaying it reduce that volume compared to Raid 4?
>
> -  In the event drive 1 fails, can its content be re-created from the
> other drives?
>
> Greg
> --
> Greg Freemyer

Two things:
Delayed writes basically to allow the parity drive to spin down if the
parity writing is only 1 block instead of spinning up the drive for
every write (obviously the data drive has to be spun up). Delays will
be both time and size constrained.
For a large write such as a 1 GB of data to file it would trigger a
configurable maximum delaying limit which would then dump to parity
drive immediately preventing memory overuse.

This again ties in to the fact that the content is not 'critical' so
if parity was not dumped when a drive fails, worst case you only lose
the latest file.

Delayed writes may be done via bcache or a similar implementation
which caches the writes in memory and need not be part of the split
raid driver at all.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-22 14:43           ` Anshuman Aggarwal
@ 2014-11-22 14:54             ` Greg Freemyer
  2014-11-24  5:36               ` SandeepKsinha
  0 siblings, 1 reply; 44+ messages in thread
From: Greg Freemyer @ 2014-11-22 14:54 UTC (permalink / raw)
  To: kernelnewbies



On November 22, 2014 9:43:23 AM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>On 22 November 2014 at 19:33, Greg Freemyer <greg.freemyer@gmail.com>
>wrote:
>> On Sat, Nov 22, 2014 at 8:22 AM, Anshuman Aggarwal
>> <anshuman.aggarwal@gmail.com> wrote:
>>> By not using stripes, we restrict writes to happen to just 1 drive
>and
>>> the XOR output to the parity drive which then explains the delayed
>and
>>> batched checksum (resulting in fewer writes to the parity drive).
>The
>>> intention is that if a drive fails then maybe we lose 1 or 2 movies
>>> but the rest is restorable from parity.
>>>
>>> Also another advantage over RAID5 or RAID6 is that in the event of
>>> multiple drive failure we only lose the content on the failed drive
>>> not the whole cluster/RAID.
>>>
>>> Did I clarify better this time around?
>>
>> I still don't understand the delayed checksum/parity.
>>
>> With classic raid 4, writing 1 GB of data to just D1 would require 1
>> GB of data first be read from D1 and 1 GB read from P then 1 GB
>> written to both D1 and P.  4 GB worth of I/O total.
>>
>> With your proposal, if you stream 1 GB of data to a file on D1:
>>
>> - Does the old/previous data on D1 have to be read?
>>
>> -  How much data goes to the parity drive?
>>
>> - Does the old data on the parity drive have to be read?
>>
>> -  Why does delaying it reduce that volume compared to Raid 4?
>>
>> -  In the event drive 1 fails, can its content be re-created from the
>> other drives?
>>
>> Greg
>> --
>> Greg Freemyer
>
>Two things:
>Delayed writes basically to allow the parity drive to spin down if the
>parity writing is only 1 block instead of spinning up the drive for
>every write (obviously the data drive has to be spun up). Delays will
>be both time and size constrained.
>For a large write such as a 1 GB of data to file it would trigger a
>configurable maximum delaying limit which would then dump to parity
>drive immediately preventing memory overuse.
>
>This again ties in to the fact that the content is not 'critical' so
>if parity was not dumped when a drive fails, worst case you only lose
>the latest file.
>
>Delayed writes may be done via bcache or a similar implementation
>which caches the writes in memory and need not be part of the split
>raid driver at all.

That provided little clarity.

File systems like xfs queue (delay) significant amounts of actual data before writing it to disk.  The same is true of journal data.  If all you are doing is caching the parity up until their is enough to bother with, then a filesystem designed for streamed data already does the for the data drive, thus you don't need to do anything new for the parity drive, just run it in sync with the data drive.

At this point I interpret your proposal to be:

Implement a Raid 4 like setup, but instead if stripping the date data drives, concatenate them.

That is something I haven't seen done, but I can see why you would want it.  Implementing via unionfs I don't understand, but as a new device mapper mechanism it seems very logical.

Obviously, I'm not a device mapper maintainer, so I'm not saying it would be accepted, but if I'm right you can now have a discussion of just a few sentences which explain your goal.

Greg
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-22 14:54             ` Greg Freemyer
@ 2014-11-24  5:36               ` SandeepKsinha
  2014-11-24  6:48                 ` Anshuman Aggarwal
  0 siblings, 1 reply; 44+ messages in thread
From: SandeepKsinha @ 2014-11-24  5:36 UTC (permalink / raw)
  To: kernelnewbies

On Sat, Nov 22, 2014 at 8:24 PM, Greg Freemyer <greg.freemyer@gmail.com>
wrote:

>
>
> On November 22, 2014 9:43:23 AM EST, Anshuman Aggarwal <
> anshuman.aggarwal at gmail.com> wrote:
> >On 22 November 2014 at 19:33, Greg Freemyer <greg.freemyer@gmail.com>
> >wrote:
> >> On Sat, Nov 22, 2014 at 8:22 AM, Anshuman Aggarwal
> >> <anshuman.aggarwal@gmail.com> wrote:
> >>> By not using stripes, we restrict writes to happen to just 1 drive
> >and
> >>> the XOR output to the parity drive which then explains the delayed
> >and
> >>> batched checksum (resulting in fewer writes to the parity drive).
> >The
> >>> intention is that if a drive fails then maybe we lose 1 or 2 movies
> >>> but the rest is restorable from parity.
> >>>
> >>> Also another advantage over RAID5 or RAID6 is that in the event of
> >>> multiple drive failure we only lose the content on the failed drive
> >>> not the whole cluster/RAID.
> >>>
> >>> Did I clarify better this time around?
> >>
> >> I still don't understand the delayed checksum/parity.
> >>
> >> With classic raid 4, writing 1 GB of data to just D1 would require 1
> >> GB of data first be read from D1 and 1 GB read from P then 1 GB
> >> written to both D1 and P.  4 GB worth of I/O total.
> >>
> >> With your proposal, if you stream 1 GB of data to a file on D1:
> >>
> >> - Does the old/previous data on D1 have to be read?
> >>
> >> -  How much data goes to the parity drive?
> >>
> >> - Does the old data on the parity drive have to be read?
> >>
> >> -  Why does delaying it reduce that volume compared to Raid 4?
> >>
> >> -  In the event drive 1 fails, can its content be re-created from the
> >> other drives?
> >>
> >> Greg
> >> --
> >> Greg Freemyer
> >
> >Two things:
> >Delayed writes basically to allow the parity drive to spin down if the
> >parity writing is only 1 block instead of spinning up the drive for
> >every write (obviously the data drive has to be spun up). Delays will
> >be both time and size constrained.
> >For a large write such as a 1 GB of data to file it would trigger a
> >configurable maximum delaying limit which would then dump to parity
> >drive immediately preventing memory overuse.
> >
> >This again ties in to the fact that the content is not 'critical' so
> >if parity was not dumped when a drive fails, worst case you only lose
> >the latest file.
> >
> >Delayed writes may be done via bcache or a similar implementation
> >which caches the writes in memory and need not be part of the split
> >raid driver at all.
>
> That provided little clarity.
>
> File systems like xfs queue (delay) significant amounts of actual data
> before writing it to disk.  The same is true of journal data.  If all you
> are doing is caching the parity up until their is enough to bother with,
> then a filesystem designed for streamed data already does the for the data
> drive, thus you don't need to do anything new for the parity drive, just
> run it in sync with the data drive.
>
> At this point I interpret your proposal to be:
>
> Implement a Raid 4 like setup, but instead if stripping the date data
> drives, concatenate them.
>
> That is something I haven't seen done, but I can see why you would want
> it.  Implementing via unionfs I don't understand, but as a new device
> mapper mechanism it seems very logical.
>
> Obviously, I'm not a device mapper maintainer, so I'm not saying it would
> be accepted, but if I'm right you can now have a discussion of just a few
> sentences which explain your goal.
>
>
RAID4 support does not exist in the mainline. Anshuman, you might want to
reach out to Neil Brown who is the maintainer for dmraid.
IIUC, your requirement can be well implemented by writing a new device
mapper target. That will make it modular and will help you make
improvements to it easily.





> Greg
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
>
> _______________________________________________
> Kernelnewbies mailing list
> Kernelnewbies at kernelnewbies.org
> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
>



-- 
Regards,
Sandeep.






?To learn is to change. Education is a process that changes the learner.?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20141124/f9c1f9b1/attachment.html 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-24  5:36               ` SandeepKsinha
@ 2014-11-24  6:48                 ` Anshuman Aggarwal
  2014-11-24 13:19                   ` Greg Freemyer
  0 siblings, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-11-24  6:48 UTC (permalink / raw)
  To: kernelnewbies

Sandeep,
 This isn't exactly RAID4 (only thing in common is a single parity
disk but the data is not striped at all). I did bring it up on the
linux-raid mailing list and have had a short conversation with Neil.
He wasn't too excited about device mapper but didn't indicate why or
why not.

I would like to have this as a layer for each block device on top of
the original block devices (intercepting write requests to the block
devices and updating the parity disk). Is device mapper the write
interface? What are the others? Also if I don't store the metadata on
the block device itself (to allow the block device to be unaware of
the RAID4 on top...how would the kernel be informed of which devices
together form the Split RAID.

Appreciate the help.

Thanks,
Anshuman

On 24 November 2014 at 11:06, SandeepKsinha <sandeepksinha@gmail.com> wrote:
>
>
> On Sat, Nov 22, 2014 at 8:24 PM, Greg Freemyer <greg.freemyer@gmail.com>
> wrote:
>>
>>
>>
>> On November 22, 2014 9:43:23 AM EST, Anshuman Aggarwal
>> <anshuman.aggarwal@gmail.com> wrote:
>> >On 22 November 2014 at 19:33, Greg Freemyer <greg.freemyer@gmail.com>
>> >wrote:
>> >> On Sat, Nov 22, 2014 at 8:22 AM, Anshuman Aggarwal
>> >> <anshuman.aggarwal@gmail.com> wrote:
>> >>> By not using stripes, we restrict writes to happen to just 1 drive
>> >and
>> >>> the XOR output to the parity drive which then explains the delayed
>> >and
>> >>> batched checksum (resulting in fewer writes to the parity drive).
>> >The
>> >>> intention is that if a drive fails then maybe we lose 1 or 2 movies
>> >>> but the rest is restorable from parity.
>> >>>
>> >>> Also another advantage over RAID5 or RAID6 is that in the event of
>> >>> multiple drive failure we only lose the content on the failed drive
>> >>> not the whole cluster/RAID.
>> >>>
>> >>> Did I clarify better this time around?
>> >>
>> >> I still don't understand the delayed checksum/parity.
>> >>
>> >> With classic raid 4, writing 1 GB of data to just D1 would require 1
>> >> GB of data first be read from D1 and 1 GB read from P then 1 GB
>> >> written to both D1 and P.  4 GB worth of I/O total.
>> >>
>> >> With your proposal, if you stream 1 GB of data to a file on D1:
>> >>
>> >> - Does the old/previous data on D1 have to be read?
>> >>
>> >> -  How much data goes to the parity drive?
>> >>
>> >> - Does the old data on the parity drive have to be read?
>> >>
>> >> -  Why does delaying it reduce that volume compared to Raid 4?
>> >>
>> >> -  In the event drive 1 fails, can its content be re-created from the
>> >> other drives?
>> >>
>> >> Greg
>> >> --
>> >> Greg Freemyer
>> >
>> >Two things:
>> >Delayed writes basically to allow the parity drive to spin down if the
>> >parity writing is only 1 block instead of spinning up the drive for
>> >every write (obviously the data drive has to be spun up). Delays will
>> >be both time and size constrained.
>> >For a large write such as a 1 GB of data to file it would trigger a
>> >configurable maximum delaying limit which would then dump to parity
>> >drive immediately preventing memory overuse.
>> >
>> >This again ties in to the fact that the content is not 'critical' so
>> >if parity was not dumped when a drive fails, worst case you only lose
>> >the latest file.
>> >
>> >Delayed writes may be done via bcache or a similar implementation
>> >which caches the writes in memory and need not be part of the split
>> >raid driver at all.
>>
>> That provided little clarity.
>>
>> File systems like xfs queue (delay) significant amounts of actual data
>> before writing it to disk.  The same is true of journal data.  If all you
>> are doing is caching the parity up until their is enough to bother with,
>> then a filesystem designed for streamed data already does the for the data
>> drive, thus you don't need to do anything new for the parity drive, just run
>> it in sync with the data drive.
>>
>> At this point I interpret your proposal to be:
>>
>> Implement a Raid 4 like setup, but instead if stripping the date data
>> drives, concatenate them.
>>
>> That is something I haven't seen done, but I can see why you would want
>> it.  Implementing via unionfs I don't understand, but as a new device mapper
>> mechanism it seems very logical.
>>
>> Obviously, I'm not a device mapper maintainer, so I'm not saying it would
>> be accepted, but if I'm right you can now have a discussion of just a few
>> sentences which explain your goal.
>>
>
> RAID4 support does not exist in the mainline. Anshuman, you might want to
> reach out to Neil Brown who is the maintainer for dmraid.
> IIUC, your requirement can be well implemented by writing a new device
> mapper target. That will make it modular and will help you make improvements
> to it easily.
>
>
>
>
>>
>> Greg
>> --
>> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
>>
>> _______________________________________________
>> Kernelnewbies mailing list
>> Kernelnewbies at kernelnewbies.org
>> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
>
>
>
>
> --
> Regards,
> Sandeep.
>
>
>
>
>
>
> ?To learn is to change. Education is a process that changes the learner.?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-24  6:48                 ` Anshuman Aggarwal
@ 2014-11-24 13:19                   ` Greg Freemyer
  2014-11-24 17:28                     ` Anshuman Aggarwal
  0 siblings, 1 reply; 44+ messages in thread
From: Greg Freemyer @ 2014-11-24 13:19 UTC (permalink / raw)
  To: kernelnewbies

On November 24, 2014 1:48:48 AM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>Sandeep,
> This isn't exactly RAID4 (only thing in common is a single parity
>disk but the data is not striped at all). I did bring it up on the
>linux-raid mailing list and have had a short conversation with Neil.
>He wasn't too excited about device mapper but didn't indicate why or
>why not.

If it was early in your proposal it may simply be he didn't understand it.

The delayed writes to the parity disk you described would have been tough for device mapper to manage.  It doesn't typically maintain its own longer term buffers, so that would have been something that might have given him concern.  The only reason you provided was reduced wear and tear for the parity drive.

Reduced wear and tear in this case is a red herring.  The kernel already buffers writes to the data disk, so no need to separately buffer parity writes.

>I would like to have this as a layer for each block device on top of
>the original block devices (intercepting write requests to the block
>devices and updating the parity disk). Is device mapper the write
>interface?

I think yes, but dm and md are actually separate.  I think of dm as a subset of md, but if you are going to really do this you will need to learn the details better than I know them:

https://www.kernel.org/doc/Documentation/device-mapper/dm-raid.txt

You will need to add code to both the dm and md kernel code.

I assume you know that both mdraid (mdadm) and lvm userspace tools are used to manage device mapper, so you would have to add user space support to mdraid/lvm as well.

> What are the others? 

Well btrfs as an example incorporates a lot of raid capability into the filesystem.  Thus btrfs is a monolithic driver that has consumed much of the dm/md layer.  I can't speak to why they are doing that, but I find it troubling.  Having monolithic aspects to the kernel has always been something the Linux kernel avoided.

> Also if I don't store the metadata on
>the block device itself (to allow the block device to be unaware of
>the RAID4 on top...how would the kernel be informed of which devices
>together form the Split RAID.

I don't understand the question.

I haven't thought through the process, but with mdraid/lvm you would identify the physical drives as under dm control.  (mdadm for md, pvcreate for dm). Then configure the split raid setup.

Have you gone through the process of creating a raid5 with mdadm.  If not at least read a howto about it.

https://raid.wiki.kernel.org/index.php/RAID_setup

I assume you would have mdadm form your multi-disk split raid volume composed of all the physical disks, then use lvm commands to define the block range on the the first drive as a lv (logical volume).  Same for the other data drives.

Then use mkfs to put a filesystem on each lv.

The filesystem has no knowledge there is a split raid below it.  It simply reads/writes to the overall, device mapper is layered below it and triggers the required i/o calls.

Ie. For a read, it is a straight passthrough.  For a write, the old data and old parity have to be read in, modified, written out.  Device mapper does this now for raid 4/5/6, so most of the code is in place.

>Appreciate the help.
>
>Thanks,
>Anshuman

I just realized I replied to a top post.

Seriously, don't do that on kernel lists if you want to be taken seriously.  It immediately identifies you as unfamiliar with the kernel mailing list netiquette.

Greg
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-24 13:19                   ` Greg Freemyer
@ 2014-11-24 17:28                     ` Anshuman Aggarwal
  2014-11-24 18:10                       ` Valdis.Kletnieks at vt.edu
  2014-11-25  4:56                       ` Greg Freemyer
  0 siblings, 2 replies; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-11-24 17:28 UTC (permalink / raw)
  To: kernelnewbies

On 24 November 2014 at 18:49, Greg Freemyer <greg.freemyer@gmail.com> wrote:
>
>
> On November 24, 2014 1:48:48 AM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>>Sandeep,
>> This isn't exactly RAID4 (only thing in common is a single parity
>>disk but the data is not striped at all). I did bring it up on the
>>linux-raid mailing list and have had a short conversation with Neil.
>>He wasn't too excited about device mapper but didn't indicate why or
>>why not.
>
> If it was early in your proposal it may simply be he didn't understand it.
>
> The delayed writes to the parity disk you described would have been tough for device mapper to manage.  It doesn't typically maintain its own longer term buffers, so that would have been something that might have given him concern.  The only reason you provided was reduced wear and tear for the parity drive.
>
> Reduced wear and tear in this case is a red herring.  The kernel already buffers writes to the data disk, so no need to separately buffer parity writes.

Fair enough, the delay in buffering for the parity writes is an
independent issue which can be deferred easily.

>
>>I would like to have this as a layer for each block device on top of
>>the original block devices (intercepting write requests to the block
>>devices and updating the parity disk). Is device mapper the write
>>interface?
>
> I think yes, but dm and md are actually separate.  I think of dm as a subset of md, but if you are going to really do this you will need to learn the details better than I know them:
>
> https://www.kernel.org/doc/Documentation/device-mapper/dm-raid.txt
>
> You will need to add code to both the dm and md kernel code.
>
> I assume you know that both mdraid (mdadm) and lvm userspace tools are used to manage device mapper, so you would have to add user space support to mdraid/lvm as well.
>
>> What are the others?
>
> Well btrfs as an example incorporates a lot of raid capability into the filesystem.  Thus btrfs is a monolithic driver that has consumed much of the dm/md layer.  I can't speak to why they are doing that, but I find it troubling.  Having monolithic aspects to the kernel has always been something the Linux kernel avoided.
>
>> Also if I don't store the metadata on
>>the block device itself (to allow the block device to be unaware of
>>the RAID4 on top...how would the kernel be informed of which devices
>>together form the Split RAID.
>
> I don't understand the question.

mdadm typically has a metadata superblock stored on the block device
which identifies the block device as part of the RAID and typically
prevents it from directly recognized by file system code . I was
wondering if Split RAID block devices can be made to be unaware to the
RAID scheme on top and be fully mountable and usable without the raid
drivers (of course invalidating the parity if any of them are written
to). This allows a parity disk to be added to existing block devices
without having to setup the superblock on the underlying devices.

Hope that is clear now?
>
> I haven't thought through the process, but with mdraid/lvm you would identify the physical drives as under dm control.  (mdadm for md, pvcreate for dm). Then configure the split raid setup.
>
> Have you gone through the process of creating a raid5 with mdadm.  If not at least read a howto about it.
>
> https://raid.wiki.kernel.org/index.php/RAID_setup

Actually, I have maintained a RAID5, RAID6 6 disk cluster with mdadm
for more than a few years and handled multiple failures. I am
reasonably familiar with md reconstruction too. It is the performance
oriented but disk intensive nature of mdadm that I would like to vary
on for a home media server.

>
> I assume you would have mdadm form your multi-disk split raid volume composed of all the physical disks, then use lvm commands to define the block range on the the first drive as a lv (logical volume).  Same for the other data drives.
>
> Then use mkfs to put a filesystem on each lv.

Maybe it can also be done via md raid creating a partitionable array
where each partition corresponds to an underlying block device without
any striping.

>
> The filesystem has no knowledge there is a split raid below it.  It simply reads/writes to the overall, device mapper is layered below it and triggers the required i/o calls.
>
> Ie. For a read, it is a straight passthrough.  For a write, the old data and old parity have to be read in, modified, written out.  Device mapper does this now for raid 4/5/6, so most of the code is in place.

Exactly. Reads are passthrough, writes lead to the parity write being
triggered. Only remaining concern for me is that the md super block
will require block device to be initialized using mdadm. That can be
acceptable I suppose, but an ideal solution would be able to use
existing block devices (which would be untouched)...put passthrough
block device on top of them and manage the parity updation on the
parity block device. The information about which block devices
comprise the array can be stored in a config file etc and does not
need a superblock as badly as a raid setup.

>
>>Appreciate the help.
>>
>>Thanks,
>>Anshuman
>
> I just realized I replied to a top post.
>
> Seriously, don't do that on kernel lists if you want to be taken seriously.  It immediately identifies you as unfamiliar with the kernel mailing list netiquette.
>
> Greg
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Sorry. Just getting used to the kernel mailing list and most tools put
the default reply on the top.  Thanks for replying and reminding me.

Anshuman

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-24 17:28                     ` Anshuman Aggarwal
@ 2014-11-24 18:10                       ` Valdis.Kletnieks at vt.edu
  2014-11-25  4:56                       ` Greg Freemyer
  1 sibling, 0 replies; 44+ messages in thread
From: Valdis.Kletnieks at vt.edu @ 2014-11-24 18:10 UTC (permalink / raw)
  To: kernelnewbies

On Mon, 24 Nov 2014 22:58:08 +0530, Anshuman Aggarwal said:

> prevents it from directly recognized by file system code . I was
> wondering if Split RAID block devices can be made to be unaware to the
> RAID scheme on top and be fully mountable and usable without the raid
> drivers (of course invalidating the parity if any of them are written

Well, there's two basic cases:

1) You have one device and you're adding a parity device - which is
basically just creating a raid-1 mirror when you get down to it.

2) You have some collection of devices in a stripe/concat/whatever, and are
adding a parity device.  This only works if the existing stripe/concat
is already functional *without* the parity device (which implies that said
stripe or concat has to be an already-supported structure)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 848 bytes
Desc: not available
Url : http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20141124/0a07dff5/attachment.bin 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-24 17:28                     ` Anshuman Aggarwal
  2014-11-24 18:10                       ` Valdis.Kletnieks at vt.edu
@ 2014-11-25  4:56                       ` Greg Freemyer
  2014-11-27 17:50                         ` Anshuman Aggarwal
  1 sibling, 1 reply; 44+ messages in thread
From: Greg Freemyer @ 2014-11-25  4:56 UTC (permalink / raw)
  To: kernelnewbies



On November 24, 2014 12:28:08 PM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>On 24 November 2014 at 18:49, Greg Freemyer <greg.freemyer@gmail.com>
>wrote:
>>
>>
>> On November 24, 2014 1:48:48 AM EST, Anshuman Aggarwal
><anshuman.aggarwal@gmail.com> wrote:
>>>Sandeep,
>>> This isn't exactly RAID4 (only thing in common is a single parity
>>>disk but the data is not striped at all). I did bring it up on the
>>>linux-raid mailing list and have had a short conversation with Neil.
>>>He wasn't too excited about device mapper but didn't indicate why or
>>>why not.
>>
>> If it was early in your proposal it may simply be he didn't
>understand it.
>>
>> The delayed writes to the parity disk you described would have been
>tough for device mapper to manage.  It doesn't typically maintain its
>own longer term buffers, so that would have been something that might
>have given him concern.  The only reason you provided was reduced wear
>and tear for the parity drive.
>>
>> Reduced wear and tear in this case is a red herring.  The kernel
>already buffers writes to the data disk, so no need to separately
>buffer parity writes.
>
>Fair enough, the delay in buffering for the parity writes is an
>independent issue which can be deferred easily.
>
>>
>>>I would like to have this as a layer for each block device on top of
>>>the original block devices (intercepting write requests to the block
>>>devices and updating the parity disk). Is device mapper the write
>>>interface?
>>
>> I think yes, but dm and md are actually separate.  I think of dm as a
>subset of md, but if you are going to really do this you will need to
>learn the details better than I know them:
>>
>> https://www.kernel.org/doc/Documentation/device-mapper/dm-raid.txt
>>
>> You will need to add code to both the dm and md kernel code.
>>
>> I assume you know that both mdraid (mdadm) and lvm userspace tools
>are used to manage device mapper, so you would have to add user space
>support to mdraid/lvm as well.
>>
>>> What are the others?
>>
>> Well btrfs as an example incorporates a lot of raid capability into
>the filesystem.  Thus btrfs is a monolithic driver that has consumed
>much of the dm/md layer.  I can't speak to why they are doing that, but
>I find it troubling.  Having monolithic aspects to the kernel has
>always been something the Linux kernel avoided.
>>
>>> Also if I don't store the metadata on
>>>the block device itself (to allow the block device to be unaware of
>>>the RAID4 on top...how would the kernel be informed of which devices
>>>together form the Split RAID.
>>
>> I don't understand the question.
>
>mdadm typically has a metadata superblock stored on the block device
>which identifies the block device as part of the RAID and typically
>prevents it from directly recognized by file system code . I was
>wondering if Split RAID block devices can be made to be unaware to the
>RAID scheme on top and be fully mountable and usable without the raid
>drivers (of course invalidating the parity if any of them are written
>to). This allows a parity disk to be added to existing block devices
>without having to setup the superblock on the underlying devices.
>
>Hope that is clear now?

Thank you, I knew about the superblock, but didn't realize that was what you were talking about.

Does this address your desire?

https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#mdadm_v3.0_--_Adding_the_Concept_of_User-Space_Managed_External_Metadata_Formats

Fyi: I'm ignorant of any real details and I have not used the above new feature, but it seems to be what you asking for.

>>
>> I haven't thought through the process, but with mdraid/lvm you would
>identify the physical drives as under dm control.  (mdadm for md,
>pvcreate for dm). Then configure the split raid setup.
>>
>> Have you gone through the process of creating a raid5 with mdadm.  If
>not at least read a howto about it.
>>
>> https://raid.wiki.kernel.org/index.php/RAID_setup
>
>Actually, I have maintained a RAID5, RAID6 6 disk cluster with mdadm
>for more than a few years and handled multiple failures. I am
>reasonably familiar with md reconstruction too. It is the performance
>oriented but disk intensive nature of mdadm that I would like to vary
>on for a home media server.
>
>>
>> I assume you would have mdadm form your multi-disk split raid volume
>composed of all the physical disks, then use lvm commands to define the
>block range on the the first drive as a lv (logical volume).  Same for
>the other data drives.
>>
>> Then use mkfs to put a filesystem on each lv.
>
>Maybe it can also be done via md raid creating a partitionable array
>where each partition corresponds to an underlying block device without
>any striping.
>

I think I agree.

>>
>> The filesystem has no knowledge there is a split raid below it.  It
>simply reads/writes to the overall, device mapper is layered below it
>and triggers the required i/o calls.
>>
>> Ie. For a read, it is a straight passthrough.  For a write, the old
>data and old parity have to be read in, modified, written out.  Device
>mapper does this now for raid 4/5/6, so most of the code is in place.
>
>Exactly. Reads are passthrough, writes lead to the parity write being
>triggered. Only remaining concern for me is that the md super block
>will require block device to be initialized using mdadm. That can be
>acceptable I suppose, but an ideal solution would be able to use
>existing block devices (which would be untouched)...put passthrough
>block device on top of them and manage the parity updation on the
>parity block device. The information about which block devices
>comprise the array can be stored in a config file etc and does not
>need a superblock as badly as a raid setup.

Hopefully the new user space feature does just that.

Greg

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-25  4:56                       ` Greg Freemyer
@ 2014-11-27 17:50                         ` Anshuman Aggarwal
  2014-11-27 18:31                           ` Greg Freemyer
  0 siblings, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-11-27 17:50 UTC (permalink / raw)
  To: kernelnewbies

On 25 November 2014 at 10:26, Greg Freemyer <greg.freemyer@gmail.com> wrote:
>
>
> On November 24, 2014 12:28:08 PM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>>On 24 November 2014 at 18:49, Greg Freemyer <greg.freemyer@gmail.com>
>>wrote:
>>>
>>>
>>> On November 24, 2014 1:48:48 AM EST, Anshuman Aggarwal
>><anshuman.aggarwal@gmail.com> wrote:
>>>>Sandeep,
>>>> This isn't exactly RAID4 (only thing in common is a single parity
>>>>disk but the data is not striped at all). I did bring it up on the
>>>>linux-raid mailing list and have had a short conversation with Neil.
>>>>He wasn't too excited about device mapper but didn't indicate why or
>>>>why not.
>>>
>>> If it was early in your proposal it may simply be he didn't
>>understand it.
>>>
>>> The delayed writes to the parity disk you described would have been
>>tough for device mapper to manage.  It doesn't typically maintain its
>>own longer term buffers, so that would have been something that might
>>have given him concern.  The only reason you provided was reduced wear
>>and tear for the parity drive.
>>>
>>> Reduced wear and tear in this case is a red herring.  The kernel
>>already buffers writes to the data disk, so no need to separately
>>buffer parity writes.
>>
>>Fair enough, the delay in buffering for the parity writes is an
>>independent issue which can be deferred easily.
>>
>>>
>>>>I would like to have this as a layer for each block device on top of
>>>>the original block devices (intercepting write requests to the block
>>>>devices and updating the parity disk). Is device mapper the write
>>>>interface?
>>>
>>> I think yes, but dm and md are actually separate.  I think of dm as a
>>subset of md, but if you are going to really do this you will need to
>>learn the details better than I know them:
>>>
>>> https://www.kernel.org/doc/Documentation/device-mapper/dm-raid.txt
>>>
>>> You will need to add code to both the dm and md kernel code.
>>>
>>> I assume you know that both mdraid (mdadm) and lvm userspace tools
>>are used to manage device mapper, so you would have to add user space
>>support to mdraid/lvm as well.
>>>
>>>> What are the others?
>>>
>>> Well btrfs as an example incorporates a lot of raid capability into
>>the filesystem.  Thus btrfs is a monolithic driver that has consumed
>>much of the dm/md layer.  I can't speak to why they are doing that, but
>>I find it troubling.  Having monolithic aspects to the kernel has
>>always been something the Linux kernel avoided.
>>>
>>>> Also if I don't store the metadata on
>>>>the block device itself (to allow the block device to be unaware of
>>>>the RAID4 on top...how would the kernel be informed of which devices
>>>>together form the Split RAID.
>>>
>>> I don't understand the question.
>>
>>mdadm typically has a metadata superblock stored on the block device
>>which identifies the block device as part of the RAID and typically
>>prevents it from directly recognized by file system code . I was
>>wondering if Split RAID block devices can be made to be unaware to the
>>RAID scheme on top and be fully mountable and usable without the raid
>>drivers (of course invalidating the parity if any of them are written
>>to). This allows a parity disk to be added to existing block devices
>>without having to setup the superblock on the underlying devices.
>>
>>Hope that is clear now?
>
> Thank you, I knew about the superblock, but didn't realize that was what you were talking about.
>
> Does this address your desire?
>
> https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#mdadm_v3.0_--_Adding_the_Concept_of_User-Space_Managed_External_Metadata_Formats
>
> Fyi: I'm ignorant of any real details and I have not used the above new feature, but it seems to be what you asking for.
>

It doesn't seem to because it appears that the unified container would
still need to be the created before putting any data on the device.
Ideally, the split raid can be added as an after thought by just
adding a parity disk (block device) to an existing set of disks (block
devices)

>>>
>>> I haven't thought through the process, but with mdraid/lvm you would
>>identify the physical drives as under dm control.  (mdadm for md,
>>pvcreate for dm). Then configure the split raid setup.
>>>
>>> Have you gone through the process of creating a raid5 with mdadm.  If
>>not at least read a howto about it.
>>>
>>> https://raid.wiki.kernel.org/index.php/RAID_setup
>>
>>Actually, I have maintained a RAID5, RAID6 6 disk cluster with mdadm
>>for more than a few years and handled multiple failures. I am
>>reasonably familiar with md reconstruction too. It is the performance
>>oriented but disk intensive nature of mdadm that I would like to vary
>>on for a home media server.
>>
>>>
>>> I assume you would have mdadm form your multi-disk split raid volume
>>composed of all the physical disks, then use lvm commands to define the
>>block range on the the first drive as a lv (logical volume).  Same for
>>the other data drives.
>>>
>>> Then use mkfs to put a filesystem on each lv.
>>
>>Maybe it can also be done via md raid creating a partitionable array
>>where each partition corresponds to an underlying block device without
>>any striping.
>>
>
> I think I agree.
>
>>>
>>> The filesystem has no knowledge there is a split raid below it.  It
>>simply reads/writes to the overall, device mapper is layered below it
>>and triggers the required i/o calls.
>>>
>>> Ie. For a read, it is a straight passthrough.  For a write, the old
>>data and old parity have to be read in, modified, written out.  Device
>>mapper does this now for raid 4/5/6, so most of the code is in place.
>>
>>Exactly. Reads are passthrough, writes lead to the parity write being
>>triggered. Only remaining concern for me is that the md super block
>>will require block device to be initialized using mdadm. That can be
>>acceptable I suppose, but an ideal solution would be able to use
>>existing block devices (which would be untouched)...put passthrough
>>block device on top of them and manage the parity updation on the
>>parity block device. The information about which block devices
>>comprise the array can be stored in a config file etc and does not
>>need a superblock as badly as a raid setup.
>
> Hopefully the new user space feature does just that.
>
> Greg

Although the user space feature doesn't seem to, Neil has suggested a
way to try out using RAID-4 in a manner so as to create a split raid
like array. Will post on this mailing list if it succeeds.
>
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-27 17:50                         ` Anshuman Aggarwal
@ 2014-11-27 18:31                           ` Greg Freemyer
  0 siblings, 0 replies; 44+ messages in thread
From: Greg Freemyer @ 2014-11-27 18:31 UTC (permalink / raw)
  To: kernelnewbies

On Thu, Nov 27, 2014 at 12:50 PM, Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:
> On 25 November 2014 at 10:26, Greg Freemyer <greg.freemyer@gmail.com> wrote:
>>
>>
>> On November 24, 2014 12:28:08 PM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>>>On 24 November 2014 at 18:49, Greg Freemyer <greg.freemyer@gmail.com>
>>>wrote:
<snip>

>>>>> Also if I don't store the metadata on
>>>>>the block device itself (to allow the block device to be unaware of
>>>>>the RAID4 on top...how would the kernel be informed of which devices
>>>>>together form the Split RAID.
>>>>
>>>> I don't understand the question.
>>>
>>>mdadm typically has a metadata superblock stored on the block device
>>>which identifies the block device as part of the RAID and typically
>>>prevents it from directly recognized by file system code . I was
>>>wondering if Split RAID block devices can be made to be unaware to the
>>>RAID scheme on top and be fully mountable and usable without the raid
>>>drivers (of course invalidating the parity if any of them are written
>>>to). This allows a parity disk to be added to existing block devices
>>>without having to setup the superblock on the underlying devices.
>>>
>>>Hope that is clear now?
>>
>> Thank you, I knew about the superblock, but didn't realize that was what you were talking about.
>>
>> Does this address your desire?
>>
>> https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#mdadm_v3.0_--_Adding_the_Concept_of_User-Space_Managed_External_Metadata_Formats
>>
>> Fyi: I'm ignorant of any real details and I have not used the above new feature, but it seems to be what you asking for.
>>
>
> It doesn't seem to because it appears that the unified container would
> still need to be the created before putting any data on the device.
> Ideally, the split raid can be added as an after thought by just
> adding a parity disk (block device) to an existing set of disks (block
> devices)

So what precisely does "creating a container" really do?

ie. have you run strace on "mdadm --create --verbose /dev/md/imsm
/dev/sd[b-g] --raid-devices 4 --metadata=imsm"?

I'm assuming for your use case /etc/ could hold a metadata file thast
defined a container and then a second metadata file that defined the
splitRAID setup.

>>>>
>>>> The filesystem has no knowledge there is a split raid below it.  It
>>>simply reads/writes to the overall, device mapper is layered below it
>>>and triggers the required i/o calls.
>>>>
>>>> Ie. For a read, it is a straight passthrough.  For a write, the old
>>>data and old parity have to be read in, modified, written out.  Device
>>>mapper does this now for raid 4/5/6, so most of the code is in place.
>>>
>>>Exactly. Reads are passthrough, writes lead to the parity write being
>>>triggered. Only remaining concern for me is that the md super block
>>>will require block device to be initialized using mdadm. That can be
>>>acceptable I suppose, but an ideal solution would be able to use
>>>existing block devices (which would be untouched)...put passthrough
>>>block device on top of them and manage the parity updation on the
>>>parity block device. The information about which block devices
>>>comprise the array can be stored in a config file etc and does not
>>>need a superblock as badly as a raid setup.
>>
>> Hopefully the new user space feature does just that.
>>
>> Greg
>
> Although the user space feature doesn't seem to, Neil has suggested a
> way to try out using RAID-4 in a manner so as to create a split raid
> like array. Will post on this mailing list if it succeeds.

I've used hardware raid setup with raid-1 that did what you want.  If
needed, you could pull out a drive and connected straight to another
computer and everything just worked (except mirroring).

Since you're working with Neil you have the expert on the case, but
don't forget most drives have unused space between sector 1 and the
start of the first partition.  ie. Traditionally sectors 1-62 were
unused/blank.  Newer systems start the first partition at sector 2048,
so sectors 1-2047 are blank.

I don't recall off-hand which sectors a GPT setup uses, but I assume
you can find an area that is rarely used.

Greg
>> --
>> Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-12-17  6:40                           ` Anshuman Aggarwal
@ 2015-01-06 11:40                             ` Anshuman Aggarwal
  0 siblings, 0 replies; 44+ messages in thread
From: Anshuman Aggarwal @ 2015-01-06 11:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

On 17 December 2014 at 12:10, Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:
> On 17 December 2014 at 03:19, NeilBrown <neilb@suse.de> wrote:
>> On Tue, 16 Dec 2014 21:55:15 +0530 Anshuman Aggarwal
>> <anshuman.aggarwal@gmail.com> wrote:
>>
>>> On 2 December 2014 at 17:26, Anshuman Aggarwal
>>> <anshuman.aggarwal@gmail.com> wrote:
>>> > It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-)
>>> > will find more space on my drives and do a larger test but don't see
>>> > why it shouldn't work)
>>> > Here are the following caveats (and questions):
>>> > - Neil, like you pointed out, the power of 2 chunk size will probably
>>> > need a code change (in the kernel or only in the userspace tool?)
>>
>> In the kernel too.
>
> Is this something that you would consider implementing soon? Is there
> a performance/other impact to any other consideration to remove this
> limitation.. could you elaborate on the reason why it was there in the
> first place?
>
> If this is a case of patches are welcome, please guide on where to
> start looking/working even if its just
>
>>
>>> >     - Any performance or other reasons why a terabyte size chunk may
>>> > not be feasible?
>>
>> Not that I can think of.
>>
>>> > - Implications of safe_mode_delay
>>> >     - Would the metadata be updated on the block device be written to
>>> > and the parity device as well?
>>
>> Probably.  Hard to give a specific answer to vague question.
>
> I should clarify.
>
> For example in a 5 device RAID4, lets say block is being written to
> device 1 and parity is on device 5 and devices 2,3,4 are sleeping
> (spun down). If we set safe_mode_delay to 0 and md decides to update
> the parity without involving the blocks on the other 3 devices and
> just updates the parity by doing a read, compute, write to device 5
> will the metadata be updated on both device 1 and 5 even though
> safe_mode_delay is 0?
>
>>
>>> >     - If the drive  fails which is the same as the drive being written
>>> > to, would that lack of metadata updates to the other devices affect
>>> > reconstruction?
>>
>> Again, to give a precise answer, a detailed question is needed.  Obviously
>> any change would have to made in such a way to ensure that things which
>> needed to work, did work.
>
> Continuing from the previous example, lets say device 1 fails after a
> write which only updated metadata on 1 and 5 while 2,3,4 were
> sleeping. In that case to access the data from 1, md will use 2,3,4,5
> but will it then update the metadata from 5 onto 2,3,4? I hope I am
> making this clear.
>
>>
>>
>>> > - Adding new devices (is it possible to move the parity to the disk
>>> > being added? How does device addition work for RAID4 ...is it added as
>>> > a zero-ed out device with parity disk remaining the same)
>>
>> RAID5 or RAID6 with ALGORITHM_PARITY_0 puts the parity on the early devices.
>> Currently if you add a device to such an array ...... I'm not sure what it
>> will do.  It should be possible to make it just write zeros out.
>>
>
> Once again, is this something that can make its way to your roadmap?
> If so, great.. otherwise could you steer me towards where in the md
> kernel and mdadm source I should be looking to make these changes.
> Thanks again.
>
>>
>> NeilBrown
>>
>>
>>> >
>>> >
>>>
>>> Neil, sorry to try to bump this thread. Could you please look over the
>>> questions and address the points on the remaining items that can make
>>> it a working solution? Thanks
>>

Hi Neil,
 Could you please find a minute to give your input to the above? Your
guidance will go a long way towards making this a reality and it may
be useful to the community at large with the new Seagate 8TB archival
drives which seem to be more geared towards occasional use but would
still benefit from a RAID like redundancy.

Many thanks,
Anshuman

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-12-16 21:49                         ` NeilBrown
@ 2014-12-17  6:40                           ` Anshuman Aggarwal
  2015-01-06 11:40                             ` Anshuman Aggarwal
  0 siblings, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-12-17  6:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

On 17 December 2014 at 03:19, NeilBrown <neilb@suse.de> wrote:
> On Tue, 16 Dec 2014 21:55:15 +0530 Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
>
>> On 2 December 2014 at 17:26, Anshuman Aggarwal
>> <anshuman.aggarwal@gmail.com> wrote:
>> > It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-)
>> > will find more space on my drives and do a larger test but don't see
>> > why it shouldn't work)
>> > Here are the following caveats (and questions):
>> > - Neil, like you pointed out, the power of 2 chunk size will probably
>> > need a code change (in the kernel or only in the userspace tool?)
>
> In the kernel too.

Is this something that you would consider implementing soon? Is there
a performance/other impact to any other consideration to remove this
limitation.. could you elaborate on the reason why it was there in the
first place?

If this is a case of patches are welcome, please guide on where to
start looking/working even if its just

>
>> >     - Any performance or other reasons why a terabyte size chunk may
>> > not be feasible?
>
> Not that I can think of.
>
>> > - Implications of safe_mode_delay
>> >     - Would the metadata be updated on the block device be written to
>> > and the parity device as well?
>
> Probably.  Hard to give a specific answer to vague question.

I should clarify.

For example in a 5 device RAID4, lets say block is being written to
device 1 and parity is on device 5 and devices 2,3,4 are sleeping
(spun down). If we set safe_mode_delay to 0 and md decides to update
the parity without involving the blocks on the other 3 devices and
just updates the parity by doing a read, compute, write to device 5
will the metadata be updated on both device 1 and 5 even though
safe_mode_delay is 0?

>
>> >     - If the drive  fails which is the same as the drive being written
>> > to, would that lack of metadata updates to the other devices affect
>> > reconstruction?
>
> Again, to give a precise answer, a detailed question is needed.  Obviously
> any change would have to made in such a way to ensure that things which
> needed to work, did work.

Continuing from the previous example, lets say device 1 fails after a
write which only updated metadata on 1 and 5 while 2,3,4 were
sleeping. In that case to access the data from 1, md will use 2,3,4,5
but will it then update the metadata from 5 onto 2,3,4? I hope I am
making this clear.

>
>
>> > - Adding new devices (is it possible to move the parity to the disk
>> > being added? How does device addition work for RAID4 ...is it added as
>> > a zero-ed out device with parity disk remaining the same)
>
> RAID5 or RAID6 with ALGORITHM_PARITY_0 puts the parity on the early devices.
> Currently if you add a device to such an array ...... I'm not sure what it
> will do.  It should be possible to make it just write zeros out.
>

Once again, is this something that can make its way to your roadmap?
If so, great.. otherwise could you steer me towards where in the md
kernel and mdadm source I should be looking to make these changes.
Thanks again.

>
> NeilBrown
>
>
>> >
>> >
>>
>> Neil, sorry to try to bump this thread. Could you please look over the
>> questions and address the points on the remaining items that can make
>> it a working solution? Thanks
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-12-16 16:25                       ` Anshuman Aggarwal
@ 2014-12-16 21:49                         ` NeilBrown
  2014-12-17  6:40                           ` Anshuman Aggarwal
  0 siblings, 1 reply; 44+ messages in thread
From: NeilBrown @ 2014-12-16 21:49 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: Mdadm

[-- Attachment #1: Type: text/plain, Size: 1928 bytes --]

On Tue, 16 Dec 2014 21:55:15 +0530 Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:

> On 2 December 2014 at 17:26, Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
> > It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-)
> > will find more space on my drives and do a larger test but don't see
> > why it shouldn't work)
> > Here are the following caveats (and questions):
> > - Neil, like you pointed out, the power of 2 chunk size will probably
> > need a code change (in the kernel or only in the userspace tool?)

In the kernel too.

> >     - Any performance or other reasons why a terabyte size chunk may
> > not be feasible?

Not that I can think of.

> > - Implications of safe_mode_delay
> >     - Would the metadata be updated on the block device be written to
> > and the parity device as well?

Probably.  Hard to give a specific answer to vague question.

> >     - If the drive  fails which is the same as the drive being written
> > to, would that lack of metadata updates to the other devices affect
> > reconstruction?

Again, to give a precise answer, a detailed question is needed.  Obviously
any change would have to made in such a way to ensure that things which
needed to work, did work.


> > - Adding new devices (is it possible to move the parity to the disk
> > being added? How does device addition work for RAID4 ...is it added as
> > a zero-ed out device with parity disk remaining the same)

RAID5 or RAID6 with ALGORITHM_PARITY_0 puts the parity on the early devices.
Currently if you add a device to such an array ...... I'm not sure what it
will do.  It should be possible to make it just write zeros out.


NeilBrown


> >
> >
> 
> Neil, sorry to try to bump this thread. Could you please look over the
> questions and address the points on the remaining items that can make
> it a working solution? Thanks


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-12-02 11:56                     ` Anshuman Aggarwal
@ 2014-12-16 16:25                       ` Anshuman Aggarwal
  2014-12-16 21:49                         ` NeilBrown
  0 siblings, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-12-16 16:25 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

On 2 December 2014 at 17:26, Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:
> It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-)
> will find more space on my drives and do a larger test but don't see
> why it shouldn't work)
> Here are the following caveats (and questions):
> - Neil, like you pointed out, the power of 2 chunk size will probably
> need a code change (in the kernel or only in the userspace tool?)
>     - Any performance or other reasons why a terabyte size chunk may
> not be feasible?
> - Implications of safe_mode_delay
>     - Would the metadata be updated on the block device be written to
> and the parity device as well?
>     - If the drive  fails which is the same as the drive being written
> to, would that lack of metadata updates to the other devices affect
> reconstruction?
> - Adding new devices (is it possible to move the parity to the disk
> being added? How does device addition work for RAID4 ...is it added as
> a zero-ed out device with parity disk remaining the same)
>
>

Neil, sorry to try to bump this thread. Could you please look over the
questions and address the points on the remaining items that can make
it a working solution? Thanks

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-12-01 21:46                   ` NeilBrown
@ 2014-12-02 11:56                     ` Anshuman Aggarwal
  2014-12-16 16:25                       ` Anshuman Aggarwal
  0 siblings, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-12-02 11:56 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-)
will find more space on my drives and do a larger test but don't see
why it shouldn't work)
Here are the following caveats (and questions):
- Neil, like you pointed out, the power of 2 chunk size will probably
need a code change (in the kernel or only in the userspace tool?)
    - Any performance or other reasons why a terabyte size chunk may
not be feasible?
- Implications of safe_mode_delay
    - Would the metadata be updated on the block device be written to
and the parity device as well?
    - If the drive  fails which is the same as the drive being written
to, would that lack of metadata updates to the other devices affect
reconstruction?
- Adding new devices (is it possible to move the parity to the disk
being added? How does device addition work for RAID4 ...is it added as
a zero-ed out device with parity disk remaining the same)


On 2 December 2014 at 03:16, NeilBrown <neilb@suse.de> wrote:
> On Mon, 1 Dec 2014 22:04:42 +0530 Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
>
>> On 1 December 2014 at 21:30, Anshuman Aggarwal
>> <anshuman.aggarwal@gmail.com> wrote:
>> > On 26 November 2014 at 11:54, Anshuman Aggarwal
>> > <anshuman.aggarwal@gmail.com> wrote:
>> >> On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote:
>> >>> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
>> >>> <anshuman.aggarwal@gmail.com> wrote:
>> >>>
>> >>>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
>> >>>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
>> >>>> > <anshuman.aggarwal@gmail.com> wrote:
>> >>>> >
>> >>>> >> Would chunksize==disksize work? Wouldn't that lead to the entire
>> >>>> >> parity be invalidated for any write to any of the disks (assuming md
>> >>>> >> operates at a chunk level)...also please see my reply below
>> >>>> >
>> >>>> > Operating at a chunk level would be a very poor design choice.  md/raid5
>> >>>> > operates in units of 1 page (4K).
>> >>>>
>> >>>> It appears that my requirement may be met by a partitionable md raid 4
>> >>>> array where the partitions are all on individual underlying block
>> >>>> devices not striped across the block devices. Is that currently
>> >>>> possible with md raid? I dont' see how but such an enhancement could
>> >>>> do all that I had outlined earlier
>> >>>>
>> >>>> Is this possible to implement using RAID4 and MD already?
>> >>>
>> >>> Nearly.  RAID4 currently requires the chunk size to be a power of 2.
>> >>> Rounding down the size of your drives to match that could waste nearly half
>> >>> the space.  However it should work as a proof-of-concept.
>> >>>
>> >>> RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
>> >>> RAID4/5/6 would be quite possible.
>> >>>
>> >>>>   can the
>> >>>> partitions be made to write to individual block devices such that
>> >>>> parity updates don't require reading all devices?
>> >>>
>> >>> md/raid4 will currently tries to minimize total IO requests when performing
>> >>> an update, but prefer spreading the IO over more devices if the total number
>> >>> of requests is the same.
>> >>>
>> >>> So for a 4-drive RAID4, Updating a single block can be done by:
>> >>>   read old data block, read parity, write data, write parity - 4 IO requests
>> >>> or
>> >>>   read other 2 data blocks, write data, write parity - 4 IO requests.
>> >>>
>> >>> In this case it will prefer the second, which is not what you want.
>> >>> With 5-drive RAID4, the second option will require 5 IO requests, so the first
>> >>> will be chosen.
>> >>> It is quite trivial to flip this default for testing
>> >>>
>> >>> -       if (rmw < rcw && rmw > 0) {
>> >>> +       if (rmw <= rcw && rmw > 0) {
>> >>>
>> >>>
>> >>> If you had 5 drives, you could experiment with no code changes.
>> >>> Make the chunk size the largest power of 2 that fits in the device, and then
>> >>> partition to align the partitions on those boundaries.
>> >>
>> >> If the chunk size is almost the same as the device size, I assume the
>> >> entire chunk is not invalidated for parity on writing to a single
>> >> block? i.e. if only 1 block is updated only that blocks parity will be
>> >> read and written and not for the whole chunk? If thats' the case, what
>> >> purpose does a chunk serve in md raid ? If that's not the case, it
>> >> wouldn't work because a single block updation would lead to parity
>> >> being written for the entire chunk, which is the size of the device
>> >>
>> >> I do have more than 5 drives though they are in use currently. I will
>> >> create a small testing partition on each device of the same size and
>> >> run the test on that after ensuring that the drives do go to sleep.
>> >>
>> >>>
>> >>> NeilBrown
>> >>>
>> >
>> > Wouldn't the meta data writes wake up all the disks in the cluster
>> > anyways (defeating the purpose)? This idea will require metadata to
>> > not be written out to each device (is that even possible or on the
>> > cards?)
>> >
>> > I am about to try out your suggestion with the chunk sizes anyways but
>> > thought about the metadata being a major stumbling block.
>> >
>>
>> And it seems to be confirmed that the metadata write is waking up the
>> other drives. On any write to a particular drive the metadata update
>> is accessing all the others.
>>
>> Am I correct in assuming that all metadata is currently written as
>> part of the block device itself and that the external metadata  is
>> still embedded in each of the block devices (only the format of the
>> metadata is defined externally?) I guess to implement this we would
>> need to store metadata elsewhere which may be a major development
>> work. Still that may be a flexibility desired in md raid for other
>> reasons...
>>
>> Neil, your thoughts.
>
> This is exactly why I suggested testing with existing code and seeing how far
> you can get.  Thanks.
>
> For a full solution we probably do need some code changes here, but for
> further testing you could:
> 1/ make sure there is no bitmap (mdadm --grow --bitmap=none)
> 2/ set the safe_mode_delay to 0
>      echo 0 > /sys/block/mdXXX/md/safe_mode_delay
>
> when it won't try to update the metadata until you stop the array, or a
> device fails.
>
> Longer term: it would probably be good to only update the bitmap on the
> devices that are being written to - and to merge all bitmaps when assembling
> the array.  Also when there is a bitmap, the safe_mode functionality should
> probably be disabled.
>
> NeilBrown
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-12-01 16:34                 ` Anshuman Aggarwal
@ 2014-12-01 21:46                   ` NeilBrown
  2014-12-02 11:56                     ` Anshuman Aggarwal
  0 siblings, 1 reply; 44+ messages in thread
From: NeilBrown @ 2014-12-01 21:46 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: Mdadm

[-- Attachment #1: Type: text/plain, Size: 5522 bytes --]

On Mon, 1 Dec 2014 22:04:42 +0530 Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:

> On 1 December 2014 at 21:30, Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
> > On 26 November 2014 at 11:54, Anshuman Aggarwal
> > <anshuman.aggarwal@gmail.com> wrote:
> >> On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote:
> >>> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
> >>> <anshuman.aggarwal@gmail.com> wrote:
> >>>
> >>>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
> >>>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
> >>>> > <anshuman.aggarwal@gmail.com> wrote:
> >>>> >
> >>>> >> Would chunksize==disksize work? Wouldn't that lead to the entire
> >>>> >> parity be invalidated for any write to any of the disks (assuming md
> >>>> >> operates at a chunk level)...also please see my reply below
> >>>> >
> >>>> > Operating at a chunk level would be a very poor design choice.  md/raid5
> >>>> > operates in units of 1 page (4K).
> >>>>
> >>>> It appears that my requirement may be met by a partitionable md raid 4
> >>>> array where the partitions are all on individual underlying block
> >>>> devices not striped across the block devices. Is that currently
> >>>> possible with md raid? I dont' see how but such an enhancement could
> >>>> do all that I had outlined earlier
> >>>>
> >>>> Is this possible to implement using RAID4 and MD already?
> >>>
> >>> Nearly.  RAID4 currently requires the chunk size to be a power of 2.
> >>> Rounding down the size of your drives to match that could waste nearly half
> >>> the space.  However it should work as a proof-of-concept.
> >>>
> >>> RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
> >>> RAID4/5/6 would be quite possible.
> >>>
> >>>>   can the
> >>>> partitions be made to write to individual block devices such that
> >>>> parity updates don't require reading all devices?
> >>>
> >>> md/raid4 will currently tries to minimize total IO requests when performing
> >>> an update, but prefer spreading the IO over more devices if the total number
> >>> of requests is the same.
> >>>
> >>> So for a 4-drive RAID4, Updating a single block can be done by:
> >>>   read old data block, read parity, write data, write parity - 4 IO requests
> >>> or
> >>>   read other 2 data blocks, write data, write parity - 4 IO requests.
> >>>
> >>> In this case it will prefer the second, which is not what you want.
> >>> With 5-drive RAID4, the second option will require 5 IO requests, so the first
> >>> will be chosen.
> >>> It is quite trivial to flip this default for testing
> >>>
> >>> -       if (rmw < rcw && rmw > 0) {
> >>> +       if (rmw <= rcw && rmw > 0) {
> >>>
> >>>
> >>> If you had 5 drives, you could experiment with no code changes.
> >>> Make the chunk size the largest power of 2 that fits in the device, and then
> >>> partition to align the partitions on those boundaries.
> >>
> >> If the chunk size is almost the same as the device size, I assume the
> >> entire chunk is not invalidated for parity on writing to a single
> >> block? i.e. if only 1 block is updated only that blocks parity will be
> >> read and written and not for the whole chunk? If thats' the case, what
> >> purpose does a chunk serve in md raid ? If that's not the case, it
> >> wouldn't work because a single block updation would lead to parity
> >> being written for the entire chunk, which is the size of the device
> >>
> >> I do have more than 5 drives though they are in use currently. I will
> >> create a small testing partition on each device of the same size and
> >> run the test on that after ensuring that the drives do go to sleep.
> >>
> >>>
> >>> NeilBrown
> >>>
> >
> > Wouldn't the meta data writes wake up all the disks in the cluster
> > anyways (defeating the purpose)? This idea will require metadata to
> > not be written out to each device (is that even possible or on the
> > cards?)
> >
> > I am about to try out your suggestion with the chunk sizes anyways but
> > thought about the metadata being a major stumbling block.
> >
> 
> And it seems to be confirmed that the metadata write is waking up the
> other drives. On any write to a particular drive the metadata update
> is accessing all the others.
> 
> Am I correct in assuming that all metadata is currently written as
> part of the block device itself and that the external metadata  is
> still embedded in each of the block devices (only the format of the
> metadata is defined externally?) I guess to implement this we would
> need to store metadata elsewhere which may be a major development
> work. Still that may be a flexibility desired in md raid for other
> reasons...
> 
> Neil, your thoughts.

This is exactly why I suggested testing with existing code and seeing how far
you can get.  Thanks.

For a full solution we probably do need some code changes here, but for
further testing you could:
1/ make sure there is no bitmap (mdadm --grow --bitmap=none)
2/ set the safe_mode_delay to 0
     echo 0 > /sys/block/mdXXX/md/safe_mode_delay

when it won't try to update the metadata until you stop the array, or a
device fails.

Longer term: it would probably be good to only update the bitmap on the
devices that are being written to - and to merge all bitmaps when assembling
the array.  Also when there is a bitmap, the safe_mode functionality should
probably be disabled.

NeilBrown


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-12-01 16:00               ` Anshuman Aggarwal
@ 2014-12-01 16:34                 ` Anshuman Aggarwal
  2014-12-01 21:46                   ` NeilBrown
  0 siblings, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-12-01 16:34 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

On 1 December 2014 at 21:30, Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:
> On 26 November 2014 at 11:54, Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
>> On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote:
>>> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
>>> <anshuman.aggarwal@gmail.com> wrote:
>>>
>>>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
>>>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
>>>> > <anshuman.aggarwal@gmail.com> wrote:
>>>> >
>>>> >> Would chunksize==disksize work? Wouldn't that lead to the entire
>>>> >> parity be invalidated for any write to any of the disks (assuming md
>>>> >> operates at a chunk level)...also please see my reply below
>>>> >
>>>> > Operating at a chunk level would be a very poor design choice.  md/raid5
>>>> > operates in units of 1 page (4K).
>>>>
>>>> It appears that my requirement may be met by a partitionable md raid 4
>>>> array where the partitions are all on individual underlying block
>>>> devices not striped across the block devices. Is that currently
>>>> possible with md raid? I dont' see how but such an enhancement could
>>>> do all that I had outlined earlier
>>>>
>>>> Is this possible to implement using RAID4 and MD already?
>>>
>>> Nearly.  RAID4 currently requires the chunk size to be a power of 2.
>>> Rounding down the size of your drives to match that could waste nearly half
>>> the space.  However it should work as a proof-of-concept.
>>>
>>> RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
>>> RAID4/5/6 would be quite possible.
>>>
>>>>   can the
>>>> partitions be made to write to individual block devices such that
>>>> parity updates don't require reading all devices?
>>>
>>> md/raid4 will currently tries to minimize total IO requests when performing
>>> an update, but prefer spreading the IO over more devices if the total number
>>> of requests is the same.
>>>
>>> So for a 4-drive RAID4, Updating a single block can be done by:
>>>   read old data block, read parity, write data, write parity - 4 IO requests
>>> or
>>>   read other 2 data blocks, write data, write parity - 4 IO requests.
>>>
>>> In this case it will prefer the second, which is not what you want.
>>> With 5-drive RAID4, the second option will require 5 IO requests, so the first
>>> will be chosen.
>>> It is quite trivial to flip this default for testing
>>>
>>> -       if (rmw < rcw && rmw > 0) {
>>> +       if (rmw <= rcw && rmw > 0) {
>>>
>>>
>>> If you had 5 drives, you could experiment with no code changes.
>>> Make the chunk size the largest power of 2 that fits in the device, and then
>>> partition to align the partitions on those boundaries.
>>
>> If the chunk size is almost the same as the device size, I assume the
>> entire chunk is not invalidated for parity on writing to a single
>> block? i.e. if only 1 block is updated only that blocks parity will be
>> read and written and not for the whole chunk? If thats' the case, what
>> purpose does a chunk serve in md raid ? If that's not the case, it
>> wouldn't work because a single block updation would lead to parity
>> being written for the entire chunk, which is the size of the device
>>
>> I do have more than 5 drives though they are in use currently. I will
>> create a small testing partition on each device of the same size and
>> run the test on that after ensuring that the drives do go to sleep.
>>
>>>
>>> NeilBrown
>>>
>
> Wouldn't the meta data writes wake up all the disks in the cluster
> anyways (defeating the purpose)? This idea will require metadata to
> not be written out to each device (is that even possible or on the
> cards?)
>
> I am about to try out your suggestion with the chunk sizes anyways but
> thought about the metadata being a major stumbling block.
>

And it seems to be confirmed that the metadata write is waking up the
other drives. On any write to a particular drive the metadata update
is accessing all the others.

Am I correct in assuming that all metadata is currently written as
part of the block device itself and that the external metadata  is
still embedded in each of the block devices (only the format of the
metadata is defined externally?) I guess to implement this we would
need to store metadata elsewhere which may be a major development
work. Still that may be a flexibility desired in md raid for other
reasons...

Neil, your thoughts.

>>
>> Thanks,
>> Anshuman
>>>
>>>>
>>>> To illustrate:
>>>> -----------------RAID - 4 ---------------------
>>>> |
>>>> Device 1       Device 2       Device 3       Parity
>>>> A1                 B1                 C1                P1
>>>> A2                 B2                 C2                P2
>>>> A3                 B3                 C3                P3
>>>>
>>>> Each device gets written to independently (via a layer of block
>>>> devices)...so Data on Device 1 is written as A1, A2, A3 contiguous
>>>> blocks leading to updation of P1, P2 P3 (without causing any reads on
>>>> devices 2 and 3 using XOR for the parity).
>>>>
>>>> In RAID4, IIUC data gets striped and all devices become a single block device.
>>>>
>>>>
>>>> >
>>>> >
>>>> >>
>>>> >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>>>> >> > Right on most counts but please see comments below.
>>>> >> >
>>>> >> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>>>> >> >> Just to be sure I understand, you would have N + X devices.  Each of the N
>>>> >> >> devices contains an independent filesystem and could be accessed directly if
>>>> >> >> needed.  Each of the X devices contains some codes so that if at most X
>>>> >> >> devices in total died, you would still be able to recover all of the data.
>>>> >> >> If more than X devices failed, you would still get complete data from the
>>>> >> >> working devices.
>>>> >> >>
>>>> >> >> Every update would only write to the particular N device on which it is
>>>> >> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>>>> >> >> than X for the spin-down to be really worth it.
>>>> >> >>
>>>> >> >> Am I right so far?
>>>> >> >
>>>> >> > Perfectly right so far. I typically have a N to X ratio of 4 (4
>>>> >> > devices to 1 data) so spin down is totally worth it for data
>>>> >> > protection but more on that below.
>>>> >> >
>>>> >> >>
>>>> >> >> For some reason the writes to X are delayed...  I don't really understand
>>>> >> >> that part.
>>>> >> >
>>>> >> > This delay is basically designed around archival devices which are
>>>> >> > rarely read from and even more rarely written to. By delaying writes
>>>> >> > on 2 criteria ( designated cache buffer filling up or preset time
>>>> >> > duration from last write expiring) we can significantly reduce the
>>>> >> > writes on the parity device. This assumes that we are ok to lose a
>>>> >> > movie or two in case the parity disk is not totally up to date but are
>>>> >> > more interested in device longevity.
>>>> >> >
>>>> >> >>
>>>> >> >> Sounds like multi-parity RAID6 with no parity rotation and
>>>> >> >>   chunksize == devicesize
>>>> >> > RAID6 would present us with a joint device and currently only allows
>>>> >> > writes to that directly, yes? Any writes will be striped.
>>>> >
>>>> > If the chunksize equals the device size, then you need a very large write for
>>>> > it to be striped.
>>>> >
>>>> >> > In any case would md raid allow the underlying device to be written to
>>>> >> > directly? Also how would it know that the device has been written to
>>>> >> > and hence parity has to be updated? What about the superblock which
>>>> >> > the FS would not know about?
>>>> >
>>>> > No, you wouldn't write to the underlying device.  You would carefully
>>>> > partition the RAID5 so each partition aligns exactly with an underlying
>>>> > device.  Then write to the partition.
>>>> >
>>>> >> >
>>>> >> > Also except for the delayed checksum writing part which would be
>>>> >> > significant if one of the objectives is to reduce the amount of
>>>> >> > writes. Can we delay that in the code currently for RAID6? I
>>>> >> > understand the objective of RAID6 is to ensure data recovery and we
>>>> >> > are looking at a compromise in this case.
>>>> >
>>>> > "simple matter of programming"
>>>> > Of course there would be a limit to how much data can be buffered in memory
>>>> > before it has to be flushed out.
>>>> > If you are mostly storing movies, then they are probably too large to
>>>> > buffer.  Why not just write them out straight away?
>>>> >
>>>> > NeilBrown
>>>> >
>>>> >
>>>> >
>>>> >> >
>>>> >> > If feasible, this can be an enhancement to MD RAID as well where N
>>>> >> > devices are presented instead of a single joint device in case of
>>>> >> > raid6 (maybe the multi part device can be individual disks?)
>>>> >> >
>>>> >> > It will certainly solve my problem of where to store the metadata. I
>>>> >> > was currently hoping to just store it as a configuration file to be
>>>> >> > read by the initramfs since in this case worst case scenario the
>>>> >> > checksum goes out of sync and is rebuilt from scratch.
>>>> >> >
>>>> >> >>
>>>> >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
>>>> >> >> impartial opinion from me on that topic.
>>>> >> >
>>>> >> > I haven't hacked around the kernel internals much so far so will have
>>>> >> > to dig out that history. I will welcome any particular links/mail
>>>> >> > threads I should look at for guidance (with both yours and opposing
>>>> >> > points of view)
>>>> >> --
>>>> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>> >> the body of a message to majordomo@vger.kernel.org
>>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> >
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-26  6:24             ` Anshuman Aggarwal
@ 2014-12-01 16:00               ` Anshuman Aggarwal
  2014-12-01 16:34                 ` Anshuman Aggarwal
  0 siblings, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-12-01 16:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

On 26 November 2014 at 11:54, Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:
> On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote:
>> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
>> <anshuman.aggarwal@gmail.com> wrote:
>>
>>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
>>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
>>> > <anshuman.aggarwal@gmail.com> wrote:
>>> >
>>> >> Would chunksize==disksize work? Wouldn't that lead to the entire
>>> >> parity be invalidated for any write to any of the disks (assuming md
>>> >> operates at a chunk level)...also please see my reply below
>>> >
>>> > Operating at a chunk level would be a very poor design choice.  md/raid5
>>> > operates in units of 1 page (4K).
>>>
>>> It appears that my requirement may be met by a partitionable md raid 4
>>> array where the partitions are all on individual underlying block
>>> devices not striped across the block devices. Is that currently
>>> possible with md raid? I dont' see how but such an enhancement could
>>> do all that I had outlined earlier
>>>
>>> Is this possible to implement using RAID4 and MD already?
>>
>> Nearly.  RAID4 currently requires the chunk size to be a power of 2.
>> Rounding down the size of your drives to match that could waste nearly half
>> the space.  However it should work as a proof-of-concept.
>>
>> RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
>> RAID4/5/6 would be quite possible.
>>
>>>   can the
>>> partitions be made to write to individual block devices such that
>>> parity updates don't require reading all devices?
>>
>> md/raid4 will currently tries to minimize total IO requests when performing
>> an update, but prefer spreading the IO over more devices if the total number
>> of requests is the same.
>>
>> So for a 4-drive RAID4, Updating a single block can be done by:
>>   read old data block, read parity, write data, write parity - 4 IO requests
>> or
>>   read other 2 data blocks, write data, write parity - 4 IO requests.
>>
>> In this case it will prefer the second, which is not what you want.
>> With 5-drive RAID4, the second option will require 5 IO requests, so the first
>> will be chosen.
>> It is quite trivial to flip this default for testing
>>
>> -       if (rmw < rcw && rmw > 0) {
>> +       if (rmw <= rcw && rmw > 0) {
>>
>>
>> If you had 5 drives, you could experiment with no code changes.
>> Make the chunk size the largest power of 2 that fits in the device, and then
>> partition to align the partitions on those boundaries.
>
> If the chunk size is almost the same as the device size, I assume the
> entire chunk is not invalidated for parity on writing to a single
> block? i.e. if only 1 block is updated only that blocks parity will be
> read and written and not for the whole chunk? If thats' the case, what
> purpose does a chunk serve in md raid ? If that's not the case, it
> wouldn't work because a single block updation would lead to parity
> being written for the entire chunk, which is the size of the device
>
> I do have more than 5 drives though they are in use currently. I will
> create a small testing partition on each device of the same size and
> run the test on that after ensuring that the drives do go to sleep.
>
>>
>> NeilBrown
>>

Wouldn't the meta data writes wake up all the disks in the cluster
anyways (defeating the purpose)? This idea will require metadata to
not be written out to each device (is that even possible or on the
cards?)

I am about to try out your suggestion with the chunk sizes anyways but
thought about the metadata being a major stumbling block.

>
> Thanks,
> Anshuman
>>
>>>
>>> To illustrate:
>>> -----------------RAID - 4 ---------------------
>>> |
>>> Device 1       Device 2       Device 3       Parity
>>> A1                 B1                 C1                P1
>>> A2                 B2                 C2                P2
>>> A3                 B3                 C3                P3
>>>
>>> Each device gets written to independently (via a layer of block
>>> devices)...so Data on Device 1 is written as A1, A2, A3 contiguous
>>> blocks leading to updation of P1, P2 P3 (without causing any reads on
>>> devices 2 and 3 using XOR for the parity).
>>>
>>> In RAID4, IIUC data gets striped and all devices become a single block device.
>>>
>>>
>>> >
>>> >
>>> >>
>>> >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>>> >> > Right on most counts but please see comments below.
>>> >> >
>>> >> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>>> >> >> Just to be sure I understand, you would have N + X devices.  Each of the N
>>> >> >> devices contains an independent filesystem and could be accessed directly if
>>> >> >> needed.  Each of the X devices contains some codes so that if at most X
>>> >> >> devices in total died, you would still be able to recover all of the data.
>>> >> >> If more than X devices failed, you would still get complete data from the
>>> >> >> working devices.
>>> >> >>
>>> >> >> Every update would only write to the particular N device on which it is
>>> >> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>>> >> >> than X for the spin-down to be really worth it.
>>> >> >>
>>> >> >> Am I right so far?
>>> >> >
>>> >> > Perfectly right so far. I typically have a N to X ratio of 4 (4
>>> >> > devices to 1 data) so spin down is totally worth it for data
>>> >> > protection but more on that below.
>>> >> >
>>> >> >>
>>> >> >> For some reason the writes to X are delayed...  I don't really understand
>>> >> >> that part.
>>> >> >
>>> >> > This delay is basically designed around archival devices which are
>>> >> > rarely read from and even more rarely written to. By delaying writes
>>> >> > on 2 criteria ( designated cache buffer filling up or preset time
>>> >> > duration from last write expiring) we can significantly reduce the
>>> >> > writes on the parity device. This assumes that we are ok to lose a
>>> >> > movie or two in case the parity disk is not totally up to date but are
>>> >> > more interested in device longevity.
>>> >> >
>>> >> >>
>>> >> >> Sounds like multi-parity RAID6 with no parity rotation and
>>> >> >>   chunksize == devicesize
>>> >> > RAID6 would present us with a joint device and currently only allows
>>> >> > writes to that directly, yes? Any writes will be striped.
>>> >
>>> > If the chunksize equals the device size, then you need a very large write for
>>> > it to be striped.
>>> >
>>> >> > In any case would md raid allow the underlying device to be written to
>>> >> > directly? Also how would it know that the device has been written to
>>> >> > and hence parity has to be updated? What about the superblock which
>>> >> > the FS would not know about?
>>> >
>>> > No, you wouldn't write to the underlying device.  You would carefully
>>> > partition the RAID5 so each partition aligns exactly with an underlying
>>> > device.  Then write to the partition.
>>> >
>>> >> >
>>> >> > Also except for the delayed checksum writing part which would be
>>> >> > significant if one of the objectives is to reduce the amount of
>>> >> > writes. Can we delay that in the code currently for RAID6? I
>>> >> > understand the objective of RAID6 is to ensure data recovery and we
>>> >> > are looking at a compromise in this case.
>>> >
>>> > "simple matter of programming"
>>> > Of course there would be a limit to how much data can be buffered in memory
>>> > before it has to be flushed out.
>>> > If you are mostly storing movies, then they are probably too large to
>>> > buffer.  Why not just write them out straight away?
>>> >
>>> > NeilBrown
>>> >
>>> >
>>> >
>>> >> >
>>> >> > If feasible, this can be an enhancement to MD RAID as well where N
>>> >> > devices are presented instead of a single joint device in case of
>>> >> > raid6 (maybe the multi part device can be individual disks?)
>>> >> >
>>> >> > It will certainly solve my problem of where to store the metadata. I
>>> >> > was currently hoping to just store it as a configuration file to be
>>> >> > read by the initramfs since in this case worst case scenario the
>>> >> > checksum goes out of sync and is rebuilt from scratch.
>>> >> >
>>> >> >>
>>> >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
>>> >> >> impartial opinion from me on that topic.
>>> >> >
>>> >> > I haven't hacked around the kernel internals much so far so will have
>>> >> > to dig out that history. I will welcome any particular links/mail
>>> >> > threads I should look at for guidance (with both yours and opposing
>>> >> > points of view)
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> >> the body of a message to majordomo@vger.kernel.org
>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-24 22:50           ` NeilBrown
@ 2014-11-26  6:24             ` Anshuman Aggarwal
  2014-12-01 16:00               ` Anshuman Aggarwal
  0 siblings, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-11-26  6:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote:
> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
>
>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
>> > <anshuman.aggarwal@gmail.com> wrote:
>> >
>> >> Would chunksize==disksize work? Wouldn't that lead to the entire
>> >> parity be invalidated for any write to any of the disks (assuming md
>> >> operates at a chunk level)...also please see my reply below
>> >
>> > Operating at a chunk level would be a very poor design choice.  md/raid5
>> > operates in units of 1 page (4K).
>>
>> It appears that my requirement may be met by a partitionable md raid 4
>> array where the partitions are all on individual underlying block
>> devices not striped across the block devices. Is that currently
>> possible with md raid? I dont' see how but such an enhancement could
>> do all that I had outlined earlier
>>
>> Is this possible to implement using RAID4 and MD already?
>
> Nearly.  RAID4 currently requires the chunk size to be a power of 2.
> Rounding down the size of your drives to match that could waste nearly half
> the space.  However it should work as a proof-of-concept.
>
> RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
> RAID4/5/6 would be quite possible.
>
>>   can the
>> partitions be made to write to individual block devices such that
>> parity updates don't require reading all devices?
>
> md/raid4 will currently tries to minimize total IO requests when performing
> an update, but prefer spreading the IO over more devices if the total number
> of requests is the same.
>
> So for a 4-drive RAID4, Updating a single block can be done by:
>   read old data block, read parity, write data, write parity - 4 IO requests
> or
>   read other 2 data blocks, write data, write parity - 4 IO requests.
>
> In this case it will prefer the second, which is not what you want.
> With 5-drive RAID4, the second option will require 5 IO requests, so the first
> will be chosen.
> It is quite trivial to flip this default for testing
>
> -       if (rmw < rcw && rmw > 0) {
> +       if (rmw <= rcw && rmw > 0) {
>
>
> If you had 5 drives, you could experiment with no code changes.
> Make the chunk size the largest power of 2 that fits in the device, and then
> partition to align the partitions on those boundaries.

If the chunk size is almost the same as the device size, I assume the
entire chunk is not invalidated for parity on writing to a single
block? i.e. if only 1 block is updated only that blocks parity will be
read and written and not for the whole chunk? If thats' the case, what
purpose does a chunk serve in md raid ? If that's not the case, it
wouldn't work because a single block updation would lead to parity
being written for the entire chunk, which is the size of the device

I do have more than 5 drives though they are in use currently. I will
create a small testing partition on each device of the same size and
run the test on that after ensuring that the drives do go to sleep.

>
> NeilBrown
>

Thanks,
Anshuman
>
>>
>> To illustrate:
>> -----------------RAID - 4 ---------------------
>> |
>> Device 1       Device 2       Device 3       Parity
>> A1                 B1                 C1                P1
>> A2                 B2                 C2                P2
>> A3                 B3                 C3                P3
>>
>> Each device gets written to independently (via a layer of block
>> devices)...so Data on Device 1 is written as A1, A2, A3 contiguous
>> blocks leading to updation of P1, P2 P3 (without causing any reads on
>> devices 2 and 3 using XOR for the parity).
>>
>> In RAID4, IIUC data gets striped and all devices become a single block device.
>>
>>
>> >
>> >
>> >>
>> >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>> >> > Right on most counts but please see comments below.
>> >> >
>> >> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>> >> >> Just to be sure I understand, you would have N + X devices.  Each of the N
>> >> >> devices contains an independent filesystem and could be accessed directly if
>> >> >> needed.  Each of the X devices contains some codes so that if at most X
>> >> >> devices in total died, you would still be able to recover all of the data.
>> >> >> If more than X devices failed, you would still get complete data from the
>> >> >> working devices.
>> >> >>
>> >> >> Every update would only write to the particular N device on which it is
>> >> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> >> >> than X for the spin-down to be really worth it.
>> >> >>
>> >> >> Am I right so far?
>> >> >
>> >> > Perfectly right so far. I typically have a N to X ratio of 4 (4
>> >> > devices to 1 data) so spin down is totally worth it for data
>> >> > protection but more on that below.
>> >> >
>> >> >>
>> >> >> For some reason the writes to X are delayed...  I don't really understand
>> >> >> that part.
>> >> >
>> >> > This delay is basically designed around archival devices which are
>> >> > rarely read from and even more rarely written to. By delaying writes
>> >> > on 2 criteria ( designated cache buffer filling up or preset time
>> >> > duration from last write expiring) we can significantly reduce the
>> >> > writes on the parity device. This assumes that we are ok to lose a
>> >> > movie or two in case the parity disk is not totally up to date but are
>> >> > more interested in device longevity.
>> >> >
>> >> >>
>> >> >> Sounds like multi-parity RAID6 with no parity rotation and
>> >> >>   chunksize == devicesize
>> >> > RAID6 would present us with a joint device and currently only allows
>> >> > writes to that directly, yes? Any writes will be striped.
>> >
>> > If the chunksize equals the device size, then you need a very large write for
>> > it to be striped.
>> >
>> >> > In any case would md raid allow the underlying device to be written to
>> >> > directly? Also how would it know that the device has been written to
>> >> > and hence parity has to be updated? What about the superblock which
>> >> > the FS would not know about?
>> >
>> > No, you wouldn't write to the underlying device.  You would carefully
>> > partition the RAID5 so each partition aligns exactly with an underlying
>> > device.  Then write to the partition.
>> >
>> >> >
>> >> > Also except for the delayed checksum writing part which would be
>> >> > significant if one of the objectives is to reduce the amount of
>> >> > writes. Can we delay that in the code currently for RAID6? I
>> >> > understand the objective of RAID6 is to ensure data recovery and we
>> >> > are looking at a compromise in this case.
>> >
>> > "simple matter of programming"
>> > Of course there would be a limit to how much data can be buffered in memory
>> > before it has to be flushed out.
>> > If you are mostly storing movies, then they are probably too large to
>> > buffer.  Why not just write them out straight away?
>> >
>> > NeilBrown
>> >
>> >
>> >
>> >> >
>> >> > If feasible, this can be an enhancement to MD RAID as well where N
>> >> > devices are presented instead of a single joint device in case of
>> >> > raid6 (maybe the multi part device can be individual disks?)
>> >> >
>> >> > It will certainly solve my problem of where to store the metadata. I
>> >> > was currently hoping to just store it as a configuration file to be
>> >> > read by the initramfs since in this case worst case scenario the
>> >> > checksum goes out of sync and is rebuilt from scratch.
>> >> >
>> >> >>
>> >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
>> >> >> impartial opinion from me on that topic.
>> >> >
>> >> > I haven't hacked around the kernel internals much so far so will have
>> >> > to dig out that history. I will welcome any particular links/mail
>> >> > threads I should look at for guidance (with both yours and opposing
>> >> > points of view)
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-24  7:29         ` Anshuman Aggarwal
@ 2014-11-24 22:50           ` NeilBrown
  2014-11-26  6:24             ` Anshuman Aggarwal
  0 siblings, 1 reply; 44+ messages in thread
From: NeilBrown @ 2014-11-24 22:50 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: Mdadm

[-- Attachment #1: Type: text/plain, Size: 7687 bytes --]

On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:

> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
> > <anshuman.aggarwal@gmail.com> wrote:
> >
> >> Would chunksize==disksize work? Wouldn't that lead to the entire
> >> parity be invalidated for any write to any of the disks (assuming md
> >> operates at a chunk level)...also please see my reply below
> >
> > Operating at a chunk level would be a very poor design choice.  md/raid5
> > operates in units of 1 page (4K).
> 
> It appears that my requirement may be met by a partitionable md raid 4
> array where the partitions are all on individual underlying block
> devices not striped across the block devices. Is that currently
> possible with md raid? I dont' see how but such an enhancement could
> do all that I had outlined earlier
> 
> Is this possible to implement using RAID4 and MD already?

Nearly.  RAID4 currently requires the chunk size to be a power of 2.
Rounding down the size of your drives to match that could waste nearly half
the space.  However it should work as a proof-of-concept.

RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
RAID4/5/6 would be quite possible.

>   can the
> partitions be made to write to individual block devices such that
> parity updates don't require reading all devices?

md/raid4 will currently tries to minimize total IO requests when performing
an update, but prefer spreading the IO over more devices if the total number
of requests is the same.

So for a 4-drive RAID4, Updating a single block can be done by:
  read old data block, read parity, write data, write parity - 4 IO requests
or
  read other 2 data blocks, write data, write parity - 4 IO requests.

In this case it will prefer the second, which is not what you want.
With 5-drive RAID4, the second option will require 5 IO requests, so the first
will be chosen.
It is quite trivial to flip this default for testing

-	if (rmw < rcw && rmw > 0) {
+	if (rmw <= rcw && rmw > 0) {


If you had 5 drives, you could experiment with no code changes.
Make the chunk size the largest power of 2 that fits in the device, and then
partition to align the partitions on those boundaries.

NeilBrown


> 
> To illustrate:
> -----------------RAID - 4 ---------------------
> |
> Device 1       Device 2       Device 3       Parity
> A1                 B1                 C1                P1
> A2                 B2                 C2                P2
> A3                 B3                 C3                P3
> 
> Each device gets written to independently (via a layer of block
> devices)...so Data on Device 1 is written as A1, A2, A3 contiguous
> blocks leading to updation of P1, P2 P3 (without causing any reads on
> devices 2 and 3 using XOR for the parity).
> 
> In RAID4, IIUC data gets striped and all devices become a single block device.
> 
> 
> >
> >
> >>
> >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
> >> > Right on most counts but please see comments below.
> >> >
> >> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
> >> >> Just to be sure I understand, you would have N + X devices.  Each of the N
> >> >> devices contains an independent filesystem and could be accessed directly if
> >> >> needed.  Each of the X devices contains some codes so that if at most X
> >> >> devices in total died, you would still be able to recover all of the data.
> >> >> If more than X devices failed, you would still get complete data from the
> >> >> working devices.
> >> >>
> >> >> Every update would only write to the particular N device on which it is
> >> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
> >> >> than X for the spin-down to be really worth it.
> >> >>
> >> >> Am I right so far?
> >> >
> >> > Perfectly right so far. I typically have a N to X ratio of 4 (4
> >> > devices to 1 data) so spin down is totally worth it for data
> >> > protection but more on that below.
> >> >
> >> >>
> >> >> For some reason the writes to X are delayed...  I don't really understand
> >> >> that part.
> >> >
> >> > This delay is basically designed around archival devices which are
> >> > rarely read from and even more rarely written to. By delaying writes
> >> > on 2 criteria ( designated cache buffer filling up or preset time
> >> > duration from last write expiring) we can significantly reduce the
> >> > writes on the parity device. This assumes that we are ok to lose a
> >> > movie or two in case the parity disk is not totally up to date but are
> >> > more interested in device longevity.
> >> >
> >> >>
> >> >> Sounds like multi-parity RAID6 with no parity rotation and
> >> >>   chunksize == devicesize
> >> > RAID6 would present us with a joint device and currently only allows
> >> > writes to that directly, yes? Any writes will be striped.
> >
> > If the chunksize equals the device size, then you need a very large write for
> > it to be striped.
> >
> >> > In any case would md raid allow the underlying device to be written to
> >> > directly? Also how would it know that the device has been written to
> >> > and hence parity has to be updated? What about the superblock which
> >> > the FS would not know about?
> >
> > No, you wouldn't write to the underlying device.  You would carefully
> > partition the RAID5 so each partition aligns exactly with an underlying
> > device.  Then write to the partition.
> >
> >> >
> >> > Also except for the delayed checksum writing part which would be
> >> > significant if one of the objectives is to reduce the amount of
> >> > writes. Can we delay that in the code currently for RAID6? I
> >> > understand the objective of RAID6 is to ensure data recovery and we
> >> > are looking at a compromise in this case.
> >
> > "simple matter of programming"
> > Of course there would be a limit to how much data can be buffered in memory
> > before it has to be flushed out.
> > If you are mostly storing movies, then they are probably too large to
> > buffer.  Why not just write them out straight away?
> >
> > NeilBrown
> >
> >
> >
> >> >
> >> > If feasible, this can be an enhancement to MD RAID as well where N
> >> > devices are presented instead of a single joint device in case of
> >> > raid6 (maybe the multi part device can be individual disks?)
> >> >
> >> > It will certainly solve my problem of where to store the metadata. I
> >> > was currently hoping to just store it as a configuration file to be
> >> > read by the initramfs since in this case worst case scenario the
> >> > checksum goes out of sync and is rebuilt from scratch.
> >> >
> >> >>
> >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
> >> >> impartial opinion from me on that topic.
> >> >
> >> > I haven't hacked around the kernel internals much so far so will have
> >> > to dig out that history. I will welcome any particular links/mail
> >> > threads I should look at for guidance (with both yours and opposing
> >> > points of view)
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-03  5:52       ` NeilBrown
  2014-11-03 18:04         ` Piergiorgio Sartor
  2014-11-06  2:24         ` Anshuman Aggarwal
@ 2014-11-24  7:29         ` Anshuman Aggarwal
  2014-11-24 22:50           ` NeilBrown
  2 siblings, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-11-24  7:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
> On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
>
>> Would chunksize==disksize work? Wouldn't that lead to the entire
>> parity be invalidated for any write to any of the disks (assuming md
>> operates at a chunk level)...also please see my reply below
>
> Operating at a chunk level would be a very poor design choice.  md/raid5
> operates in units of 1 page (4K).

It appears that my requirement may be met by a partitionable md raid 4
array where the partitions are all on individual underlying block
devices not striped across the block devices. Is that currently
possible with md raid? I dont' see how but such an enhancement could
do all that I had outlined earlier

Is this possible to implement using RAID4 and MD already? can the
partitions be made to write to individual block devices such that
parity updates don't require reading all devices?

To illustrate:
-----------------RAID - 4 ---------------------
|
Device 1       Device 2       Device 3       Parity
A1                 B1                 C1                P1
A2                 B2                 C2                P2
A3                 B3                 C3                P3

Each device gets written to independently (via a layer of block
devices)...so Data on Device 1 is written as A1, A2, A3 contiguous
blocks leading to updation of P1, P2 P3 (without causing any reads on
devices 2 and 3 using XOR for the parity).

In RAID4, IIUC data gets striped and all devices become a single block device.


>
>
>>
>> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>> > Right on most counts but please see comments below.
>> >
>> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>> >> Just to be sure I understand, you would have N + X devices.  Each of the N
>> >> devices contains an independent filesystem and could be accessed directly if
>> >> needed.  Each of the X devices contains some codes so that if at most X
>> >> devices in total died, you would still be able to recover all of the data.
>> >> If more than X devices failed, you would still get complete data from the
>> >> working devices.
>> >>
>> >> Every update would only write to the particular N device on which it is
>> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> >> than X for the spin-down to be really worth it.
>> >>
>> >> Am I right so far?
>> >
>> > Perfectly right so far. I typically have a N to X ratio of 4 (4
>> > devices to 1 data) so spin down is totally worth it for data
>> > protection but more on that below.
>> >
>> >>
>> >> For some reason the writes to X are delayed...  I don't really understand
>> >> that part.
>> >
>> > This delay is basically designed around archival devices which are
>> > rarely read from and even more rarely written to. By delaying writes
>> > on 2 criteria ( designated cache buffer filling up or preset time
>> > duration from last write expiring) we can significantly reduce the
>> > writes on the parity device. This assumes that we are ok to lose a
>> > movie or two in case the parity disk is not totally up to date but are
>> > more interested in device longevity.
>> >
>> >>
>> >> Sounds like multi-parity RAID6 with no parity rotation and
>> >>   chunksize == devicesize
>> > RAID6 would present us with a joint device and currently only allows
>> > writes to that directly, yes? Any writes will be striped.
>
> If the chunksize equals the device size, then you need a very large write for
> it to be striped.
>
>> > In any case would md raid allow the underlying device to be written to
>> > directly? Also how would it know that the device has been written to
>> > and hence parity has to be updated? What about the superblock which
>> > the FS would not know about?
>
> No, you wouldn't write to the underlying device.  You would carefully
> partition the RAID5 so each partition aligns exactly with an underlying
> device.  Then write to the partition.
>
>> >
>> > Also except for the delayed checksum writing part which would be
>> > significant if one of the objectives is to reduce the amount of
>> > writes. Can we delay that in the code currently for RAID6? I
>> > understand the objective of RAID6 is to ensure data recovery and we
>> > are looking at a compromise in this case.
>
> "simple matter of programming"
> Of course there would be a limit to how much data can be buffered in memory
> before it has to be flushed out.
> If you are mostly storing movies, then they are probably too large to
> buffer.  Why not just write them out straight away?
>
> NeilBrown
>
>
>
>> >
>> > If feasible, this can be an enhancement to MD RAID as well where N
>> > devices are presented instead of a single joint device in case of
>> > raid6 (maybe the multi part device can be individual disks?)
>> >
>> > It will certainly solve my problem of where to store the metadata. I
>> > was currently hoping to just store it as a configuration file to be
>> > read by the initramfs since in this case worst case scenario the
>> > checksum goes out of sync and is rebuilt from scratch.
>> >
>> >>
>> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
>> >> impartial opinion from me on that topic.
>> >
>> > I haven't hacked around the kernel internals much so far so will have
>> > to dig out that history. I will welcome any particular links/mail
>> > threads I should look at for guidance (with both yours and opposing
>> > points of view)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-01 12:55             ` Piergiorgio Sartor
@ 2014-11-06  2:29               ` Anshuman Aggarwal
  0 siblings, 0 replies; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-11-06  2:29 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: Ethan Wilson, Mdadm

On 1 November 2014 18:25, Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:
> On Fri, Oct 31, 2014 at 04:35:11PM +0530, Anshuman Aggarwal wrote:
>> Hi pg,
>> With MD raid striping all the writes not only does it keep ALL disks
>> spinning to read/write the current content, it also leads to
>> catastrophic data loss in case the rebuild/disk failure exceeds the
>> number of parity disks.
>
> Hi Anshuman,
>
> yes but do you have hard evidence that
> this is a common RAID-6 problem?
> Considering that we have now bad block list,
> write intent bitmap and proactive replacement,
> it does not seem to me really the main issue,
> having a triple fail in RAID-6.
> Considering that there are available libraries
> for more that 2 parities, I think the multiple
> failure case is quite a rarity.
> Furthermore, I suspect there are other type
> of catastrophic situation (lighting, for example)
> that can destroy an array completely.

I have most definitely lost data when a drive fails and during
reconstruction another drive fails (remember the array has been
chugging away all drives active for 2-3 years). At this point I'm dead
scared of losing another one to avoid catastrophic. If I dont' go out
and buy a replacement right away i'm on borrowed time for my whole
array. For home use this is not fun.

>
>> But more importantly, I find myself setting up multiple RAID levels
>> (at least RAID6 and now thinking of more) just to make sure that MD
>> raid will recover my data and not lose the whole cluster if an
>> additional disk fails above the number of parity!!! The biggest
>> advantage of the scheme that I have outlined is that with a single
>> check sum I am mostly assure of a failed disk restoration and worst
>> case only the media (movies/music) on the failing disk are lost not on
>> the whole cluster.
>
> Each disk will have its own filesystem?
> If this is not the case, you cannot say
> if a single disk failure will lose only
> some files.

Indeed, each device will indeed be an independent block device and
file system. Joined together by some union FS if the user so requires
but that's not in scope for this discussion.

>
>> Also in my experience about disks and usage, while what you are saying
>> was true a while ago when storage capacity had not hit multiple TBs.
>> Now if I am buying 3-4 TB disks they are likely to last a while
>> especially since the incremental % growth in sizes seem to be slowing
>> down.
>
> As wrote above, you can safely replace
> disks before they fail, without compromising
> the array.

Same point above. For home use, I might be away or not have time to
give the array the TLC (tender loving care ;) it needs which is the
only shortcoming of MD really...its hard on the disks  and has
potential of compromising the whole array (giving super fast R/W
performance in return for sure)

>
> bye,
>
> pg
>
>> Regards,
>> Anshuman
>>
>> On 30 October 2014 22:55, Piergiorgio Sartor
>> <piergiorgio.sartor@nexgo.de> wrote:
>> > On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote:
>> >>  What you are suggesting will work for delaying writing the checksum
>> >> (but still making 2 disks work non stop and lead to failure, cost
>> >> etc).
>> >
>> > Hi Anshuman,
>> >
>> > I'm a bit missing the point here.
>> >
>> > In my experience, with my storage systems, I change
>> > disks because they're too small, way long before they
>> > are too old (way long before they fail).
>> > That's why I end up with a collection of small HDDs.
>> > which, in turn, I recycled in some custom storage
>> > system (using disks of different size, like explained
>> > in one of the links posted before).
>> >
>> > Honestly, the only reason to spin down the disks, still
>> > in my experience, is for reducing power consumption.
>> > And this can be done with a RAID-6 without problems
>> > and in a extremely flexible way.
>> >
>> > So, the bottom line, still in my experience, is that
>> > this you're describing seems quite a nice situation.
>> >
>> > Or, I did not understood what you're proposing.
>> >
>> > Thanks,
>> >
>> > bye,
>> >
>> > pg
>> >
>> >> I am proposing N independent disks which are rarely accessed. When
>> >> parity has to be written to the remaining 1,2 ...X disks ...it is
>> >> batched up (bcache is feasible) and written out once in a while
>> >> depending on how much write is happening. N-1 disks stay spun down and
>> >> only X disks wake up periodically to get checksum written to (this
>> >> would be tweaked by the user based on how up to date he needs the
>> >> parity to be (tolerance of rebuilding parity in case of crash) and vs
>> >> disk access for each parity write)
>> >>
>> >> It can't be done using any RAID6 because RAID5/6 will stripe all the
>> >> data across the devices making any read access wake up all the
>> >> devices. Ditto for writing to parity on every write to a single disk.
>> >>
>> >> The architecture being proposed is a lazy write to manage parity for
>> >> individual disks which won't suffer from RAID catastrophic data loss
>> >> and concurrent disk.
>> >>
>> >>
>> >>
>> >>
>> >> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote:
>> >> > On 29/10/2014 10:25, Anshuman Aggarwal wrote:
>> >> >>
>> >> >> Right on most counts but please see comments below.
>> >> >>
>> >> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>> >> >>>
>> >> >>> Just to be sure I understand, you would have N + X devices.  Each of the
>> >> >>> N
>> >> >>> devices contains an independent filesystem and could be accessed directly
>> >> >>> if
>> >> >>> needed.  Each of the X devices contains some codes so that if at most X
>> >> >>> devices in total died, you would still be able to recover all of the
>> >> >>> data.
>> >> >>> If more than X devices failed, you would still get complete data from the
>> >> >>> working devices.
>> >> >>>
>> >> >>> Every update would only write to the particular N device on which it is
>> >> >>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> >> >>> than X for the spin-down to be really worth it.
>> >> >>>
>> >> >>> Am I right so far?
>> >> >>
>> >> >> Perfectly right so far. I typically have a N to X ratio of 4 (4
>> >> >> devices to 1 data) so spin down is totally worth it for data
>> >> >> protection but more on that below.
>> >> >>
>> >> >>> For some reason the writes to X are delayed...  I don't really understand
>> >> >>> that part.
>> >> >>
>> >> >> This delay is basically designed around archival devices which are
>> >> >> rarely read from and even more rarely written to. By delaying writes
>> >> >> on 2 criteria ( designated cache buffer filling up or preset time
>> >> >> duration from last write expiring) we can significantly reduce the
>> >> >> writes on the parity device. This assumes that we are ok to lose a
>> >> >> movie or two in case the parity disk is not totally up to date but are
>> >> >> more interested in device longevity.
>> >> >>
>> >> >>> Sounds like multi-parity RAID6 with no parity rotation and
>> >> >>>    chunksize == devicesize
>> >> >>
>> >> >> RAID6 would present us with a joint device and currently only allows
>> >> >> writes to that directly, yes? Any writes will be striped.
>> >> >
>> >> >
>> >> > I am not totally sure I understand your design, but it seems to me that the
>> >> > following solution could work for you:
>> >> >
>> >> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet,
>> >> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so
>> >> > expensive that you can't scrub)
>> >> >
>> >> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those
>> >> > two will never spin-down) in writeback mode with writeback_running=off .
>> >> > This will prevent writes to backend and leave the backend array spun down.
>> >> > When bcache is almost full (poll dirty_data), switch to writeback_running=on
>> >> > and writethrough: it will wake up the backend raid6 array and flush all
>> >> > dirty data. You can then then revert to writeback and writeback_running=off.
>> >> > After this you can spin-down the backend array again.
>> >> >
>> >> > You also get read caching for free, which helps the backend array to stay
>> >> > spun down as much as possible.
>> >> >
>> >> > Maybe you can modify bcache slightly so to implement an automatic switching
>> >> > between the modes as described above, instead of polling the state from
>> >> > outside.
>> >> >
>> >> > Would that work, or you are asking something different?
>> >> >
>> >> > EW
>> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> > --
>> >
>> > piergiorgio
>
> --
>
> piergiorgio

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-03  5:52       ` NeilBrown
  2014-11-03 18:04         ` Piergiorgio Sartor
@ 2014-11-06  2:24         ` Anshuman Aggarwal
  2014-11-24  7:29         ` Anshuman Aggarwal
  2 siblings, 0 replies; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-11-06  2:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

Pls see below

On 3 November 2014 11:22, NeilBrown <neilb@suse.de> wrote:
> On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
>
>> Would chunksize==disksize work? Wouldn't that lead to the entire
>> parity be invalidated for any write to any of the disks (assuming md
>> operates at a chunk level)...also please see my reply below
>
> Operating at a chunk level would be a very poor design choice.  md/raid5
> operates in units of 1 page (4K).
>
>
>>
>> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>> > Right on most counts but please see comments below.
>> >
>> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>> >> Just to be sure I understand, you would have N + X devices.  Each of the N
>> >> devices contains an independent filesystem and could be accessed directly if
>> >> needed.  Each of the X devices contains some codes so that if at most X
>> >> devices in total died, you would still be able to recover all of the data.
>> >> If more than X devices failed, you would still get complete data from the
>> >> working devices.
>> >>
>> >> Every update would only write to the particular N device on which it is
>> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> >> than X for the spin-down to be really worth it.
>> >>
>> >> Am I right so far?
>> >
>> > Perfectly right so far. I typically have a N to X ratio of 4 (4
>> > devices to 1 data) so spin down is totally worth it for data
>> > protection but more on that below.
>> >
>> >>
>> >> For some reason the writes to X are delayed...  I don't really understand
>> >> that part.
>> >
>> > This delay is basically designed around archival devices which are
>> > rarely read from and even more rarely written to. By delaying writes
>> > on 2 criteria ( designated cache buffer filling up or preset time
>> > duration from last write expiring) we can significantly reduce the
>> > writes on the parity device. This assumes that we are ok to lose a
>> > movie or two in case the parity disk is not totally up to date but are
>> > more interested in device longevity.
>> >
>> >>
>> >> Sounds like multi-parity RAID6 with no parity rotation and
>> >>   chunksize == devicesize
>> > RAID6 would present us with a joint device and currently only allows
>> > writes to that directly, yes? Any writes will be striped.
>
> If the chunksize equals the device size, then you need a very large write for
> it to be striped.
>
>> > In any case would md raid allow the underlying device to be written to
>> > directly? Also how would it know that the device has been written to
>> > and hence parity has to be updated? What about the superblock which
>> > the FS would not know about?
>
> No, you wouldn't write to the underlying device.  You would carefully
> partition the RAID5 so each partition aligns exactly with an underlying
> device.  Then write to the partition.

This is what I'm unclear about. Even with non rotating parity on RAID
5/6 is it possible to create md partitions such that the writes are
effectively not striped (within each partition) and that each
partition on the md device ends up writing only to that one device?
How is this managed? My understanding is that raid5/6 will stripe any
data blocks across all the devices making all of them spin up for each
read and write.




>
>> >
>> > Also except for the delayed checksum writing part which would be
>> > significant if one of the objectives is to reduce the amount of
>> > writes. Can we delay that in the code currently for RAID6? I
>> > understand the objective of RAID6 is to ensure data recovery and we
>> > are looking at a compromise in this case.
>
> "simple matter of programming"
> Of course there would be a limit to how much data can be buffered in memory
> before it has to be flushed out.
> If you are mostly storing movies, then they are probably too large to
> buffer.  Why not just write them out straight away?

Well, yeah if the buffer gets filled (such as by a movie) the parity
will get written pretty much write away (the main data drive gets
written to immediately anyways). The delay is to prevent parity drive
spin ups due to a small updates on any one of the drives in the array.
Maybe a small temp file created by a software etc.
>
> NeilBrown
>
>
>
>> >
>> > If feasible, this can be an enhancement to MD RAID as well where N
>> > devices are presented instead of a single joint device in case of
>> > raid6 (maybe the multi part device can be individual disks?)
>> >
>> > It will certainly solve my problem of where to store the metadata. I
>> > was currently hoping to just store it as a configuration file to be
>> > read by the initramfs since in this case worst case scenario the
>> > checksum goes out of sync and is rebuilt from scratch.
>> >
>> >>
>> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
>> >> impartial opinion from me on that topic.
>> >
>> > I haven't hacked around the kernel internals much so far so will have
>> > to dig out that history. I will welcome any particular links/mail
>> > threads I should look at for guidance (with both yours and opposing
>> > points of view)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-03  5:52       ` NeilBrown
@ 2014-11-03 18:04         ` Piergiorgio Sartor
  2014-11-06  2:24         ` Anshuman Aggarwal
  2014-11-24  7:29         ` Anshuman Aggarwal
  2 siblings, 0 replies; 44+ messages in thread
From: Piergiorgio Sartor @ 2014-11-03 18:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: Anshuman Aggarwal, linux-raid

On Mon, Nov 03, 2014 at 04:52:17PM +1100, NeilBrown wrote:
[...]
> "simple matter of programming"
> Of course there would be a limit to how much data can be buffered in memory
> before it has to be flushed out.
> If you are mostly storing movies, then they are probably too large to
> buffer.  Why not just write them out straight away?

One scenario I can envision is the following.

You've a bunch of HDDs in RAID-5/6, which are
almost always in standby (spin down).
Together, you've 2 SSDs in RAID-10.

All the write (and read, if possible) operations
are done towards the SSDs.
When the SSD RAID is X% full, the RAID-5/6 is
activated and the data *moved* (maybe copied, with
proper cache policy) there.
In case of reading (a large file), the RAID-5/6 is
activated, the file copied to the SSD RAID, and,
when finished, the HDDs put in standby again.

Of course, this is *not* a block device protocol,
it is a filesystem one.
It is the FS that must handle the caching, because
only the FS can know the file size, for example.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-30 15:00     ` Anshuman Aggarwal
@ 2014-11-03  5:52       ` NeilBrown
  2014-11-03 18:04         ` Piergiorgio Sartor
                           ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: NeilBrown @ 2014-11-03  5:52 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4572 bytes --]

On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:

> Would chunksize==disksize work? Wouldn't that lead to the entire
> parity be invalidated for any write to any of the disks (assuming md
> operates at a chunk level)...also please see my reply below

Operating at a chunk level would be a very poor design choice.  md/raid5
operates in units of 1 page (4K).


> 
> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
> > Right on most counts but please see comments below.
> >
> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
> >> Just to be sure I understand, you would have N + X devices.  Each of the N
> >> devices contains an independent filesystem and could be accessed directly if
> >> needed.  Each of the X devices contains some codes so that if at most X
> >> devices in total died, you would still be able to recover all of the data.
> >> If more than X devices failed, you would still get complete data from the
> >> working devices.
> >>
> >> Every update would only write to the particular N device on which it is
> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
> >> than X for the spin-down to be really worth it.
> >>
> >> Am I right so far?
> >
> > Perfectly right so far. I typically have a N to X ratio of 4 (4
> > devices to 1 data) so spin down is totally worth it for data
> > protection but more on that below.
> >
> >>
> >> For some reason the writes to X are delayed...  I don't really understand
> >> that part.
> >
> > This delay is basically designed around archival devices which are
> > rarely read from and even more rarely written to. By delaying writes
> > on 2 criteria ( designated cache buffer filling up or preset time
> > duration from last write expiring) we can significantly reduce the
> > writes on the parity device. This assumes that we are ok to lose a
> > movie or two in case the parity disk is not totally up to date but are
> > more interested in device longevity.
> >
> >>
> >> Sounds like multi-parity RAID6 with no parity rotation and
> >>   chunksize == devicesize
> > RAID6 would present us with a joint device and currently only allows
> > writes to that directly, yes? Any writes will be striped.

If the chunksize equals the device size, then you need a very large write for
it to be striped.

> > In any case would md raid allow the underlying device to be written to
> > directly? Also how would it know that the device has been written to
> > and hence parity has to be updated? What about the superblock which
> > the FS would not know about?

No, you wouldn't write to the underlying device.  You would carefully
partition the RAID5 so each partition aligns exactly with an underlying
device.  Then write to the partition.

> >
> > Also except for the delayed checksum writing part which would be
> > significant if one of the objectives is to reduce the amount of
> > writes. Can we delay that in the code currently for RAID6? I
> > understand the objective of RAID6 is to ensure data recovery and we
> > are looking at a compromise in this case.

"simple matter of programming"
Of course there would be a limit to how much data can be buffered in memory
before it has to be flushed out.
If you are mostly storing movies, then they are probably too large to
buffer.  Why not just write them out straight away?

NeilBrown



> >
> > If feasible, this can be an enhancement to MD RAID as well where N
> > devices are presented instead of a single joint device in case of
> > raid6 (maybe the multi part device can be individual disks?)
> >
> > It will certainly solve my problem of where to store the metadata. I
> > was currently hoping to just store it as a configuration file to be
> > read by the initramfs since in this case worst case scenario the
> > checksum goes out of sync and is rebuilt from scratch.
> >
> >>
> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
> >> impartial opinion from me on that topic.
> >
> > I haven't hacked around the kernel internals much so far so will have
> > to dig out that history. I will welcome any particular links/mail
> > threads I should look at for guidance (with both yours and opposing
> > points of view)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-31 11:05           ` Anshuman Aggarwal
  2014-10-31 14:25             ` Matt Garman
@ 2014-11-01 12:55             ` Piergiorgio Sartor
  2014-11-06  2:29               ` Anshuman Aggarwal
  1 sibling, 1 reply; 44+ messages in thread
From: Piergiorgio Sartor @ 2014-11-01 12:55 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: Piergiorgio Sartor, Ethan Wilson, linux-raid

On Fri, Oct 31, 2014 at 04:35:11PM +0530, Anshuman Aggarwal wrote:
> Hi pg,
> With MD raid striping all the writes not only does it keep ALL disks
> spinning to read/write the current content, it also leads to
> catastrophic data loss in case the rebuild/disk failure exceeds the
> number of parity disks.

Hi Anshuman,

yes but do you have hard evidence that
this is a common RAID-6 problem?
Considering that we have now bad block list,
write intent bitmap and proactive replacement,
it does not seem to me really the main issue,
having a triple fail in RAID-6.
Considering that there are available libraries
for more that 2 parities, I think the multiple
failure case is quite a rarity.
Furthermore, I suspect there are other type
of catastrophic situation (lighting, for example)
that can destroy an array completely.
 
> But more importantly, I find myself setting up multiple RAID levels
> (at least RAID6 and now thinking of more) just to make sure that MD
> raid will recover my data and not lose the whole cluster if an
> additional disk fails above the number of parity!!! The biggest
> advantage of the scheme that I have outlined is that with a single
> check sum I am mostly assure of a failed disk restoration and worst
> case only the media (movies/music) on the failing disk are lost not on
> the whole cluster.

Each disk will have its own filesystem?
If this is not the case, you cannot say
if a single disk failure will lose only
some files.

> Also in my experience about disks and usage, while what you are saying
> was true a while ago when storage capacity had not hit multiple TBs.
> Now if I am buying 3-4 TB disks they are likely to last a while
> especially since the incremental % growth in sizes seem to be slowing
> down.

As wrote above, you can safely replace
disks before they fail, without compromising
the array.

bye,

pg
 
> Regards,
> Anshuman
> 
> On 30 October 2014 22:55, Piergiorgio Sartor
> <piergiorgio.sartor@nexgo.de> wrote:
> > On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote:
> >>  What you are suggesting will work for delaying writing the checksum
> >> (but still making 2 disks work non stop and lead to failure, cost
> >> etc).
> >
> > Hi Anshuman,
> >
> > I'm a bit missing the point here.
> >
> > In my experience, with my storage systems, I change
> > disks because they're too small, way long before they
> > are too old (way long before they fail).
> > That's why I end up with a collection of small HDDs.
> > which, in turn, I recycled in some custom storage
> > system (using disks of different size, like explained
> > in one of the links posted before).
> >
> > Honestly, the only reason to spin down the disks, still
> > in my experience, is for reducing power consumption.
> > And this can be done with a RAID-6 without problems
> > and in a extremely flexible way.
> >
> > So, the bottom line, still in my experience, is that
> > this you're describing seems quite a nice situation.
> >
> > Or, I did not understood what you're proposing.
> >
> > Thanks,
> >
> > bye,
> >
> > pg
> >
> >> I am proposing N independent disks which are rarely accessed. When
> >> parity has to be written to the remaining 1,2 ...X disks ...it is
> >> batched up (bcache is feasible) and written out once in a while
> >> depending on how much write is happening. N-1 disks stay spun down and
> >> only X disks wake up periodically to get checksum written to (this
> >> would be tweaked by the user based on how up to date he needs the
> >> parity to be (tolerance of rebuilding parity in case of crash) and vs
> >> disk access for each parity write)
> >>
> >> It can't be done using any RAID6 because RAID5/6 will stripe all the
> >> data across the devices making any read access wake up all the
> >> devices. Ditto for writing to parity on every write to a single disk.
> >>
> >> The architecture being proposed is a lazy write to manage parity for
> >> individual disks which won't suffer from RAID catastrophic data loss
> >> and concurrent disk.
> >>
> >>
> >>
> >>
> >> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote:
> >> > On 29/10/2014 10:25, Anshuman Aggarwal wrote:
> >> >>
> >> >> Right on most counts but please see comments below.
> >> >>
> >> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
> >> >>>
> >> >>> Just to be sure I understand, you would have N + X devices.  Each of the
> >> >>> N
> >> >>> devices contains an independent filesystem and could be accessed directly
> >> >>> if
> >> >>> needed.  Each of the X devices contains some codes so that if at most X
> >> >>> devices in total died, you would still be able to recover all of the
> >> >>> data.
> >> >>> If more than X devices failed, you would still get complete data from the
> >> >>> working devices.
> >> >>>
> >> >>> Every update would only write to the particular N device on which it is
> >> >>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
> >> >>> than X for the spin-down to be really worth it.
> >> >>>
> >> >>> Am I right so far?
> >> >>
> >> >> Perfectly right so far. I typically have a N to X ratio of 4 (4
> >> >> devices to 1 data) so spin down is totally worth it for data
> >> >> protection but more on that below.
> >> >>
> >> >>> For some reason the writes to X are delayed...  I don't really understand
> >> >>> that part.
> >> >>
> >> >> This delay is basically designed around archival devices which are
> >> >> rarely read from and even more rarely written to. By delaying writes
> >> >> on 2 criteria ( designated cache buffer filling up or preset time
> >> >> duration from last write expiring) we can significantly reduce the
> >> >> writes on the parity device. This assumes that we are ok to lose a
> >> >> movie or two in case the parity disk is not totally up to date but are
> >> >> more interested in device longevity.
> >> >>
> >> >>> Sounds like multi-parity RAID6 with no parity rotation and
> >> >>>    chunksize == devicesize
> >> >>
> >> >> RAID6 would present us with a joint device and currently only allows
> >> >> writes to that directly, yes? Any writes will be striped.
> >> >
> >> >
> >> > I am not totally sure I understand your design, but it seems to me that the
> >> > following solution could work for you:
> >> >
> >> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet,
> >> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so
> >> > expensive that you can't scrub)
> >> >
> >> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those
> >> > two will never spin-down) in writeback mode with writeback_running=off .
> >> > This will prevent writes to backend and leave the backend array spun down.
> >> > When bcache is almost full (poll dirty_data), switch to writeback_running=on
> >> > and writethrough: it will wake up the backend raid6 array and flush all
> >> > dirty data. You can then then revert to writeback and writeback_running=off.
> >> > After this you can spin-down the backend array again.
> >> >
> >> > You also get read caching for free, which helps the backend array to stay
> >> > spun down as much as possible.
> >> >
> >> > Maybe you can modify bcache slightly so to implement an automatic switching
> >> > between the modes as described above, instead of polling the state from
> >> > outside.
> >> >
> >> > Would that work, or you are asking something different?
> >> >
> >> > EW
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> > --
> >
> > piergiorgio

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
       [not found] ` <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com>
@ 2014-11-01  5:36   ` Anshuman Aggarwal
  0 siblings, 0 replies; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-11-01  5:36 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

On 31 October 2014 19:53, Matt Garman <matthew.garman@gmail.com> wrote:
> In a later post, you said you had a 4-to-1 scheme, but it wasn't clear to me
> if that was 1 drive worth of data, and 4 drives worth of checksum/backup, or
> the other way around.

I was wondering if anybody would catch that slip. I meant 4 data to 1
parity seems about the right mix to me so far based on the my read and
feel of probability of drive failure.

>
> In your proposed scheme, I assume you want your actual data drives to be
> spinning all the time?  Otherwise, when you go to read data (play
> music/videos), you have the multi-second spinup delay... or is that OK with
> you?

Well, actually in my experience with 6-8, 2-4TB drives there is a lot
of music/video content that I dont' end up playing that often. Those
drives can easily be spun down (maybe for days on end and at least all
night) and a small initial (one time) delay before playing a file who
drive hasn't been accessed easily seems like a good trade off ( both
for power and drive life )

>
> Some other considerations: modern 5400 RPM drives generally consume less
> than five watts in idle state[1].  Actual AC draw will be higher due to
> power supply inefficiency, so we'll err on the conservative side and say
> each drive requires 10 AC watts of power.  My electrical rates in Chicago
> are about average for the USA (11 or 12 cents/kWH), and conveniently it
> roughly works out such that one always-on watt costs about $1/year.  So,
> each always-running hard drive will cost about $10/year to run, less with a
> more efficient power supply.  I know electricity is substantially more
> expensive in many parts of the world; or maybe you're running off-the-grid
> (e.g. solar) and have a very small power budget?

Besides the cost, there is an environmental aspect. If something has
superior efficiency and increases life of the product isn't it a good
thing wherever we live on the planet. BTW great calculation but I
moved back (to India) from San Francisco some time ago :) and the
electricity cost is quite high (and availability of supply is not 100%
yet).  I'd like to maximize my backups and spinning disks that are not
being used for hours on end sounds bad.

Just to add, internet is metered per GB in many parts (and in mine
sadly :( for high speed access (meaning 4-8 MBps) so I have to store
content locally (before cloud suggestions are thrown around)

>
> On Wed, Oct 29, 2014 at 2:15 AM, Anshuman Aggarwal

> <anshuman.aggarwal@gmail.com> wrote:
>>
>> - SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
>> based scheme (Its advantages are that its in user space and has cross
>> platform support but has the huge disadvantage of every checksum being
>> done from scratch slowing the system, causing immense wear and tear on
>> every snapshot and also losing any information updates upto the
>> snapshot point etc)
>
>
> Last time I looked at SnapRAID, it seemed like yours was its target use
> case.  The "huge disadvantage of every checksum being done from scratch"
> sounds like a SnapRAID feature enhancement that might be
> simpler/easier/faster-to-get done than a major enhancement to the Linux
> kernel (just speculating though).

SnapRAID can't be enhanced without involving the kernel because the
delta checksum will require knowing which blocks were written to and
only a  kernel level driver can know that. This is a hard reality, no
way around it and that was my reason to propose this.

>
> But, on the other hand, by your use case description, writes are very
> infrequent, and you're willing to buffer checksum updates for quite a
> while... so what if you had a *monthly* cron job to do parity syncs?
> Schedule it for a time when the system is unlikely to be used to offset the
> increased load.  That's only 12 "hard" tasks for the drive per year.  I'm
> not an expert, but that doesn't "feel" like a lot of wear and tear.

Well, again, between infrequent updates down to weekly or monthly
crons sounds like a bad compromise either way when a better
incremental update could store the checksum in a buffer and write them
out eventually (2-3 times a day). Almost always the buffer will get
written out giving us an updated parity with little to none "extra"
wear and tear.

>
> On the issue of wear and tear, I've mostly given up trying to understand
> what's best for my drives.  One school of thought says many spinup-spindown
> cycles are actually harder on the drive than running 24/7.  But maybe
> consumer drives actually aren't designed for 24/7 operation, so they're
> better off being cycled up and down.  Or consumer drives can't handle the
> vibrations of being in a case with other 24/7 drives.  But failure
> to"exercise" the entire drive regularly enough might result in a situation
> where an error has developed but you don't know until it's too late or your
> warranty period has expired.

You are right about consumer drives where spin downs are good ...with
a time of an hour or so should reduce unnecessary spin up/downs. Once
spun down, most may stay that way for days which is better for all of
us (energy, wastage of drives etc). Spin down technology is
progressing faster than block failure (also because block density is
going up causing media failure and not the head failure to be the
primary cause of drive outage)

The drive can be tested periodically (by non destructive bad blocks
etc) as a pure testing exercise to find errors being developed. There
is no need to needlessly stress the drives out by reading/writing to
all parts continuously. Also RAID speeds are often no longer required
due to the higher R/W coming from the drives.

Thanks for reading and writing such a thorough reply.

Neil, would you be willing to assist/guide in helping design or with
the best approach to the same? I would like to avoid the obvious
pitfalls that any new kernel block level device writer is bound to
face.

Regards,
Anshuman

>
>
> [1] http://www.silentpcreview.com/article29-page2.html
>
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-31 11:05           ` Anshuman Aggarwal
@ 2014-10-31 14:25             ` Matt Garman
  2014-11-01 12:55             ` Piergiorgio Sartor
  1 sibling, 0 replies; 44+ messages in thread
From: Matt Garman @ 2014-10-31 14:25 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: Piergiorgio Sartor, Ethan Wilson, Mdadm

(Re-posting as I forgot to change to plaintext mode for the mailing
list, sorry for any dups.)

In a later post, you said you had a 4-to-1 scheme, but it wasn't clear
to me if that was 1 drive worth of data, and 4 drives worth of
checksum/backup, or the other way around.

In your proposed scheme, I assume you want your actual data drives to
be spinning all the time?  Otherwise, when you go to read data (play
music/videos), you have the multi-second spinup delay... or is that OK
with you?

Some other considerations: modern 5400 RPM drives generally consume
less than five watts in idle state[1].  Actual AC draw will be higher
due to power supply inefficiency, so we'll err on the conservative
side and say each drive requires 10 AC watts of power.  My electrical
rates in Chicago are about average for the USA (11 or 12 cents/kWH),
and conveniently it roughly works out such that one always-on watt
costs about $1/year.  So, each always-running hard drive will cost
about $10/year to run, less with a more efficient power supply.  I
know electricity is substantially more expensive in many parts of the
world; or maybe you're running off-the-grid (e.g. solar) and have a
very small power budget?

On Wed, Oct 29, 2014 at 2:15 AM, Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:
>
> - SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
> based scheme (Its advantages are that its in user space and has cross
> platform support but has the huge disadvantage of every checksum being
> done from scratch slowing the system, causing immense wear and tear on
> every snapshot and also losing any information updates upto the
> snapshot point etc)


Last time I looked at SnapRAID, it seemed like yours was its target
use case.  The "huge disadvantage of every checksum being done from
scratch" sounds like a SnapRAID feature enhancement that might be
simpler/easier/faster-to-get done than a major enhancement to the
Linux kernel (just speculating though).

But, on the other hand, by your use case description, writes are very
infrequent, and you're willing to buffer checksum updates for quite a
while... so what if you had a *monthly* cron job to do parity syncs?
Schedule it for a time when the system is unlikely to be used to
offset the increased load.  That's only 12 "hard" tasks for the drive
per year.  I'm not an expert, but that doesn't "feel" like a lot of
wear and tear.

On the issue of wear and tear, I've mostly given up trying to
understand what's best for my drives.  One school of thought says many
spinup-spindown cycles are actually harder on the drive than running
24/7.  But maybe consumer drives actually aren't designed for 24/7
operation, so they're better off being cycled up and down.  Or
consumer drives can't handle the vibrations of being in a case with
other 24/7 drives.  But failure to"exercise" the entire drive
regularly enough might result in a situation where an error has
developed but you don't know until it's too late or your warranty
period has expired.


[1] http://www.silentpcreview.com/article29-page2.html


On Fri, Oct 31, 2014 at 6:05 AM, Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:
> Hi pg,
> With MD raid striping all the writes not only does it keep ALL disks
> spinning to read/write the current content, it also leads to
> catastrophic data loss in case the rebuild/disk failure exceeds the
> number of parity disks.
>
> But more importantly, I find myself setting up multiple RAID levels
> (at least RAID6 and now thinking of more) just to make sure that MD
> raid will recover my data and not lose the whole cluster if an
> additional disk fails above the number of parity!!! The biggest
> advantage of the scheme that I have outlined is that with a single
> check sum I am mostly assure of a failed disk restoration and worst
> case only the media (movies/music) on the failing disk are lost not on
> the whole cluster.
>
> Also in my experience about disks and usage, while what you are saying
> was true a while ago when storage capacity had not hit multiple TBs.
> Now if I am buying 3-4 TB disks they are likely to last a while
> especially since the incremental % growth in sizes seem to be slowing
> down.
>
> Regards,
> Anshuman
>
> On 30 October 2014 22:55, Piergiorgio Sartor
> <piergiorgio.sartor@nexgo.de> wrote:
>> On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote:
>>>  What you are suggesting will work for delaying writing the checksum
>>> (but still making 2 disks work non stop and lead to failure, cost
>>> etc).
>>
>> Hi Anshuman,
>>
>> I'm a bit missing the point here.
>>
>> In my experience, with my storage systems, I change
>> disks because they're too small, way long before they
>> are too old (way long before they fail).
>> That's why I end up with a collection of small HDDs.
>> which, in turn, I recycled in some custom storage
>> system (using disks of different size, like explained
>> in one of the links posted before).
>>
>> Honestly, the only reason to spin down the disks, still
>> in my experience, is for reducing power consumption.
>> And this can be done with a RAID-6 without problems
>> and in a extremely flexible way.
>>
>> So, the bottom line, still in my experience, is that
>> this you're describing seems quite a nice situation.
>>
>> Or, I did not understood what you're proposing.
>>
>> Thanks,
>>
>> bye,
>>
>> pg
>>
>>> I am proposing N independent disks which are rarely accessed. When
>>> parity has to be written to the remaining 1,2 ...X disks ...it is
>>> batched up (bcache is feasible) and written out once in a while
>>> depending on how much write is happening. N-1 disks stay spun down and
>>> only X disks wake up periodically to get checksum written to (this
>>> would be tweaked by the user based on how up to date he needs the
>>> parity to be (tolerance of rebuilding parity in case of crash) and vs
>>> disk access for each parity write)
>>>
>>> It can't be done using any RAID6 because RAID5/6 will stripe all the
>>> data across the devices making any read access wake up all the
>>> devices. Ditto for writing to parity on every write to a single disk.
>>>
>>> The architecture being proposed is a lazy write to manage parity for
>>> individual disks which won't suffer from RAID catastrophic data loss
>>> and concurrent disk.
>>>
>>>
>>>
>>>
>>> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote:
>>> > On 29/10/2014 10:25, Anshuman Aggarwal wrote:
>>> >>
>>> >> Right on most counts but please see comments below.
>>> >>
>>> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>>> >>>
>>> >>> Just to be sure I understand, you would have N + X devices.  Each of the
>>> >>> N
>>> >>> devices contains an independent filesystem and could be accessed directly
>>> >>> if
>>> >>> needed.  Each of the X devices contains some codes so that if at most X
>>> >>> devices in total died, you would still be able to recover all of the
>>> >>> data.
>>> >>> If more than X devices failed, you would still get complete data from the
>>> >>> working devices.
>>> >>>
>>> >>> Every update would only write to the particular N device on which it is
>>> >>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>>> >>> than X for the spin-down to be really worth it.
>>> >>>
>>> >>> Am I right so far?
>>> >>
>>> >> Perfectly right so far. I typically have a N to X ratio of 4 (4
>>> >> devices to 1 data) so spin down is totally worth it for data
>>> >> protection but more on that below.
>>> >>
>>> >>> For some reason the writes to X are delayed...  I don't really understand
>>> >>> that part.
>>> >>
>>> >> This delay is basically designed around archival devices which are
>>> >> rarely read from and even more rarely written to. By delaying writes
>>> >> on 2 criteria ( designated cache buffer filling up or preset time
>>> >> duration from last write expiring) we can significantly reduce the
>>> >> writes on the parity device. This assumes that we are ok to lose a
>>> >> movie or two in case the parity disk is not totally up to date but are
>>> >> more interested in device longevity.
>>> >>
>>> >>> Sounds like multi-parity RAID6 with no parity rotation and
>>> >>>    chunksize == devicesize
>>> >>
>>> >> RAID6 would present us with a joint device and currently only allows
>>> >> writes to that directly, yes? Any writes will be striped.
>>> >
>>> >
>>> > I am not totally sure I understand your design, but it seems to me that the
>>> > following solution could work for you:
>>> >
>>> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet,
>>> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so
>>> > expensive that you can't scrub)
>>> >
>>> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those
>>> > two will never spin-down) in writeback mode with writeback_running=off .
>>> > This will prevent writes to backend and leave the backend array spun down.
>>> > When bcache is almost full (poll dirty_data), switch to writeback_running=on
>>> > and writethrough: it will wake up the backend raid6 array and flush all
>>> > dirty data. You can then then revert to writeback and writeback_running=off.
>>> > After this you can spin-down the backend array again.
>>> >
>>> > You also get read caching for free, which helps the backend array to stay
>>> > spun down as much as possible.
>>> >
>>> > Maybe you can modify bcache slightly so to implement an automatic switching
>>> > between the modes as described above, instead of polling the state from
>>> > outside.
>>> >
>>> > Would that work, or you are asking something different?
>>> >
>>> > EW
>>> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>>
>> piergiorgio
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-30 17:25         ` Piergiorgio Sartor
@ 2014-10-31 11:05           ` Anshuman Aggarwal
  2014-10-31 14:25             ` Matt Garman
  2014-11-01 12:55             ` Piergiorgio Sartor
  0 siblings, 2 replies; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-10-31 11:05 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: Ethan Wilson, linux-raid

Hi pg,
With MD raid striping all the writes not only does it keep ALL disks
spinning to read/write the current content, it also leads to
catastrophic data loss in case the rebuild/disk failure exceeds the
number of parity disks.

But more importantly, I find myself setting up multiple RAID levels
(at least RAID6 and now thinking of more) just to make sure that MD
raid will recover my data and not lose the whole cluster if an
additional disk fails above the number of parity!!! The biggest
advantage of the scheme that I have outlined is that with a single
check sum I am mostly assure of a failed disk restoration and worst
case only the media (movies/music) on the failing disk are lost not on
the whole cluster.

Also in my experience about disks and usage, while what you are saying
was true a while ago when storage capacity had not hit multiple TBs.
Now if I am buying 3-4 TB disks they are likely to last a while
especially since the incremental % growth in sizes seem to be slowing
down.

Regards,
Anshuman

On 30 October 2014 22:55, Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:
> On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote:
>>  What you are suggesting will work for delaying writing the checksum
>> (but still making 2 disks work non stop and lead to failure, cost
>> etc).
>
> Hi Anshuman,
>
> I'm a bit missing the point here.
>
> In my experience, with my storage systems, I change
> disks because they're too small, way long before they
> are too old (way long before they fail).
> That's why I end up with a collection of small HDDs.
> which, in turn, I recycled in some custom storage
> system (using disks of different size, like explained
> in one of the links posted before).
>
> Honestly, the only reason to spin down the disks, still
> in my experience, is for reducing power consumption.
> And this can be done with a RAID-6 without problems
> and in a extremely flexible way.
>
> So, the bottom line, still in my experience, is that
> this you're describing seems quite a nice situation.
>
> Or, I did not understood what you're proposing.
>
> Thanks,
>
> bye,
>
> pg
>
>> I am proposing N independent disks which are rarely accessed. When
>> parity has to be written to the remaining 1,2 ...X disks ...it is
>> batched up (bcache is feasible) and written out once in a while
>> depending on how much write is happening. N-1 disks stay spun down and
>> only X disks wake up periodically to get checksum written to (this
>> would be tweaked by the user based on how up to date he needs the
>> parity to be (tolerance of rebuilding parity in case of crash) and vs
>> disk access for each parity write)
>>
>> It can't be done using any RAID6 because RAID5/6 will stripe all the
>> data across the devices making any read access wake up all the
>> devices. Ditto for writing to parity on every write to a single disk.
>>
>> The architecture being proposed is a lazy write to manage parity for
>> individual disks which won't suffer from RAID catastrophic data loss
>> and concurrent disk.
>>
>>
>>
>>
>> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote:
>> > On 29/10/2014 10:25, Anshuman Aggarwal wrote:
>> >>
>> >> Right on most counts but please see comments below.
>> >>
>> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>> >>>
>> >>> Just to be sure I understand, you would have N + X devices.  Each of the
>> >>> N
>> >>> devices contains an independent filesystem and could be accessed directly
>> >>> if
>> >>> needed.  Each of the X devices contains some codes so that if at most X
>> >>> devices in total died, you would still be able to recover all of the
>> >>> data.
>> >>> If more than X devices failed, you would still get complete data from the
>> >>> working devices.
>> >>>
>> >>> Every update would only write to the particular N device on which it is
>> >>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> >>> than X for the spin-down to be really worth it.
>> >>>
>> >>> Am I right so far?
>> >>
>> >> Perfectly right so far. I typically have a N to X ratio of 4 (4
>> >> devices to 1 data) so spin down is totally worth it for data
>> >> protection but more on that below.
>> >>
>> >>> For some reason the writes to X are delayed...  I don't really understand
>> >>> that part.
>> >>
>> >> This delay is basically designed around archival devices which are
>> >> rarely read from and even more rarely written to. By delaying writes
>> >> on 2 criteria ( designated cache buffer filling up or preset time
>> >> duration from last write expiring) we can significantly reduce the
>> >> writes on the parity device. This assumes that we are ok to lose a
>> >> movie or two in case the parity disk is not totally up to date but are
>> >> more interested in device longevity.
>> >>
>> >>> Sounds like multi-parity RAID6 with no parity rotation and
>> >>>    chunksize == devicesize
>> >>
>> >> RAID6 would present us with a joint device and currently only allows
>> >> writes to that directly, yes? Any writes will be striped.
>> >
>> >
>> > I am not totally sure I understand your design, but it seems to me that the
>> > following solution could work for you:
>> >
>> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet,
>> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so
>> > expensive that you can't scrub)
>> >
>> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those
>> > two will never spin-down) in writeback mode with writeback_running=off .
>> > This will prevent writes to backend and leave the backend array spun down.
>> > When bcache is almost full (poll dirty_data), switch to writeback_running=on
>> > and writethrough: it will wake up the backend raid6 array and flush all
>> > dirty data. You can then then revert to writeback and writeback_running=off.
>> > After this you can spin-down the backend array again.
>> >
>> > You also get read caching for free, which helps the backend array to stay
>> > spun down as much as possible.
>> >
>> > Maybe you can modify bcache slightly so to implement an automatic switching
>> > between the modes as described above, instead of polling the state from
>> > outside.
>> >
>> > Would that work, or you are asking something different?
>> >
>> > EW
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
>
> piergiorgio

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-30 14:57       ` Anshuman Aggarwal
@ 2014-10-30 17:25         ` Piergiorgio Sartor
  2014-10-31 11:05           ` Anshuman Aggarwal
  0 siblings, 1 reply; 44+ messages in thread
From: Piergiorgio Sartor @ 2014-10-30 17:25 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: Ethan Wilson, linux-raid

On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote:
>  What you are suggesting will work for delaying writing the checksum
> (but still making 2 disks work non stop and lead to failure, cost
> etc).

Hi Anshuman,

I'm a bit missing the point here.

In my experience, with my storage systems, I change
disks because they're too small, way long before they
are too old (way long before they fail).
That's why I end up with a collection of small HDDs.
which, in turn, I recycled in some custom storage
system (using disks of different size, like explained
in one of the links posted before).

Honestly, the only reason to spin down the disks, still
in my experience, is for reducing power consumption.
And this can be done with a RAID-6 without problems
and in a extremely flexible way.

So, the bottom line, still in my experience, is that
this you're describing seems quite a nice situation.

Or, I did not understood what you're proposing.

Thanks,

bye,

pg

> I am proposing N independent disks which are rarely accessed. When
> parity has to be written to the remaining 1,2 ...X disks ...it is
> batched up (bcache is feasible) and written out once in a while
> depending on how much write is happening. N-1 disks stay spun down and
> only X disks wake up periodically to get checksum written to (this
> would be tweaked by the user based on how up to date he needs the
> parity to be (tolerance of rebuilding parity in case of crash) and vs
> disk access for each parity write)
> 
> It can't be done using any RAID6 because RAID5/6 will stripe all the
> data across the devices making any read access wake up all the
> devices. Ditto for writing to parity on every write to a single disk.
> 
> The architecture being proposed is a lazy write to manage parity for
> individual disks which won't suffer from RAID catastrophic data loss
> and concurrent disk.
> 
> 
> 
> 
> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote:
> > On 29/10/2014 10:25, Anshuman Aggarwal wrote:
> >>
> >> Right on most counts but please see comments below.
> >>
> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
> >>>
> >>> Just to be sure I understand, you would have N + X devices.  Each of the
> >>> N
> >>> devices contains an independent filesystem and could be accessed directly
> >>> if
> >>> needed.  Each of the X devices contains some codes so that if at most X
> >>> devices in total died, you would still be able to recover all of the
> >>> data.
> >>> If more than X devices failed, you would still get complete data from the
> >>> working devices.
> >>>
> >>> Every update would only write to the particular N device on which it is
> >>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
> >>> than X for the spin-down to be really worth it.
> >>>
> >>> Am I right so far?
> >>
> >> Perfectly right so far. I typically have a N to X ratio of 4 (4
> >> devices to 1 data) so spin down is totally worth it for data
> >> protection but more on that below.
> >>
> >>> For some reason the writes to X are delayed...  I don't really understand
> >>> that part.
> >>
> >> This delay is basically designed around archival devices which are
> >> rarely read from and even more rarely written to. By delaying writes
> >> on 2 criteria ( designated cache buffer filling up or preset time
> >> duration from last write expiring) we can significantly reduce the
> >> writes on the parity device. This assumes that we are ok to lose a
> >> movie or two in case the parity disk is not totally up to date but are
> >> more interested in device longevity.
> >>
> >>> Sounds like multi-parity RAID6 with no parity rotation and
> >>>    chunksize == devicesize
> >>
> >> RAID6 would present us with a joint device and currently only allows
> >> writes to that directly, yes? Any writes will be striped.
> >
> >
> > I am not totally sure I understand your design, but it seems to me that the
> > following solution could work for you:
> >
> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet,
> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so
> > expensive that you can't scrub)
> >
> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those
> > two will never spin-down) in writeback mode with writeback_running=off .
> > This will prevent writes to backend and leave the backend array spun down.
> > When bcache is almost full (poll dirty_data), switch to writeback_running=on
> > and writethrough: it will wake up the backend raid6 array and flush all
> > dirty data. You can then then revert to writeback and writeback_running=off.
> > After this you can spin-down the backend array again.
> >
> > You also get read caching for free, which helps the backend array to stay
> > spun down as much as possible.
> >
> > Maybe you can modify bcache slightly so to implement an automatic switching
> > between the modes as described above, instead of polling the state from
> > outside.
> >
> > Would that work, or you are asking something different?
> >
> > EW
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-29  9:25   ` Anshuman Aggarwal
  2014-10-29 19:27     ` Ethan Wilson
@ 2014-10-30 15:00     ` Anshuman Aggarwal
  2014-11-03  5:52       ` NeilBrown
  1 sibling, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-10-30 15:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Would chunksize==disksize work? Wouldn't that lead to the entire
parity be invalidated for any write to any of the disks (assuming md
operates at a chunk level)...also please see my reply below

On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
> Right on most counts but please see comments below.
>
> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>> Just to be sure I understand, you would have N + X devices.  Each of the N
>> devices contains an independent filesystem and could be accessed directly if
>> needed.  Each of the X devices contains some codes so that if at most X
>> devices in total died, you would still be able to recover all of the data.
>> If more than X devices failed, you would still get complete data from the
>> working devices.
>>
>> Every update would only write to the particular N device on which it is
>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> than X for the spin-down to be really worth it.
>>
>> Am I right so far?
>
> Perfectly right so far. I typically have a N to X ratio of 4 (4
> devices to 1 data) so spin down is totally worth it for data
> protection but more on that below.
>
>>
>> For some reason the writes to X are delayed...  I don't really understand
>> that part.
>
> This delay is basically designed around archival devices which are
> rarely read from and even more rarely written to. By delaying writes
> on 2 criteria ( designated cache buffer filling up or preset time
> duration from last write expiring) we can significantly reduce the
> writes on the parity device. This assumes that we are ok to lose a
> movie or two in case the parity disk is not totally up to date but are
> more interested in device longevity.
>
>>
>> Sounds like multi-parity RAID6 with no parity rotation and
>>   chunksize == devicesize
> RAID6 would present us with a joint device and currently only allows
> writes to that directly, yes? Any writes will be striped.
> In any case would md raid allow the underlying device to be written to
> directly? Also how would it know that the device has been written to
> and hence parity has to be updated? What about the superblock which
> the FS would not know about?
>
> Also except for the delayed checksum writing part which would be
> significant if one of the objectives is to reduce the amount of
> writes. Can we delay that in the code currently for RAID6? I
> understand the objective of RAID6 is to ensure data recovery and we
> are looking at a compromise in this case.
>
> If feasible, this can be an enhancement to MD RAID as well where N
> devices are presented instead of a single joint device in case of
> raid6 (maybe the multi part device can be individual disks?)
>
> It will certainly solve my problem of where to store the metadata. I
> was currently hoping to just store it as a configuration file to be
> read by the initramfs since in this case worst case scenario the
> checksum goes out of sync and is rebuilt from scratch.
>
>>
>> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
>> impartial opinion from me on that topic.
>
> I haven't hacked around the kernel internals much so far so will have
> to dig out that history. I will welcome any particular links/mail
> threads I should look at for guidance (with both yours and opposing
> points of view)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-29 19:27     ` Ethan Wilson
@ 2014-10-30 14:57       ` Anshuman Aggarwal
  2014-10-30 17:25         ` Piergiorgio Sartor
  0 siblings, 1 reply; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-10-30 14:57 UTC (permalink / raw)
  To: Ethan Wilson; +Cc: linux-raid

 What you are suggesting will work for delaying writing the checksum
(but still making 2 disks work non stop and lead to failure, cost
etc).
I am proposing N independent disks which are rarely accessed. When
parity has to be written to the remaining 1,2 ...X disks ...it is
batched up (bcache is feasible) and written out once in a while
depending on how much write is happening. N-1 disks stay spun down and
only X disks wake up periodically to get checksum written to (this
would be tweaked by the user based on how up to date he needs the
parity to be (tolerance of rebuilding parity in case of crash) and vs
disk access for each parity write)

It can't be done using any RAID6 because RAID5/6 will stripe all the
data across the devices making any read access wake up all the
devices. Ditto for writing to parity on every write to a single disk.

The architecture being proposed is a lazy write to manage parity for
individual disks which won't suffer from RAID catastrophic data loss
and concurrent disk.




On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote:
> On 29/10/2014 10:25, Anshuman Aggarwal wrote:
>>
>> Right on most counts but please see comments below.
>>
>> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>>>
>>> Just to be sure I understand, you would have N + X devices.  Each of the
>>> N
>>> devices contains an independent filesystem and could be accessed directly
>>> if
>>> needed.  Each of the X devices contains some codes so that if at most X
>>> devices in total died, you would still be able to recover all of the
>>> data.
>>> If more than X devices failed, you would still get complete data from the
>>> working devices.
>>>
>>> Every update would only write to the particular N device on which it is
>>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>>> than X for the spin-down to be really worth it.
>>>
>>> Am I right so far?
>>
>> Perfectly right so far. I typically have a N to X ratio of 4 (4
>> devices to 1 data) so spin down is totally worth it for data
>> protection but more on that below.
>>
>>> For some reason the writes to X are delayed...  I don't really understand
>>> that part.
>>
>> This delay is basically designed around archival devices which are
>> rarely read from and even more rarely written to. By delaying writes
>> on 2 criteria ( designated cache buffer filling up or preset time
>> duration from last write expiring) we can significantly reduce the
>> writes on the parity device. This assumes that we are ok to lose a
>> movie or two in case the parity disk is not totally up to date but are
>> more interested in device longevity.
>>
>>> Sounds like multi-parity RAID6 with no parity rotation and
>>>    chunksize == devicesize
>>
>> RAID6 would present us with a joint device and currently only allows
>> writes to that directly, yes? Any writes will be striped.
>
>
> I am not totally sure I understand your design, but it seems to me that the
> following solution could work for you:
>
> MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet,
> but just do a periodic scrub and 2 parities can be fine. Wake-up is not so
> expensive that you can't scrub)
>
> Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those
> two will never spin-down) in writeback mode with writeback_running=off .
> This will prevent writes to backend and leave the backend array spun down.
> When bcache is almost full (poll dirty_data), switch to writeback_running=on
> and writethrough: it will wake up the backend raid6 array and flush all
> dirty data. You can then then revert to writeback and writeback_running=off.
> After this you can spin-down the backend array again.
>
> You also get read caching for free, which helps the backend array to stay
> spun down as much as possible.
>
> Maybe you can modify bcache slightly so to implement an automatic switching
> between the modes as described above, instead of polling the state from
> outside.
>
> Would that work, or you are asking something different?
>
> EW
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-29  9:25   ` Anshuman Aggarwal
@ 2014-10-29 19:27     ` Ethan Wilson
  2014-10-30 14:57       ` Anshuman Aggarwal
  2014-10-30 15:00     ` Anshuman Aggarwal
  1 sibling, 1 reply; 44+ messages in thread
From: Ethan Wilson @ 2014-10-29 19:27 UTC (permalink / raw)
  To: linux-raid

On 29/10/2014 10:25, Anshuman Aggarwal wrote:
> Right on most counts but please see comments below.
>
> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>> Just to be sure I understand, you would have N + X devices.  Each of the N
>> devices contains an independent filesystem and could be accessed directly if
>> needed.  Each of the X devices contains some codes so that if at most X
>> devices in total died, you would still be able to recover all of the data.
>> If more than X devices failed, you would still get complete data from the
>> working devices.
>>
>> Every update would only write to the particular N device on which it is
>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> than X for the spin-down to be really worth it.
>>
>> Am I right so far?
> Perfectly right so far. I typically have a N to X ratio of 4 (4
> devices to 1 data) so spin down is totally worth it for data
> protection but more on that below.
>
>> For some reason the writes to X are delayed...  I don't really understand
>> that part.
> This delay is basically designed around archival devices which are
> rarely read from and even more rarely written to. By delaying writes
> on 2 criteria ( designated cache buffer filling up or preset time
> duration from last write expiring) we can significantly reduce the
> writes on the parity device. This assumes that we are ok to lose a
> movie or two in case the parity disk is not totally up to date but are
> more interested in device longevity.
>
>> Sounds like multi-parity RAID6 with no parity rotation and
>>    chunksize == devicesize
> RAID6 would present us with a joint device and currently only allows
> writes to that directly, yes? Any writes will be striped.

I am not totally sure I understand your design, but it seems to me that 
the following solution could work for you:

MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD 
yet, but just do a periodic scrub and 2 parities can be fine. Wake-up is 
not so expensive that you can't scrub)

Over that you put a raid1 of 2 x 4TB disks as a bcache cache device 
(those two will never spin-down) in writeback mode with 
writeback_running=off . This will prevent writes to backend and leave 
the backend array spun down.
When bcache is almost full (poll dirty_data), switch to 
writeback_running=on and writethrough: it will wake up the backend raid6 
array and flush all dirty data. You can then then revert to writeback 
and writeback_running=off. After this you can spin-down the backend 
array again.

You also get read caching for free, which helps the backend array to 
stay spun down as much as possible.

Maybe you can modify bcache slightly so to implement an automatic 
switching between the modes as described above, instead of polling the 
state from outside.

Would that work, or you are asking something different?

EW

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-29  9:05 ` NeilBrown
@ 2014-10-29  9:25   ` Anshuman Aggarwal
  2014-10-29 19:27     ` Ethan Wilson
  2014-10-30 15:00     ` Anshuman Aggarwal
  0 siblings, 2 replies; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-10-29  9:25 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Right on most counts but please see comments below.

On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
> Just to be sure I understand, you would have N + X devices.  Each of the N
> devices contains an independent filesystem and could be accessed directly if
> needed.  Each of the X devices contains some codes so that if at most X
> devices in total died, you would still be able to recover all of the data.
> If more than X devices failed, you would still get complete data from the
> working devices.
>
> Every update would only write to the particular N device on which it is
> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
> than X for the spin-down to be really worth it.
>
> Am I right so far?

Perfectly right so far. I typically have a N to X ratio of 4 (4
devices to 1 data) so spin down is totally worth it for data
protection but more on that below.

>
> For some reason the writes to X are delayed...  I don't really understand
> that part.

This delay is basically designed around archival devices which are
rarely read from and even more rarely written to. By delaying writes
on 2 criteria ( designated cache buffer filling up or preset time
duration from last write expiring) we can significantly reduce the
writes on the parity device. This assumes that we are ok to lose a
movie or two in case the parity disk is not totally up to date but are
more interested in device longevity.

>
> Sounds like multi-parity RAID6 with no parity rotation and
>   chunksize == devicesize
RAID6 would present us with a joint device and currently only allows
writes to that directly, yes? Any writes will be striped.
In any case would md raid allow the underlying device to be written to
directly? Also how would it know that the device has been written to
and hence parity has to be updated? What about the superblock which
the FS would not know about?

Also except for the delayed checksum writing part which would be
significant if one of the objectives is to reduce the amount of
writes. Can we delay that in the code currently for RAID6? I
understand the objective of RAID6 is to ensure data recovery and we
are looking at a compromise in this case.

If feasible, this can be an enhancement to MD RAID as well where N
devices are presented instead of a single joint device in case of
raid6 (maybe the multi part device can be individual disks?)

It will certainly solve my problem of where to store the metadata. I
was currently hoping to just store it as a configuration file to be
read by the initramfs since in this case worst case scenario the
checksum goes out of sync and is rebuilt from scratch.

>
> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
> impartial opinion from me on that topic.

I haven't hacked around the kernel internals much so far so will have
to dig out that history. I will welcome any particular links/mail
threads I should look at for guidance (with both yours and opposing
points of view)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-29  7:15 Anshuman Aggarwal
  2014-10-29  7:32 ` Roman Mamedov
@ 2014-10-29  9:05 ` NeilBrown
  2014-10-29  9:25   ` Anshuman Aggarwal
       [not found] ` <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com>
  2 siblings, 1 reply; 44+ messages in thread
From: NeilBrown @ 2014-10-29  9:05 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3348 bytes --]

On Wed, 29 Oct 2014 12:45:34 +0530 Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:

> I'm outlining below a proposal for a RAID device mapper virtual block
> device for the kernel which adds "split raid" functionality on an
> incremental batch basis for a home media server/archived content which
> is rarely accessed.
> 
> Given a set of N+X block devices (of the same size but smallest common
> size wins)
> 
> the SplitRAID device mapper device generates virtual devices which are
> passthrough for N devices and write a Batched/Delayed checksum into
> the X devices so as to allow offline recovery of block on the N
> devices in case of a single disk failure.
> 
> Advantages over conventional RAID:
> 
> - Disks can be spun down reducing wear and tear over MD RAID Levels
> (such as 1, 10, 5,6) in the case of rarely accessed archival content
> 
> - Prevent catastrophic data loss for multiple device failure since
> each block device is independent and hence unlike MD RAID will only
> lose data incrementally.
> 
> - Performance degradation for writes can be achieved by keeping the
> checksum update asynchronous and delaying the fsync to the checksum
> block device.
> 
> In the event of improper shutdown the checksum may not have all the
> updated data but will be mostly up to date which is often acceptable
> for home media server requirements. A flag can be set in case the
> checksum block device was shutdown properly indicating that  a full
> checksum rebuild is not required.
> 
> Existing solutions considered:
> 
> - SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
> based scheme (Its advantages are that its in user space and has cross
> platform support but has the huge disadvantage of every checksum being
> done from scratch slowing the system, causing immense wear and tear on
> every snapshot and also losing any information updates upto the
> snapshot point etc)
> 
> I'd like to get opinions on the pros and cons of this proposal from
> more experienced people on the list to redirect suitably on the
> following questions:
> 
> - Maybe this can already be done using the block devices available in
> the kernel?
> 
> - If not, Device mapper the right API to use? (I think so)
> 
> - What would be the best block devices code to look at to implement?
> 
> Neil, would appreciate your weighing in on this.

Just to be sure I understand, you would have N + X devices.  Each of the N
devices contains an independent filesystem and could be accessed directly if
needed.  Each of the X devices contains some codes so that if at most X
devices in total died, you would still be able to recover all of the data.
If more than X devices failed, you would still get complete data from the
working devices.

Every update would only write to the particular N device on which it is
relevant, and  all of the X devices.  So N needs to be quite a bit bigger
than X for the spin-down to be really worth it.

Am I right so far?

For some reason the writes to X are delayed...  I don't really understand
that part.

Sounds like multi-parity RAID6 with no parity rotation and 
  chunksize == devicesize

I wouldn't use device-mapper myself, but you are unlikely to get an entirely
impartial opinion from me on that topic.

NeilBrown


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-29  7:32 ` Roman Mamedov
@ 2014-10-29  8:31   ` Anshuman Aggarwal
  0 siblings, 0 replies; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-10-29  8:31 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-raid

 Actually I already use a combination of these solutions (MD raid,
multiple devices + LVM2 to join). Unfortunately, none of these
solutions address the following:

- Full data loss in case of disk failure beyond the raid level ( 2
disks in raid5, 3 disks in raid6). This solution allows for single
disk data loss
- Continous read/write to all disks causing wear and tear reducing
life and increasing end user cost

mhddfs (or something like it) will probably be used on top of the N
devices in this proposal to join but that is upto the requirement of
the user.


On 29 October 2014 13:02, Roman Mamedov <rm@romanrm.net> wrote:
> On Wed, 29 Oct 2014 12:45:34 +0530
> Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>
>> I'm outlining below a proposal for a RAID device mapper virtual block
>> device for the kernel which adds "split raid" functionality on an
>> incremental batch basis for a home media server/archived content which
>> is rarely accessed.
>
>> Existing solutions considered:
>
> Some of the already-available "home media server" setup schemes you did not
> mention:
>
> http://linuxconfig.org/prouhd-raid-for-the-end-user
> a smart way of managing MD RAID given multiple devices of various sizes;
>
> http://louwrentius.com/building-a-raid-6-array-of-mixed-drives.html
> what to do with a set of mixed-size drives, in simpler terms;
>
> https://romanrm.net/mhddfs
> File-level "concatenation" of disks, with smart distribution of new files;
>
> --
> With respect,
> Roman

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-29  7:15 Anshuman Aggarwal
@ 2014-10-29  7:32 ` Roman Mamedov
  2014-10-29  8:31   ` Anshuman Aggarwal
  2014-10-29  9:05 ` NeilBrown
       [not found] ` <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com>
  2 siblings, 1 reply; 44+ messages in thread
From: Roman Mamedov @ 2014-10-29  7:32 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: linux-raid

On Wed, 29 Oct 2014 12:45:34 +0530
Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:

> I'm outlining below a proposal for a RAID device mapper virtual block
> device for the kernel which adds "split raid" functionality on an
> incremental batch basis for a home media server/archived content which
> is rarely accessed.

> Existing solutions considered:

Some of the already-available "home media server" setup schemes you did not
mention:

http://linuxconfig.org/prouhd-raid-for-the-end-user
a smart way of managing MD RAID given multiple devices of various sizes;

http://louwrentius.com/building-a-raid-6-array-of-mixed-drives.html
what to do with a set of mixed-size drives, in simpler terms;

https://romanrm.net/mhddfs
File-level "concatenation" of disks, with smart distribution of new files;

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Split RAID: Proposal for archival RAID using incremental batch checksum
@ 2014-10-29  7:15 Anshuman Aggarwal
  2014-10-29  7:32 ` Roman Mamedov
                   ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-10-29  7:15 UTC (permalink / raw)
  To: linux-raid

I'm outlining below a proposal for a RAID device mapper virtual block
device for the kernel which adds "split raid" functionality on an
incremental batch basis for a home media server/archived content which
is rarely accessed.

Given a set of N+X block devices (of the same size but smallest common
size wins)

the SplitRAID device mapper device generates virtual devices which are
passthrough for N devices and write a Batched/Delayed checksum into
the X devices so as to allow offline recovery of block on the N
devices in case of a single disk failure.

Advantages over conventional RAID:

- Disks can be spun down reducing wear and tear over MD RAID Levels
(such as 1, 10, 5,6) in the case of rarely accessed archival content

- Prevent catastrophic data loss for multiple device failure since
each block device is independent and hence unlike MD RAID will only
lose data incrementally.

- Performance degradation for writes can be achieved by keeping the
checksum update asynchronous and delaying the fsync to the checksum
block device.

In the event of improper shutdown the checksum may not have all the
updated data but will be mostly up to date which is often acceptable
for home media server requirements. A flag can be set in case the
checksum block device was shutdown properly indicating that  a full
checksum rebuild is not required.

Existing solutions considered:

- SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
based scheme (Its advantages are that its in user space and has cross
platform support but has the huge disadvantage of every checksum being
done from scratch slowing the system, causing immense wear and tear on
every snapshot and also losing any information updates upto the
snapshot point etc)

I'd like to get opinions on the pros and cons of this proposal from
more experienced people on the list to redirect suitably on the
following questions:

- Maybe this can already be done using the block devices available in
the kernel?

- If not, Device mapper the right API to use? (I think so)

- What would be the best block devices code to look at to implement?

Neil, would appreciate your weighing in on this.

Regards,

Anshuman Aggarwal

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2015-01-06 11:40 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-21 10:15 Split RAID: Proposal for archival RAID using incremental batch checksum Anshuman Aggarwal
2014-11-21 11:41 ` Greg Freemyer
2014-11-21 18:48   ` Anshuman Aggarwal
2014-11-22 13:17     ` Greg Freemyer
2014-11-22 13:22       ` Anshuman Aggarwal
2014-11-22 14:03         ` Greg Freemyer
2014-11-22 14:43           ` Anshuman Aggarwal
2014-11-22 14:54             ` Greg Freemyer
2014-11-24  5:36               ` SandeepKsinha
2014-11-24  6:48                 ` Anshuman Aggarwal
2014-11-24 13:19                   ` Greg Freemyer
2014-11-24 17:28                     ` Anshuman Aggarwal
2014-11-24 18:10                       ` Valdis.Kletnieks at vt.edu
2014-11-25  4:56                       ` Greg Freemyer
2014-11-27 17:50                         ` Anshuman Aggarwal
2014-11-27 18:31                           ` Greg Freemyer
  -- strict thread matches above, loose matches on Subject: below --
2014-10-29  7:15 Anshuman Aggarwal
2014-10-29  7:32 ` Roman Mamedov
2014-10-29  8:31   ` Anshuman Aggarwal
2014-10-29  9:05 ` NeilBrown
2014-10-29  9:25   ` Anshuman Aggarwal
2014-10-29 19:27     ` Ethan Wilson
2014-10-30 14:57       ` Anshuman Aggarwal
2014-10-30 17:25         ` Piergiorgio Sartor
2014-10-31 11:05           ` Anshuman Aggarwal
2014-10-31 14:25             ` Matt Garman
2014-11-01 12:55             ` Piergiorgio Sartor
2014-11-06  2:29               ` Anshuman Aggarwal
2014-10-30 15:00     ` Anshuman Aggarwal
2014-11-03  5:52       ` NeilBrown
2014-11-03 18:04         ` Piergiorgio Sartor
2014-11-06  2:24         ` Anshuman Aggarwal
2014-11-24  7:29         ` Anshuman Aggarwal
2014-11-24 22:50           ` NeilBrown
2014-11-26  6:24             ` Anshuman Aggarwal
2014-12-01 16:00               ` Anshuman Aggarwal
2014-12-01 16:34                 ` Anshuman Aggarwal
2014-12-01 21:46                   ` NeilBrown
2014-12-02 11:56                     ` Anshuman Aggarwal
2014-12-16 16:25                       ` Anshuman Aggarwal
2014-12-16 21:49                         ` NeilBrown
2014-12-17  6:40                           ` Anshuman Aggarwal
2015-01-06 11:40                             ` Anshuman Aggarwal
     [not found] ` <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com>
2014-11-01  5:36   ` Anshuman Aggarwal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.