All of lore.kernel.org
 help / color / mirror / Atom feed
* clustered MD - beyond RAID1
@ 2015-12-18 15:29 Scott Sinno
  2015-12-20 23:25 ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Scott Sinno @ 2015-12-18 15:29 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]

Neil(or anyone well informed in mdadm development roadmaps),
	
	Aaron and myself are engineers at NASA Goddard with strong interest in
MDADM.  We currently host 6PB(raw) of live JBOD storage leveraging MDADM
exclusively for RAID functionality.

	We're very interested in Clustered MDADM to improve data-availability
in the environment, but note that only RAID1 is currently supported.
Are there plans in the nearish-term(say over the next year) to expound
clustered bitmap functionality to RAID5/6, or anything else you can
divulge on that front?  Thanks in advance for any guidance.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
  2015-12-18 15:29 clustered MD - beyond RAID1 Scott Sinno
@ 2015-12-20 23:25 ` NeilBrown
  2015-12-21 19:19   ` Tejas Rao
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2015-12-20 23:25 UTC (permalink / raw)
  To: Scott Sinno, linux-raid
  Cc: Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]

[-- Attachment #1: Type: text/plain, Size: 1538 bytes --]

On Sat, Dec 19 2015, Scott Sinno wrote:

> Neil(or anyone well informed in mdadm development roadmaps),
> 	
> 	Aaron and myself are engineers at NASA Goddard with strong interest in
> MDADM.  We currently host 6PB(raw) of live JBOD storage leveraging MDADM
> exclusively for RAID functionality.
>
> 	We're very interested in Clustered MDADM to improve data-availability
> in the environment, but note that only RAID1 is currently supported.
> Are there plans in the nearish-term(say over the next year) to expound
> clustered bitmap functionality to RAID5/6, or anything else you can
> divulge on that front?  Thanks in advance for any guidance.

We don't talk about plans that are not backed by code - you can't trust
them.

However I cannot imagine how you could make RAID5 work efficiently in a
cluster.
RAID1 works because we assume that the file system will have its own
locking to ensure that only one node writes to a given block at a given
time.  So while node-A is writing to a block, RAID1 knows that no other
node is writing there so it can update all copies and be sure no race
will result in the copies being inconsistent.

For this to work with RAID5 we would need to assume the filesystem will
ensure only one node is writing to a given stripe at a time, and that is
not realistic.

So to make it work we would need the md layer to lock each stripe during
an update.  I have trouble imagining that running with much speed.  Hard
to know without testing of course.
I know of no-one with plans to do that testing.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
  2015-12-20 23:25 ` NeilBrown
@ 2015-12-21 19:19   ` Tejas Rao
  2015-12-21 20:47     ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Tejas Rao @ 2015-12-21 19:19 UTC (permalink / raw)
  To: NeilBrown, Scott Sinno, linux-raid
  Cc: Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]

What if the application is doing the locking and making sure that only 1 
node writes to a md device at a time? Will this work? How are rebuilds 
handled? This would be helpful with distributed filesystems like 
GPFS/lustre etc.

Tejas.

On 12/20/2015 18:25, NeilBrown wrote:
> On Sat, Dec 19 2015, Scott Sinno wrote:
>
>> Neil(or anyone well informed in mdadm development roadmaps),
>> 	
>> 	Aaron and myself are engineers at NASA Goddard with strong interest in
>> MDADM.  We currently host 6PB(raw) of live JBOD storage leveraging MDADM
>> exclusively for RAID functionality.
>>
>> 	We're very interested in Clustered MDADM to improve data-availability
>> in the environment, but note that only RAID1 is currently supported.
>> Are there plans in the nearish-term(say over the next year) to expound
>> clustered bitmap functionality to RAID5/6, or anything else you can
>> divulge on that front?  Thanks in advance for any guidance.
> We don't talk about plans that are not backed by code - you can't trust
> them.
>
> However I cannot imagine how you could make RAID5 work efficiently in a
> cluster.
> RAID1 works because we assume that the file system will have its own
> locking to ensure that only one node writes to a given block at a given
> time.  So while node-A is writing to a block, RAID1 knows that no other
> node is writing there so it can update all copies and be sure no race
> will result in the copies being inconsistent.
>
> For this to work with RAID5 we would need to assume the filesystem will
> ensure only one node is writing to a given stripe at a time, and that is
> not realistic.
>
> So to make it work we would need the md layer to lock each stripe during
> an update.  I have trouble imagining that running with much speed.  Hard
> to know without testing of course.
> I know of no-one with plans to do that testing.
>
> NeilBrown


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
  2015-12-21 19:19   ` Tejas Rao
@ 2015-12-21 20:47     ` NeilBrown
  2015-12-21 21:27       ` Tejas Rao
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2015-12-21 20:47 UTC (permalink / raw)
  To: Tejas Rao, Scott Sinno, linux-raid
  Cc: Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]

[-- Attachment #1: Type: text/plain, Size: 646 bytes --]

On Tue, Dec 22 2015, Tejas Rao wrote:

> What if the application is doing the locking and making sure that only 1 
> node writes to a md device at a time? Will this work? How are rebuilds 
> handled? This would be helpful with distributed filesystems like 
> GPFS/lustre etc.
>

You would also need to make sure that the filesystem only wrote from a
single node at a time (or access the block device directly).  I doubt
GPFS/lustre make any promise like that, but I'm happy to be educated.

rebuilds are handled by using a cluster-wide lock to block all writes to
a range of addresses while those stripes are repaired.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
  2015-12-21 20:47     ` NeilBrown
@ 2015-12-21 21:27       ` Tejas Rao
  2015-12-21 22:03         ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Tejas Rao @ 2015-12-21 21:27 UTC (permalink / raw)
  To: NeilBrown, Scott Sinno, linux-raid
  Cc: Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]

GPFS guarantees that only one node will write to a linux block device 
using disk leases. Only a node with a disk lease has the right to submit 
I/O and disk leases expire every 30 secs and needs to be renewed. Lustre 
and other distributed file systems have other ways of handing this.

Using md devices in a shared/clustered environment is something not 
supported by Redhat on RHEL6 or RHEL7 kernels, so this is something we 
would not try in our production environments.

Tejas.

On 12/21/2015 15:47, NeilBrown wrote:
> On Tue, Dec 22 2015, Tejas Rao wrote:
>
>> What if the application is doing the locking and making sure that only 1
>> node writes to a md device at a time? Will this work? How are rebuilds
>> handled? This would be helpful with distributed filesystems like
>> GPFS/lustre etc.
>>
> You would also need to make sure that the filesystem only wrote from a
> single node at a time (or access the block device directly).  I doubt
> GPFS/lustre make any promise like that, but I'm happy to be educated.
>
> rebuilds are handled by using a cluster-wide lock to block all writes to
> a range of addresses while those stripes are repaired.
>
> NeilBrown


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
  2015-12-21 21:27       ` Tejas Rao
@ 2015-12-21 22:03         ` NeilBrown
  2015-12-21 22:29           ` Adam Goryachev
                             ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: NeilBrown @ 2015-12-21 22:03 UTC (permalink / raw)
  To: Tejas Rao, Scott Sinno, linux-raid
  Cc: Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]

[-- Attachment #1: Type: text/plain, Size: 1828 bytes --]

On Tue, Dec 22 2015, Tejas Rao wrote:

> GPFS guarantees that only one node will write to a linux block device 
> using disk leases.

Do you have a reference to documentation explaining that?
A few moments searching the internet suggests that a "disk lease" is
much like a heart-beat.  A node uses it to say "I'm still alive, please
don't ignore me".  I could find no evidence that only one node could
hold a disk lease at any time.

NeilBrown


>                    Only a node with a disk lease has the right to submit 
> I/O and disk leases expire every 30 secs and needs to be renewed. Lustre 
> and other distributed file systems have other ways of handing this.
>
> Using md devices in a shared/clustered environment is something not 
> supported by Redhat on RHEL6 or RHEL7 kernels, so this is something we 
> would not try in our production environments.
>
> Tejas.
>
> On 12/21/2015 15:47, NeilBrown wrote:
>> On Tue, Dec 22 2015, Tejas Rao wrote:
>>
>>> What if the application is doing the locking and making sure that only 1
>>> node writes to a md device at a time? Will this work? How are rebuilds
>>> handled? This would be helpful with distributed filesystems like
>>> GPFS/lustre etc.
>>>
>> You would also need to make sure that the filesystem only wrote from a
>> single node at a time (or access the block device directly).  I doubt
>> GPFS/lustre make any promise like that, but I'm happy to be educated.
>>
>> rebuilds are handled by using a cluster-wide lock to block all writes to
>> a range of addresses while those stripes are repaired.
>>
>> NeilBrown
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
  2015-12-21 22:03         ` NeilBrown
@ 2015-12-21 22:29           ` Adam Goryachev
  2015-12-21 23:09             ` NeilBrown
  2015-12-22  1:36           ` Tejas Rao
       [not found]           ` <5678A2B9.6070008@bnl.gov>
  2 siblings, 1 reply; 17+ messages in thread
From: Adam Goryachev @ 2015-12-21 22:29 UTC (permalink / raw)
  To: NeilBrown, Tejas Rao, Scott Sinno, linux-raid
  Cc: Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]

On 22/12/15 09:03, NeilBrown wrote:
> On Tue, Dec 22 2015, Tejas Rao wrote:
>
>> On 12/21/2015 15:47, NeilBrown wrote:
>>> On Tue, Dec 22 2015, Tejas Rao wrote:
>>>
>>>> What if the application is doing the locking and making sure that only 1
>>>> node writes to a md device at a time? Will this work? How are rebuilds
>>>> handled? This would be helpful with distributed filesystems like
>>>> GPFS/lustre etc.
>>>>
>>> You would also need to make sure that the filesystem only wrote from a
>>> single node at a time (or access the block device directly).  I doubt
>>> GPFS/lustre make any promise like that, but I'm happy to be educated.
>>>
>>> rebuilds are handled by using a cluster-wide lock to block all writes to
>>> a range of addresses while those stripes are repaired.
>>>
>>> NeilBrown

My understanding of MD level cross host RAID was that it would not 
magically create cluster aware filesystems out of non-cluster aware 
filesystems. ie, you wouldn't be able to use the same multi-host RAID 
device on multiple hosts concurrently with ext3.

IMHO, if it was able to behave similar to DRBD, then that would be 
perfect (ie, enforce only a single node can write at a time (unless you 
specifically set it for multi-node write)). The benefit should be that 
you can lose a node without losing your data. After you lose that node, 
you can then "do something" to use the remaining node to access the data 
(eg, mount it, export with iscsi/nfs, etc).

Currently, this is what I use DRBD for, previously, I've used NBD + MD 
RAID1 to do the same thing. One question though is what advantage 
multi-host MD RAID might have over the existing in-kernel DRBD ? Are 
there plans which show why this is going to be better, have better 
performance, features, etc?

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
  2015-12-21 22:29           ` Adam Goryachev
@ 2015-12-21 23:09             ` NeilBrown
  0 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2015-12-21 23:09 UTC (permalink / raw)
  To: Adam Goryachev, Tejas Rao, Scott Sinno, linux-raid
  Cc: Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]

[-- Attachment #1: Type: text/plain, Size: 3131 bytes --]

On Tue, Dec 22 2015, Adam Goryachev wrote:

> On 22/12/15 09:03, NeilBrown wrote:
>> On Tue, Dec 22 2015, Tejas Rao wrote:
>>
>>> On 12/21/2015 15:47, NeilBrown wrote:
>>>> On Tue, Dec 22 2015, Tejas Rao wrote:
>>>>
>>>>> What if the application is doing the locking and making sure that only 1
>>>>> node writes to a md device at a time? Will this work? How are rebuilds
>>>>> handled? This would be helpful with distributed filesystems like
>>>>> GPFS/lustre etc.
>>>>>
>>>> You would also need to make sure that the filesystem only wrote from a
>>>> single node at a time (or access the block device directly).  I doubt
>>>> GPFS/lustre make any promise like that, but I'm happy to be educated.
>>>>
>>>> rebuilds are handled by using a cluster-wide lock to block all writes to
>>>> a range of addresses while those stripes are repaired.
>>>>
>>>> NeilBrown
>
> My understanding of MD level cross host RAID was that it would not 
> magically create cluster aware filesystems out of non-cluster aware 
> filesystems. ie, you wouldn't be able to use the same multi-host RAID 
> device on multiple hosts concurrently with ext3.

This is correct.  The expectation is that clustered md/raid1 would be
used with a cluster-aware filesystem such as ocfs2 or gpfs.  Certainly
not with ext3 or similar.

>
> IMHO, if it was able to behave similar to DRBD, then that would be 
> perfect (ie, enforce only a single node can write at a time (unless you 
> specifically set it for multi-node write)). The benefit should be that 
> you can lose a node without losing your data. After you lose that node, 
> you can then "do something" to use the remaining node to access the data 
> (eg, mount it, export with iscsi/nfs, etc).

There is a lot of similarity between DRBD and clustered md/raid1.
I don't know the current state of DRBD but it initially assumed each
storage device was local to a single node and so sent data over the
network (i.e. over IP) to "remote" devices.

clustered md/raid1 assumes that all storage is equally accessible to all
nodes (over a 'storage area network', which may still be IP).

So yes: if you lose a node you should not lose functionality.

>
> Currently, this is what I use DRBD for, previously, I've used NBD + MD 
> RAID1 to do the same thing. One question though is what advantage 
> multi-host MD RAID might have over the existing in-kernel DRBD ? Are 
> there plans which show why this is going to be better, have better 
> performance, features, etc?

I'm not the driving force behind clustered md/raid1 so I am not
completely familiar with the motivation, but I believe DRBD doesn't, or
didn't, make best possible use of the storage network when every storage
device is connected to every compute node.  It is expected that clustered
md/raid1 will.

I *think* DRBD is primarily for pair of nodes (though there is some
multi-node support).  clustered md/raid1 is designed to work with
multiple nodes - however big your cluster is.
(DRBD 9.0 appears to support multi-node configurations.  I haven't
researched the details)

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
  2015-12-21 22:03         ` NeilBrown
  2015-12-21 22:29           ` Adam Goryachev
@ 2015-12-22  1:36           ` Tejas Rao
  2015-12-22  2:29             ` Alireza Haghdoost
  2015-12-22  4:13             ` NeilBrown
       [not found]           ` <5678A2B9.6070008@bnl.gov>
  2 siblings, 2 replies; 17+ messages in thread
From: Tejas Rao @ 2015-12-22  1:36 UTC (permalink / raw)
  To: NeilBrown
  Cc: Scott Sinno, linux-raid, Knister,
	Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]

Each GPFS disk (block device) has a list of servers associated with it. 
When the first storage server fails (expired disk lease), the storage 
node is expelled and a different server which also sees the shared 
storage will do I/O.

There is a "leaseRecoveryWait" parameter which tells the filesystem 
manager to wait for few seconds to allow the expelled node to complete 
any I/O in flight to the shared storage device to avoid any out of order 
i/O. After this wait time, the filesystem manager completes recovery on 
the failed node, replaying journal logs, freeing up shared tokens/locks 
etc. After the recovery is complete a different storage node will do 
I/O. There is a concept of primary/secondary servers for a given block 
device. The secondary server will only do I/O when the primary server 
has failed and this has been confirmed.

See "servers=ServerList" in man page for mmcrnsd. ( I don't think I am 
allowed to send web links)

We currently have 10's of petabytes in production using linux md raid. 
We are currently not sharing md devices, only hardware raid block 
devices are shared. In our experience hardware raid controllers are 
expensive. Linux raid has worked well over the years and performance is 
very good as GPFS coalesces I/O in large filesystem blocksize blocks 
(8MB) and if aligned properly eliminate RMW (doing full stripe writes) 
and the need for NVRAM (unless someone is doing POSIX fsync).

In the future ,we would prefer to use linux raid (RAID6) in a shared 
environment shielding us against server failures. Unfortunately we can 
only do this after Redhat supports such an environment with linux raid. 
Currently they do not support this even in an active/passive environment 
(only one server can have a md device assembled and active regardless).

Tejas.

On 12/21/2015 17:03, NeilBrown wrote:
> On Tue, Dec 22 2015, Tejas Rao wrote:
>
>> GPFS guarantees that only one node will write to a linux block device
>> using disk leases.
>
> Do you have a reference to documentation explaining that?
> A few moments searching the internet suggests that a "disk lease" is
> much like a heart-beat.  A node uses it to say "I'm still alive, please
> don't ignore me".  I could find no evidence that only one node could
> hold a disk lease at any time.
>
> NeilBrown
>
>
>>                     Only a node with a disk lease has the right to submit
>> I/O and disk leases expire every 30 secs and needs to be renewed. Lustre
>> and other distributed file systems have other ways of handing this.
>>
>> Using md devices in a shared/clustered environment is something not
>> supported by Redhat on RHEL6 or RHEL7 kernels, so this is something we
>> would not try in our production environments.
>>
>> Tejas.
>>
>> On 12/21/2015 15:47, NeilBrown wrote:
>>> On Tue, Dec 22 2015, Tejas Rao wrote:
>>>
>>>> What if the application is doing the locking and making sure that only 1
>>>> node writes to a md device at a time? Will this work? How are rebuilds
>>>> handled? This would be helpful with distributed filesystems like
>>>> GPFS/lustre etc.
>>>>
>>> You would also need to make sure that the filesystem only wrote from a
>>> single node at a time (or access the block device directly).  I doubt
>>> GPFS/lustre make any promise like that, but I'm happy to be educated.
>>>
>>> rebuilds are handled by using a cluster-wide lock to block all writes to
>>> a range of addresses while those stripes are repaired.
>>>
>>> NeilBrown
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
       [not found]           ` <5678A2B9.6070008@bnl.gov>
@ 2015-12-22  1:50             ` Aaron Knister
  2015-12-22  2:33               ` Tejas Rao
  0 siblings, 1 reply; 17+ messages in thread
From: Aaron Knister @ 2015-12-22  1:50 UTC (permalink / raw)
  To: Tejas Rao, NeilBrown; +Cc: Scott Sinno, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3912 bytes --]

Hi Tejas et al,

I'm fairly confident in saying that GPFS can have many servers actively 
writing to a given NSD (LUN) at any given time. In our production 
environment the NSDs have 6 servers defined and clients more or less 
write to whichever one their little hearts desire. Do you think it's 
possible that the explicit primary/secondary concept is from an older 
version of GPFS? I'm not sure what the locking granularity is for 
NSDs/disks, but even if it's a single GPFS FS block and that block size 
corresponds to the stripe width of the array I'm pretty nervous relying 
on that assumption for data integrity :)

The use case here is creating effectively highly available block storage 
from shared JBODs for use by VMs on the servers as well as to be 
exported to other nodes. The filesystem we're using for this is actually 
GPFS. The intent was to use RAID6 in an active/active fashion on two 
nodes sharing a common set of disks. The active/active was in an effort 
to simplify the configuration.

I'm curious now, Redhat doesn't support SW raid failover? I did some 
googling and found this:

https://access.redhat.com/solutions/231643

While I can't read the solution I have to figure that they're now 
supporting that. I might actually explore that for this project.

-Aaron

On 12/21/15 8:09 PM, Tejas Rao wrote:
> Each GPFS disk (block device) has a list of servers associated with it.
> When the first storage server fails (expired disk lease), the storage
> node is expelled and a different server which also sees the shared
> storage will do I/O.
>
> There is a "leaseRecoveryWait" parameter which tells the filesystem
> manager to wait for few seconds to allow the expelled node to complete
> any I/O in flight to the shared storage device to avoid any out of order
> i/O. After this wait time, the filesystem manager completes recovery on
> the failed node, replaying journal logs, freeing up shared tokens/locks
> etc. After the recovery is complete a different storage node will do
> I/O. There is a concept of primary/secondary servers for a given block
> device. The secondary server will only do I/O when the primary server
> has failed and this has been confirmed.
>
> See "servers=ServerList" in man page for mmcrnsd. ( I don't think I am
> allowed to send web links)
>
> We currently have 10's of petabytes in production using linux md raid.
> We are currently not sharing md devices, only hardware raid block
> devices are shared. In our experience hardware raid controllers are
> expensive. Linux raid has worked well over the years and performance is
> very good as GPFS coalesces I/O in large filesystem blocksize blocks
> (8MB) and if aligned properly eliminate RMW (doing full stripe writes)
> and the need for NVRAM (unless someone is doing POSIX fsync).
>
> In the future ,we would prefer to use linux raid (RAID6) in a shared
> environment shielding us against server failures. Unfortunately we can
> only do this after Redhat supports such an environment with linux raid.
> Currently they do not support this even in an active/passive environment
> (only one server can have a md device assembled and active regardless).
>
> Tejas.
>
> On 12/21/2015 17:03, NeilBrown wrote:
>  > On Tue, Dec 22 2015, Tejas Rao wrote:
>  >
>  >> GPFS guarantees that only one node will write to a linux block device
>  >> using disk leases.
>  >
>  > Do you have a reference to documentation explaining that?
>  > A few moments searching the internet suggests that a "disk lease" is
>  > much like a heart-beat.  A node uses it to say "I'm still alive, please
>  > don't ignore me".  I could find no evidence that only one node could
>  > hold a disk lease at any time.
>  >
>  > NeilBrown
>

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 842 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
  2015-12-22  1:36           ` Tejas Rao
@ 2015-12-22  2:29             ` Alireza Haghdoost
  2015-12-22  4:13             ` NeilBrown
  1 sibling, 0 replies; 17+ messages in thread
From: Alireza Haghdoost @ 2015-12-22  2:29 UTC (permalink / raw)
  To: Tejas Rao
  Cc: NeilBrown, Scott Sinno, Linux RAID, Knister,
	Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]

On Mon, Dec 21, 2015 at 7:36 PM, Tejas Rao <raot@bnl.gov> wrote:
>
> We currently have 10's of petabytes in production using linux md raid. We
> are currently not sharing md devices, only hardware raid block devices are
> shared. In our experience hardware raid controllers are expensive. Linux
> raid has worked well over the years and performance is very good as GPFS
> coalesces I/O in large filesystem blocksize blocks (8MB) and if aligned
> properly eliminate RMW (doing full stripe writes) and the need for NVRAM
> (unless someone is doing POSIX fsync).
>

Full Stripe Write does not eliminate the need for NVRAM. Stored data
under MD RAID-6 is vulnerable to write-hole data corruption issue.
Yes, hardware RAID controllers are expensive because they would
provide more data reliability and high-availibiltiy that might not be
useful for this use case.

--Alireza

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
  2015-12-22  1:50             ` Aaron Knister
@ 2015-12-22  2:33               ` Tejas Rao
       [not found]                 ` <5678B693.40907-IGkKxAqZmp0@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Tejas Rao @ 2015-12-22  2:33 UTC (permalink / raw)
  To: Aaron Knister; +Cc: NeilBrown, Scott Sinno, linux-raid

On 12/21/2015 20:50, Aaron Knister wrote:
> Hi Tejas et al,
>
> I'm fairly confident in saying that GPFS can have many servers actively
> writing to a given NSD (LUN) at any given time. In our production
> environment the NSDs have 6 servers defined and clients more or less
> write to whichever one their little hearts desire. Do you think it's
> possible that the explicit primary/secondary concept is from an older
> version of GPFS? I'm not sure what the locking granularity is for
> NSDs/disks, but even if it's a single GPFS FS block and that block size
> corresponds to the stripe width of the array I'm pretty nervous relying
> on that assumption for data integrity :)
>
> The use case here is creating effectively highly available block storage
> from shared JBODs for use by VMs on the servers as well as to be
> exported to other nodes. The filesystem we're using for this is actually
> GPFS. The intent was to use RAID6 in an active/active fashion on two
> nodes sharing a common set of disks. The active/active was in an effort
> to simplify the configuration.

You are probably not defining the NSD parameter "servers=ServerList". If 
this parameter is not defined, GPFS assumes that the disks are SAN 
attached to all the NSD nodes, in this case there is no 
primary/secondary server. Of-course there is no risk of data integrity 
even if the "servers" parameter is not defined.
>
> I'm curious now, Redhat doesn't support SW raid failover? I did some
> googling and found this:
>
> https://access.redhat.com/solutions/231643
>
> While I can't read the solution I have to figure that they're now
> supporting that. I might actually explore that for this project.
https://access.redhat.com/solutions/410203
This article states that md raid is not supported in RHEL6/7 under any 
circumstances, including active/passive modes.
>
> -Aaron
>
> On 12/21/15 8:09 PM, Tejas Rao wrote:
>> Each GPFS disk (block device) has a list of servers associated with it.
>> When the first storage server fails (expired disk lease), the storage
>> node is expelled and a different server which also sees the shared
>> storage will do I/O.
>>
>> There is a "leaseRecoveryWait" parameter which tells the filesystem
>> manager to wait for few seconds to allow the expelled node to complete
>> any I/O in flight to the shared storage device to avoid any out of order
>> i/O. After this wait time, the filesystem manager completes recovery on
>> the failed node, replaying journal logs, freeing up shared tokens/locks
>> etc. After the recovery is complete a different storage node will do
>> I/O. There is a concept of primary/secondary servers for a given block
>> device. The secondary server will only do I/O when the primary server
>> has failed and this has been confirmed.
>>
>> See "servers=ServerList" in man page for mmcrnsd. ( I don't think I am
>> allowed to send web links)
>>
>> We currently have 10's of petabytes in production using linux md raid.
>> We are currently not sharing md devices, only hardware raid block
>> devices are shared. In our experience hardware raid controllers are
>> expensive. Linux raid has worked well over the years and performance is
>> very good as GPFS coalesces I/O in large filesystem blocksize blocks
>> (8MB) and if aligned properly eliminate RMW (doing full stripe writes)
>> and the need for NVRAM (unless someone is doing POSIX fsync).
>>
>> In the future ,we would prefer to use linux raid (RAID6) in a shared
>> environment shielding us against server failures. Unfortunately we can
>> only do this after Redhat supports such an environment with linux raid.
>> Currently they do not support this even in an active/passive environment
>> (only one server can have a md device assembled and active regardless).
>>
>> Tejas.
>>
>> On 12/21/2015 17:03, NeilBrown wrote:
>> > On Tue, Dec 22 2015, Tejas Rao wrote:
>> >
>> >> GPFS guarantees that only one node will write to a linux block device
>> >> using disk leases.
>> >
>> > Do you have a reference to documentation explaining that?
>> > A few moments searching the internet suggests that a "disk lease" is
>> > much like a heart-beat. A node uses it to say "I'm still alive, please
>> > don't ignore me". I could find no evidence that only one node could
>> > hold a disk lease at any time.
>> >
>> > NeilBrown
>>
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
  2015-12-22  1:36           ` Tejas Rao
  2015-12-22  2:29             ` Alireza Haghdoost
@ 2015-12-22  4:13             ` NeilBrown
       [not found]               ` <CAB9NSeXhoHd3_BDRrWAsBrW0Dj2=NucyUFt8pSP0zB5K=RkUOg@mail.gmail.com>
  1 sibling, 1 reply; 17+ messages in thread
From: NeilBrown @ 2015-12-22  4:13 UTC (permalink / raw)
  To: Tejas Rao
  Cc: Scott Sinno, linux-raid, Knister,
	Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]

[-- Attachment #1: Type: text/plain, Size: 1413 bytes --]

On Tue, Dec 22 2015, Tejas Rao wrote:

> Each GPFS disk (block device) has a list of servers associated with it. 
> When the first storage server fails (expired disk lease), the storage 
> node is expelled and a different server which also sees the shared 
> storage will do I/O.

In that case something probably could be made to work with md/raid5
using much of the cluster support developed for md/raid1.

The raid5 module would take a cluster lock that covered some region of
the array and would not need to release it until a fail-over happened.
So there would be little performance penalty.

The simplest approach would be to lock the whole array.  This would
preclude the possibility of different partitions being accessed from
different nodes.  Maybe that is not a problem.  If it were, a solution
could probably be found but there would be little point searching for a
solution before a clear need was presented.

>
> In the future ,we would prefer to use linux raid (RAID6) in a shared 
> environment shielding us against server failures. Unfortunately we can 
> only do this after Redhat supports such an environment with linux raid. 
> Currently they do not support this even in an active/passive environment 
> (only one server can have a md device assembled and active regardless).

Obviously that is something you would need to discuss with Redhat.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
       [not found]                 ` <5678B693.40907-IGkKxAqZmp0@public.gmane.org>
@ 2015-12-25  8:47                   ` roger zhou
  0 siblings, 0 replies; 17+ messages in thread
From: roger zhou @ 2015-12-25  8:47 UTC (permalink / raw)
  To: Tejas Rao, Aaron Knister
  Cc: NeilBrown, linux-raid-u79uwXL29TY76Z2rM5mHXA, Scott Sinno,
	users-iJOQKD3IPJ/s2ve5o0e2fw


[-- Attachment #1.1: Type: text/plain, Size: 1179 bytes --]


On 12/22/2015 10:33 AM, Tejas Rao wrote:
> On 12/21/2015 20:50, Aaron Knister wrote:
>
[...]
>>
>> I'm curious now, Redhat doesn't support SW raid failover? I did some
>> googling and found this:
>>
>> https://access.redhat.com/solutions/231643
>>
>> While I can't read the solution I have to figure that they're now
>> supporting that. I might actually explore that for this project.
> https://access.redhat.com/solutions/410203
> This article states that md raid is not supported in RHEL6/7 under any 
> circumstances, including active/passive modes.

OCFS2 or GFS2(same for GPFS, as the shared filesystem) over a shared 
storage is a typical Cluster configuration for Linux High Availability. 
Where, Clustered LVM (cLVM) is supported by both SUSE and Redhat to do 
mirroring to protect the data. However, the performance loss is very big 
and make people not so happy about this clustered mirror solution. This 
is where the motivation for clustered MD solution comes from.

With clustered md, this new solution could provide nearly the same 
performance as the native raid1. You may have interest to validate this 
from your lab with your configuration;)

Cheers,
Roger




[-- Attachment #1.2: Type: text/html, Size: 2232 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
       [not found]               ` <CAB9NSeXhoHd3_BDRrWAsBrW0Dj2=NucyUFt8pSP0zB5K=RkUOg@mail.gmail.com>
@ 2016-12-05  1:46                 ` Aaron Knister
  0 siblings, 0 replies; 17+ messages in thread
From: Aaron Knister @ 2016-12-05  1:46 UTC (permalink / raw)
  To: Robert Woodworth, NeilBrown, linux-raid; +Cc: Tejas Rao, Scott Sinno

Hi Robert,

I don't know the answer to the question, but I had to refresh my memory 
on the issue and here are my thoughts on it:

I dug it up and the issue with clustered RAID5/6 is one of stripe 
locking (http://www.spinics.net/lists/raid/msg51020.html). If two 
different nodes say write 4k to 2 different locations within a 128k 
stripe there's a race condition and best case you'll lose one of the 
writes, worst case you'd entirely corrupt the stripe. As was pointed out 
on the thread I linked to the performance implications of locking a 
stripe could be dire. If you'd like to do 1GiB/s to an array (not 
unreasonable for even a modest drive count) with a chunk size of 128KiB 
that's 8,192 locks per second.

The approach I was going to take (but I'm no longer working on the 
project that required it) was to use pacemaker to co-ordinate failover 
of the md devices between nodes. Pacemaker would co-ordinate the 
"serving" of the LUNs via SCST and on the passive node would present a 
pair of "dummy" devices. ALUA would be used to inform the clients of 
which was the active path. On the clients multipath would be required to 
pull it all together. I had started work on the pacemaker failover of 
the SCST LUNs and have some patches to the available OCF resource agents 
to make the progress I did. I can send those to you if you'd like. It's 
rather precarious and without extensive testing I don't trust it to not 
eat all your data.

The only thing I know of that could, out of the box, perhaps do what 
you're asking is Raidix (http://www.raidix.com/) and that's based purely 
on advertised specifications.

I agree with you, though, if it clustered md raid6/60 could be made to 
work and be performant I think it would be an asset to the Linux community.

-Aaron

On 12/2/16 1:03 PM, Robert Woodworth wrote:
> Excuse me for being late to the party on this subject, but is the idea
> of clustered RAID5/6 alive or dead?
>
> I have a need for such a feature. I'm in development on SAS JBODs with
> large drive counts, 60 and 90 drives per JBOD. We would like to support
> multi-host connectivity in an active/active fashion with MD RAID60.
> This clustered MD RAID can and should be a nice alternative to HW RAID
> solutions like LSI/Avago "Syncro" MegaRAID.
>
> I currently have the hardware and time to help develop and test the
> clustered RAID5/6.
> I just finished up building a test cluster of 2 nodes with the
> cluster-md RAID1.  Worked fine with gfs2 on top.
>
>
> My current real job is firmware on these SAS JBODS. I have many years of
> Linux experience and have developed (years ago) some kernel modules for
> a custom FPGA based PCIe cards.
>
>
> On Mon, Dec 21, 2015 at 9:13 PM, NeilBrown <neilb@suse.de
> <mailto:neilb@suse.de>> wrote:
>
>     On Tue, Dec 22 2015, Tejas Rao wrote:
>
>     > Each GPFS disk (block device) has a list of servers associated with it.
>     > When the first storage server fails (expired disk lease), the storage
>     > node is expelled and a different server which also sees the shared
>     > storage will do I/O.
>
>     In that case something probably could be made to work with md/raid5
>     using much of the cluster support developed for md/raid1.
>
>     The raid5 module would take a cluster lock that covered some region of
>     the array and would not need to release it until a fail-over happened.
>     So there would be little performance penalty.
>
>     The simplest approach would be to lock the whole array.  This would
>     preclude the possibility of different partitions being accessed from
>     different nodes.  Maybe that is not a problem.  If it were, a solution
>     could probably be found but there would be little point searching for a
>     solution before a clear need was presented.
>
>     >
>     > In the future ,we would prefer to use linux raid (RAID6) in a shared
>     > environment shielding us against server failures. Unfortunately we can
>     > only do this after Redhat supports such an environment with linux raid.
>     > Currently they do not support this even in an active/passive environment
>     > (only one server can have a md device assembled and active regardless).
>
>     Obviously that is something you would need to discuss with Redhat.
>
>     Thanks,
>     NeilBrown
>
>

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: clustered MD - beyond RAID1
  2016-12-02 18:12 Robert Woodworth
@ 2016-12-02 20:02 ` Shaohua Li
  0 siblings, 0 replies; 17+ messages in thread
From: Shaohua Li @ 2016-12-02 20:02 UTC (permalink / raw)
  To: Robert Woodworth; +Cc: linux-raid

On Fri, Dec 02, 2016 at 11:12:52AM -0700, Robert Woodworth wrote:
> Excuse me for being late to the party on this subject, but is the idea of
> clustered RAID5/6 alive or dead?
> 
> I have a need for such a feature. I'm in development on SAS JBODs with
> large drive counts, 60 and 90 drives per JBOD. We would like to support
> multi-host connectivity in an active/active fashion with MD RAID60.  This
> clustered MD RAID can and should be a nice alternative to HW RAID solutions
> like LSI/Avago "Syncro" MegaRAID.
> 
> I currently have the hardware and time to help develop and test the
> clustered RAID5/6.
> I just finished up building a test cluster of 2 nodes with the cluster-md
> RAID1.  Worked fine with gfs2 on top.
> 
> 
> My current real job is firmware on these SAS JBODS. I have many years of
> Linux experience and have developed (years ago) some kernel modules for a
> custom FPGA based PCIe cards.

It makes a lot of sense to me and no reason we don't support it, and especially
you have real usage of it. If anybody wants to implement it, I'm very glad to
help/review patches.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* clustered MD - beyond RAID1
@ 2016-12-02 18:12 Robert Woodworth
  2016-12-02 20:02 ` Shaohua Li
  0 siblings, 1 reply; 17+ messages in thread
From: Robert Woodworth @ 2016-12-02 18:12 UTC (permalink / raw)
  To: linux-raid

Excuse me for being late to the party on this subject, but is the idea of
clustered RAID5/6 alive or dead?

I have a need for such a feature. I'm in development on SAS JBODs with
large drive counts, 60 and 90 drives per JBOD. We would like to support
multi-host connectivity in an active/active fashion with MD RAID60.  This
clustered MD RAID can and should be a nice alternative to HW RAID solutions
like LSI/Avago "Syncro" MegaRAID.

I currently have the hardware and time to help develop and test the
clustered RAID5/6.
I just finished up building a test cluster of 2 nodes with the cluster-md
RAID1.  Worked fine with gfs2 on top.


My current real job is firmware on these SAS JBODS. I have many years of
Linux experience and have developed (years ago) some kernel modules for a
custom FPGA based PCIe cards.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-12-05  1:46 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-18 15:29 clustered MD - beyond RAID1 Scott Sinno
2015-12-20 23:25 ` NeilBrown
2015-12-21 19:19   ` Tejas Rao
2015-12-21 20:47     ` NeilBrown
2015-12-21 21:27       ` Tejas Rao
2015-12-21 22:03         ` NeilBrown
2015-12-21 22:29           ` Adam Goryachev
2015-12-21 23:09             ` NeilBrown
2015-12-22  1:36           ` Tejas Rao
2015-12-22  2:29             ` Alireza Haghdoost
2015-12-22  4:13             ` NeilBrown
     [not found]               ` <CAB9NSeXhoHd3_BDRrWAsBrW0Dj2=NucyUFt8pSP0zB5K=RkUOg@mail.gmail.com>
2016-12-05  1:46                 ` Aaron Knister
     [not found]           ` <5678A2B9.6070008@bnl.gov>
2015-12-22  1:50             ` Aaron Knister
2015-12-22  2:33               ` Tejas Rao
     [not found]                 ` <5678B693.40907-IGkKxAqZmp0@public.gmane.org>
2015-12-25  8:47                   ` roger zhou
2016-12-02 18:12 Robert Woodworth
2016-12-02 20:02 ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.