linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] Software RAID Support for NV-DIMM
@ 2019-02-15  9:57 Johannes Thumshirn
  2019-02-15 16:34 ` Dan Williams
  2019-02-16  5:31 ` Dave Chinner
  0 siblings, 2 replies; 10+ messages in thread
From: Johannes Thumshirn @ 2019-02-15  9:57 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-block, linux-fsdevel, linux-nvdimm, linux-btrfs, hare

(This is a joint proposal with Hannes Reinecke)

Servers with NV-DIMM are slowly emerging in data centers but one key feature
for reliability of these systems hasn't been addressed up to now, data
redundancy.

While it would be best to solve this issue in the memory controller of the CPU
itself, I don't see this coming in the next few years. This puts us as the OS
in the burden to create the redundant copies of data for the users.

If we leave of the DAX support Linux' software RAID implementations (MD,
device-mapper and BTRFS RAID) do already work on top of pmem devices, but they
are incompatible with DAX.

In this session Hannes and I would like to discuss eventual ways how we as an
operating system can mitigate these issues for our users.

Byte,
	Johannes

-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Software RAID Support for NV-DIMM
  2019-02-15  9:57 [LSF/MM TOPIC] Software RAID Support for NV-DIMM Johannes Thumshirn
@ 2019-02-15 16:34 ` Dan Williams
  2019-02-16  5:31 ` Dave Chinner
  1 sibling, 0 replies; 10+ messages in thread
From: Dan Williams @ 2019-02-15 16:34 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: lsf-pc, linux-block, linux-fsdevel, Hannes Reinecke, linux-btrfs,
	linux-nvdimm

On Fri, Feb 15, 2019 at 1:57 AM Johannes Thumshirn <jthumshirn@suse.de> wrote:
>
> (This is a joint proposal with Hannes Reinecke)
>
> Servers with NV-DIMM are slowly emerging in data centers but one key feature
> for reliability of these systems hasn't been addressed up to now, data
> redundancy.
>
> While it would be best to solve this issue in the memory controller of the CPU
> itself, I don't see this coming in the next few years. This puts us as the OS
> in the burden to create the redundant copies of data for the users.
>
> If we leave of the DAX support Linux' software RAID implementations (MD,
> device-mapper and BTRFS RAID) do already work on top of pmem devices, but they
> are incompatible with DAX.
>
> In this session Hannes and I would like to discuss eventual ways how we as an
> operating system can mitigate these issues for our users.

One feature request I have heard in this space is to at least have
filesystem metadata redundancy for DAX. For applications that can
handle their own data-replication the single-point-of-failure FS
metadata becomes a larger liability.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Software RAID Support for NV-DIMM
  2019-02-15  9:57 [LSF/MM TOPIC] Software RAID Support for NV-DIMM Johannes Thumshirn
  2019-02-15 16:34 ` Dan Williams
@ 2019-02-16  5:31 ` Dave Chinner
  2019-02-16  5:39   ` Dave Chinner
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2019-02-16  5:31 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: lsf-pc, linux-block, linux-fsdevel, linux-nvdimm, linux-btrfs, hare

On Fri, Feb 15, 2019 at 10:57:12AM +0100, Johannes Thumshirn wrote:
> (This is a joint proposal with Hannes Reinecke)
> 
> Servers with NV-DIMM are slowly emerging in data centers but one key feature
> for reliability of these systems hasn't been addressed up to now, data
> redundancy.
> 
> While it would be best to solve this issue in the memory controller of the CPU
> itself, I don't see this coming in the next few years. This puts us as the OS
> in the burden to create the redundant copies of data for the users.
> 
> If we leave of the DAX support Linux' software RAID implementations (MD,
> device-mapper and BTRFS RAID) do already work on top of pmem devices, but they
> are incompatible with DAX.
> 
> In this session Hannes and I would like to discuss eventual ways how we as an
> operating system can mitigate these issues for our users.

We've supported this since mid 2018 and commit ba23cba9b3bd ("fs:
allow per-device dax status checking for filesystems"). That is,
we can have DAX on the XFS RT device indepently of the data device.

That is, you set up pmem in three segments - two small identical
segments start get mirrored with RAID1 as the data device, and
the remainder as a block device that is dax capable set up as the
XFS realtime device. Set the RTINHERIT bit on the root directory at
mkfs time ("-d rtinherit=1") and then all the data goes to the DAX
capable realtime device, and all the metadata goes to the software
raided pmem block devices that aren't DAX capable.

Problem already solved, yes?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Software RAID Support for NV-DIMM
  2019-02-16  5:31 ` Dave Chinner
@ 2019-02-16  5:39   ` Dave Chinner
  2019-02-16  8:16     ` Bob Liu
                       ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Dave Chinner @ 2019-02-16  5:39 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: lsf-pc, linux-block, linux-fsdevel, linux-nvdimm, linux-btrfs, hare

On Sat, Feb 16, 2019 at 04:31:33PM +1100, Dave Chinner wrote:
> On Fri, Feb 15, 2019 at 10:57:12AM +0100, Johannes Thumshirn wrote:
> > (This is a joint proposal with Hannes Reinecke)
> > 
> > Servers with NV-DIMM are slowly emerging in data centers but one key feature
> > for reliability of these systems hasn't been addressed up to now, data
> > redundancy.
> > 
> > While it would be best to solve this issue in the memory controller of the CPU
> > itself, I don't see this coming in the next few years. This puts us as the OS
> > in the burden to create the redundant copies of data for the users.
> > 
> > If we leave of the DAX support Linux' software RAID implementations (MD,
> > device-mapper and BTRFS RAID) do already work on top of pmem devices, but they
> > are incompatible with DAX.
> > 
> > In this session Hannes and I would like to discuss eventual ways how we as an
> > operating system can mitigate these issues for our users.
> 
> We've supported this since mid 2018 and commit ba23cba9b3bd ("fs:
> allow per-device dax status checking for filesystems"). That is,
> we can have DAX on the XFS RT device indepently of the data device.
> 
> That is, you set up pmem in three segments - two small identical
> segments start get mirrored with RAID1 as the data device, and
> the remainder as a block device that is dax capable set up as the
> XFS realtime device. Set the RTINHERIT bit on the root directory at
> mkfs time ("-d rtinherit=1") and then all the data goes to the DAX
> capable realtime device, and all the metadata goes to the software
> raided pmem block devices that aren't DAX capable.
> 
> Problem already solved, yes?

Sorry, this was meant to be a reply to Dan's email commenting about
some people needing mirrored metadata, not the parent that was
talking about whole device RAID...

i.e. mirrored metadata w/ FS-DAX for data should already be a solved
problem...

Cheers,

Dave.
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Software RAID Support for NV-DIMM
  2019-02-16  5:39   ` Dave Chinner
@ 2019-02-16  8:16     ` Bob Liu
  2019-02-16 17:05     ` Dan Williams
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Bob Liu @ 2019-02-16  8:16 UTC (permalink / raw)
  To: Dave Chinner, Johannes Thumshirn
  Cc: lsf-pc, linux-block, linux-fsdevel, linux-nvdimm, linux-btrfs, hare

On 2/16/19 1:39 PM, Dave Chinner wrote:
> On Sat, Feb 16, 2019 at 04:31:33PM +1100, Dave Chinner wrote:
>> On Fri, Feb 15, 2019 at 10:57:12AM +0100, Johannes Thumshirn wrote:
>>> (This is a joint proposal with Hannes Reinecke)
>>>
>>> Servers with NV-DIMM are slowly emerging in data centers but one key feature
>>> for reliability of these systems hasn't been addressed up to now, data
>>> redundancy.
>>>
>>> While it would be best to solve this issue in the memory controller of the CPU
>>> itself, I don't see this coming in the next few years. This puts us as the OS
>>> in the burden to create the redundant copies of data for the users.
>>>
>>> If we leave of the DAX support Linux' software RAID implementations (MD,
>>> device-mapper and BTRFS RAID) do already work on top of pmem devices, but they
>>> are incompatible with DAX.
>>>
>>> In this session Hannes and I would like to discuss eventual ways how we as an
>>> operating system can mitigate these issues for our users.
>>
>> We've supported this since mid 2018 and commit ba23cba9b3bd ("fs:
>> allow per-device dax status checking for filesystems"). That is,
>> we can have DAX on the XFS RT device indepently of the data device.
>>
>> That is, you set up pmem in three segments - two small identical
>> segments start get mirrored with RAID1 as the data device, and
>> the remainder as a block device that is dax capable set up as the
>> XFS realtime device. Set the RTINHERIT bit on the root directory at
>> mkfs time ("-d rtinherit=1") and then all the data goes to the DAX
>> capable realtime device, and all the metadata goes to the software
>> raided pmem block devices that aren't DAX capable.
>>
>> Problem already solved, yes?
> 
> Sorry, this was meant to be a reply to Dan's email commenting about
> some people needing mirrored metadata, not the parent that was
> talking about whole device RAID...
> 
> i.e. mirrored metadata w/ FS-DAX for data should already be a solved
> problem...
> 

Indeed, here is the v2 version about mirrored metadata retry.
https://marc.info/?l=linux-block&m=155005161104512&w=2
Appreciate any reviews, thank you!

- Bob


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Software RAID Support for NV-DIMM
  2019-02-16  5:39   ` Dave Chinner
  2019-02-16  8:16     ` Bob Liu
@ 2019-02-16 17:05     ` Dan Williams
  2019-02-16 23:00       ` Dave Chinner
  2019-02-18 10:50     ` Johannes Thumshirn
       [not found]     ` <d7037b76-8bbe-412d-387a-4e27db26b005@oracle.com>
  3 siblings, 1 reply; 10+ messages in thread
From: Dan Williams @ 2019-02-16 17:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Thumshirn, linux-nvdimm, linux-block, Hannes Reinecke,
	linux-fsdevel, lsf-pc, linux-btrfs

On Fri, Feb 15, 2019 at 9:40 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sat, Feb 16, 2019 at 04:31:33PM +1100, Dave Chinner wrote:
> > On Fri, Feb 15, 2019 at 10:57:12AM +0100, Johannes Thumshirn wrote:
> > > (This is a joint proposal with Hannes Reinecke)
> > >
> > > Servers with NV-DIMM are slowly emerging in data centers but one key feature
> > > for reliability of these systems hasn't been addressed up to now, data
> > > redundancy.
> > >
> > > While it would be best to solve this issue in the memory controller of the CPU
> > > itself, I don't see this coming in the next few years. This puts us as the OS
> > > in the burden to create the redundant copies of data for the users.
> > >
> > > If we leave of the DAX support Linux' software RAID implementations (MD,
> > > device-mapper and BTRFS RAID) do already work on top of pmem devices, but they
> > > are incompatible with DAX.
> > >
> > > In this session Hannes and I would like to discuss eventual ways how we as an
> > > operating system can mitigate these issues for our users.
> >
> > We've supported this since mid 2018 and commit ba23cba9b3bd ("fs:
> > allow per-device dax status checking for filesystems"). That is,
> > we can have DAX on the XFS RT device indepently of the data device.
> >
> > That is, you set up pmem in three segments - two small identical
> > segments start get mirrored with RAID1 as the data device, and
> > the remainder as a block device that is dax capable set up as the
> > XFS realtime device. Set the RTINHERIT bit on the root directory at
> > mkfs time ("-d rtinherit=1") and then all the data goes to the DAX
> > capable realtime device, and all the metadata goes to the software
> > raided pmem block devices that aren't DAX capable.
> >
> > Problem already solved, yes?
>
> Sorry, this was meant to be a reply to Dan's email commenting about
> some people needing mirrored metadata, not the parent that was
> talking about whole device RAID...
>
> i.e. mirrored metadata w/ FS-DAX for data should already be a solved
> problem...

Ah true, thanks for the clarification. I'll give it a try, the last
time I looked RT configurations failed with DAX, but perhaps that's
been fixed and I can drop if from my list of broken DAX items.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Software RAID Support for NV-DIMM
  2019-02-16 17:05     ` Dan Williams
@ 2019-02-16 23:00       ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2019-02-16 23:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Johannes Thumshirn, linux-nvdimm, linux-block, Hannes Reinecke,
	linux-fsdevel, lsf-pc, linux-btrfs

On Sat, Feb 16, 2019 at 09:05:31AM -0800, Dan Williams wrote:
> On Fri, Feb 15, 2019 at 9:40 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Sat, Feb 16, 2019 at 04:31:33PM +1100, Dave Chinner wrote:
> > > On Fri, Feb 15, 2019 at 10:57:12AM +0100, Johannes Thumshirn wrote:
> > > > (This is a joint proposal with Hannes Reinecke)
> > > >
> > > > Servers with NV-DIMM are slowly emerging in data centers but one key feature
> > > > for reliability of these systems hasn't been addressed up to now, data
> > > > redundancy.
> > > >
> > > > While it would be best to solve this issue in the memory controller of the CPU
> > > > itself, I don't see this coming in the next few years. This puts us as the OS
> > > > in the burden to create the redundant copies of data for the users.
> > > >
> > > > If we leave of the DAX support Linux' software RAID implementations (MD,
> > > > device-mapper and BTRFS RAID) do already work on top of pmem devices, but they
> > > > are incompatible with DAX.
> > > >
> > > > In this session Hannes and I would like to discuss eventual ways how we as an
> > > > operating system can mitigate these issues for our users.
> > >
> > > We've supported this since mid 2018 and commit ba23cba9b3bd ("fs:
> > > allow per-device dax status checking for filesystems"). That is,
> > > we can have DAX on the XFS RT device indepently of the data device.
> > >
> > > That is, you set up pmem in three segments - two small identical
> > > segments start get mirrored with RAID1 as the data device, and
> > > the remainder as a block device that is dax capable set up as the
> > > XFS realtime device. Set the RTINHERIT bit on the root directory at
> > > mkfs time ("-d rtinherit=1") and then all the data goes to the DAX
> > > capable realtime device, and all the metadata goes to the software
> > > raided pmem block devices that aren't DAX capable.
> > >
> > > Problem already solved, yes?
> >
> > Sorry, this was meant to be a reply to Dan's email commenting about
> > some people needing mirrored metadata, not the parent that was
> > talking about whole device RAID...
> >
> > i.e. mirrored metadata w/ FS-DAX for data should already be a solved
> > problem...
> 
> Ah true, thanks for the clarification. I'll give it a try, the last
> time I looked RT configurations failed with DAX, but perhaps that's
> been fixed and I can drop if from my list of broken DAX items.

It should work. The whole reason for DAX on rt devices is that we
can guarantee PMD sized and aligned allocations for all user data
with the RT device (i.e. using "-r extsize=<PMD_SIZE>" mkfs option)
so it's nearly equivalent in capability compared to using device dax
directly. We can't guarantee such alignment with the data device as
extent size hints are, well, just hints and it will fall back to
smaller allocations if it's too difficult to find PMD aligned free
space...

$ sudo mkfs.xfs -f -r rtdev=/dev/pmem1,extsize=2m -d rtinherit=1 /dev/pmem0
meta-data=/dev/pmem0             isize=512    agcount=4, agsize=524288 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=2097152, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =/dev/pmem1             extsz=2097152 blocks=2097152, rtextents=4096
$ sudo mount -o dax,rtdev=/dev/pmem1 /dev/pmem0 /mnt/scratch
$ sudo dmesg |tail -3
XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
XFS (pmem0): Mounting V5 Filesystem
XFS (pmem0): Ending clean mount
$

Yup, DAX is enabled on the filesystem.

$ sudo xfs_io -c stat /mnt/scratch
....
fsxattr.xflags = 0x100 [-------t--------]
....
$

The root dir is configured to put all new files on the rt device.

$ sudo xfs_io -f -c "pwrite 0 1m" -c stat -c "bmap -vp" /mnt/scratch/foo
wrote 1048576/1048576 bytes at offset 0
1 MiB, 256 ops; 0.0029 sec (338.983 MiB/sec and 86779.6610 ops/sec)
fd.path = "/mnt/scratch/foo"
fd.flags = non-sync,non-direct,read-write
stat.ino = 131
stat.type = regular file
stat.size = 1048576
stat.blocks = 4096
fsxattr.xflags = 0x1 [r---------------]
fsxattr.projid = 0
fsxattr.extsize = 0
fsxattr.cowextsize = 0
fsxattr.nextents = 1
fsxattr.naextents = 0
dioattr.mem = 0x200
dioattr.miniosz = 512
dioattr.maxiosz = 2147483136
/mnt/scratch/foo:
 EXT: FILE-OFFSET      RT-BLOCK-RANGE     TOTAL FLAGS
   0: [0..4095]:       0..4095             4096 000000
$

Yup, /mnt/scratch/foo is on the rt device, it's got a 2MB sized and
aligned extent allocated to it, and DAX is enabled.

So it looks to me like this all just works fine.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Software RAID Support for NV-DIMM
  2019-02-16  5:39   ` Dave Chinner
  2019-02-16  8:16     ` Bob Liu
  2019-02-16 17:05     ` Dan Williams
@ 2019-02-18 10:50     ` Johannes Thumshirn
  2019-02-18 18:27       ` Dan Williams
       [not found]     ` <d7037b76-8bbe-412d-387a-4e27db26b005@oracle.com>
  3 siblings, 1 reply; 10+ messages in thread
From: Johannes Thumshirn @ 2019-02-18 10:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: lsf-pc, linux-block, linux-fsdevel, linux-nvdimm, linux-btrfs, hare

On 16/02/2019 06:39, Dave Chinner wrote:
[..]

>> We've supported this since mid 2018 and commit ba23cba9b3bd ("fs:
>> allow per-device dax status checking for filesystems"). That is,
>> we can have DAX on the XFS RT device indepently of the data device.
>>
>> That is, you set up pmem in three segments - two small identical
>> segments start get mirrored with RAID1 as the data device, and
>> the remainder as a block device that is dax capable set up as the
>> XFS realtime device. Set the RTINHERIT bit on the root directory at
>> mkfs time ("-d rtinherit=1") and then all the data goes to the DAX
>> capable realtime device, and all the metadata goes to the software
>> raided pmem block devices that aren't DAX capable.
>>
>> Problem already solved, yes?
> 
> Sorry, this was meant to be a reply to Dan's email commenting about
> some people needing mirrored metadata, not the parent that was
> talking about whole device RAID...
> 
> i.e. mirrored metadata w/ FS-DAX for data should already be a solved
> problem...

Trying to answer you both.

But deferring the data redundancy to the application sounds like a no-go
to me, sorry. We don't do that for "traditional" block storage (SCSI,
NVMe, you name it). Some applications might already be able to handle it
but definitively not all. I don't see your random DBMS like MariaDB or
Postgres already doing data duplication over interleave sets of NV-DIMMs.

And if you carve out a bit of your pmem space into an own namespace for
the metadata (did I understand you right here?) you still have the
problem that all data written to the DIMMs is interleaved in an
interleave set, if I understand it correctly.

So if one DIMM in your interleave set goes bad, you're lost anyways.

Byte,
	Johannes
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Software RAID Support for NV-DIMM
  2019-02-18 10:50     ` Johannes Thumshirn
@ 2019-02-18 18:27       ` Dan Williams
  0 siblings, 0 replies; 10+ messages in thread
From: Dan Williams @ 2019-02-18 18:27 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Dave Chinner, linux-nvdimm, linux-block, Hannes Reinecke,
	linux-fsdevel, lsf-pc, linux-btrfs

On Mon, Feb 18, 2019 at 2:50 AM Johannes Thumshirn <jthumshirn@suse.de> wrote:
>
> On 16/02/2019 06:39, Dave Chinner wrote:
> [..]
>
> >> We've supported this since mid 2018 and commit ba23cba9b3bd ("fs:
> >> allow per-device dax status checking for filesystems"). That is,
> >> we can have DAX on the XFS RT device indepently of the data device.
> >>
> >> That is, you set up pmem in three segments - two small identical
> >> segments start get mirrored with RAID1 as the data device, and
> >> the remainder as a block device that is dax capable set up as the
> >> XFS realtime device. Set the RTINHERIT bit on the root directory at
> >> mkfs time ("-d rtinherit=1") and then all the data goes to the DAX
> >> capable realtime device, and all the metadata goes to the software
> >> raided pmem block devices that aren't DAX capable.
> >>
> >> Problem already solved, yes?
> >
> > Sorry, this was meant to be a reply to Dan's email commenting about
> > some people needing mirrored metadata, not the parent that was
> > talking about whole device RAID...
> >
> > i.e. mirrored metadata w/ FS-DAX for data should already be a solved
> > problem...
>
> Trying to answer you both.
>
> But deferring the data redundancy to the application sounds like a no-go
> to me, sorry. We don't do that for "traditional" block storage (SCSI,
> NVMe, you name it). Some applications might already be able to handle it
> but definitively not all. I don't see your random DBMS like MariaDB or
> Postgres already doing data duplication over interleave sets of NV-DIMMs.

Oh, definitely agreed. I was just saying for the subset of
applications that *do* perform application level redundancy the lack
of metadata redundancy was a liability.

> And if you carve out a bit of your pmem space into an own namespace for
> the metadata (did I understand you right here?) you still have the
> problem that all data written to the DIMMs is interleaved in an
> interleave set, if I understand it correctly.
>
> So if one DIMM in your interleave set goes bad, you're lost anyways.

Yes, if you want to be able to survive the loss of a single-DIMM then
you need to disable interleaving and RAID across the DIMMs. However,
once you do that, dax for data can't work by definition, but RAID for
metadata would work.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Software RAID Support for NV-DIMM
       [not found]     ` <d7037b76-8bbe-412d-387a-4e27db26b005@oracle.com>
@ 2019-02-19  3:59       ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2019-02-19  3:59 UTC (permalink / raw)
  To: Jane Chu
  Cc: Johannes Thumshirn, linux-nvdimm, linux-block, hare,
	linux-fsdevel, lsf-pc, linux-btrfs

On Mon, Feb 18, 2019 at 06:15:34PM -0800, Jane Chu wrote:
> On 2/15/2019 9:39 PM, Dave Chinner wrote:
> 
> >On Sat, Feb 16, 2019 at 04:31:33PM +1100, Dave Chinner wrote:
> >>On Fri, Feb 15, 2019 at 10:57:12AM +0100, Johannes Thumshirn wrote:
> >>>(This is a joint proposal with Hannes Reinecke)
> >>>
> >>>Servers with NV-DIMM are slowly emerging in data centers but one key feature
> >>>for reliability of these systems hasn't been addressed up to now, data
> >>>redundancy.
> >>>
> >>>While it would be best to solve this issue in the memory controller of the CPU
> >>>itself, I don't see this coming in the next few years. This puts us as the OS
> >>>in the burden to create the redundant copies of data for the users.
> >>>
> >>>If we leave of the DAX support Linux' software RAID implementations (MD,
> >>>device-mapper and BTRFS RAID) do already work on top of pmem devices, but they
> >>>are incompatible with DAX.
> >>>
> >>>In this session Hannes and I would like to discuss eventual ways how we as an
> >>>operating system can mitigate these issues for our users.
> >>We've supported this since mid 2018 and commit ba23cba9b3bd ("fs:
> >>allow per-device dax status checking for filesystems"). That is,
> >>we can have DAX on the XFS RT device indepently of the data device.
> >>
> >>That is, you set up pmem in three segments - two small identical
> >>segments start get mirrored with RAID1 as the data device, and
> >>the remainder as a block device that is dax capable set up as the
> >>XFS realtime device. Set the RTINHERIT bit on the root directory at
> >>mkfs time ("-d rtinherit=1") and then all the data goes to the DAX
> >>capable realtime device, and all the metadata goes to the software
> >>raided pmem block devices that aren't DAX capable.
> >>
> >>Problem already solved, yes?
> >Sorry, this was meant to be a reply to Dan's email commenting about
> >some people needing mirrored metadata, not the parent that was
> >talking about whole device RAID...
> >
> >i.e. mirrored metadata w/ FS-DAX for data should already be a solved
> >problem...
> 
> What about DAX Ext4 filesystem? What could be done for the EXT4
> filesystem metadata?

AFAIK, this can't be supported with ext4 right now. You'll have to
either use hardware DIMM mirroring or wait to see if full device
mirroring can be done in software and still preserve DAX (which is
what the original proposal is about).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-02-19  3:59 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-15  9:57 [LSF/MM TOPIC] Software RAID Support for NV-DIMM Johannes Thumshirn
2019-02-15 16:34 ` Dan Williams
2019-02-16  5:31 ` Dave Chinner
2019-02-16  5:39   ` Dave Chinner
2019-02-16  8:16     ` Bob Liu
2019-02-16 17:05     ` Dan Williams
2019-02-16 23:00       ` Dave Chinner
2019-02-18 10:50     ` Johannes Thumshirn
2019-02-18 18:27       ` Dan Williams
     [not found]     ` <d7037b76-8bbe-412d-387a-4e27db26b005@oracle.com>
2019-02-19  3:59       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).