All of lore.kernel.org
 help / color / mirror / Atom feed
* experimental raid5/6 code in git
@ 2013-02-02 16:02 Chris Mason
  2013-02-03 17:33 ` Hendrik Friedel
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Chris Mason @ 2013-02-02 16:02 UTC (permalink / raw)
  To: linux-btrfs

Hi everyone,

I've uploaded an experimental release of the raid5/6 support to git, in
branches named raid56-experimental.  This is based on David Woodhouse's
initial implementation (thanks Dave!).

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git raid56-experimental
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git raid56-experimental

These are working well for me, but I'm sure I've missed at least one or
two problems.  Most importantly, the kernel side of things can have
inconsistent parity if you crash or lower power.  I'm adding new code to
fix that right now, it's the big missing piece.

But, I wanted to give everyone the chance to test what I have while I'm
finishing off the last few details.  Also missing:

* Support for scrub repairing bad blocks.  This is not difficult, we
just need to make a way for scrub to lock stripes and rewrite the
whole stripe with proper parity.

* Support for discard.  The discard code needs to discard entire
stripes.

* Progs support for parity rebuild.  Missing drives upset the progs
today, but the kernel does rebuild parity properly.

* Planned support for N-way mirroring (triple mirror raid1) isn't
included yet.

With all those warnings out of the way, how does it work?  The
original plan was to base read/modify/write cycles at high levels in the
filesystem, so that we always gave full stripe writes down to raid56
layers.  But this had a few problems, especially when you start thinking
about converting from one stripe size to another.  It doesn't fit with
the delayed allocation model where we pick physical extents for a given
operation as late as we possibly can.

Instead I'm doing read/modify/write when we map bios down to the
individual drives.  This allows blocks from multiple files to share a
stripe, and it allows us to have metadata blocks smaller than a full
stripe.  That's important if you don't want to spin every disk for each
metadata read.

This does sound quite a lot like MD raid, and that's because it is.  By
doing the raid inside of Btrfs, we're able to use different raid levels
for metadata vs data, and we're able to force parity rebuilds when crcs
don't match.  Also management operations such as restriping and
adding/removing drives are able to hook into the filesystem
transactions.  Longer term we'll be able to skip reads on blocks that
aren't allocated and do other connections between raid56 and the FS
metadata.

I've spent a long time running different performance numbers, but there
are many benchmarks left to run.  The matrix of different configurations
is fairly large, with btrfs-raid56 vs MD-raid56 vs Btrfs-on-MD-raid56,
and then comparing all the basic workloads.  Before I dive into numbers,
I want to describe a few moving pieces.

Stripe cache -- This avoids read/modify/write cycles with an LRU of
recently written stripes.  Picture a database that does adjacent
synchronous 4K writes (say a log record and a commit block).  We want to
make sure we don't repeat read/modify/writes for the commit block after
writing the log block.

In btrfs the stripe cache changes because we're doing COW.  Hopefully we
are able to collect writes from multiple processes into a full stripe
and do fewer read/modify/write cycles.  But, we still need the cache.
The cache in btrfs defaults to 1024 stripes and can't (yet) be tuned.
In MD it can be tuned up to 32768 stripes.

In the btrfs code, the stripe cache is the director in a state machine
that pulls stripes from initial submission to completion.  It
coordinates merging stripes, parity rebuild and handing off the stripe
lock to the next bio.

Plugging -- The on stack plugging code has a slick way for anyone in the
IO stack to participate in plugging.  Btrfs is using this to collect
partial stripe writes in hopes of merging them into full stripes.  When
the kernel code unplugs, we sort, merge and fire off the IOs.  MD has a
plugging callback as well.

Parity calculations --  For full stripes, Btrfs does P/Q calculations
at IO submission time without handing off to helper threads.  The code
uses the synchronous xor/memcpy/raid6 lib apis.  For sub-stripe writes,
Btrfs kicks the work off to its own helper threads and uses the same
synchronous apis.  I'm definitely open to trying out the ioat code, but
so far I don't see the P/Q math as a real bottleneck.

Everyone who made it this far gets to see benchmarks!  I've run these on
two different systems.

1) A large HP DL380 with two sockets and 4TB of flash.  The
flash is spread over 4 drives and in a raid0 run it can do 5GB/s
streaming writes.  This machine has the IOAT async raid engine.

2) A smaller single socket box with 4 spindles and 2 fusionio drives.
No raid offload here.  This box can do 2.5GB/s streaming writes.

These are all on 3.7.0 with MD created with -c 64 and --assume-clean.
I upped the MD stripe cache to 32768, but didn't include Shaohua's
patches to parallelize the MD parity calculations.  I'll do those runs
after I have the next round of btrfs changes done.

Lets start with an easy benchmark:

machine #2 flash broken up into 8 logical volumes and then raid5
created on top (64K stripe size).  Single dd doing streaming full stripe
writes:

dd if=/dev/zero of=/mnt/oo bs=1344K oflag=direct count=4096

Btrfs -- 604MB/s
MD    -- 162MB/s

My guess is the performance difference here is coming from latencies
related to handing off parity to helpers.  Btrfs is doing everything
inline and MD is handing off.

fs/direct-io.c is sending down partial stripes (one IO per 64 pages),
but our plugging callbacks let us collect them.  Neither MD or Btrfs are
doing any reads here.

Now for something a little bigger:

machine #1 with all 4 drives configured in raid6.  This one is using fio
to do a streaming aio/dio write of large full stripes.  The numbers
below are from blktrace.  Since we're doing raid6 over 4 drives, half
our IO was for parity.  The actual tput seen by fio is 1/2 of this.

The MD runs are going directly to MD, no filesystem involved.

MD -- 800MB/s very little system time
http://masoncoding.com/mason/benchmark/btrfs-raid6/md-raid6-full-stripe-tput.png
http://masoncoding.com/mason/benchmark/btrfs-raid6/md-raid6-full-stripe-sys.png

Btrfs -- 3.8GB/s one CPU mostly pegged
http://masoncoding.com/mason/benchmark/btrfs-raid6/btrfs-full-stripe-tput.png
http://masoncoding.com/mason/benchmark/btrfs-raid6/btrfs-full-stripe-sys.png

That one CPU is handling interrupts for the flash.

I spent some time trying to figure out why MD was doing reads in this
run, but I wasn't able to nail it down.

Long story short, I spent a long time tuning for streaming writes on
flash.  MD isn't CPU bound in these runs, and latencytop shows it is
waiting for room in its stripe cache.

Ok, but what about read/modify/write?
Machine #2 with fio doing 32K writes onto raid5

Btrfs -- 380MB/s seen by fio
MD    -- 174MB/s seen by fio

http://masoncoding.com/mason/benchmark/btrfs-raid6/btrfs-32K-write-raid5-full.png
http://masoncoding.com/mason/benchmark/btrfs-raid6/md-raid5-32K.png

For the Btrfs run, I filled the disk with 8 files and then deleted one
of them.  The end result made it impossible for btrfs to ever allocate a
full stripe, even when it was doing COW.  So every 32K write triggered a
read/modify/write cycle.  MD was doing rmw on every IO as well.

It's interesting that MD is doing a 1:1 read/write while btrfs is doing
more reads than writes.  Some of that is metadata required for the IO.

How does Btrfs do at 32K sub stripe writes when the FS is empty?

http://masoncoding.com/mason/benchmark/btrfs-raid6/btrfs-32K-write-raid5-empty.png

COW lets us collect 32K writes from multiple procs into a full stripe,
so we can avoid the rmw cycle some of the time.  It's faster, but only
lasts while the space is free.

Metadata intensive workloads hit the read/modify/write code much harder,
and are even more latency sensitive than O_DIRECT.  To test this, I used
fs_mark, both on spindles and on flash.

The interesting thing is that on flash, MD was within 15% of the Btrfs
number.  The fs_mark run was actually CPU bound creating new files in
Btrfs, so once we used flash the storage wasn't the bottleneck any more.

Spindles looked a little different.  For these runs I tested btrfs on
top of MD vs btrfs raid5.

http://masoncoding.com/mason/benchmark/btrfs-raid5/btrfs-fsmark-md-raid5-spindle.png
http://masoncoding.com/mason/benchmark/btrfs-raid5/btrfs-fsmark-raid5-spindle.png

Creating 12 million files on Btrfs raid5 took 226 seconds, vs 485
seconds on MD.  In general MD is doing more reads for the same
workload.  I don't have a great explanation for this yet but the
Btrfs stripe cache may have a bigger window for merging concurrent IOs
into the same stripe.

Ok, that's enough for now, happy testing everyone.

-chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: experimental raid5/6 code in git
  2013-02-02 16:02 experimental raid5/6 code in git Chris Mason
@ 2013-02-03 17:33 ` Hendrik Friedel
  2013-02-04 21:42 ` H. Peter Anvin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Hendrik Friedel @ 2013-02-03 17:33 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs

Hi Chris,

I've been keen for raid5/6 in btrfs since I heard of it.

I cannot give you any feedback, but I'd like to take the opportunity to 
thank you -and all contributors (thinking of David for the raid) for 
your work.

Regards,
Hendrik

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: experimental raid5/6 code in git
  2013-02-02 16:02 experimental raid5/6 code in git Chris Mason
  2013-02-03 17:33 ` Hendrik Friedel
@ 2013-02-04 21:42 ` H. Peter Anvin
  2013-02-05  1:23   ` Chris Mason
  2013-02-05 14:22 ` Tomasz Torcz
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: H. Peter Anvin @ 2013-02-04 21:42 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs

@@ -1389,6 +1392,14 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
 	}
 	btrfs_dev_replace_unlock(&root->fs_info->dev_replace);

+	if ((all_avail & (BTRFS_BLOCK_GROUP_RAID5 |
+			  BTRFS_BLOCK_GROUP_RAID6) && num_devices <= 3)) {
+		printk(KERN_ERR "btrfs: unable to go below three devices "
+		       "on raid5 or raid6\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
 	if ((all_avail & BTRFS_BLOCK_GROUP_RAID10) && num_devices <= 4) {
 		printk(KERN_ERR "btrfs: unable to go below four devices "
 		       "on raid10\n");
@@ -1403,6 +1414,21 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
 		goto out;
 	}

+	if ((all_avail & BTRFS_BLOCK_GROUP_RAID5) &&
+	    root->fs_info->fs_devices->rw_devices <= 2) {
+		printk(KERN_ERR "btrfs: unable to go below two "
+		       "devices on raid5\n");
+		ret = -EINVAL;
+		goto out;
+	}
+	if ((all_avail & BTRFS_BLOCK_GROUP_RAID6) &&
+	    root->fs_info->fs_devices->rw_devices <= 3) {
+		printk(KERN_ERR "btrfs: unable to go below three "
+		       "devices on raid6\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
 	if (strcmp(device_path, "missing") == 0) {
 		struct list_head *devices;
 		struct btrfs_device *tmp;


This seems inconsistent?

	-hpa


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: experimental raid5/6 code in git
  2013-02-04 21:42 ` H. Peter Anvin
@ 2013-02-05  1:23   ` Chris Mason
  2013-02-05  2:26     ` H. Peter Anvin
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Mason @ 2013-02-05  1:23 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Chris Mason, linux-btrfs

On Mon, Feb 04, 2013 at 02:42:24PM -0700, H. Peter Anvin wrote:
> @@ -1389,6 +1392,14 @@ int btrfs_rm_device(struct btrfs_root *root, char
> *device_path)
>  	}
>  	btrfs_dev_replace_unlock(&root->fs_info->dev_replace);
> 
> +	if ((all_avail & (BTRFS_BLOCK_GROUP_RAID5 |
> +			  BTRFS_BLOCK_GROUP_RAID6) && num_devices <= 3)) {
> +		printk(KERN_ERR "btrfs: unable to go below three devices "
> +		       "on raid5 or raid6\n");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
>  	if ((all_avail & BTRFS_BLOCK_GROUP_RAID10) && num_devices <= 4) {
>  		printk(KERN_ERR "btrfs: unable to go below four devices "
>  		       "on raid10\n");
> @@ -1403,6 +1414,21 @@ int btrfs_rm_device(struct btrfs_root *root, char
> *device_path)
>  		goto out;
>  	}
> 
> +	if ((all_avail & BTRFS_BLOCK_GROUP_RAID5) &&
> +	    root->fs_info->fs_devices->rw_devices <= 2) {
> +		printk(KERN_ERR "btrfs: unable to go below two "
> +		       "devices on raid5\n");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	if ((all_avail & BTRFS_BLOCK_GROUP_RAID6) &&
> +	    root->fs_info->fs_devices->rw_devices <= 3) {
> +		printk(KERN_ERR "btrfs: unable to go below three "
> +		       "devices on raid6\n");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
>  	if (strcmp(device_path, "missing") == 0) {
>  		struct list_head *devices;
>  		struct btrfs_device *tmp;
> 
> 
> This seems inconsistent?

Whoops, missed that one.  Thanks!

-chris


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: experimental raid5/6 code in git
  2013-02-05  1:23   ` Chris Mason
@ 2013-02-05  2:26     ` H. Peter Anvin
  2013-02-05  2:59       ` Gareth Pye
  0 siblings, 1 reply; 13+ messages in thread
From: H. Peter Anvin @ 2013-02-05  2:26 UTC (permalink / raw)
  To: Chris Mason; +Cc: Chris Mason, linux-btrfs

Also, a 2-member raid5 or 3-member raid6 are a raid1 and can be treated as such.

Chris Mason <chris.mason@fusionio.com> wrote:

>On Mon, Feb 04, 2013 at 02:42:24PM -0700, H. Peter Anvin wrote:
>> @@ -1389,6 +1392,14 @@ int btrfs_rm_device(struct btrfs_root *root,
>char
>> *device_path)
>>  	}
>>  	btrfs_dev_replace_unlock(&root->fs_info->dev_replace);
>> 
>> +	if ((all_avail & (BTRFS_BLOCK_GROUP_RAID5 |
>> +			  BTRFS_BLOCK_GROUP_RAID6) && num_devices <= 3)) {
>> +		printk(KERN_ERR "btrfs: unable to go below three devices "
>> +		       "on raid5 or raid6\n");
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +
>>  	if ((all_avail & BTRFS_BLOCK_GROUP_RAID10) && num_devices <= 4) {
>>  		printk(KERN_ERR "btrfs: unable to go below four devices "
>>  		       "on raid10\n");
>> @@ -1403,6 +1414,21 @@ int btrfs_rm_device(struct btrfs_root *root,
>char
>> *device_path)
>>  		goto out;
>>  	}
>> 
>> +	if ((all_avail & BTRFS_BLOCK_GROUP_RAID5) &&
>> +	    root->fs_info->fs_devices->rw_devices <= 2) {
>> +		printk(KERN_ERR "btrfs: unable to go below two "
>> +		       "devices on raid5\n");
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +	if ((all_avail & BTRFS_BLOCK_GROUP_RAID6) &&
>> +	    root->fs_info->fs_devices->rw_devices <= 3) {
>> +		printk(KERN_ERR "btrfs: unable to go below three "
>> +		       "devices on raid6\n");
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +
>>  	if (strcmp(device_path, "missing") == 0) {
>>  		struct list_head *devices;
>>  		struct btrfs_device *tmp;
>> 
>> 
>> This seems inconsistent?
>
>Whoops, missed that one.  Thanks!
>
>-chris

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: experimental raid5/6 code in git
  2013-02-05  2:26     ` H. Peter Anvin
@ 2013-02-05  2:59       ` Gareth Pye
  2013-02-05  5:29         ` Chester
  0 siblings, 1 reply; 13+ messages in thread
From: Gareth Pye @ 2013-02-05  2:59 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Chris Mason, Chris Mason, linux-btrfs

I felt like having a small play with this stuff, as I've been wanting
it for so long :)

But apparently I've made some incredibly newb error.

I used the following two lines to check out the code:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git
raid56-experimental
git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git
raid56-experimental-progs

Then I did not very much to compile both of them (installed lots and
lots of packages that various places told me would be needed so they'd
both compile) finishing up with a "sudo make install" for both the
kernel and the tools.
Rebooting miracuously it came up with the new kernel and uname -a
assures me that I have a new kernel running:
btrfs@ubuntu:/kernel/raid56-experimental$ uname -a
Linux ubuntu 3.6.0+ #1 SMP Tue Feb 5 12:26:03 EST 2013 x86_64 x86_64
x86_64 GNU/Linux
but 3.6.0 sounds rather low, but it is newer than Ubuntu 12.10's 3.5
so I believe I am running the kernel I just compiled

Where things fail is that I can figure out how to make a raid5 btrfs,
I'm certain I'm using the mkfs.btrfs that I just compiled (by
explicitly calling it in the make folder) but it wont recognise what I
assume the parameter to be:
btrfs@ubuntu:/kernel/raid56-experimental-progs$ ./mkfs.btrfs -m raid5
-d raid5 /dev/sd[bcdef]
Unknown profile raid5

Which flavour of newb am I today?

PS: I use newb in a very friendly way, I feel no shame over that term :)

On Tue, Feb 5, 2013 at 1:26 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> Also, a 2-member raid5 or 3-member raid6 are a raid1 and can be treated as such.
>
> Chris Mason <chris.mason@fusionio.com> wrote:
>
>>On Mon, Feb 04, 2013 at 02:42:24PM -0700, H. Peter Anvin wrote:
>>> @@ -1389,6 +1392,14 @@ int btrfs_rm_device(struct btrfs_root *root,
>>char
>>> *device_path)
>>>      }
>>>      btrfs_dev_replace_unlock(&root->fs_info->dev_replace);
>>>
>>> +    if ((all_avail & (BTRFS_BLOCK_GROUP_RAID5 |
>>> +                      BTRFS_BLOCK_GROUP_RAID6) && num_devices <= 3)) {
>>> +            printk(KERN_ERR "btrfs: unable to go below three devices "
>>> +                   "on raid5 or raid6\n");
>>> +            ret = -EINVAL;
>>> +            goto out;
>>> +    }
>>> +
>>>      if ((all_avail & BTRFS_BLOCK_GROUP_RAID10) && num_devices <= 4) {
>>>              printk(KERN_ERR "btrfs: unable to go below four devices "
>>>                     "on raid10\n");
>>> @@ -1403,6 +1414,21 @@ int btrfs_rm_device(struct btrfs_root *root,
>>char
>>> *device_path)
>>>              goto out;
>>>      }
>>>
>>> +    if ((all_avail & BTRFS_BLOCK_GROUP_RAID5) &&
>>> +        root->fs_info->fs_devices->rw_devices <= 2) {
>>> +            printk(KERN_ERR "btrfs: unable to go below two "
>>> +                   "devices on raid5\n");
>>> +            ret = -EINVAL;
>>> +            goto out;
>>> +    }
>>> +    if ((all_avail & BTRFS_BLOCK_GROUP_RAID6) &&
>>> +        root->fs_info->fs_devices->rw_devices <= 3) {
>>> +            printk(KERN_ERR "btrfs: unable to go below three "
>>> +                   "devices on raid6\n");
>>> +            ret = -EINVAL;
>>> +            goto out;
>>> +    }
>>> +
>>>      if (strcmp(device_path, "missing") == 0) {
>>>              struct list_head *devices;
>>>              struct btrfs_device *tmp;
>>>
>>>
>>> This seems inconsistent?
>>
>>Whoops, missed that one.  Thanks!
>>
>>-chris
>
> --
> Sent from my mobile phone. Please excuse brevity and lack of formatting.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Gareth Pye
Level 2 Judge, Melbourne, Australia
Australian MTG Forum: mtgau.com
gareth@cerberos.id.au - www.rockpaperdynamite.wordpress.com
"Dear God, I would like to file a bug report"

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: experimental raid5/6 code in git
  2013-02-05  2:59       ` Gareth Pye
@ 2013-02-05  5:29         ` Chester
  2013-02-05  6:10           ` Gareth Pye
  0 siblings, 1 reply; 13+ messages in thread
From: Chester @ 2013-02-05  5:29 UTC (permalink / raw)
  To: Gareth Pye; +Cc: H. Peter Anvin, Chris Mason, Chris Mason, linux-btrfs

The last argument should be the directory you want to clone into. Use
'-b <branch>' to specify the branch you want to clone. I'm pretty sure
you've compiled just the master branch of both linux-btrfs and
btrfs-progs.

On Mon, Feb 4, 2013 at 8:59 PM, Gareth Pye <gareth@cerberos.id.au> wrote:
> I felt like having a small play with this stuff, as I've been wanting
> it for so long :)
>
> But apparently I've made some incredibly newb error.
>
> I used the following two lines to check out the code:
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git
> raid56-experimental
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git
> raid56-experimental-progs
>
> Then I did not very much to compile both of them (installed lots and
> lots of packages that various places told me would be needed so they'd
> both compile) finishing up with a "sudo make install" for both the
> kernel and the tools.
> Rebooting miracuously it came up with the new kernel and uname -a
> assures me that I have a new kernel running:
> btrfs@ubuntu:/kernel/raid56-experimental$ uname -a
> Linux ubuntu 3.6.0+ #1 SMP Tue Feb 5 12:26:03 EST 2013 x86_64 x86_64
> x86_64 GNU/Linux
> but 3.6.0 sounds rather low, but it is newer than Ubuntu 12.10's 3.5
> so I believe I am running the kernel I just compiled
>
> Where things fail is that I can figure out how to make a raid5 btrfs,
> I'm certain I'm using the mkfs.btrfs that I just compiled (by
> explicitly calling it in the make folder) but it wont recognise what I
> assume the parameter to be:
> btrfs@ubuntu:/kernel/raid56-experimental-progs$ ./mkfs.btrfs -m raid5
> -d raid5 /dev/sd[bcdef]
> Unknown profile raid5
>
> Which flavour of newb am I today?
>
> PS: I use newb in a very friendly way, I feel no shame over that term :)
>
> On Tue, Feb 5, 2013 at 1:26 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> Also, a 2-member raid5 or 3-member raid6 are a raid1 and can be treated as such.
>>
>> Chris Mason <chris.mason@fusionio.com> wrote:
>>
>>>On Mon, Feb 04, 2013 at 02:42:24PM -0700, H. Peter Anvin wrote:
>>>> @@ -1389,6 +1392,14 @@ int btrfs_rm_device(struct btrfs_root *root,
>>>char
>>>> *device_path)
>>>>      }
>>>>      btrfs_dev_replace_unlock(&root->fs_info->dev_replace);
>>>>
>>>> +    if ((all_avail & (BTRFS_BLOCK_GROUP_RAID5 |
>>>> +                      BTRFS_BLOCK_GROUP_RAID6) && num_devices <= 3)) {
>>>> +            printk(KERN_ERR "btrfs: unable to go below three devices "
>>>> +                   "on raid5 or raid6\n");
>>>> +            ret = -EINVAL;
>>>> +            goto out;
>>>> +    }
>>>> +
>>>>      if ((all_avail & BTRFS_BLOCK_GROUP_RAID10) && num_devices <= 4) {
>>>>              printk(KERN_ERR "btrfs: unable to go below four devices "
>>>>                     "on raid10\n");
>>>> @@ -1403,6 +1414,21 @@ int btrfs_rm_device(struct btrfs_root *root,
>>>char
>>>> *device_path)
>>>>              goto out;
>>>>      }
>>>>
>>>> +    if ((all_avail & BTRFS_BLOCK_GROUP_RAID5) &&
>>>> +        root->fs_info->fs_devices->rw_devices <= 2) {
>>>> +            printk(KERN_ERR "btrfs: unable to go below two "
>>>> +                   "devices on raid5\n");
>>>> +            ret = -EINVAL;
>>>> +            goto out;
>>>> +    }
>>>> +    if ((all_avail & BTRFS_BLOCK_GROUP_RAID6) &&
>>>> +        root->fs_info->fs_devices->rw_devices <= 3) {
>>>> +            printk(KERN_ERR "btrfs: unable to go below three "
>>>> +                   "devices on raid6\n");
>>>> +            ret = -EINVAL;
>>>> +            goto out;
>>>> +    }
>>>> +
>>>>      if (strcmp(device_path, "missing") == 0) {
>>>>              struct list_head *devices;
>>>>              struct btrfs_device *tmp;
>>>>
>>>>
>>>> This seems inconsistent?
>>>
>>>Whoops, missed that one.  Thanks!
>>>
>>>-chris
>>
>> --
>> Sent from my mobile phone. Please excuse brevity and lack of formatting.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Gareth Pye
> Level 2 Judge, Melbourne, Australia
> Australian MTG Forum: mtgau.com
> gareth@cerberos.id.au - www.rockpaperdynamite.wordpress.com
> "Dear God, I would like to file a bug report"
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: experimental raid5/6 code in git
  2013-02-05  5:29         ` Chester
@ 2013-02-05  6:10           ` Gareth Pye
  0 siblings, 0 replies; 13+ messages in thread
From: Gareth Pye @ 2013-02-05  6:10 UTC (permalink / raw)
  To: Chester; +Cc: H. Peter Anvin, Chris Mason, Chris Mason, linux-btrfs

Thank you, that makes a lot of sense :)

It's been a good day, I've learnt something :)

On Tue, Feb 5, 2013 at 4:29 PM, Chester <somethingsome2000@gmail.com> wrote:
> The last argument should be the directory you want to clone into. Use
> '-b <branch>' to specify the branch you want to clone. I'm pretty sure
> you've compiled just the master branch of both linux-btrfs and
> btrfs-progs.
>
> On Mon, Feb 4, 2013 at 8:59 PM, Gareth Pye <gareth@cerberos.id.au> wrote:
>> I felt like having a small play with this stuff, as I've been wanting
>> it for so long :)
>>
>> But apparently I've made some incredibly newb error.
>>
>> I used the following two lines to check out the code:
>> git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git
>> raid56-experimental
>> git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git
>> raid56-experimental-progs
>>
>> Then I did not very much to compile both of them (installed lots and
>> lots of packages that various places told me would be needed so they'd
>> both compile) finishing up with a "sudo make install" for both the
>> kernel and the tools.
>> Rebooting miracuously it came up with the new kernel and uname -a
>> assures me that I have a new kernel running:
>> btrfs@ubuntu:/kernel/raid56-experimental$ uname -a
>> Linux ubuntu 3.6.0+ #1 SMP Tue Feb 5 12:26:03 EST 2013 x86_64 x86_64
>> x86_64 GNU/Linux
>> but 3.6.0 sounds rather low, but it is newer than Ubuntu 12.10's 3.5
>> so I believe I am running the kernel I just compiled
>>
>> Where things fail is that I can figure out how to make a raid5 btrfs,
>> I'm certain I'm using the mkfs.btrfs that I just compiled (by
>> explicitly calling it in the make folder) but it wont recognise what I
>> assume the parameter to be:
>> btrfs@ubuntu:/kernel/raid56-experimental-progs$ ./mkfs.btrfs -m raid5
>> -d raid5 /dev/sd[bcdef]
>> Unknown profile raid5
>>
>> Which flavour of newb am I today?
>>
>> PS: I use newb in a very friendly way, I feel no shame over that term :)
>>
>> On Tue, Feb 5, 2013 at 1:26 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> Also, a 2-member raid5 or 3-member raid6 are a raid1 and can be treated as such.
>>>
>>> Chris Mason <chris.mason@fusionio.com> wrote:
>>>
>>>>On Mon, Feb 04, 2013 at 02:42:24PM -0700, H. Peter Anvin wrote:
>>>>> @@ -1389,6 +1392,14 @@ int btrfs_rm_device(struct btrfs_root *root,
>>>>char
>>>>> *device_path)
>>>>>      }
>>>>>      btrfs_dev_replace_unlock(&root->fs_info->dev_replace);
>>>>>
>>>>> +    if ((all_avail & (BTRFS_BLOCK_GROUP_RAID5 |
>>>>> +                      BTRFS_BLOCK_GROUP_RAID6) && num_devices <= 3)) {
>>>>> +            printk(KERN_ERR "btrfs: unable to go below three devices "
>>>>> +                   "on raid5 or raid6\n");
>>>>> +            ret = -EINVAL;
>>>>> +            goto out;
>>>>> +    }
>>>>> +
>>>>>      if ((all_avail & BTRFS_BLOCK_GROUP_RAID10) && num_devices <= 4) {
>>>>>              printk(KERN_ERR "btrfs: unable to go below four devices "
>>>>>                     "on raid10\n");
>>>>> @@ -1403,6 +1414,21 @@ int btrfs_rm_device(struct btrfs_root *root,
>>>>char
>>>>> *device_path)
>>>>>              goto out;
>>>>>      }
>>>>>
>>>>> +    if ((all_avail & BTRFS_BLOCK_GROUP_RAID5) &&
>>>>> +        root->fs_info->fs_devices->rw_devices <= 2) {
>>>>> +            printk(KERN_ERR "btrfs: unable to go below two "
>>>>> +                   "devices on raid5\n");
>>>>> +            ret = -EINVAL;
>>>>> +            goto out;
>>>>> +    }
>>>>> +    if ((all_avail & BTRFS_BLOCK_GROUP_RAID6) &&
>>>>> +        root->fs_info->fs_devices->rw_devices <= 3) {
>>>>> +            printk(KERN_ERR "btrfs: unable to go below three "
>>>>> +                   "devices on raid6\n");
>>>>> +            ret = -EINVAL;
>>>>> +            goto out;
>>>>> +    }
>>>>> +
>>>>>      if (strcmp(device_path, "missing") == 0) {
>>>>>              struct list_head *devices;
>>>>>              struct btrfs_device *tmp;
>>>>>
>>>>>
>>>>> This seems inconsistent?
>>>>
>>>>Whoops, missed that one.  Thanks!
>>>>
>>>>-chris
>>>
>>> --
>>> Sent from my mobile phone. Please excuse brevity and lack of formatting.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Gareth Pye
>> Level 2 Judge, Melbourne, Australia
>> Australian MTG Forum: mtgau.com
>> gareth@cerberos.id.au - www.rockpaperdynamite.wordpress.com
>> "Dear God, I would like to file a bug report"
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Gareth Pye
Level 2 Judge, Melbourne, Australia
Australian MTG Forum: mtgau.com
gareth@cerberos.id.au - www.rockpaperdynamite.wordpress.com
"Dear God, I would like to file a bug report"

On Tue, Feb 5, 2013 at 4:29 PM, Chester <somethingsome2000@gmail.com> wrote:
> The last argument should be the directory you want to clone into. Use
> '-b <branch>' to specify the branch you want to clone. I'm pretty sure
> you've compiled just the master branch of both linux-btrfs and
> btrfs-progs.
>
> On Mon, Feb 4, 2013 at 8:59 PM, Gareth Pye <gareth@cerberos.id.au> wrote:
>> I felt like having a small play with this stuff, as I've been wanting
>> it for so long :)
>>
>> But apparently I've made some incredibly newb error.
>>
>> I used the following two lines to check out the code:
>> git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git
>> raid56-experimental
>> git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git
>> raid56-experimental-progs
>>
>> Then I did not very much to compile both of them (installed lots and
>> lots of packages that various places told me would be needed so they'd
>> both compile) finishing up with a "sudo make install" for both the
>> kernel and the tools.
>> Rebooting miracuously it came up with the new kernel and uname -a
>> assures me that I have a new kernel running:
>> btrfs@ubuntu:/kernel/raid56-experimental$ uname -a
>> Linux ubuntu 3.6.0+ #1 SMP Tue Feb 5 12:26:03 EST 2013 x86_64 x86_64
>> x86_64 GNU/Linux
>> but 3.6.0 sounds rather low, but it is newer than Ubuntu 12.10's 3.5
>> so I believe I am running the kernel I just compiled
>>
>> Where things fail is that I can figure out how to make a raid5 btrfs,
>> I'm certain I'm using the mkfs.btrfs that I just compiled (by
>> explicitly calling it in the make folder) but it wont recognise what I
>> assume the parameter to be:
>> btrfs@ubuntu:/kernel/raid56-experimental-progs$ ./mkfs.btrfs -m raid5
>> -d raid5 /dev/sd[bcdef]
>> Unknown profile raid5
>>
>> Which flavour of newb am I today?
>>
>> PS: I use newb in a very friendly way, I feel no shame over that term :)
>>
>> On Tue, Feb 5, 2013 at 1:26 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> Also, a 2-member raid5 or 3-member raid6 are a raid1 and can be treated as such.
>>>
>>> Chris Mason <chris.mason@fusionio.com> wrote:
>>>
>>>>On Mon, Feb 04, 2013 at 02:42:24PM -0700, H. Peter Anvin wrote:
>>>>> @@ -1389,6 +1392,14 @@ int btrfs_rm_device(struct btrfs_root *root,
>>>>char
>>>>> *device_path)
>>>>>      }
>>>>>      btrfs_dev_replace_unlock(&root->fs_info->dev_replace);
>>>>>
>>>>> +    if ((all_avail & (BTRFS_BLOCK_GROUP_RAID5 |
>>>>> +                      BTRFS_BLOCK_GROUP_RAID6) && num_devices <= 3)) {
>>>>> +            printk(KERN_ERR "btrfs: unable to go below three devices "
>>>>> +                   "on raid5 or raid6\n");
>>>>> +            ret = -EINVAL;
>>>>> +            goto out;
>>>>> +    }
>>>>> +
>>>>>      if ((all_avail & BTRFS_BLOCK_GROUP_RAID10) && num_devices <= 4) {
>>>>>              printk(KERN_ERR "btrfs: unable to go below four devices "
>>>>>                     "on raid10\n");
>>>>> @@ -1403,6 +1414,21 @@ int btrfs_rm_device(struct btrfs_root *root,
>>>>char
>>>>> *device_path)
>>>>>              goto out;
>>>>>      }
>>>>>
>>>>> +    if ((all_avail & BTRFS_BLOCK_GROUP_RAID5) &&
>>>>> +        root->fs_info->fs_devices->rw_devices <= 2) {
>>>>> +            printk(KERN_ERR "btrfs: unable to go below two "
>>>>> +                   "devices on raid5\n");
>>>>> +            ret = -EINVAL;
>>>>> +            goto out;
>>>>> +    }
>>>>> +    if ((all_avail & BTRFS_BLOCK_GROUP_RAID6) &&
>>>>> +        root->fs_info->fs_devices->rw_devices <= 3) {
>>>>> +            printk(KERN_ERR "btrfs: unable to go below three "
>>>>> +                   "devices on raid6\n");
>>>>> +            ret = -EINVAL;
>>>>> +            goto out;
>>>>> +    }
>>>>> +
>>>>>      if (strcmp(device_path, "missing") == 0) {
>>>>>              struct list_head *devices;
>>>>>              struct btrfs_device *tmp;
>>>>>
>>>>>
>>>>> This seems inconsistent?
>>>>
>>>>Whoops, missed that one.  Thanks!
>>>>
>>>>-chris
>>>
>>> --
>>> Sent from my mobile phone. Please excuse brevity and lack of formatting.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Gareth Pye
>> Level 2 Judge, Melbourne, Australia
>> Australian MTG Forum: mtgau.com
>> gareth@cerberos.id.au - www.rockpaperdynamite.wordpress.com
>> "Dear God, I would like to file a bug report"
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Gareth Pye
Level 2 Judge, Melbourne, Australia
Australian MTG Forum: mtgau.com
gareth@cerberos.id.au - www.rockpaperdynamite.wordpress.com
"Dear God, I would like to file a bug report"

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: experimental raid5/6 code in git
  2013-02-02 16:02 experimental raid5/6 code in git Chris Mason
  2013-02-03 17:33 ` Hendrik Friedel
  2013-02-04 21:42 ` H. Peter Anvin
@ 2013-02-05 14:22 ` Tomasz Torcz
  2013-02-05 18:32   ` Chris Mason
       [not found] ` <CAM_ZMPCAcKtOH4UapR8HbSN6Woi7Lh7mqy+e14TDY+8tM386iQ@mail.gmail.com>
  2013-02-12 15:16 ` Kaspar Schleiser
  4 siblings, 1 reply; 13+ messages in thread
From: Tomasz Torcz @ 2013-02-05 14:22 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Hi,

  I believe XOR_BLOCKS must be selected, otherwise build fails with:
ERROR: "xor_blocks" [fs/btrfs/btrfs.ko] undefined!
 

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index 4f5dc93..5f583c8 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -7,6 +7,7 @@ config BTRFS_FS
 	select LZO_COMPRESS
 	select LZO_DECOMPRESS
 	select RAID6_PQ
+	select XOR_BLOCKS
 
 	help
 	  Btrfs is a new filesystem with extents, writable snapshotting,

-- 
Tomasz   .. oo o.   oo o. .o   .o o. o. oo o.   ..
Torcz    .. .o .o   .o .o oo   oo .o .. .. oo   oo
o.o.o.   .o .. o.   o. o. o.   o. o. oo .. ..   o.


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: experimental raid5/6 code in git
  2013-02-05 14:22 ` Tomasz Torcz
@ 2013-02-05 18:32   ` Chris Mason
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Mason @ 2013-02-05 18:32 UTC (permalink / raw)
  To: Tomasz Torcz; +Cc: Chris Mason, linux-btrfs

On Tue, Feb 05, 2013 at 07:22:36AM -0700, Tomasz Torcz wrote:
> Hi,
> 
>   I believe XOR_BLOCKS must be selected, otherwise build fails with:
> ERROR: "xor_blocks" [fs/btrfs/btrfs.ko] undefined!
>  
> 
> diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
> index 4f5dc93..5f583c8 100644
> --- a/fs/btrfs/Kconfig
> +++ b/fs/btrfs/Kconfig
> @@ -7,6 +7,7 @@ config BTRFS_FS
>  	select LZO_COMPRESS
>  	select LZO_DECOMPRESS
>  	select RAID6_PQ
> +	select XOR_BLOCKS
>  
>  	help
>  	  Btrfs is a new filesystem with extents, writable snapshotting,

Thanks, I've put this in.

-chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: experimental raid5/6 code in git
       [not found] ` <CAM_ZMPCAcKtOH4UapR8HbSN6Woi7Lh7mqy+e14TDY+8tM386iQ@mail.gmail.com>
@ 2013-02-11 15:13   ` Chris Mason
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Mason @ 2013-02-11 15:13 UTC (permalink / raw)
  To: Gordon Manning; +Cc: Chris Mason, linux-btrfs

On Sun, Feb 10, 2013 at 03:35:05PM -0700, Gordon Manning wrote:
>    Hi,
>    Is the BTRFS raid code susceptible to RAID-5 write holes? �I think with
>    the original plan, the problem was avoided by always giving full stripe
>    writes to the raid layers. �Does the current plan deal with the hole in a
>    different manner?

The current code in my git tree does not deal with the raid-5 write
hole.  That's the part I'm finishing off now.

-chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: experimental raid5/6 code in git
  2013-02-02 16:02 experimental raid5/6 code in git Chris Mason
                   ` (3 preceding siblings ...)
       [not found] ` <CAM_ZMPCAcKtOH4UapR8HbSN6Woi7Lh7mqy+e14TDY+8tM386iQ@mail.gmail.com>
@ 2013-02-12 15:16 ` Kaspar Schleiser
  2013-02-12 15:30   ` Chris Mason
  4 siblings, 1 reply; 13+ messages in thread
From: Kaspar Schleiser @ 2013-02-12 15:16 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chris Mason

Hey Chris,

On 02/02/2013 05:02 PM, Chris Mason wrote:
> Btrfs -- 604MB/s
> MD    -- 162MB/s
> 
> 
> MD -- 800MB/s very little system time
> Btrfs -- 3.8GB/s one CPU mostly pegged

> Btrfs -- 380MB/s seen by fio
> MD    -- 174MB/s seen by fio

> Creating 12 million files on Btrfs raid5 took 226 seconds, vs 485
> seconds on MD.

Do I read these numbers incorrectly, or does even this first iteration
of btrfs' raid5/6 code run circles around MD?

Thanks for all the work!

Kaspar

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: experimental raid5/6 code in git
  2013-02-12 15:16 ` Kaspar Schleiser
@ 2013-02-12 15:30   ` Chris Mason
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Mason @ 2013-02-12 15:30 UTC (permalink / raw)
  To: Kaspar Schleiser; +Cc: linux-btrfs, Chris Mason

On Tue, Feb 12, 2013 at 08:16:49AM -0700, Kaspar Schleiser wrote:
> Hey Chris,
> 
> On 02/02/2013 05:02 PM, Chris Mason wrote:
> > Btrfs -- 604MB/s
> > MD    -- 162MB/s
> > 
> > 
> > MD -- 800MB/s very little system time
> > Btrfs -- 3.8GB/s one CPU mostly pegged
> 
> > Btrfs -- 380MB/s seen by fio
> > MD    -- 174MB/s seen by fio
> 
> > Creating 12 million files on Btrfs raid5 took 226 seconds, vs 485
> > seconds on MD.
> 
> Do I read these numbers incorrectly, or does even this first iteration
> of btrfs' raid5/6 code run circles around MD?

Yes and no.  Most of the differences were on flash, and really it just
looks like MD needs tuning for IO latency and concurrency.  There are
some MD patches for this recently to add more threads for parity
calculations, and these solve some throughput problems.

But one thing that we've proven with btrfs is that helper threads mean
more IO latencies.  So the MD code probably needs some short cuts to do
the parity inline as well.

-chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-02-12 15:30 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-02 16:02 experimental raid5/6 code in git Chris Mason
2013-02-03 17:33 ` Hendrik Friedel
2013-02-04 21:42 ` H. Peter Anvin
2013-02-05  1:23   ` Chris Mason
2013-02-05  2:26     ` H. Peter Anvin
2013-02-05  2:59       ` Gareth Pye
2013-02-05  5:29         ` Chester
2013-02-05  6:10           ` Gareth Pye
2013-02-05 14:22 ` Tomasz Torcz
2013-02-05 18:32   ` Chris Mason
     [not found] ` <CAM_ZMPCAcKtOH4UapR8HbSN6Woi7Lh7mqy+e14TDY+8tM386iQ@mail.gmail.com>
2013-02-11 15:13   ` Chris Mason
2013-02-12 15:16 ` Kaspar Schleiser
2013-02-12 15:30   ` Chris Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.