All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Process requests instead of bios to use a scheduler
@ 2014-05-28 13:04 Sebastian Parschauer
  2014-06-01 23:32 ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: Sebastian Parschauer @ 2014-05-28 13:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux RAID, Florian-Ewald Müller

Hi Neil,

at ProfitBricks we use the raid0 driver stacked on top of raid1 to form
a RAID-10. Above there is LVM and SCST/ib_srpt.

We've extended the md driver for our 3.4 based kernels to do full bio
accounting (by adding ticks and in-flights). Then, we've extended it to
use the request-by-request mode using blk_init_queue() and an
md_request_function() selectable by a module parameter and extended
mdadm. This way the block layer provides the accounting and the
possibility to select a scheduler.
With the ticks we maintain a latency statistic. This way we can compare
both modes.

My colleague Florian is in CC as he has been the main developer for this.

We did some fio 2.1.7 tests with iodepth 64, posixaio, 10 LVs with 1M
chunks sequential I/O and 10 LVs with 4K chunks sequential as well as
random I/O - one fio call per device. After 60s all fio processes are
killed.
Test systems have four 1 TB Seagate Constellation HDDs in RAID-10. LVs
are 20G in size each.

The biggest issue in our cloud is unfairness leading to high latency,
SRP timeouts and reconnects. This way we would need a scheduler for our
raid0 device.
The difference is tremendous when comparing the results of 4K random
writes fighting against 1M sequential writes. With a scheduler the
maximum write latency dropped from 10s to 1.6s. The other statistic
values are number of bios for scheduler none and number of requests for
other schedulers. First read, then write.

Scheduler: none
<      8 ms: 0 2139
<     16 ms: 0 9451
<     32 ms: 0 10277
<     64 ms: 0 3586
<    128 ms: 0 5169
<    256 ms: 2 31688
<    512 ms: 3 115360
<   1024 ms: 2 283681
<   2048 ms: 0 420918
<   4096 ms: 0 10625
<   8192 ms: 0 220
<  16384 ms: 0 4
<  32768 ms: 0 0
<  65536 ms: 0 0
>= 65536 ms: 0 0
 maximum ms: 660 9920

Scheduler: deadline
<      8 ms: 2 435
<     16 ms: 1 997
<     32 ms: 0 1560
<     64 ms: 0 4345
<    128 ms: 1 11933
<    256 ms: 2 46366
<    512 ms: 0 182166
<   1024 ms: 1 75903
<   2048 ms: 0 146
<   4096 ms: 0 0
<   8192 ms: 0 0
<  16384 ms: 0 0
<  32768 ms: 0 0
<  65536 ms: 0 0
>= 65536 ms: 0 0
 maximum ms: 640 1640

We clone the bios from the request and put them into a bio list. The
request is marked as in-flight and afterwards the bios are processed
one-by-one the same way as with the other mode.

Is it safe to do it like this with a scheduler?

Any concerns regarding the write-intent bitmap?

Do you have any other concerns?

We can provide you with the full test results, the test scripts and also
some code parts if you wish.

Cheers,
Sebastian

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Process requests instead of bios to use a scheduler
  2014-05-28 13:04 [RFC] Process requests instead of bios to use a scheduler Sebastian Parschauer
@ 2014-06-01 23:32 ` NeilBrown
  2014-06-02  9:51   ` Sebastian Parschauer
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2014-06-01 23:32 UTC (permalink / raw)
  To: Sebastian Parschauer; +Cc: Linux RAID, Florian-Ewald Müller

[-- Attachment #1: Type: text/plain, Size: 3762 bytes --]

On Wed, 28 May 2014 15:04:14 +0200 Sebastian Parschauer
<sebastian.riemer@profitbricks.com> wrote:

> Hi Neil,
> 
> at ProfitBricks we use the raid0 driver stacked on top of raid1 to form
> a RAID-10. Above there is LVM and SCST/ib_srpt.

Any particular reason you don't use the raid10 driver?


> 
> We've extended the md driver for our 3.4 based kernels to do full bio
> accounting (by adding ticks and in-flights). Then, we've extended it to
> use the request-by-request mode using blk_init_queue() and an
> md_request_function() selectable by a module parameter and extended
> mdadm. This way the block layer provides the accounting and the
> possibility to select a scheduler.
> With the ticks we maintain a latency statistic. This way we can compare
> both modes.
> 
> My colleague Florian is in CC as he has been the main developer for this.
> 
> We did some fio 2.1.7 tests with iodepth 64, posixaio, 10 LVs with 1M
> chunks sequential I/O and 10 LVs with 4K chunks sequential as well as
> random I/O - one fio call per device. After 60s all fio processes are
> killed.
> Test systems have four 1 TB Seagate Constellation HDDs in RAID-10. LVs
> are 20G in size each.
> 
> The biggest issue in our cloud is unfairness leading to high latency,
> SRP timeouts and reconnects. This way we would need a scheduler for our
> raid0 device.

Having a scheduler for RAID0 doesn't make any sense to me.
RAID0 simply passes each request down to the appropriate underlying device.
That device then does its own scheduling.

Adding a scheduler may well make sense for RAID1 (the current "scheduler"
only does some read balancing and is rather simplistic) and for RAID4/5/6/10.

But not for RAID0 .... was that a typo?


> The difference is tremendous when comparing the results of 4K random
> writes fighting against 1M sequential writes. With a scheduler the
> maximum write latency dropped from 10s to 1.6s. The other statistic
> values are number of bios for scheduler none and number of requests for
> other schedulers. First read, then write.
> 
> Scheduler: none
> <      8 ms: 0 2139
> <     16 ms: 0 9451
> <     32 ms: 0 10277
> <     64 ms: 0 3586
> <    128 ms: 0 5169
> <    256 ms: 2 31688
> <    512 ms: 3 115360
> <   1024 ms: 2 283681
> <   2048 ms: 0 420918
> <   4096 ms: 0 10625
> <   8192 ms: 0 220
> <  16384 ms: 0 4
> <  32768 ms: 0 0
> <  65536 ms: 0 0
> >= 65536 ms: 0 0
>  maximum ms: 660 9920
> 
> Scheduler: deadline
> <      8 ms: 2 435
> <     16 ms: 1 997
> <     32 ms: 0 1560
> <     64 ms: 0 4345
> <    128 ms: 1 11933
> <    256 ms: 2 46366
> <    512 ms: 0 182166
> <   1024 ms: 1 75903
> <   2048 ms: 0 146
> <   4096 ms: 0 0
> <   8192 ms: 0 0
> <  16384 ms: 0 0
> <  32768 ms: 0 0
> <  65536 ms: 0 0
> >= 65536 ms: 0 0
>  maximum ms: 640 1640

Could you do a graph?  I like graphs :-)
I can certainly seem something has changed here...

> 
> We clone the bios from the request and put them into a bio list. The
> request is marked as in-flight and afterwards the bios are processed
> one-by-one the same way as with the other mode.
> 
> Is it safe to do it like this with a scheduler?

I see nothing inherently wrong with the theory.  The details of the code are
much more important.

> 
> Any concerns regarding the write-intent bitmap?

Only that it has to keep working.

> 
> Do you have any other concerns?
> 
> We can provide you with the full test results, the test scripts and also
> some code parts if you wish.

I'm not against improving the scheduling in various md raid levels, though
not RAID0 as I mentioned above.

Show me the code and I might be able to provide a more detailed opinion.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Process requests instead of bios to use a scheduler
  2014-06-01 23:32 ` NeilBrown
@ 2014-06-02  9:51   ` Sebastian Parschauer
  2014-06-02 10:20     ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: Sebastian Parschauer @ 2014-06-02  9:51 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux RAID, Florian-Ewald Müller

[-- Attachment #1: Type: text/plain, Size: 4236 bytes --]

Hi Neil,

first of all thank you very much for your response!

On 02.06.2014 01:32, NeilBrown wrote:
>> at ProfitBricks we use the raid0 driver stacked on top of raid1 to form
>> a RAID-10. Above there is LVM and SCST/ib_srpt.
> 
> Any particular reason you don't use the raid10 driver?

Yes, scaling tests (putting more and more disks into RAID-10) have shown
that this has better performance than the raid10 driver. Especially,
with our 24 HDDs the raid10 driver had strangely only half the
performance. It plain didn't scale right during our test while
raid1+raid0 did.
We also have HW RAID systems where 8 * 2 disks in RAID-10 is the
maximum. So we only do RAID-1 in HW and use MD RAID-0 on top.

>> [...]
>>
>> We did some fio 2.1.7 tests with iodepth 64, posixaio, 10 LVs with 1M
>> chunks sequential I/O and 10 LVs with 4K chunks sequential as well as
>> random I/O - one fio call per device. After 60s all fio processes are
>> killed.
>> Test systems have four 1 TB Seagate Constellation HDDs in RAID-10. LVs
>> are 20G in size each.
>>
>> The biggest issue in our cloud is unfairness leading to high latency,
>> SRP timeouts and reconnects. This way we would need a scheduler for our
>> raid0 device.
> 
> Having a scheduler for RAID0 doesn't make any sense to me.
> RAID0 simply passes each request down to the appropriate underlying device.
> That device then does its own scheduling.
> 
> Adding a scheduler may well make sense for RAID1 (the current "scheduler"
> only does some read balancing and is rather simplistic) and for RAID4/5/6/10.
> 
> But not for RAID0 .... was that a typo?

Nope, we have our RAID-1+0. So it is more or less a RAID-10 and putting
the scheduler to this RAID-0 layer makes sense for us.

>> The difference is tremendous when comparing the results of 4K random
>> writes fighting against 1M sequential writes. With a scheduler the
>> maximum write latency dropped from 10s to 1.6s. The other statistic
>> values are number of bios for scheduler none and number of requests for
>> other schedulers. First read, then write.
>>
>> Scheduler: none
>> <      8 ms: 0 2139
>> <     16 ms: 0 9451
>> <     32 ms: 0 10277
>> <     64 ms: 0 3586
>> <    128 ms: 0 5169
>> <    256 ms: 2 31688
>> <    512 ms: 3 115360
>> <   1024 ms: 2 283681
>> <   2048 ms: 0 420918
>> <   4096 ms: 0 10625
>> <   8192 ms: 0 220
>> <  16384 ms: 0 4
>> <  32768 ms: 0 0
>> <  65536 ms: 0 0
>>> = 65536 ms: 0 0
>>  maximum ms: 660 9920
>>
>> Scheduler: deadline
>> <      8 ms: 2 435
>> <     16 ms: 1 997
>> <     32 ms: 0 1560
>> <     64 ms: 0 4345
>> <    128 ms: 1 11933
>> <    256 ms: 2 46366
>> <    512 ms: 0 182166
>> <   1024 ms: 1 75903
>> <   2048 ms: 0 146
>> <   4096 ms: 0 0
>> <   8192 ms: 0 0
>> <  16384 ms: 0 0
>> <  32768 ms: 0 0
>> <  65536 ms: 0 0
>>> = 65536 ms: 0 0
>>  maximum ms: 640 1640
> 
> Could you do a graph?  I like graphs :-)
> I can certainly seem something has changed here...

Sure, please find the graphs attached. I've converted it into percentage
so that number of bios can be compared to number of requests.

You can see that the scheduler has less IOs below 32 ms. But it has most
IOs below 512 ms while it is below 2048 ms without a scheduler.

>> We clone the bios from the request and put them into a bio list. The
>> request is marked as in-flight and afterwards the bios are processed
>> one-by-one the same way as with the other mode.
>>
>> Is it safe to do it like this with a scheduler?
> 
> I see nothing inherently wrong with the theory.  The details of the code are
> much more important.
> 
>>
>> Any concerns regarding the write-intent bitmap?
> 
> Only that it has to keep working.
> 
>> Do you have any other concerns?
>>
>> We can provide you with the full test results, the test scripts and also
>> some code parts if you wish.
> 
> I'm not against improving the scheduling in various md raid levels, though
> not RAID0 as I mentioned above.
> 
> Show me the code and I might be able to provide a more detailed opinion.

I would say let the user decide whether an MD device should be equipped
with a scheduler or not. We can port our code to latest kernel + latest
mdadm and send you a patch set for testing. Just give me some time to do it.

Cheers,
Sebastian

[-- Attachment #2: md_latency_stat_sched.png --]
[-- Type: image/png, Size: 38833 bytes --]

[-- Attachment #3: md_max_latency_sched.png --]
[-- Type: image/png, Size: 15823 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Process requests instead of bios to use a scheduler
  2014-06-02  9:51   ` Sebastian Parschauer
@ 2014-06-02 10:20     ` NeilBrown
  2014-06-02 11:12       ` Sebastian Parschauer
  2014-06-04 17:09       ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
  0 siblings, 2 replies; 14+ messages in thread
From: NeilBrown @ 2014-06-02 10:20 UTC (permalink / raw)
  To: Sebastian Parschauer; +Cc: Linux RAID, Florian-Ewald Müller

[-- Attachment #1: Type: text/plain, Size: 1743 bytes --]

On Mon, 02 Jun 2014 11:51:52 +0200 Sebastian Parschauer
<sebastian.riemer@profitbricks.com> wrote:


> > Having a scheduler for RAID0 doesn't make any sense to me.
> > RAID0 simply passes each request down to the appropriate underlying device.
> > That device then does its own scheduling.
> > 
> > Adding a scheduler may well make sense for RAID1 (the current "scheduler"
> > only does some read balancing and is rather simplistic) and for RAID4/5/6/10.
> > 
> > But not for RAID0 .... was that a typo?
> 
> Nope, we have our RAID-1+0. So it is more or less a RAID-10 and putting
> the scheduler to this RAID-0 layer makes sense for us.

I still cannot imagine how this would work.  RAID-0 has no decisions to make,
so no where for a scheduler to fit.

Just to clarify: is this md/raid0 over md/raid1 or md/raid0 over
hardware/raid1?


> > Could you do a graph?  I like graphs :-)
> > I can certainly seem something has changed here...
> 
> Sure, please find the graphs attached. I've converted it into percentage
> so that number of bios can be compared to number of requests.

Thanks.

> > 
> > Show me the code and I might be able to provide a more detailed opinion.
> 
> I would say let the user decide whether an MD device should be equipped
> with a scheduler or not. We can port our code to latest kernel + latest
> mdadm and send you a patch set for testing. Just give me some time to do it.

In the first instance, I just want to get a concrete idea of what you have
done because what you have said doesn't make sense to me.  I'm happy to look
at code against a not-quite-current kernel to get that idea.  But I'm also
happy for it to be against the latest, whatever suits you.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Process requests instead of bios to use a scheduler
  2014-06-02 10:20     ` NeilBrown
@ 2014-06-02 11:12       ` Sebastian Parschauer
  2014-06-04 17:09       ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
  1 sibling, 0 replies; 14+ messages in thread
From: Sebastian Parschauer @ 2014-06-02 11:12 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux RAID, Florian-Ewald Müller

On 02.06.2014 12:20, NeilBrown wrote:
> On Mon, 02 Jun 2014 11:51:52 +0200 Sebastian Parschauer 
> <sebastian.riemer@profitbricks.com> wrote:
>> 
>> Nope, we have our RAID-1+0. So it is more or less a RAID-10 and
>> putting the scheduler to this RAID-0 layer makes sense for us.
> 
> I still cannot imagine how this would work.  RAID-0 has no
> decisions to make, so no where for a scheduler to fit.
> 
> Just to clarify: is this md/raid0 over md/raid1 or md/raid0 over 
> hardware/raid1?

We have both variants but tested the scheduler with servers without HW
RAID and only 4 HDDs. On the production servers there are 24 HDDs,
every 2 HDDs are in md/raid1 or HW RAID-1 and these 12 RAID-1 devices
form an md/raid0 device.

This is the only PV for LVM and LVs are the customer volumes exported
via SCST/SRP. These are used by virtual machines on other servers. So
the scheduler has to bring some fairness to the customer volumes so
that a streaming customer can't block a database customer completely
with his big sequential IOs as these would have priority. But our goal
with a scheduler is to reduce latency for everyone.

I hope this is clear, now. Just tell me if not. ;-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 0/4] md/mdadm: introduce request function mode support
  2014-06-02 10:20     ` NeilBrown
  2014-06-02 11:12       ` Sebastian Parschauer
@ 2014-06-04 17:09       ` Sebastian Parschauer
  2014-06-04 17:09         ` [RFC PATCH 1/4] md: complete bio accounting and add io_latency extension Sebastian Parschauer
                           ` (4 more replies)
  1 sibling, 5 replies; 14+ messages in thread
From: Sebastian Parschauer @ 2014-06-04 17:09 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid


Patches apply to linux v3.15-rc8 and mdadm master commit 8a3544f895.

Florian-Ewald Mueller (3):
      md: complete bio accounting and add io_latency extension
      md: introduce request function mode support
      md: handle IO latency accounting in rqfn mode

Sebastian Parschauer (1):
      mdadm: introduce '--use-requestfn' create/assembly option

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 1/4] md: complete bio accounting and add io_latency extension
  2014-06-04 17:09       ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
@ 2014-06-04 17:09         ` Sebastian Parschauer
  2014-06-04 17:10         ` [RFC PATCH 2/4] md: introduce request function mode support Sebastian Parschauer
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Sebastian Parschauer @ 2014-06-04 17:09 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, Florian-Ewald Mueller, Sebastian Parschauer

From: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>

The md layer only accounts the number of I/Os and sectors per bio.
So account in-flight and ticks as well. Also maintain an I/O latency
statistic by counting I/Os in power of 2 latency areas starting at
< 8 ms and ending at >= 65536 ms. Determine the maximum latency as
well. This I/O latency statistic can be read and reset to 0 with the
md sysfs file 'io_latency'.

Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>
[spars: added a description, replaced gcc atomics with atomic64_t,
 merged commits, fixed checkpatch warnings]
Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
---
 drivers/md/md.c |  175 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/md.h |   18 ++++++
 2 files changed, 193 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 237b7e0..8c653f9 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -54,6 +54,32 @@
 #include "md.h"
 #include "bitmap.h"
 
+#ifdef BIO_ACCOUNTING_EXTENSION
+
+#include <linux/ratelimit.h>
+
+struct md_bio_private {
+	void		(*orig_bio_endio)(struct bio *, int);
+	void		*orig_bio_private;
+	struct mddev	*mddev;
+	unsigned int	sectors;
+	unsigned long	ticks;
+};
+
+static struct kmem_cache *md_bio_private_cache __read_mostly;
+
+static DEFINE_RATELIMIT_STATE(md_ratelimit_state,
+			DEFAULT_RATELIMIT_INTERVAL,
+			DEFAULT_RATELIMIT_BURST);
+
+static inline int __must_check md_valid_ptr(const void *p)
+{
+	return !ZERO_OR_NULL_PTR(p) && !IS_ERR(p);
+}
+#define VALID_PTR(p)	md_valid_ptr(p)
+
+#endif	/* BIO_ACCOUNTING_EXTENSION */
+
 #ifndef MODULE
 static void autostart_arrays(int part);
 #endif
@@ -241,6 +267,64 @@ static DEFINE_SPINLOCK(all_mddevs_lock);
 		_tmp = _tmp->next;})					\
 		)
 
+#ifdef BIO_ACCOUNTING_EXTENSION
+
+static inline long atomic64_set_if_greater(atomic64_t *v, long val)
+{
+	long act, old;
+
+	old = atomic64_read(v);
+	for (;;) {
+		if (val <= old)
+			break;
+		act = atomic64_cmpxchg(v, old, val);
+		if (likely(act == old))
+			break;
+		old = act;
+	}
+	return old;
+}
+
+static void md_bio_endio(struct bio *bio, int err)
+{
+	struct md_bio_private *mbp = bio->bi_private;
+	struct mddev *mddev = mbp->mddev;
+	struct md_stats *sp = &mddev->stats;
+
+	unsigned int sectors = mbp->sectors;
+	int cpu, idx, rw = bio_data_dir(bio);
+	unsigned long ms, ticks;
+
+	BUILD_BUG_ON(ARRAY_SIZE(sp->latency_table[0]) != 2);
+	BUILD_BUG_ON(ARRAY_SIZE(sp->max_latency) != 2);
+
+	ticks = (long)jiffies - (long)mbp->ticks;
+
+	cpu = part_stat_lock();
+	part_stat_inc(cpu, &mddev->gendisk->part0, ios[rw]);
+	part_stat_add(cpu, &mddev->gendisk->part0, sectors[rw], sectors);
+	part_stat_add(cpu, &mddev->gendisk->part0, ticks[rw], ticks);
+	part_dec_in_flight(&mddev->gendisk->part0, rw);
+	part_round_stats(cpu, &mddev->gendisk->part0);
+	part_stat_unlock();
+
+	ms = jiffies_to_msecs(ticks);
+	if (likely(ticks > 0) && ms > 0) {
+		idx = ilog2(ms) - MD_LATENCY_LOGBASE + 1;
+		idx = clamp(idx, 0, (int)ARRAY_SIZE(sp->latency_table) - 1);
+	} else {
+		idx = 0;
+	}
+	atomic64_set_if_greater(&sp->max_latency[rw], ticks);
+	atomic64_inc(&sp->latency_table[idx][rw]);
+
+	bio->bi_private = mbp->orig_bio_private;
+	bio->bi_end_io = mbp->orig_bio_endio;
+	kmem_cache_free(md_bio_private_cache, mbp);
+	bio_endio_nodec(bio, err);  /* >= 3.14, bio_endio() otherwise */
+}
+
+#endif	/* BIO_ACCOUNTING_EXTENSION */
 
 /* Rather than calling directly into the personality make_request function,
  * IO requests come here first so that we can check if the device is
@@ -255,6 +339,9 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
 	struct mddev *mddev = q->queuedata;
 	int cpu;
 	unsigned int sectors;
+#ifdef BIO_ACCOUNTING_EXTENSION
+	struct md_bio_private *mbp;
+#endif	/* BIO_ACCOUNTING_EXTENSION */
 
 	if (mddev == NULL || mddev->pers == NULL
 	    || !mddev->ready) {
@@ -288,12 +375,36 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
 	 * go away inside make_request
 	 */
 	sectors = bio_sectors(bio);
+#ifdef BIO_ACCOUNTING_EXTENSION
+	mbp = kmem_cache_alloc(md_bio_private_cache, GFP_NOIO);
+	if (unlikely(!VALID_PTR(mbp))) {
+		if (__ratelimit(&md_ratelimit_state))
+			pr_warn("%s: [%s] kmem_cache_alloc failed\n",
+				__func__, mdname(mddev));
+		cpu = part_stat_lock();
+		part_stat_inc(cpu, &mddev->gendisk->part0, ios[rw]);
+		part_stat_add(cpu, &mddev->gendisk->part0, sectors[rw],
+			      sectors);
+		part_stat_unlock();
+	} else {
+		part_inc_in_flight(&mddev->gendisk->part0, rw);
+		mbp->orig_bio_private = bio->bi_private;
+		mbp->orig_bio_endio = bio->bi_end_io;
+		mbp->sectors = sectors;
+		mbp->ticks = jiffies;
+		mbp->mddev = mddev;
+		bio->bi_end_io = md_bio_endio;
+		bio->bi_private = mbp;
+	}
+#endif	/* BIO_ACCOUNTING_EXTENSION */
 	mddev->pers->make_request(mddev, bio);
 
+#ifndef BIO_ACCOUNTING_EXTENSION
 	cpu = part_stat_lock();
 	part_stat_inc(cpu, &mddev->gendisk->part0, ios[rw]);
 	part_stat_add(cpu, &mddev->gendisk->part0, sectors[rw], sectors);
 	part_stat_unlock();
+#endif	/* !BIO_ACCOUNTING_EXTENSION */
 
 	if (atomic_dec_and_test(&mddev->active_io) && mddev->suspended)
 		wake_up(&mddev->sb_wait);
@@ -4652,6 +4763,52 @@ static struct md_sysfs_entry md_array_size =
 __ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
        array_size_store);
 
+#ifdef BIO_ACCOUNTING_EXTENSION
+
+static ssize_t
+md_io_latency_show(struct mddev *mddev, char *page)
+{
+	struct md_stats *sp = &mddev->stats;
+	ssize_t cnt;
+	int i;
+
+	for (cnt = i = 0; i < (ARRAY_SIZE(sp->latency_table) - 1); i++) {
+		cnt += scnprintf(page + cnt, PAGE_SIZE - cnt,
+			"<  %5d ms: %lu %lu\n",
+			(1 << (i + MD_LATENCY_LOGBASE)),
+			atomic64_read(&sp->latency_table[i][0]),
+			atomic64_read(&sp->latency_table[i][1]));
+	}
+	cnt += scnprintf(page + cnt, PAGE_SIZE - cnt, ">= %5d ms: %lu %lu\n",
+		(1 << ((i - 1) + MD_LATENCY_LOGBASE)),
+		atomic64_read(&sp->latency_table[i][0]),
+		atomic64_read(&sp->latency_table[i][1]));
+	cnt += scnprintf(page + cnt, PAGE_SIZE - cnt, " maximum ms: %u %u\n",
+		jiffies_to_msecs(atomic64_read(&sp->max_latency[0])),
+		jiffies_to_msecs(atomic64_read(&sp->max_latency[1])));
+	return cnt;
+}
+
+static ssize_t
+md_io_latency_store(struct mddev *mddev, const char *buf, size_t len)
+{
+	struct md_stats *sp = &mddev->stats;
+	int i, j;
+
+	for (i = 0; i < ARRAY_SIZE(sp->max_latency); i++)
+		atomic64_set(&sp->max_latency[i], 0);
+	for (i = 0; i < ARRAY_SIZE(sp->latency_table); i++) {
+		for (j = 0; j < ARRAY_SIZE(sp->latency_table[i]); j++)
+			atomic64_set(&sp->latency_table[i][j], 0);
+	}
+	return len;
+}
+
+static struct md_sysfs_entry md_io_latency =
+__ATTR(io_latency, S_IRUGO|S_IWUSR, md_io_latency_show, md_io_latency_store);
+
+#endif	/* BIO_ACCOUNTING_EXTENSION */
+
 static struct attribute *md_default_attrs[] = {
 	&md_level.attr,
 	&md_layout.attr,
@@ -4667,6 +4824,9 @@ static struct attribute *md_default_attrs[] = {
 	&md_reshape_direction.attr,
 	&md_array_size.attr,
 	&max_corr_read_errors.attr,
+#ifdef BIO_ACCOUNTING_EXTENSION
+	&md_io_latency.attr,
+#endif	/* BIO_ACCOUNTING_EXTENSION */
 	NULL,
 };
 
@@ -8551,6 +8711,14 @@ static int __init md_init(void)
 {
 	int ret = -ENOMEM;
 
+#ifdef BIO_ACCOUNTING_EXTENSION
+	md_bio_private_cache = KMEM_CACHE(md_bio_private, 0);
+	if (unlikely(!VALID_PTR(md_bio_private_cache))) {
+		pr_err("%s: KMEM_CACHE failed\n", __func__);
+		return -ENOMEM;
+	}
+#endif	/* BIO_ACCOUNTING_EXTENSION */
+
 	md_wq = alloc_workqueue("md", WQ_MEM_RECLAIM, 0);
 	if (!md_wq)
 		goto err_wq;
@@ -8687,6 +8855,13 @@ static __exit void md_exit(void)
 	}
 	destroy_workqueue(md_misc_wq);
 	destroy_workqueue(md_wq);
+
+#ifdef BIO_ACCOUNTING_EXTENSION
+	if (likely(VALID_PTR(md_bio_private_cache))) {
+		kmem_cache_destroy(md_bio_private_cache);
+		md_bio_private_cache = NULL;
+	}
+#endif	/* BIO_ACCOUNTING_EXTENSION */
 }
 
 subsys_initcall(md_init);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index a49d991..f0e9171 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -24,6 +24,10 @@
 #include <linux/wait.h>
 #include <linux/workqueue.h>
 
+#if 1
+#define BIO_ACCOUNTING_EXTENSION
+#endif
+
 #define MaxSector (~(sector_t)0)
 
 /* Bad block numbers are stored sorted in a single page.
@@ -202,6 +206,17 @@ extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 				int is_new);
 extern void md_ack_all_badblocks(struct badblocks *bb);
 
+#ifdef BIO_ACCOUNTING_EXTENSION
+
+#define MD_LATENCY_LOGBASE	3
+
+struct md_stats {
+	atomic64_t			latency_table[15][2];
+	atomic64_t			max_latency[2];
+};
+
+#endif /* BIO_ACCOUNTING_EXTENSION */
+
 struct mddev {
 	void				*private;
 	struct md_personality		*pers;
@@ -437,6 +452,9 @@ struct mddev {
 	struct work_struct flush_work;
 	struct work_struct event_work;	/* used by dm to report failure event */
 	void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
+#ifdef BIO_ACCOUNTING_EXTENSION
+	struct md_stats stats;
+#endif	/* BIO_ACCOUNTING_EXTENSION */
 };
 
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 2/4] md: introduce request function mode support
  2014-06-04 17:09       ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
  2014-06-04 17:09         ` [RFC PATCH 1/4] md: complete bio accounting and add io_latency extension Sebastian Parschauer
@ 2014-06-04 17:10         ` Sebastian Parschauer
  2014-06-04 17:10         ` [RFC PATCH 3/4] md: handle IO latency accounting in rqfn mode Sebastian Parschauer
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Sebastian Parschauer @ 2014-06-04 17:10 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, Florian-Ewald Mueller, Sebastian Parschauer

From: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>

This introduces the writable module parameter 'rq_mode' which is
used to set the I/O mode for all subsequently created MD devices.
Set it to 0 for the default mode (the make request function mode)
in order to process I/O bio-by-bio or set it to 1 for the new
request function mode to process I/O request-by-request. Common
code is shared between both modes.

The advantage of the new mode is that a scheduler can be used and
the block layer cares for I/O statistics.

Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>
[spars: merged commits, changed description, fixed checkpatch warnings]
Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
---
 drivers/md/md.c |  280 +++++++++++++++++++++++++++++++++++++++++++++++++------
 drivers/md/md.h |    7 ++
 2 files changed, 257 insertions(+), 30 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8c653f9..0e5c420 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -56,8 +56,6 @@
 
 #ifdef BIO_ACCOUNTING_EXTENSION
 
-#include <linux/ratelimit.h>
-
 struct md_bio_private {
 	void		(*orig_bio_endio)(struct bio *, int);
 	void		*orig_bio_private;
@@ -68,6 +66,30 @@ struct md_bio_private {
 
 static struct kmem_cache *md_bio_private_cache __read_mostly;
 
+#endif	/* BIO_ACCOUNTING_EXTENSION */
+
+#ifdef MD_REQUEST_FUNCTION
+
+struct md_request_clone {
+	struct work_struct work;
+	struct mddev	*mdp;
+	struct request	*req;
+	struct bio_list	bios;
+	atomic_t	cnt;
+	int		err;
+};
+
+#define MD_RQ_MODE_DEFAULT	0
+
+static unsigned int rq_mode __read_mostly = MD_RQ_MODE_DEFAULT;
+static struct kmem_cache *md_request_clone_cache __read_mostly;
+
+#endif	/* MD_REQUEST_FUNCTION */
+
+#if defined BIO_ACCOUNTING_EXTENSION || defined MD_REQUEST_FUNCTION
+
+#include <linux/ratelimit.h>
+
 static DEFINE_RATELIMIT_STATE(md_ratelimit_state,
 			DEFAULT_RATELIMIT_INTERVAL,
 			DEFAULT_RATELIMIT_BURST);
@@ -78,7 +100,7 @@ static inline int __must_check md_valid_ptr(const void *p)
 }
 #define VALID_PTR(p)	md_valid_ptr(p)
 
-#endif	/* BIO_ACCOUNTING_EXTENSION */
+#endif	/* BIO_ACCOUNTING_EXTENSION || MD_REQUEST_FUNCTION */
 
 #ifndef MODULE
 static void autostart_arrays(int part);
@@ -326,31 +348,17 @@ static void md_bio_endio(struct bio *bio, int err)
 
 #endif	/* BIO_ACCOUNTING_EXTENSION */
 
-/* Rather than calling directly into the personality make_request function,
- * IO requests come here first so that we can check if the device is
- * being suspended pending a reconfiguration.
- * We hold a refcount over the call to ->make_request.  By the time that
- * call has finished, the bio has been linked into some internal structure
- * and so is visible to ->quiesce(), so we don't need the refcount any more.
- */
-static void md_make_request(struct request_queue *q, struct bio *bio)
+static inline int md_make_request_head(struct mddev *mddev, struct bio *bio)
 {
 	const int rw = bio_data_dir(bio);
-	struct mddev *mddev = q->queuedata;
-	int cpu;
-	unsigned int sectors;
-#ifdef BIO_ACCOUNTING_EXTENSION
-	struct md_bio_private *mbp;
-#endif	/* BIO_ACCOUNTING_EXTENSION */
 
-	if (mddev == NULL || mddev->pers == NULL
-	    || !mddev->ready) {
+	if (mddev == NULL || mddev->pers == NULL || !mddev->ready) {
 		bio_io_error(bio);
-		return;
+		return 1;
 	}
 	if (mddev->ro == 1 && unlikely(rw == WRITE)) {
 		bio_endio(bio, bio_sectors(bio) == 0 ? 0 : -EROFS);
-		return;
+		return 1;
 	}
 	smp_rmb(); /* Ensure implications of  'active' are visible */
 	rcu_read_lock();
@@ -369,6 +377,39 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
 	}
 	atomic_inc(&mddev->active_io);
 	rcu_read_unlock();
+	return 0;
+}
+
+static inline void md_make_request_body(struct mddev *mddev, struct bio *bio)
+{
+	mddev->pers->make_request(mddev, bio);
+}
+
+static inline void md_make_request_tail(struct mddev *mddev, struct bio *bio)
+{
+	if (atomic_dec_and_test(&mddev->active_io) && mddev->suspended)
+		wake_up(&mddev->sb_wait);
+}
+
+/* Rather than calling directly into the personality make_request function,
+ * IO requests come here first so that we can check if the device is
+ * being suspended pending a reconfiguration.
+ * We hold a refcount over the call to ->make_request.  By the time that
+ * call has finished, the bio has been linked into some internal structure
+ * and so is visible to ->quiesce(), so we don't need the refcount any more.
+ */
+static void md_make_request(struct request_queue *q, struct bio *bio)
+{
+	const int rw = bio_data_dir(bio);
+	struct mddev *mddev = q->queuedata;
+	int cpu;
+	unsigned int sectors;
+#ifdef BIO_ACCOUNTING_EXTENSION
+	struct md_bio_private *mbp;
+#endif	/* BIO_ACCOUNTING_EXTENSION */
+
+	if (unlikely(md_make_request_head(mddev, bio)))
+		return;
 
 	/*
 	 * save the sectors now since our bio can
@@ -397,7 +438,7 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
 		bio->bi_private = mbp;
 	}
 #endif	/* BIO_ACCOUNTING_EXTENSION */
-	mddev->pers->make_request(mddev, bio);
+	md_make_request_body(mddev, bio);
 
 #ifndef BIO_ACCOUNTING_EXTENSION
 	cpu = part_stat_lock();
@@ -406,10 +447,131 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
 	part_stat_unlock();
 #endif	/* !BIO_ACCOUNTING_EXTENSION */
 
-	if (atomic_dec_and_test(&mddev->active_io) && mddev->suspended)
-		wake_up(&mddev->sb_wait);
+	md_make_request_tail(mddev, bio);
+}
+
+#ifdef MD_REQUEST_FUNCTION
+
+static inline void md_make_request_bio(struct mddev *mddev, struct bio *bio)
+{
+	if (unlikely(md_make_request_head(mddev, bio)))
+		return;
+	md_make_request_body(mddev, bio);
+	md_make_request_tail(mddev, bio);
+}
+
+static inline void md_request_clone_release(struct md_request_clone *rcl)
+{
+	if (atomic_dec_and_test(&rcl->cnt)) {
+		blk_end_request_all(rcl->req, rcl->err);
+		kmem_cache_free(md_request_clone_cache, rcl);
+	}
+}
+
+static void md_request_bio_endio(struct bio *bio, int err)
+{
+	struct md_request_clone *rcl = bio->bi_private;
+
+	if (unlikely(err < 0))
+		rcl->err = err;
+
+	bio_put(bio);
+	md_request_clone_release(rcl);
+}
+
+static void md_request_clone_worker(struct work_struct *wkp)
+{
+	struct md_request_clone *rcl =
+		container_of(wkp, struct md_request_clone, work);
+	struct bio_list *blp = &rcl->bios;
+	struct mddev *mddev = rcl->mdp;
+	struct bio *bio;
+
+	bio = bio_list_pop(blp);
+	while (VALID_PTR(bio)) {
+		md_make_request_bio(mddev, bio);
+		bio = bio_list_pop(blp);
+	}
+	md_request_clone_release(rcl);
 }
 
+static inline int md_process_request(struct mddev *mddev, struct request *req)
+{
+	struct md_request_clone *rcl;
+
+	struct bio *bio, *clone;
+	int error;
+
+	rcl = kmem_cache_alloc(md_request_clone_cache, GFP_NOIO);
+	if (unlikely(!VALID_PTR(rcl))) {
+		if (__ratelimit(&md_ratelimit_state))
+			pr_warn("%s: [%s] kmem_cache_alloc failed\n",
+				__func__, mdname(mddev));
+		return -ENOMEM;
+	}
+	rcl->err = 0;
+	rcl->req = req;
+	rcl->mdp = mddev;
+	atomic_set(&rcl->cnt, 1);
+	bio_list_init(&rcl->bios);
+	bio = req->bio;
+	while (VALID_PTR(bio)) {
+		clone = bio_clone(bio, GFP_NOWAIT);
+		if (unlikely(!VALID_PTR(clone))) {
+			if (__ratelimit(&md_ratelimit_state))
+				pr_warn("%s: [%s] bio_clone failed\n",
+					__func__, mdname(mddev));
+			error = -ENOMEM;
+			goto error_out;
+		}
+		clone->bi_private = rcl;
+		clone->bi_end_io = md_request_bio_endio;
+		bio_list_add(&rcl->bios, clone);
+		atomic_inc(&rcl->cnt);
+		bio = bio->bi_next;
+	}
+	INIT_WORK(&rcl->work, md_request_clone_worker);
+	queue_work(mddev->request_wq, &rcl->work);
+	return 0;
+error_out:
+	bio = bio_list_pop(&rcl->bios);
+	while (VALID_PTR(bio)) {
+		bio_put(bio);
+		bio = bio_list_pop(&rcl->bios);
+	}
+	kmem_cache_free(md_request_clone_cache, rcl);
+	return error;
+}
+
+#ifndef blk_fs_request
+#define blk_fs_request(p)	((p)->cmd_type == REQ_TYPE_FS)
+#endif	/* !blk_fs_request */
+
+static void md_request_function(struct request_queue *rqp)
+{
+	struct mddev *mddev = rqp->queuedata;
+
+	struct request *req;
+	int rc;
+
+	while ((req = blk_fetch_request(rqp)) != NULL) {
+		if (unlikely(!blk_fs_request(req))) {
+			if (__ratelimit(&md_ratelimit_state))
+				pr_warn("%s: [%s] non-fs request\n",
+					__func__, mdname(mddev));
+			__blk_end_request_all(req, -ENOTSUPP);
+			continue;
+		}
+		spin_unlock_irq(rqp->queue_lock);
+		rc = md_process_request(mddev, req);
+		spin_lock_irq(rqp->queue_lock);
+		if (unlikely(rc < 0))
+			__blk_end_request_all(req, rc);
+	}
+}
+
+#endif	/* MD_REQUEST_FUNCTION */
+
 /* mddev_suspend makes sure no new requests are submitted
  * to the device, and that any requests that have been submitted
  * are completely handled.
@@ -567,8 +729,15 @@ static void mddev_put(struct mddev *mddev)
 			 */
 			INIT_WORK(&mddev->del_work, mddev_delayed_delete);
 			queue_work(md_misc_wq, &mddev->del_work);
-		} else
+		} else {
+#ifdef MD_REQUEST_FUNCTION
+			if (likely(VALID_PTR(mddev->request_wq))) {
+				destroy_workqueue(mddev->request_wq);
+				mddev->request_wq = NULL;
+			}
+#endif	/* MD_REQUEST_FUNCTION */
 			kfree(mddev);
+		}
 	}
 	spin_unlock(&all_mddevs_lock);
 	if (bs)
@@ -4923,6 +5092,13 @@ static void md_free(struct kobject *ko)
 	if (mddev->queue)
 		blk_cleanup_queue(mddev->queue);
 
+#ifdef MD_REQUEST_FUNCTION
+	if (likely(VALID_PTR(mddev->request_wq))) {
+		destroy_workqueue(mddev->request_wq);
+		mddev->request_wq = NULL;
+	}
+#endif	/* MD_REQUEST_FUNCTION */
+
 	kfree(mddev);
 }
 
@@ -4990,12 +5166,32 @@ static int md_alloc(dev_t dev, char *name)
 	}
 
 	error = -ENOMEM;
-	mddev->queue = blk_alloc_queue(GFP_KERNEL);
-	if (!mddev->queue)
-		goto abort;
+#ifdef MD_REQUEST_FUNCTION
+	if (!rq_mode) {
+#endif	/* MD_REQUEST_FUNCTION */
+		mddev->queue = blk_alloc_queue(GFP_KERNEL);
+		if (!mddev->queue)
+			goto abort;
+		blk_queue_make_request(mddev->queue, md_make_request);
+#ifdef MD_REQUEST_FUNCTION
+	} else {
+		mddev->request_wq =
+			create_singlethread_workqueue(mdname(mddev));
+		if (unlikely(!VALID_PTR(mddev->request_wq))) {
+			pr_warn("%s: create_singlethread_workqueue (%s) "
+				"failed\n", __func__, mdname(mddev));
+			goto abort;
+		}
+		mddev->queue = blk_init_queue(md_request_function, NULL);
+		if (!mddev->queue) {
+			destroy_workqueue(mddev->request_wq);
+			mddev->request_wq = NULL;
+			goto abort;
+		}
+	}
+#endif	/* MD_REQUEST_FUNCTION */
 	mddev->queue->queuedata = mddev;
 
-	blk_queue_make_request(mddev->queue, md_make_request);
 	blk_set_stacking_limits(&mddev->queue->limits);
 
 	disk = alloc_disk(1 << shift);
@@ -8714,11 +8910,23 @@ static int __init md_init(void)
 #ifdef BIO_ACCOUNTING_EXTENSION
 	md_bio_private_cache = KMEM_CACHE(md_bio_private, 0);
 	if (unlikely(!VALID_PTR(md_bio_private_cache))) {
-		pr_err("%s: KMEM_CACHE failed\n", __func__);
+		pr_err("%s: KMEM_CACHE (bio_priv) failed\n", __func__);
 		return -ENOMEM;
 	}
 #endif	/* BIO_ACCOUNTING_EXTENSION */
 
+#ifdef MD_REQUEST_FUNCTION
+	md_request_clone_cache = KMEM_CACHE(md_request_clone, 0);
+	if (unlikely(!VALID_PTR(md_request_clone_cache))) {
+		pr_err("%s: KMEM_CACHE (req_clone) failed\n", __func__);
+#ifdef BIO_ACCOUNTING_EXTENSION
+		kmem_cache_destroy(md_bio_private_cache);
+		md_bio_private_cache = NULL;
+#endif	/* BIO_ACCOUNTING_EXTENSION */
+		return -ENOMEM;
+	}
+#endif	/* MD_REQUEST_FUNCTION */
+
 	md_wq = alloc_workqueue("md", WQ_MEM_RECLAIM, 0);
 	if (!md_wq)
 		goto err_wq;
@@ -8856,6 +9064,13 @@ static __exit void md_exit(void)
 	destroy_workqueue(md_misc_wq);
 	destroy_workqueue(md_wq);
 
+#ifdef MD_REQUEST_FUNCTION
+	if (likely(VALID_PTR(md_request_clone_cache))) {
+		kmem_cache_destroy(md_request_clone_cache);
+		md_request_clone_cache = NULL;
+	}
+#endif	/* MD_REQUEST_FUNCTION */
+
 #ifdef BIO_ACCOUNTING_EXTENSION
 	if (likely(VALID_PTR(md_bio_private_cache))) {
 		kmem_cache_destroy(md_bio_private_cache);
@@ -8887,6 +9102,11 @@ module_param(start_dirty_degraded, int, S_IRUGO|S_IWUSR);
 
 module_param_call(new_array, add_named_array, NULL, NULL, S_IWUSR);
 
+#ifdef MD_REQUEST_FUNCTION
+module_param(rq_mode, int, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(rq_mode, " this module's io input mode (default: 0 [make request mode])");
+#endif	/* MD_REQUEST_FUNCTION */
+
 EXPORT_SYMBOL(register_md_personality);
 EXPORT_SYMBOL(unregister_md_personality);
 EXPORT_SYMBOL(md_error);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index f0e9171..8d639e0 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -25,6 +25,10 @@
 #include <linux/workqueue.h>
 
 #if 1
+#define MD_REQUEST_FUNCTION
+#endif
+
+#if 1
 #define BIO_ACCOUNTING_EXTENSION
 #endif
 
@@ -455,6 +459,9 @@ struct mddev {
 #ifdef BIO_ACCOUNTING_EXTENSION
 	struct md_stats stats;
 #endif	/* BIO_ACCOUNTING_EXTENSION */
+#ifdef MD_REQUEST_FUNCTION
+	struct workqueue_struct *request_wq;
+#endif	/* MD_REQUEST_FUNCTION */
 };
 
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 3/4] md: handle IO latency accounting in rqfn mode
  2014-06-04 17:09       ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
  2014-06-04 17:09         ` [RFC PATCH 1/4] md: complete bio accounting and add io_latency extension Sebastian Parschauer
  2014-06-04 17:10         ` [RFC PATCH 2/4] md: introduce request function mode support Sebastian Parschauer
@ 2014-06-04 17:10         ` Sebastian Parschauer
  2014-06-04 17:10         ` [RFC PATCH 4/4] mdadm: introduce '--use-requestfn' create/assembly option Sebastian Parschauer
  2014-06-17 13:20         ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
  4 siblings, 0 replies; 14+ messages in thread
From: Sebastian Parschauer @ 2014-06-04 17:10 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, Florian-Ewald Mueller, Sebastian Parschauer

From: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>

Also in the request function mode, the IO latency accounting should
be done. In contrast to the make request mode, it is done per
request.

Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>
[spars: changed description, merged commits, fixed checkpatch warnings]
Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
---
 drivers/md/md.c |   48 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0e5c420..5858af0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -75,6 +75,9 @@ struct md_request_clone {
 	struct mddev	*mdp;
 	struct request	*req;
 	struct bio_list	bios;
+#ifdef BIO_ACCOUNTING_EXTENSION
+	unsigned long	ticks;
+#endif	/* BIO_ACCOUNTING_EXTENSION */
 	atomic_t	cnt;
 	int		err;
 };
@@ -307,18 +310,34 @@ static inline long atomic64_set_if_greater(atomic64_t *v, long val)
 	return old;
 }
 
+static inline void md_account_io_latency(struct mddev *mddev,
+					 unsigned long duration, int rw)
+{
+	struct md_stats *sp = &mddev->stats;
+	unsigned long msecs = jiffies_to_msecs(duration);
+	int idx;
+
+	BUILD_BUG_ON(ARRAY_SIZE(sp->latency_table[0]) != 2);
+	BUILD_BUG_ON(ARRAY_SIZE(sp->max_latency) != 2);
+
+	if (likely(duration > 0) && msecs > 0) {
+		idx = ilog2(msecs) - MD_LATENCY_LOGBASE + 1;
+		idx = clamp(idx, 0, (int)ARRAY_SIZE(sp->latency_table) - 1);
+	} else {
+		idx = 0;
+	}
+	atomic64_set_if_greater(&sp->max_latency[rw], duration);
+	atomic64_inc(&sp->latency_table[idx][rw]);
+}
+
 static void md_bio_endio(struct bio *bio, int err)
 {
 	struct md_bio_private *mbp = bio->bi_private;
 	struct mddev *mddev = mbp->mddev;
-	struct md_stats *sp = &mddev->stats;
 
 	unsigned int sectors = mbp->sectors;
-	int cpu, idx, rw = bio_data_dir(bio);
-	unsigned long ms, ticks;
-
-	BUILD_BUG_ON(ARRAY_SIZE(sp->latency_table[0]) != 2);
-	BUILD_BUG_ON(ARRAY_SIZE(sp->max_latency) != 2);
+	int cpu, rw = bio_data_dir(bio);
+	unsigned long ticks;
 
 	ticks = (long)jiffies - (long)mbp->ticks;
 
@@ -330,15 +349,7 @@ static void md_bio_endio(struct bio *bio, int err)
 	part_round_stats(cpu, &mddev->gendisk->part0);
 	part_stat_unlock();
 
-	ms = jiffies_to_msecs(ticks);
-	if (likely(ticks > 0) && ms > 0) {
-		idx = ilog2(ms) - MD_LATENCY_LOGBASE + 1;
-		idx = clamp(idx, 0, (int)ARRAY_SIZE(sp->latency_table) - 1);
-	} else {
-		idx = 0;
-	}
-	atomic64_set_if_greater(&sp->max_latency[rw], ticks);
-	atomic64_inc(&sp->latency_table[idx][rw]);
+	md_account_io_latency(mddev, ticks, rw);
 
 	bio->bi_private = mbp->orig_bio_private;
 	bio->bi_end_io = mbp->orig_bio_endio;
@@ -463,6 +474,10 @@ static inline void md_make_request_bio(struct mddev *mddev, struct bio *bio)
 static inline void md_request_clone_release(struct md_request_clone *rcl)
 {
 	if (atomic_dec_and_test(&rcl->cnt)) {
+#ifdef BIO_ACCOUNTING_EXTENSION
+		unsigned long diff = (long)jiffies - (long)rcl->ticks;
+		md_account_io_latency(rcl->mdp, diff, rq_data_dir(rcl->req));
+#endif	/* BIO_ACCOUNTING_EXTENSION */
 		blk_end_request_all(rcl->req, rcl->err);
 		kmem_cache_free(md_request_clone_cache, rcl);
 	}
@@ -512,6 +527,9 @@ static inline int md_process_request(struct mddev *mddev, struct request *req)
 	rcl->err = 0;
 	rcl->req = req;
 	rcl->mdp = mddev;
+#ifdef BIO_ACCOUNTING_EXTENSION
+	rcl->ticks = jiffies;
+#endif	/* BIO_ACCOUNTING_EXTENSION */
 	atomic_set(&rcl->cnt, 1);
 	bio_list_init(&rcl->bios);
 	bio = req->bio;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 4/4] mdadm: introduce '--use-requestfn' create/assembly option
  2014-06-04 17:09       ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
                           ` (2 preceding siblings ...)
  2014-06-04 17:10         ` [RFC PATCH 3/4] md: handle IO latency accounting in rqfn mode Sebastian Parschauer
@ 2014-06-04 17:10         ` Sebastian Parschauer
  2014-06-17 13:20         ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
  4 siblings, 0 replies; 14+ messages in thread
From: Sebastian Parschauer @ 2014-06-04 17:10 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, Sebastian Parschauer

With that option it is possible to tell the kernel via the writable
module parameter 'rq_mode' that it should use the request function
mode (request-by-request) instead of the default make request
function mode (bio-by-bio). The advantage is that a scheduler can
be used and the block layer cares for statistics.

Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
---
 Assemble.c |    4 ++++
 Create.c   |    4 ++++
 ReadMe.c   |    3 +++
 mdadm.c    |    5 +++++
 mdadm.h    |    3 +++
 util.c     |   34 ++++++++++++++++++++++++++++++++++
 6 files changed, 53 insertions(+)

diff --git a/Assemble.c b/Assemble.c
index a57d384..d35d218 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -1275,6 +1275,10 @@ int Assemble(struct supertype *st, char *mddev,
 		       mddev ? mddev : "further assembly");
 		return 1;
 	}
+	if (set_rq_mode(c->use_requestfn) != 0) {
+		pr_err("Cannot set the request function mode - cannot assemble.\n");
+		return 1;
+	}
 
 	if (devlist == NULL)
 		devlist = conf_get_devs();
diff --git a/Create.c b/Create.c
index 330c5b4..dc576f1 100644
--- a/Create.c
+++ b/Create.c
@@ -138,6 +138,10 @@ int Create(struct supertype *st, char *mddev,
 		pr_err("This level does not support spare devices\n");
 		return 1;
 	}
+	if (set_rq_mode(c->use_requestfn) != 0) {
+		pr_err("Cannot set the request function mode.\n");
+		return 1;
+	}
 
 	if (subdevs == 1 && strcmp(devlist->devname, "missing") != 0) {
 		/* If given a single device, it might be a container, and we can
diff --git a/ReadMe.c b/ReadMe.c
index bd8c85e..ecb829f 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -150,6 +150,7 @@ struct option long_options[] = {
     {"force",	  0, 0, Force},
     {"update",	  1, 0, 'U'},
     {"freeze-reshape", 0, 0, FreezeReshape},
+    {"use-requestfn", 0, 0, UseRequestFn},
 
     /* Management */
     {"add",       0, 0, Add},
@@ -372,6 +373,7 @@ char Help_create[] =
 "  --name=       -N   : Textual name for array - max 32 characters\n"
 "  --bitmap-chunk=    : bitmap chunksize in Kilobytes.\n"
 "  --delay=      -d   : bitmap update delay in seconds.\n"
+"  --use-requestfn    : Tell the kernel to process requests instead of bios.\n"
 "\n"
 ;
 
@@ -456,6 +458,7 @@ char Help_assemble[] =
 "  --update=     -U   : Update superblock: try '-A --update=?' for option list.\n"
 "  --no-degraded      : Assemble but do not start degraded arrays.\n"
 "  --readonly    -o   : Mark the array as read-only. No resync will start.\n"
+"  --use-requestfn    : Tell the kernel to process requests instead of bios.\n"
 ;
 
 char Help_manage[] =
diff --git a/mdadm.c b/mdadm.c
index be990b8..35ff339 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -652,6 +652,11 @@ int main(int argc, char *argv[])
 		case O(INCREMENTAL, FreezeReshape):
 			c.freeze_reshape = 1;
 			continue;
+		case O(CREATE, UseRequestFn):
+		case O(ASSEMBLE, UseRequestFn):
+		case O(INCREMENTAL, UseRequestFn):
+			c.use_requestfn = 1;
+			continue;
 		case O(CREATE,'u'): /* uuid of array */
 		case O(ASSEMBLE,'u'): /* uuid of array */
 			if (ident.uuid_set) {
diff --git a/mdadm.h b/mdadm.h
index c0726bf..bb9f043 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -343,6 +343,7 @@ enum special_options {
 	Dump,
 	Restore,
 	Action,
+	UseRequestFn,
 };
 
 enum prefix_standard {
@@ -417,6 +418,7 @@ struct context {
 	char	*backup_file;
 	int	invalid_backup;
 	char	*action;
+	int	use_requestfn;
 };
 
 struct shape {
@@ -1355,6 +1357,7 @@ extern int remove_disk(int mdfd, struct supertype *st,
 		       struct mdinfo *sra, struct mdinfo *info);
 extern int set_array_info(int mdfd, struct supertype *st, struct mdinfo *info);
 unsigned long long min_recovery_start(struct mdinfo *array);
+extern int set_rq_mode(int use_requestfn);
 
 extern char *human_size(long long bytes);
 extern char *human_size_brief(long long bytes, int prefix);
diff --git a/util.c b/util.c
index 7937eb6..b9c6cc3 100644
--- a/util.c
+++ b/util.c
@@ -1968,3 +1968,37 @@ void reopen_mddev(int mdfd)
 	if (fd >= 0 && fd != mdfd)
 		dup2(fd, mdfd);
 }
+
+/*
+ * 0: make request fn mode (bio-by-bio)
+ * 1: request fn mode (request-by-request)
+ */
+#define BIO_MODE "0\n"
+#define REQ_MODE "1\n"
+
+int set_rq_mode(int use_requestfn)
+{
+	int fd, rv = 0;
+	ssize_t size, wbytes;
+	char *mode;
+
+	fd = open("/sys/module/md_mod/parameters/rq_mode", O_WRONLY);
+	if (fd < 0) {
+		if (errno != ENOENT || use_requestfn)
+			rv = 1;
+		goto out;
+	}
+	if (use_requestfn) {
+		mode = REQ_MODE;
+		size = sizeof(REQ_MODE);
+	} else {
+		mode = BIO_MODE;
+		size = sizeof(BIO_MODE);
+	}
+	wbytes = write(fd, mode, size);
+	if (wbytes != size)
+		rv = 1;
+	close(fd);
+out:
+	return rv;
+}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/4] md/mdadm: introduce request function mode support
  2014-06-04 17:09       ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
                           ` (3 preceding siblings ...)
  2014-06-04 17:10         ` [RFC PATCH 4/4] mdadm: introduce '--use-requestfn' create/assembly option Sebastian Parschauer
@ 2014-06-17 13:20         ` Sebastian Parschauer
       [not found]           ` <CAH3kUhEK26+4KryoReosMt654-vcrkkgkxaW5tKkFRDBqgX82w@mail.gmail.com>
  2014-06-24  7:09           ` NeilBrown
  4 siblings, 2 replies; 14+ messages in thread
From: Sebastian Parschauer @ 2014-06-17 13:20 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, Sebastian Parschauer

Hi Neil,

have you already had the time to look at these patches?

We have this in production with cfq scheduler already on two storage
servers and customers noticed a huge performance boost. Also our
monitoring shows higher and smoother iop numbers.

Cheers,
Sebastian


On 04.06.2014 19:09, Sebastian Parschauer wrote:
> Patches apply to linux v3.15-rc8 and mdadm master commit 8a3544f895.
> 
> Florian-Ewald Mueller (3):
>       md: complete bio accounting and add io_latency extension
>       md: introduce request function mode support
>       md: handle IO latency accounting in rqfn mode
> 
> Sebastian Parschauer (1):
>       mdadm: introduce '--use-requestfn' create/assembly option
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/4] md/mdadm: introduce request function mode support
       [not found]             ` <53A14513.20902@profitbricks.com>
@ 2014-06-18 13:57               ` Roberto Spadim
  2014-06-18 14:43                 ` Sebastian Parschauer
  0 siblings, 1 reply; 14+ messages in thread
From: Roberto Spadim @ 2014-06-18 13:57 UTC (permalink / raw)
  To: Sebastian Parschauer; +Cc: Neil Brown, Linux-RAID

Just a comment about the read balance:
i'm talking about a  /sys/block/mdX/queue/read_balance
today we have 'near head', we can do some read balances like freebsd,
but that's what i'm thinking about:
cat  /sys/block/mdX/queue/read_balance:
[nearhead] roundrobin timebased stripe

------------
NEARHEAD:
  today read balance, each write/read, mark the position of disk
'head', sequencial reads are done by the same disk, non sequencial
reads select the disk with min(current position - read position) value
  here i'm thinking about debugging, we could implement some sys files

cat /sys/block/mdX/queue/nearhead_info:

/dev/sda1 (device 1) - current position: xxxx
/dev/sda2 (device 2) - current position: xxxx
/dev/sda3 (device 3) - current position: xxxx
...


------------
ROUNDROBIN:
  select the disk based on reads/disk, current disk and current disk
reads, here some configurations:

cat /sys/block/mdX/queue/roundrobin_info

/dev/sda1 (device 1) - reads count: xxxxx, max reads: yyyyy, current disk
/dev/sda2 (device 2) - max reads: yyyyy
/dev/sda3 (device 3) - max reads: yyyyy

MAX READ VARIABLE:
cat /sys/block/mdX/queue/roundrobin_maxreads_dev1
yyyyy
echo 1234 > /sys/block/mdX/queue/roundrobin_maxreads_dev1
cat /sys/block/mdX/queue/roundrobin_maxreads_dev1
1234

READ COUNT VARIABLE:
cat /sys/block/mdX/queue/roundrobin_readcount_dev1
xxxxx
echo 1234 > /sys/block/mdX/queue/roundrobin_readcount_dev1
cat /sys/block/mdX/queue/roundrobin_readcount_dev1
1234

CURRENT DISK VARIABLE
cat /sys/block/mdX/queue/roundrobin_currentdevice
xxxxx
echo 1 > /sys/block/mdX/queue/roundrobin_currentdevice
cat /sys/block/mdX/queue/roundrobin_readcount_dev1
1


----------------
STRIPE:
   it's something like raid0, each disk read one part of the array

/sys/block/mdX/queue/stripe_array_shift
   this one, select how many bytes/sectors, per disk, for exaple, from
0-100 disk 1, 101-200 disk 2, 201-300 disk 3, 301-400 disk 1, 401-500
disk 2 .... etc, that's just a number of how many sectors/bytes per
disk


---------------
TIME BASED:
 this one, is more specific per disk, and we can mix ssd and hdd, it's
just a standard model and can change, but it give 1% of speed up with
ssd+hdd arrays

the expected time to read is:
  (read_rate_sequencial * read_size) +
  (head_distance_rate * head_distance) +
  fixed_access_time_non_sequencial +
  fixed_access_time_sequencial +
  queue_expected_time

a example for hd:
read_rate_sequencial = 180mb/s  (must invert since we need s/mb)
head_distance_rate = 10ms/total_disk_size
fixed_access_time_nonsequencial = ~10ms (1 disk rotation, this can be
disk rpm => 7200rpm = 120hz, 1/120 = 0.008333 seconds)
fixed_access_time_sequencial = 0
queue_expected_time = (must check queue if we could get this information)

for ssd:
read_rate_sequencial = 270mb/s
head_distance_rate = 0
fixed_access_time_sequencial = 0,1ms (ocz vertex 2 )
fixed_access_time_non_sequencial = 0,1ms (ocz vertex 2 )


examples with 20MB, considering disk at current position:
hd:
  (0.0055555 * 20) +    (180mb/s = 0.00555s/mb)
  (0.000009765625 * 0) +    (considering 1tb disk => 10ms/1024gb,
10ms/1024000mb = 0.000009765625mb/ms)
  0 +
  0
  =0.11111 second, (just mb/s matters here)


ssd:
  (0.0037037037037037 * 20) +  (270mb/s = 0.0037037037037037s/mb)
  (0 * 0) +  (0ms / total ssd size = 0)
  0,0001  (0,1ms)
  = 0.0740 second (mb/s + access time of 0,0001second)


for small reads, this model select hd when it's near head position,
for bigger reads it select ssd, if we could consider queue of ssd and
hdd, we have a better read time prediction, it does a nice work (1% of
speedup) but have many parameters / disk

-------
there's an old implementation at raid1.c here:
http://www.spadim.com.br/raid1/raid1.c

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/4] md/mdadm: introduce request function mode support
  2014-06-18 13:57               ` Roberto Spadim
@ 2014-06-18 14:43                 ` Sebastian Parschauer
  0 siblings, 0 replies; 14+ messages in thread
From: Sebastian Parschauer @ 2014-06-18 14:43 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: Neil Brown, Linux-RAID

Sounds good but also completely unrelated. You should really post a new
thread for this and not using the request function mode support topic.

On 18.06.2014 15:57, Roberto Spadim wrote:
> Just a comment about the read balance:
> i'm talking about a  /sys/block/mdX/queue/read_balance
> today we have 'near head', we can do some read balances like freebsd,
> but that's what i'm thinking about:
> cat  /sys/block/mdX/queue/read_balance:
> [nearhead] roundrobin timebased stripe
> 
> ------------
> NEARHEAD:
>   today read balance, each write/read, mark the position of disk
> 'head', sequencial reads are done by the same disk, non sequencial
> reads select the disk with min(current position - read position) value
>   here i'm thinking about debugging, we could implement some sys files
> 
> cat /sys/block/mdX/queue/nearhead_info:
> 
> /dev/sda1 (device 1) - current position: xxxx
> /dev/sda2 (device 2) - current position: xxxx
> /dev/sda3 (device 3) - current position: xxxx
> ...
> 
> 
> ------------
> ROUNDROBIN:
>   select the disk based on reads/disk, current disk and current disk
> reads, here some configurations:
> 
> cat /sys/block/mdX/queue/roundrobin_info
> 
> /dev/sda1 (device 1) - reads count: xxxxx, max reads: yyyyy, current disk
> /dev/sda2 (device 2) - max reads: yyyyy
> /dev/sda3 (device 3) - max reads: yyyyy
> 
> MAX READ VARIABLE:
> cat /sys/block/mdX/queue/roundrobin_maxreads_dev1
> yyyyy
> echo 1234 > /sys/block/mdX/queue/roundrobin_maxreads_dev1
> cat /sys/block/mdX/queue/roundrobin_maxreads_dev1
> 1234
> 
> READ COUNT VARIABLE:
> cat /sys/block/mdX/queue/roundrobin_readcount_dev1
> xxxxx
> echo 1234 > /sys/block/mdX/queue/roundrobin_readcount_dev1
> cat /sys/block/mdX/queue/roundrobin_readcount_dev1
> 1234
> 
> CURRENT DISK VARIABLE
> cat /sys/block/mdX/queue/roundrobin_currentdevice
> xxxxx
> echo 1 > /sys/block/mdX/queue/roundrobin_currentdevice
> cat /sys/block/mdX/queue/roundrobin_readcount_dev1
> 1
> 
> 
> ----------------
> STRIPE:
>    it's something like raid0, each disk read one part of the array
> 
> /sys/block/mdX/queue/stripe_array_shift
>    this one, select how many bytes/sectors, per disk, for exaple, from
> 0-100 disk 1, 101-200 disk 2, 201-300 disk 3, 301-400 disk 1, 401-500
> disk 2 .... etc, that's just a number of how many sectors/bytes per
> disk
> 
> 
> ---------------
> TIME BASED:
>  this one, is more specific per disk, and we can mix ssd and hdd, it's
> just a standard model and can change, but it give 1% of speed up with
> ssd+hdd arrays
> 
> the expected time to read is:
>   (read_rate_sequencial * read_size) +
>   (head_distance_rate * head_distance) +
>   fixed_access_time_non_sequencial +
>   fixed_access_time_sequencial +
>   queue_expected_time
> 
> a example for hd:
> read_rate_sequencial = 180mb/s  (must invert since we need s/mb)
> head_distance_rate = 10ms/total_disk_size
> fixed_access_time_nonsequencial = ~10ms (1 disk rotation, this can be
> disk rpm => 7200rpm = 120hz, 1/120 = 0.008333 seconds)
> fixed_access_time_sequencial = 0
> queue_expected_time = (must check queue if we could get this information)
> 
> for ssd:
> read_rate_sequencial = 270mb/s
> head_distance_rate = 0
> fixed_access_time_sequencial = 0,1ms (ocz vertex 2 )
> fixed_access_time_non_sequencial = 0,1ms (ocz vertex 2 )
> 
> 
> examples with 20MB, considering disk at current position:
> hd:
>   (0.0055555 * 20) +    (180mb/s = 0.00555s/mb)
>   (0.000009765625 * 0) +    (considering 1tb disk => 10ms/1024gb,
> 10ms/1024000mb = 0.000009765625mb/ms)
>   0 +
>   0
>   =0.11111 second, (just mb/s matters here)
> 
> 
> ssd:
>   (0.0037037037037037 * 20) +  (270mb/s = 0.0037037037037037s/mb)
>   (0 * 0) +  (0ms / total ssd size = 0)
>   0,0001  (0,1ms)
>   = 0.0740 second (mb/s + access time of 0,0001second)
> 
> 
> for small reads, this model select hd when it's near head position,
> for bigger reads it select ssd, if we could consider queue of ssd and
> hdd, we have a better read time prediction, it does a nice work (1% of
> speedup) but have many parameters / disk
> 
> -------
> there's an old implementation at raid1.c here:
> http://www.spadim.com.br/raid1/raid1.c
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/4] md/mdadm: introduce request function mode support
  2014-06-17 13:20         ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
       [not found]           ` <CAH3kUhEK26+4KryoReosMt654-vcrkkgkxaW5tKkFRDBqgX82w@mail.gmail.com>
@ 2014-06-24  7:09           ` NeilBrown
  1 sibling, 0 replies; 14+ messages in thread
From: NeilBrown @ 2014-06-24  7:09 UTC (permalink / raw)
  To: Sebastian Parschauer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2661 bytes --]

On Tue, 17 Jun 2014 15:20:53 +0200 Sebastian Parschauer
<sebastian.riemer@profitbricks.com> wrote:

> Hi Neil,
> 
> have you already had the time to look at these patches?

I have now looked through them at last.  Sorry for the delay - there was a
week of leave in there and various other distractions.

The improved accounting is probably a good idea.  The extra malloc for each
request is a bit unpleasant.
For raid1 and raid10 we could use the r1_bio they already allocate to store
the extra data.  For raid5/6 we could possibly do something similar though
the results might be a little less precise.

For raid0 and linear I'd rather not intervene at all.  The underlying devices
will report accurate statistics which can be combined to get correct
statistics for the overall devices.

The request function support is a bit of a mix.  In some ways it is quite
nice and localised.  However I'm not thrilled with the extra work queue, and
I don't really understand why md_process_request() allocates a new
md_request_clone and clones all the bios... I wonder if that is only needed
for RAID0, not for higher levels.


I still don't think any of this should  be applied to RAID0.  I asked about
that before and you didn't really provide an explanation.

RAID0 simply maps the bio to a different device and offset and sends it down.
So the queue management on the underlying device should be able to provide
all the fairness etc that you need.

So if you have RAID0 over RAID1, then I can understand the desire to use
requestfn rather than make_requestfn on the RAID1, but not on the RAID0.

Have you tried that configuration?  i.e. the RAID0 made without
--use-requestfn and the RAID1s made with it?  Does that work well as well as
with both arrays using requestfn??

The graphs you posted subsequently certainly look impressive, but I really
want to make sure I understand what is really needed.

Thanks,
NeilBrown




> 
> We have this in production with cfq scheduler already on two storage
> servers and customers noticed a huge performance boost. Also our
> monitoring shows higher and smoother iop numbers.
> 
> Cheers,
> Sebastian
> 
> 
> On 04.06.2014 19:09, Sebastian Parschauer wrote:
> > Patches apply to linux v3.15-rc8 and mdadm master commit 8a3544f895.
> > 
> > Florian-Ewald Mueller (3):
> >       md: complete bio accounting and add io_latency extension
> >       md: introduce request function mode support
> >       md: handle IO latency accounting in rqfn mode
> > 
> > Sebastian Parschauer (1):
> >       mdadm: introduce '--use-requestfn' create/assembly option
> > 


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-06-24  7:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-28 13:04 [RFC] Process requests instead of bios to use a scheduler Sebastian Parschauer
2014-06-01 23:32 ` NeilBrown
2014-06-02  9:51   ` Sebastian Parschauer
2014-06-02 10:20     ` NeilBrown
2014-06-02 11:12       ` Sebastian Parschauer
2014-06-04 17:09       ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
2014-06-04 17:09         ` [RFC PATCH 1/4] md: complete bio accounting and add io_latency extension Sebastian Parschauer
2014-06-04 17:10         ` [RFC PATCH 2/4] md: introduce request function mode support Sebastian Parschauer
2014-06-04 17:10         ` [RFC PATCH 3/4] md: handle IO latency accounting in rqfn mode Sebastian Parschauer
2014-06-04 17:10         ` [RFC PATCH 4/4] mdadm: introduce '--use-requestfn' create/assembly option Sebastian Parschauer
2014-06-17 13:20         ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
     [not found]           ` <CAH3kUhEK26+4KryoReosMt654-vcrkkgkxaW5tKkFRDBqgX82w@mail.gmail.com>
     [not found]             ` <53A14513.20902@profitbricks.com>
2014-06-18 13:57               ` Roberto Spadim
2014-06-18 14:43                 ` Sebastian Parschauer
2014-06-24  7:09           ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.