* [RFC PATCH 1/4] md: complete bio accounting and add io_latency extension
2014-06-04 17:09 ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
@ 2014-06-04 17:09 ` Sebastian Parschauer
2014-06-04 17:10 ` [RFC PATCH 2/4] md: introduce request function mode support Sebastian Parschauer
` (3 subsequent siblings)
4 siblings, 0 replies; 14+ messages in thread
From: Sebastian Parschauer @ 2014-06-04 17:09 UTC (permalink / raw)
To: neilb; +Cc: linux-raid, Florian-Ewald Mueller, Sebastian Parschauer
From: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>
The md layer only accounts the number of I/Os and sectors per bio.
So account in-flight and ticks as well. Also maintain an I/O latency
statistic by counting I/Os in power of 2 latency areas starting at
< 8 ms and ending at >= 65536 ms. Determine the maximum latency as
well. This I/O latency statistic can be read and reset to 0 with the
md sysfs file 'io_latency'.
Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>
[spars: added a description, replaced gcc atomics with atomic64_t,
merged commits, fixed checkpatch warnings]
Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
---
drivers/md/md.c | 175 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/md/md.h | 18 ++++++
2 files changed, 193 insertions(+)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 237b7e0..8c653f9 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -54,6 +54,32 @@
#include "md.h"
#include "bitmap.h"
+#ifdef BIO_ACCOUNTING_EXTENSION
+
+#include <linux/ratelimit.h>
+
+struct md_bio_private {
+ void (*orig_bio_endio)(struct bio *, int);
+ void *orig_bio_private;
+ struct mddev *mddev;
+ unsigned int sectors;
+ unsigned long ticks;
+};
+
+static struct kmem_cache *md_bio_private_cache __read_mostly;
+
+static DEFINE_RATELIMIT_STATE(md_ratelimit_state,
+ DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+static inline int __must_check md_valid_ptr(const void *p)
+{
+ return !ZERO_OR_NULL_PTR(p) && !IS_ERR(p);
+}
+#define VALID_PTR(p) md_valid_ptr(p)
+
+#endif /* BIO_ACCOUNTING_EXTENSION */
+
#ifndef MODULE
static void autostart_arrays(int part);
#endif
@@ -241,6 +267,64 @@ static DEFINE_SPINLOCK(all_mddevs_lock);
_tmp = _tmp->next;}) \
)
+#ifdef BIO_ACCOUNTING_EXTENSION
+
+static inline long atomic64_set_if_greater(atomic64_t *v, long val)
+{
+ long act, old;
+
+ old = atomic64_read(v);
+ for (;;) {
+ if (val <= old)
+ break;
+ act = atomic64_cmpxchg(v, old, val);
+ if (likely(act == old))
+ break;
+ old = act;
+ }
+ return old;
+}
+
+static void md_bio_endio(struct bio *bio, int err)
+{
+ struct md_bio_private *mbp = bio->bi_private;
+ struct mddev *mddev = mbp->mddev;
+ struct md_stats *sp = &mddev->stats;
+
+ unsigned int sectors = mbp->sectors;
+ int cpu, idx, rw = bio_data_dir(bio);
+ unsigned long ms, ticks;
+
+ BUILD_BUG_ON(ARRAY_SIZE(sp->latency_table[0]) != 2);
+ BUILD_BUG_ON(ARRAY_SIZE(sp->max_latency) != 2);
+
+ ticks = (long)jiffies - (long)mbp->ticks;
+
+ cpu = part_stat_lock();
+ part_stat_inc(cpu, &mddev->gendisk->part0, ios[rw]);
+ part_stat_add(cpu, &mddev->gendisk->part0, sectors[rw], sectors);
+ part_stat_add(cpu, &mddev->gendisk->part0, ticks[rw], ticks);
+ part_dec_in_flight(&mddev->gendisk->part0, rw);
+ part_round_stats(cpu, &mddev->gendisk->part0);
+ part_stat_unlock();
+
+ ms = jiffies_to_msecs(ticks);
+ if (likely(ticks > 0) && ms > 0) {
+ idx = ilog2(ms) - MD_LATENCY_LOGBASE + 1;
+ idx = clamp(idx, 0, (int)ARRAY_SIZE(sp->latency_table) - 1);
+ } else {
+ idx = 0;
+ }
+ atomic64_set_if_greater(&sp->max_latency[rw], ticks);
+ atomic64_inc(&sp->latency_table[idx][rw]);
+
+ bio->bi_private = mbp->orig_bio_private;
+ bio->bi_end_io = mbp->orig_bio_endio;
+ kmem_cache_free(md_bio_private_cache, mbp);
+ bio_endio_nodec(bio, err); /* >= 3.14, bio_endio() otherwise */
+}
+
+#endif /* BIO_ACCOUNTING_EXTENSION */
/* Rather than calling directly into the personality make_request function,
* IO requests come here first so that we can check if the device is
@@ -255,6 +339,9 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
struct mddev *mddev = q->queuedata;
int cpu;
unsigned int sectors;
+#ifdef BIO_ACCOUNTING_EXTENSION
+ struct md_bio_private *mbp;
+#endif /* BIO_ACCOUNTING_EXTENSION */
if (mddev == NULL || mddev->pers == NULL
|| !mddev->ready) {
@@ -288,12 +375,36 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
* go away inside make_request
*/
sectors = bio_sectors(bio);
+#ifdef BIO_ACCOUNTING_EXTENSION
+ mbp = kmem_cache_alloc(md_bio_private_cache, GFP_NOIO);
+ if (unlikely(!VALID_PTR(mbp))) {
+ if (__ratelimit(&md_ratelimit_state))
+ pr_warn("%s: [%s] kmem_cache_alloc failed\n",
+ __func__, mdname(mddev));
+ cpu = part_stat_lock();
+ part_stat_inc(cpu, &mddev->gendisk->part0, ios[rw]);
+ part_stat_add(cpu, &mddev->gendisk->part0, sectors[rw],
+ sectors);
+ part_stat_unlock();
+ } else {
+ part_inc_in_flight(&mddev->gendisk->part0, rw);
+ mbp->orig_bio_private = bio->bi_private;
+ mbp->orig_bio_endio = bio->bi_end_io;
+ mbp->sectors = sectors;
+ mbp->ticks = jiffies;
+ mbp->mddev = mddev;
+ bio->bi_end_io = md_bio_endio;
+ bio->bi_private = mbp;
+ }
+#endif /* BIO_ACCOUNTING_EXTENSION */
mddev->pers->make_request(mddev, bio);
+#ifndef BIO_ACCOUNTING_EXTENSION
cpu = part_stat_lock();
part_stat_inc(cpu, &mddev->gendisk->part0, ios[rw]);
part_stat_add(cpu, &mddev->gendisk->part0, sectors[rw], sectors);
part_stat_unlock();
+#endif /* !BIO_ACCOUNTING_EXTENSION */
if (atomic_dec_and_test(&mddev->active_io) && mddev->suspended)
wake_up(&mddev->sb_wait);
@@ -4652,6 +4763,52 @@ static struct md_sysfs_entry md_array_size =
__ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
array_size_store);
+#ifdef BIO_ACCOUNTING_EXTENSION
+
+static ssize_t
+md_io_latency_show(struct mddev *mddev, char *page)
+{
+ struct md_stats *sp = &mddev->stats;
+ ssize_t cnt;
+ int i;
+
+ for (cnt = i = 0; i < (ARRAY_SIZE(sp->latency_table) - 1); i++) {
+ cnt += scnprintf(page + cnt, PAGE_SIZE - cnt,
+ "< %5d ms: %lu %lu\n",
+ (1 << (i + MD_LATENCY_LOGBASE)),
+ atomic64_read(&sp->latency_table[i][0]),
+ atomic64_read(&sp->latency_table[i][1]));
+ }
+ cnt += scnprintf(page + cnt, PAGE_SIZE - cnt, ">= %5d ms: %lu %lu\n",
+ (1 << ((i - 1) + MD_LATENCY_LOGBASE)),
+ atomic64_read(&sp->latency_table[i][0]),
+ atomic64_read(&sp->latency_table[i][1]));
+ cnt += scnprintf(page + cnt, PAGE_SIZE - cnt, " maximum ms: %u %u\n",
+ jiffies_to_msecs(atomic64_read(&sp->max_latency[0])),
+ jiffies_to_msecs(atomic64_read(&sp->max_latency[1])));
+ return cnt;
+}
+
+static ssize_t
+md_io_latency_store(struct mddev *mddev, const char *buf, size_t len)
+{
+ struct md_stats *sp = &mddev->stats;
+ int i, j;
+
+ for (i = 0; i < ARRAY_SIZE(sp->max_latency); i++)
+ atomic64_set(&sp->max_latency[i], 0);
+ for (i = 0; i < ARRAY_SIZE(sp->latency_table); i++) {
+ for (j = 0; j < ARRAY_SIZE(sp->latency_table[i]); j++)
+ atomic64_set(&sp->latency_table[i][j], 0);
+ }
+ return len;
+}
+
+static struct md_sysfs_entry md_io_latency =
+__ATTR(io_latency, S_IRUGO|S_IWUSR, md_io_latency_show, md_io_latency_store);
+
+#endif /* BIO_ACCOUNTING_EXTENSION */
+
static struct attribute *md_default_attrs[] = {
&md_level.attr,
&md_layout.attr,
@@ -4667,6 +4824,9 @@ static struct attribute *md_default_attrs[] = {
&md_reshape_direction.attr,
&md_array_size.attr,
&max_corr_read_errors.attr,
+#ifdef BIO_ACCOUNTING_EXTENSION
+ &md_io_latency.attr,
+#endif /* BIO_ACCOUNTING_EXTENSION */
NULL,
};
@@ -8551,6 +8711,14 @@ static int __init md_init(void)
{
int ret = -ENOMEM;
+#ifdef BIO_ACCOUNTING_EXTENSION
+ md_bio_private_cache = KMEM_CACHE(md_bio_private, 0);
+ if (unlikely(!VALID_PTR(md_bio_private_cache))) {
+ pr_err("%s: KMEM_CACHE failed\n", __func__);
+ return -ENOMEM;
+ }
+#endif /* BIO_ACCOUNTING_EXTENSION */
+
md_wq = alloc_workqueue("md", WQ_MEM_RECLAIM, 0);
if (!md_wq)
goto err_wq;
@@ -8687,6 +8855,13 @@ static __exit void md_exit(void)
}
destroy_workqueue(md_misc_wq);
destroy_workqueue(md_wq);
+
+#ifdef BIO_ACCOUNTING_EXTENSION
+ if (likely(VALID_PTR(md_bio_private_cache))) {
+ kmem_cache_destroy(md_bio_private_cache);
+ md_bio_private_cache = NULL;
+ }
+#endif /* BIO_ACCOUNTING_EXTENSION */
}
subsys_initcall(md_init);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index a49d991..f0e9171 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -24,6 +24,10 @@
#include <linux/wait.h>
#include <linux/workqueue.h>
+#if 1
+#define BIO_ACCOUNTING_EXTENSION
+#endif
+
#define MaxSector (~(sector_t)0)
/* Bad block numbers are stored sorted in a single page.
@@ -202,6 +206,17 @@ extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
int is_new);
extern void md_ack_all_badblocks(struct badblocks *bb);
+#ifdef BIO_ACCOUNTING_EXTENSION
+
+#define MD_LATENCY_LOGBASE 3
+
+struct md_stats {
+ atomic64_t latency_table[15][2];
+ atomic64_t max_latency[2];
+};
+
+#endif /* BIO_ACCOUNTING_EXTENSION */
+
struct mddev {
void *private;
struct md_personality *pers;
@@ -437,6 +452,9 @@ struct mddev {
struct work_struct flush_work;
struct work_struct event_work; /* used by dm to report failure event */
void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
+#ifdef BIO_ACCOUNTING_EXTENSION
+ struct md_stats stats;
+#endif /* BIO_ACCOUNTING_EXTENSION */
};
--
1.7.9.5
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH 2/4] md: introduce request function mode support
2014-06-04 17:09 ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
2014-06-04 17:09 ` [RFC PATCH 1/4] md: complete bio accounting and add io_latency extension Sebastian Parschauer
@ 2014-06-04 17:10 ` Sebastian Parschauer
2014-06-04 17:10 ` [RFC PATCH 3/4] md: handle IO latency accounting in rqfn mode Sebastian Parschauer
` (2 subsequent siblings)
4 siblings, 0 replies; 14+ messages in thread
From: Sebastian Parschauer @ 2014-06-04 17:10 UTC (permalink / raw)
To: neilb; +Cc: linux-raid, Florian-Ewald Mueller, Sebastian Parschauer
From: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>
This introduces the writable module parameter 'rq_mode' which is
used to set the I/O mode for all subsequently created MD devices.
Set it to 0 for the default mode (the make request function mode)
in order to process I/O bio-by-bio or set it to 1 for the new
request function mode to process I/O request-by-request. Common
code is shared between both modes.
The advantage of the new mode is that a scheduler can be used and
the block layer cares for I/O statistics.
Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>
[spars: merged commits, changed description, fixed checkpatch warnings]
Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
---
drivers/md/md.c | 280 +++++++++++++++++++++++++++++++++++++++++++++++++------
drivers/md/md.h | 7 ++
2 files changed, 257 insertions(+), 30 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8c653f9..0e5c420 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -56,8 +56,6 @@
#ifdef BIO_ACCOUNTING_EXTENSION
-#include <linux/ratelimit.h>
-
struct md_bio_private {
void (*orig_bio_endio)(struct bio *, int);
void *orig_bio_private;
@@ -68,6 +66,30 @@ struct md_bio_private {
static struct kmem_cache *md_bio_private_cache __read_mostly;
+#endif /* BIO_ACCOUNTING_EXTENSION */
+
+#ifdef MD_REQUEST_FUNCTION
+
+struct md_request_clone {
+ struct work_struct work;
+ struct mddev *mdp;
+ struct request *req;
+ struct bio_list bios;
+ atomic_t cnt;
+ int err;
+};
+
+#define MD_RQ_MODE_DEFAULT 0
+
+static unsigned int rq_mode __read_mostly = MD_RQ_MODE_DEFAULT;
+static struct kmem_cache *md_request_clone_cache __read_mostly;
+
+#endif /* MD_REQUEST_FUNCTION */
+
+#if defined BIO_ACCOUNTING_EXTENSION || defined MD_REQUEST_FUNCTION
+
+#include <linux/ratelimit.h>
+
static DEFINE_RATELIMIT_STATE(md_ratelimit_state,
DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
@@ -78,7 +100,7 @@ static inline int __must_check md_valid_ptr(const void *p)
}
#define VALID_PTR(p) md_valid_ptr(p)
-#endif /* BIO_ACCOUNTING_EXTENSION */
+#endif /* BIO_ACCOUNTING_EXTENSION || MD_REQUEST_FUNCTION */
#ifndef MODULE
static void autostart_arrays(int part);
@@ -326,31 +348,17 @@ static void md_bio_endio(struct bio *bio, int err)
#endif /* BIO_ACCOUNTING_EXTENSION */
-/* Rather than calling directly into the personality make_request function,
- * IO requests come here first so that we can check if the device is
- * being suspended pending a reconfiguration.
- * We hold a refcount over the call to ->make_request. By the time that
- * call has finished, the bio has been linked into some internal structure
- * and so is visible to ->quiesce(), so we don't need the refcount any more.
- */
-static void md_make_request(struct request_queue *q, struct bio *bio)
+static inline int md_make_request_head(struct mddev *mddev, struct bio *bio)
{
const int rw = bio_data_dir(bio);
- struct mddev *mddev = q->queuedata;
- int cpu;
- unsigned int sectors;
-#ifdef BIO_ACCOUNTING_EXTENSION
- struct md_bio_private *mbp;
-#endif /* BIO_ACCOUNTING_EXTENSION */
- if (mddev == NULL || mddev->pers == NULL
- || !mddev->ready) {
+ if (mddev == NULL || mddev->pers == NULL || !mddev->ready) {
bio_io_error(bio);
- return;
+ return 1;
}
if (mddev->ro == 1 && unlikely(rw == WRITE)) {
bio_endio(bio, bio_sectors(bio) == 0 ? 0 : -EROFS);
- return;
+ return 1;
}
smp_rmb(); /* Ensure implications of 'active' are visible */
rcu_read_lock();
@@ -369,6 +377,39 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
}
atomic_inc(&mddev->active_io);
rcu_read_unlock();
+ return 0;
+}
+
+static inline void md_make_request_body(struct mddev *mddev, struct bio *bio)
+{
+ mddev->pers->make_request(mddev, bio);
+}
+
+static inline void md_make_request_tail(struct mddev *mddev, struct bio *bio)
+{
+ if (atomic_dec_and_test(&mddev->active_io) && mddev->suspended)
+ wake_up(&mddev->sb_wait);
+}
+
+/* Rather than calling directly into the personality make_request function,
+ * IO requests come here first so that we can check if the device is
+ * being suspended pending a reconfiguration.
+ * We hold a refcount over the call to ->make_request. By the time that
+ * call has finished, the bio has been linked into some internal structure
+ * and so is visible to ->quiesce(), so we don't need the refcount any more.
+ */
+static void md_make_request(struct request_queue *q, struct bio *bio)
+{
+ const int rw = bio_data_dir(bio);
+ struct mddev *mddev = q->queuedata;
+ int cpu;
+ unsigned int sectors;
+#ifdef BIO_ACCOUNTING_EXTENSION
+ struct md_bio_private *mbp;
+#endif /* BIO_ACCOUNTING_EXTENSION */
+
+ if (unlikely(md_make_request_head(mddev, bio)))
+ return;
/*
* save the sectors now since our bio can
@@ -397,7 +438,7 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
bio->bi_private = mbp;
}
#endif /* BIO_ACCOUNTING_EXTENSION */
- mddev->pers->make_request(mddev, bio);
+ md_make_request_body(mddev, bio);
#ifndef BIO_ACCOUNTING_EXTENSION
cpu = part_stat_lock();
@@ -406,10 +447,131 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
part_stat_unlock();
#endif /* !BIO_ACCOUNTING_EXTENSION */
- if (atomic_dec_and_test(&mddev->active_io) && mddev->suspended)
- wake_up(&mddev->sb_wait);
+ md_make_request_tail(mddev, bio);
+}
+
+#ifdef MD_REQUEST_FUNCTION
+
+static inline void md_make_request_bio(struct mddev *mddev, struct bio *bio)
+{
+ if (unlikely(md_make_request_head(mddev, bio)))
+ return;
+ md_make_request_body(mddev, bio);
+ md_make_request_tail(mddev, bio);
+}
+
+static inline void md_request_clone_release(struct md_request_clone *rcl)
+{
+ if (atomic_dec_and_test(&rcl->cnt)) {
+ blk_end_request_all(rcl->req, rcl->err);
+ kmem_cache_free(md_request_clone_cache, rcl);
+ }
+}
+
+static void md_request_bio_endio(struct bio *bio, int err)
+{
+ struct md_request_clone *rcl = bio->bi_private;
+
+ if (unlikely(err < 0))
+ rcl->err = err;
+
+ bio_put(bio);
+ md_request_clone_release(rcl);
+}
+
+static void md_request_clone_worker(struct work_struct *wkp)
+{
+ struct md_request_clone *rcl =
+ container_of(wkp, struct md_request_clone, work);
+ struct bio_list *blp = &rcl->bios;
+ struct mddev *mddev = rcl->mdp;
+ struct bio *bio;
+
+ bio = bio_list_pop(blp);
+ while (VALID_PTR(bio)) {
+ md_make_request_bio(mddev, bio);
+ bio = bio_list_pop(blp);
+ }
+ md_request_clone_release(rcl);
}
+static inline int md_process_request(struct mddev *mddev, struct request *req)
+{
+ struct md_request_clone *rcl;
+
+ struct bio *bio, *clone;
+ int error;
+
+ rcl = kmem_cache_alloc(md_request_clone_cache, GFP_NOIO);
+ if (unlikely(!VALID_PTR(rcl))) {
+ if (__ratelimit(&md_ratelimit_state))
+ pr_warn("%s: [%s] kmem_cache_alloc failed\n",
+ __func__, mdname(mddev));
+ return -ENOMEM;
+ }
+ rcl->err = 0;
+ rcl->req = req;
+ rcl->mdp = mddev;
+ atomic_set(&rcl->cnt, 1);
+ bio_list_init(&rcl->bios);
+ bio = req->bio;
+ while (VALID_PTR(bio)) {
+ clone = bio_clone(bio, GFP_NOWAIT);
+ if (unlikely(!VALID_PTR(clone))) {
+ if (__ratelimit(&md_ratelimit_state))
+ pr_warn("%s: [%s] bio_clone failed\n",
+ __func__, mdname(mddev));
+ error = -ENOMEM;
+ goto error_out;
+ }
+ clone->bi_private = rcl;
+ clone->bi_end_io = md_request_bio_endio;
+ bio_list_add(&rcl->bios, clone);
+ atomic_inc(&rcl->cnt);
+ bio = bio->bi_next;
+ }
+ INIT_WORK(&rcl->work, md_request_clone_worker);
+ queue_work(mddev->request_wq, &rcl->work);
+ return 0;
+error_out:
+ bio = bio_list_pop(&rcl->bios);
+ while (VALID_PTR(bio)) {
+ bio_put(bio);
+ bio = bio_list_pop(&rcl->bios);
+ }
+ kmem_cache_free(md_request_clone_cache, rcl);
+ return error;
+}
+
+#ifndef blk_fs_request
+#define blk_fs_request(p) ((p)->cmd_type == REQ_TYPE_FS)
+#endif /* !blk_fs_request */
+
+static void md_request_function(struct request_queue *rqp)
+{
+ struct mddev *mddev = rqp->queuedata;
+
+ struct request *req;
+ int rc;
+
+ while ((req = blk_fetch_request(rqp)) != NULL) {
+ if (unlikely(!blk_fs_request(req))) {
+ if (__ratelimit(&md_ratelimit_state))
+ pr_warn("%s: [%s] non-fs request\n",
+ __func__, mdname(mddev));
+ __blk_end_request_all(req, -ENOTSUPP);
+ continue;
+ }
+ spin_unlock_irq(rqp->queue_lock);
+ rc = md_process_request(mddev, req);
+ spin_lock_irq(rqp->queue_lock);
+ if (unlikely(rc < 0))
+ __blk_end_request_all(req, rc);
+ }
+}
+
+#endif /* MD_REQUEST_FUNCTION */
+
/* mddev_suspend makes sure no new requests are submitted
* to the device, and that any requests that have been submitted
* are completely handled.
@@ -567,8 +729,15 @@ static void mddev_put(struct mddev *mddev)
*/
INIT_WORK(&mddev->del_work, mddev_delayed_delete);
queue_work(md_misc_wq, &mddev->del_work);
- } else
+ } else {
+#ifdef MD_REQUEST_FUNCTION
+ if (likely(VALID_PTR(mddev->request_wq))) {
+ destroy_workqueue(mddev->request_wq);
+ mddev->request_wq = NULL;
+ }
+#endif /* MD_REQUEST_FUNCTION */
kfree(mddev);
+ }
}
spin_unlock(&all_mddevs_lock);
if (bs)
@@ -4923,6 +5092,13 @@ static void md_free(struct kobject *ko)
if (mddev->queue)
blk_cleanup_queue(mddev->queue);
+#ifdef MD_REQUEST_FUNCTION
+ if (likely(VALID_PTR(mddev->request_wq))) {
+ destroy_workqueue(mddev->request_wq);
+ mddev->request_wq = NULL;
+ }
+#endif /* MD_REQUEST_FUNCTION */
+
kfree(mddev);
}
@@ -4990,12 +5166,32 @@ static int md_alloc(dev_t dev, char *name)
}
error = -ENOMEM;
- mddev->queue = blk_alloc_queue(GFP_KERNEL);
- if (!mddev->queue)
- goto abort;
+#ifdef MD_REQUEST_FUNCTION
+ if (!rq_mode) {
+#endif /* MD_REQUEST_FUNCTION */
+ mddev->queue = blk_alloc_queue(GFP_KERNEL);
+ if (!mddev->queue)
+ goto abort;
+ blk_queue_make_request(mddev->queue, md_make_request);
+#ifdef MD_REQUEST_FUNCTION
+ } else {
+ mddev->request_wq =
+ create_singlethread_workqueue(mdname(mddev));
+ if (unlikely(!VALID_PTR(mddev->request_wq))) {
+ pr_warn("%s: create_singlethread_workqueue (%s) "
+ "failed\n", __func__, mdname(mddev));
+ goto abort;
+ }
+ mddev->queue = blk_init_queue(md_request_function, NULL);
+ if (!mddev->queue) {
+ destroy_workqueue(mddev->request_wq);
+ mddev->request_wq = NULL;
+ goto abort;
+ }
+ }
+#endif /* MD_REQUEST_FUNCTION */
mddev->queue->queuedata = mddev;
- blk_queue_make_request(mddev->queue, md_make_request);
blk_set_stacking_limits(&mddev->queue->limits);
disk = alloc_disk(1 << shift);
@@ -8714,11 +8910,23 @@ static int __init md_init(void)
#ifdef BIO_ACCOUNTING_EXTENSION
md_bio_private_cache = KMEM_CACHE(md_bio_private, 0);
if (unlikely(!VALID_PTR(md_bio_private_cache))) {
- pr_err("%s: KMEM_CACHE failed\n", __func__);
+ pr_err("%s: KMEM_CACHE (bio_priv) failed\n", __func__);
return -ENOMEM;
}
#endif /* BIO_ACCOUNTING_EXTENSION */
+#ifdef MD_REQUEST_FUNCTION
+ md_request_clone_cache = KMEM_CACHE(md_request_clone, 0);
+ if (unlikely(!VALID_PTR(md_request_clone_cache))) {
+ pr_err("%s: KMEM_CACHE (req_clone) failed\n", __func__);
+#ifdef BIO_ACCOUNTING_EXTENSION
+ kmem_cache_destroy(md_bio_private_cache);
+ md_bio_private_cache = NULL;
+#endif /* BIO_ACCOUNTING_EXTENSION */
+ return -ENOMEM;
+ }
+#endif /* MD_REQUEST_FUNCTION */
+
md_wq = alloc_workqueue("md", WQ_MEM_RECLAIM, 0);
if (!md_wq)
goto err_wq;
@@ -8856,6 +9064,13 @@ static __exit void md_exit(void)
destroy_workqueue(md_misc_wq);
destroy_workqueue(md_wq);
+#ifdef MD_REQUEST_FUNCTION
+ if (likely(VALID_PTR(md_request_clone_cache))) {
+ kmem_cache_destroy(md_request_clone_cache);
+ md_request_clone_cache = NULL;
+ }
+#endif /* MD_REQUEST_FUNCTION */
+
#ifdef BIO_ACCOUNTING_EXTENSION
if (likely(VALID_PTR(md_bio_private_cache))) {
kmem_cache_destroy(md_bio_private_cache);
@@ -8887,6 +9102,11 @@ module_param(start_dirty_degraded, int, S_IRUGO|S_IWUSR);
module_param_call(new_array, add_named_array, NULL, NULL, S_IWUSR);
+#ifdef MD_REQUEST_FUNCTION
+module_param(rq_mode, int, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(rq_mode, " this module's io input mode (default: 0 [make request mode])");
+#endif /* MD_REQUEST_FUNCTION */
+
EXPORT_SYMBOL(register_md_personality);
EXPORT_SYMBOL(unregister_md_personality);
EXPORT_SYMBOL(md_error);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index f0e9171..8d639e0 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -25,6 +25,10 @@
#include <linux/workqueue.h>
#if 1
+#define MD_REQUEST_FUNCTION
+#endif
+
+#if 1
#define BIO_ACCOUNTING_EXTENSION
#endif
@@ -455,6 +459,9 @@ struct mddev {
#ifdef BIO_ACCOUNTING_EXTENSION
struct md_stats stats;
#endif /* BIO_ACCOUNTING_EXTENSION */
+#ifdef MD_REQUEST_FUNCTION
+ struct workqueue_struct *request_wq;
+#endif /* MD_REQUEST_FUNCTION */
};
--
1.7.9.5
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH 3/4] md: handle IO latency accounting in rqfn mode
2014-06-04 17:09 ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
2014-06-04 17:09 ` [RFC PATCH 1/4] md: complete bio accounting and add io_latency extension Sebastian Parschauer
2014-06-04 17:10 ` [RFC PATCH 2/4] md: introduce request function mode support Sebastian Parschauer
@ 2014-06-04 17:10 ` Sebastian Parschauer
2014-06-04 17:10 ` [RFC PATCH 4/4] mdadm: introduce '--use-requestfn' create/assembly option Sebastian Parschauer
2014-06-17 13:20 ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
4 siblings, 0 replies; 14+ messages in thread
From: Sebastian Parschauer @ 2014-06-04 17:10 UTC (permalink / raw)
To: neilb; +Cc: linux-raid, Florian-Ewald Mueller, Sebastian Parschauer
From: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>
Also in the request function mode, the IO latency accounting should
be done. In contrast to the make request mode, it is done per
request.
Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>
[spars: changed description, merged commits, fixed checkpatch warnings]
Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
---
drivers/md/md.c | 48 +++++++++++++++++++++++++++++++++---------------
1 file changed, 33 insertions(+), 15 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0e5c420..5858af0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -75,6 +75,9 @@ struct md_request_clone {
struct mddev *mdp;
struct request *req;
struct bio_list bios;
+#ifdef BIO_ACCOUNTING_EXTENSION
+ unsigned long ticks;
+#endif /* BIO_ACCOUNTING_EXTENSION */
atomic_t cnt;
int err;
};
@@ -307,18 +310,34 @@ static inline long atomic64_set_if_greater(atomic64_t *v, long val)
return old;
}
+static inline void md_account_io_latency(struct mddev *mddev,
+ unsigned long duration, int rw)
+{
+ struct md_stats *sp = &mddev->stats;
+ unsigned long msecs = jiffies_to_msecs(duration);
+ int idx;
+
+ BUILD_BUG_ON(ARRAY_SIZE(sp->latency_table[0]) != 2);
+ BUILD_BUG_ON(ARRAY_SIZE(sp->max_latency) != 2);
+
+ if (likely(duration > 0) && msecs > 0) {
+ idx = ilog2(msecs) - MD_LATENCY_LOGBASE + 1;
+ idx = clamp(idx, 0, (int)ARRAY_SIZE(sp->latency_table) - 1);
+ } else {
+ idx = 0;
+ }
+ atomic64_set_if_greater(&sp->max_latency[rw], duration);
+ atomic64_inc(&sp->latency_table[idx][rw]);
+}
+
static void md_bio_endio(struct bio *bio, int err)
{
struct md_bio_private *mbp = bio->bi_private;
struct mddev *mddev = mbp->mddev;
- struct md_stats *sp = &mddev->stats;
unsigned int sectors = mbp->sectors;
- int cpu, idx, rw = bio_data_dir(bio);
- unsigned long ms, ticks;
-
- BUILD_BUG_ON(ARRAY_SIZE(sp->latency_table[0]) != 2);
- BUILD_BUG_ON(ARRAY_SIZE(sp->max_latency) != 2);
+ int cpu, rw = bio_data_dir(bio);
+ unsigned long ticks;
ticks = (long)jiffies - (long)mbp->ticks;
@@ -330,15 +349,7 @@ static void md_bio_endio(struct bio *bio, int err)
part_round_stats(cpu, &mddev->gendisk->part0);
part_stat_unlock();
- ms = jiffies_to_msecs(ticks);
- if (likely(ticks > 0) && ms > 0) {
- idx = ilog2(ms) - MD_LATENCY_LOGBASE + 1;
- idx = clamp(idx, 0, (int)ARRAY_SIZE(sp->latency_table) - 1);
- } else {
- idx = 0;
- }
- atomic64_set_if_greater(&sp->max_latency[rw], ticks);
- atomic64_inc(&sp->latency_table[idx][rw]);
+ md_account_io_latency(mddev, ticks, rw);
bio->bi_private = mbp->orig_bio_private;
bio->bi_end_io = mbp->orig_bio_endio;
@@ -463,6 +474,10 @@ static inline void md_make_request_bio(struct mddev *mddev, struct bio *bio)
static inline void md_request_clone_release(struct md_request_clone *rcl)
{
if (atomic_dec_and_test(&rcl->cnt)) {
+#ifdef BIO_ACCOUNTING_EXTENSION
+ unsigned long diff = (long)jiffies - (long)rcl->ticks;
+ md_account_io_latency(rcl->mdp, diff, rq_data_dir(rcl->req));
+#endif /* BIO_ACCOUNTING_EXTENSION */
blk_end_request_all(rcl->req, rcl->err);
kmem_cache_free(md_request_clone_cache, rcl);
}
@@ -512,6 +527,9 @@ static inline int md_process_request(struct mddev *mddev, struct request *req)
rcl->err = 0;
rcl->req = req;
rcl->mdp = mddev;
+#ifdef BIO_ACCOUNTING_EXTENSION
+ rcl->ticks = jiffies;
+#endif /* BIO_ACCOUNTING_EXTENSION */
atomic_set(&rcl->cnt, 1);
bio_list_init(&rcl->bios);
bio = req->bio;
--
1.7.9.5
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH 4/4] mdadm: introduce '--use-requestfn' create/assembly option
2014-06-04 17:09 ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
` (2 preceding siblings ...)
2014-06-04 17:10 ` [RFC PATCH 3/4] md: handle IO latency accounting in rqfn mode Sebastian Parschauer
@ 2014-06-04 17:10 ` Sebastian Parschauer
2014-06-17 13:20 ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
4 siblings, 0 replies; 14+ messages in thread
From: Sebastian Parschauer @ 2014-06-04 17:10 UTC (permalink / raw)
To: neilb; +Cc: linux-raid, Sebastian Parschauer
With that option it is possible to tell the kernel via the writable
module parameter 'rq_mode' that it should use the request function
mode (request-by-request) instead of the default make request
function mode (bio-by-bio). The advantage is that a scheduler can
be used and the block layer cares for statistics.
Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
---
Assemble.c | 4 ++++
Create.c | 4 ++++
ReadMe.c | 3 +++
mdadm.c | 5 +++++
mdadm.h | 3 +++
util.c | 34 ++++++++++++++++++++++++++++++++++
6 files changed, 53 insertions(+)
diff --git a/Assemble.c b/Assemble.c
index a57d384..d35d218 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -1275,6 +1275,10 @@ int Assemble(struct supertype *st, char *mddev,
mddev ? mddev : "further assembly");
return 1;
}
+ if (set_rq_mode(c->use_requestfn) != 0) {
+ pr_err("Cannot set the request function mode - cannot assemble.\n");
+ return 1;
+ }
if (devlist == NULL)
devlist = conf_get_devs();
diff --git a/Create.c b/Create.c
index 330c5b4..dc576f1 100644
--- a/Create.c
+++ b/Create.c
@@ -138,6 +138,10 @@ int Create(struct supertype *st, char *mddev,
pr_err("This level does not support spare devices\n");
return 1;
}
+ if (set_rq_mode(c->use_requestfn) != 0) {
+ pr_err("Cannot set the request function mode.\n");
+ return 1;
+ }
if (subdevs == 1 && strcmp(devlist->devname, "missing") != 0) {
/* If given a single device, it might be a container, and we can
diff --git a/ReadMe.c b/ReadMe.c
index bd8c85e..ecb829f 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -150,6 +150,7 @@ struct option long_options[] = {
{"force", 0, 0, Force},
{"update", 1, 0, 'U'},
{"freeze-reshape", 0, 0, FreezeReshape},
+ {"use-requestfn", 0, 0, UseRequestFn},
/* Management */
{"add", 0, 0, Add},
@@ -372,6 +373,7 @@ char Help_create[] =
" --name= -N : Textual name for array - max 32 characters\n"
" --bitmap-chunk= : bitmap chunksize in Kilobytes.\n"
" --delay= -d : bitmap update delay in seconds.\n"
+" --use-requestfn : Tell the kernel to process requests instead of bios.\n"
"\n"
;
@@ -456,6 +458,7 @@ char Help_assemble[] =
" --update= -U : Update superblock: try '-A --update=?' for option list.\n"
" --no-degraded : Assemble but do not start degraded arrays.\n"
" --readonly -o : Mark the array as read-only. No resync will start.\n"
+" --use-requestfn : Tell the kernel to process requests instead of bios.\n"
;
char Help_manage[] =
diff --git a/mdadm.c b/mdadm.c
index be990b8..35ff339 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -652,6 +652,11 @@ int main(int argc, char *argv[])
case O(INCREMENTAL, FreezeReshape):
c.freeze_reshape = 1;
continue;
+ case O(CREATE, UseRequestFn):
+ case O(ASSEMBLE, UseRequestFn):
+ case O(INCREMENTAL, UseRequestFn):
+ c.use_requestfn = 1;
+ continue;
case O(CREATE,'u'): /* uuid of array */
case O(ASSEMBLE,'u'): /* uuid of array */
if (ident.uuid_set) {
diff --git a/mdadm.h b/mdadm.h
index c0726bf..bb9f043 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -343,6 +343,7 @@ enum special_options {
Dump,
Restore,
Action,
+ UseRequestFn,
};
enum prefix_standard {
@@ -417,6 +418,7 @@ struct context {
char *backup_file;
int invalid_backup;
char *action;
+ int use_requestfn;
};
struct shape {
@@ -1355,6 +1357,7 @@ extern int remove_disk(int mdfd, struct supertype *st,
struct mdinfo *sra, struct mdinfo *info);
extern int set_array_info(int mdfd, struct supertype *st, struct mdinfo *info);
unsigned long long min_recovery_start(struct mdinfo *array);
+extern int set_rq_mode(int use_requestfn);
extern char *human_size(long long bytes);
extern char *human_size_brief(long long bytes, int prefix);
diff --git a/util.c b/util.c
index 7937eb6..b9c6cc3 100644
--- a/util.c
+++ b/util.c
@@ -1968,3 +1968,37 @@ void reopen_mddev(int mdfd)
if (fd >= 0 && fd != mdfd)
dup2(fd, mdfd);
}
+
+/*
+ * 0: make request fn mode (bio-by-bio)
+ * 1: request fn mode (request-by-request)
+ */
+#define BIO_MODE "0\n"
+#define REQ_MODE "1\n"
+
+int set_rq_mode(int use_requestfn)
+{
+ int fd, rv = 0;
+ ssize_t size, wbytes;
+ char *mode;
+
+ fd = open("/sys/module/md_mod/parameters/rq_mode", O_WRONLY);
+ if (fd < 0) {
+ if (errno != ENOENT || use_requestfn)
+ rv = 1;
+ goto out;
+ }
+ if (use_requestfn) {
+ mode = REQ_MODE;
+ size = sizeof(REQ_MODE);
+ } else {
+ mode = BIO_MODE;
+ size = sizeof(BIO_MODE);
+ }
+ wbytes = write(fd, mode, size);
+ if (wbytes != size)
+ rv = 1;
+ close(fd);
+out:
+ return rv;
+}
--
1.7.9.5
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/4] md/mdadm: introduce request function mode support
2014-06-04 17:09 ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
` (3 preceding siblings ...)
2014-06-04 17:10 ` [RFC PATCH 4/4] mdadm: introduce '--use-requestfn' create/assembly option Sebastian Parschauer
@ 2014-06-17 13:20 ` Sebastian Parschauer
[not found] ` <CAH3kUhEK26+4KryoReosMt654-vcrkkgkxaW5tKkFRDBqgX82w@mail.gmail.com>
2014-06-24 7:09 ` NeilBrown
4 siblings, 2 replies; 14+ messages in thread
From: Sebastian Parschauer @ 2014-06-17 13:20 UTC (permalink / raw)
To: neilb; +Cc: linux-raid, Sebastian Parschauer
Hi Neil,
have you already had the time to look at these patches?
We have this in production with cfq scheduler already on two storage
servers and customers noticed a huge performance boost. Also our
monitoring shows higher and smoother iop numbers.
Cheers,
Sebastian
On 04.06.2014 19:09, Sebastian Parschauer wrote:
> Patches apply to linux v3.15-rc8 and mdadm master commit 8a3544f895.
>
> Florian-Ewald Mueller (3):
> md: complete bio accounting and add io_latency extension
> md: introduce request function mode support
> md: handle IO latency accounting in rqfn mode
>
> Sebastian Parschauer (1):
> mdadm: introduce '--use-requestfn' create/assembly option
>
^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <CAH3kUhEK26+4KryoReosMt654-vcrkkgkxaW5tKkFRDBqgX82w@mail.gmail.com>]
* Re: [RFC PATCH 0/4] md/mdadm: introduce request function mode support
2014-06-17 13:20 ` [RFC PATCH 0/4] md/mdadm: introduce request function mode support Sebastian Parschauer
[not found] ` <CAH3kUhEK26+4KryoReosMt654-vcrkkgkxaW5tKkFRDBqgX82w@mail.gmail.com>
@ 2014-06-24 7:09 ` NeilBrown
1 sibling, 0 replies; 14+ messages in thread
From: NeilBrown @ 2014-06-24 7:09 UTC (permalink / raw)
To: Sebastian Parschauer; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 2661 bytes --]
On Tue, 17 Jun 2014 15:20:53 +0200 Sebastian Parschauer
<sebastian.riemer@profitbricks.com> wrote:
> Hi Neil,
>
> have you already had the time to look at these patches?
I have now looked through them at last. Sorry for the delay - there was a
week of leave in there and various other distractions.
The improved accounting is probably a good idea. The extra malloc for each
request is a bit unpleasant.
For raid1 and raid10 we could use the r1_bio they already allocate to store
the extra data. For raid5/6 we could possibly do something similar though
the results might be a little less precise.
For raid0 and linear I'd rather not intervene at all. The underlying devices
will report accurate statistics which can be combined to get correct
statistics for the overall devices.
The request function support is a bit of a mix. In some ways it is quite
nice and localised. However I'm not thrilled with the extra work queue, and
I don't really understand why md_process_request() allocates a new
md_request_clone and clones all the bios... I wonder if that is only needed
for RAID0, not for higher levels.
I still don't think any of this should be applied to RAID0. I asked about
that before and you didn't really provide an explanation.
RAID0 simply maps the bio to a different device and offset and sends it down.
So the queue management on the underlying device should be able to provide
all the fairness etc that you need.
So if you have RAID0 over RAID1, then I can understand the desire to use
requestfn rather than make_requestfn on the RAID1, but not on the RAID0.
Have you tried that configuration? i.e. the RAID0 made without
--use-requestfn and the RAID1s made with it? Does that work well as well as
with both arrays using requestfn??
The graphs you posted subsequently certainly look impressive, but I really
want to make sure I understand what is really needed.
Thanks,
NeilBrown
>
> We have this in production with cfq scheduler already on two storage
> servers and customers noticed a huge performance boost. Also our
> monitoring shows higher and smoother iop numbers.
>
> Cheers,
> Sebastian
>
>
> On 04.06.2014 19:09, Sebastian Parschauer wrote:
> > Patches apply to linux v3.15-rc8 and mdadm master commit 8a3544f895.
> >
> > Florian-Ewald Mueller (3):
> > md: complete bio accounting and add io_latency extension
> > md: introduce request function mode support
> > md: handle IO latency accounting in rqfn mode
> >
> > Sebastian Parschauer (1):
> > mdadm: introduce '--use-requestfn' create/assembly option
> >
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread