[PATCH 0/5] MD: don't abuse bi_phys

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] MD: don't abuse bi_phys_segements
@ 2017-02-07 16:55 Shaohua Li
  2017-02-07 16:55 ` [PATCH 1/5] MD: attach data to each bio Shaohua Li
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Shaohua Li @ 2017-02-07 16:55 UTC (permalink / raw)
  To: linux-raid; +Cc: khlebnikov, hch, neilb

This patchset removes the hack of bio->by_phys_segements usage

Shaohua Li (5):
  MD: attach data to each bio
  md/raid5: don't abuse bio->bi_phys_segments
  md/raid5: change disk limits
  md/raid1: don't abuse bio->bi_phys_segments
  md/raid10: don't abuse bio->bi_phys_segments

 drivers/md/md.c     | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/md.h     | 18 ++++++++++++++
 drivers/md/raid1.c  | 44 +++++++++++++++++++---------------
 drivers/md/raid10.c | 40 ++++++++++++++++++-------------
 drivers/md/raid5.c  | 21 ++++++++---------
 drivers/md/raid5.h  | 37 +++++++++++++++--------------
 6 files changed, 164 insertions(+), 64 deletions(-)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] MD: attach data to each bio
  2017-02-07 16:55 [PATCH 0/5] MD: don't abuse bi_phys_segements Shaohua Li
@ 2017-02-07 16:55 ` Shaohua Li
  2017-02-08  9:07   ` Guoqing Jiang
  2017-02-10  6:08   ` NeilBrown
  2017-02-07 16:55 ` [PATCH 2/5] md/raid5: don't abuse bio->bi_phys_segments Shaohua Li
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 16+ messages in thread
From: Shaohua Li @ 2017-02-07 16:55 UTC (permalink / raw)
  To: linux-raid; +Cc: khlebnikov, hch, neilb

Currently MD is rebusing some bio fields. To remove the hack, we attach
extra data to each bio. Each personablity can attach extra data to the
bios, so we don't need to rebuse bio fields.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/md.h | 18 +++++++++++++++
 2 files changed, 86 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 67a1854..6f23964 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -248,6 +248,47 @@ static DEFINE_SPINLOCK(all_mddevs_lock);
 		_tmp = _tmp->next;})					\
 		)
 
+#define MD_MEMPOOL_SIZE (32)
+static mempool_t *md_alloc_per_bio_pool(struct md_personality *per)
+{
+	mempool_t *p;
+
+	if (!per->per_bio_data_size)
+		return NULL;
+	p = mempool_create_kmalloc_pool(MD_MEMPOOL_SIZE,
+				MD_PER_BIO_DATA_SIZE(per));
+	if (p)
+		return p;
+	return ERR_PTR(-ENOMEM);
+}
+
+static void md_bio_end_io(struct bio *bio)
+{
+	struct md_per_bio_data *data = bio->bi_private;
+
+	bio->bi_private = data->orig_private;
+	bio->bi_end_io = data->orig_endio;
+	bio_endio(bio);
+
+	mempool_free(data, data->mddev->per_bio_pool);
+}
+
+void md_bio_attach_data(struct mddev *mddev, struct bio *bio)
+{
+	struct md_per_bio_data *data;
+
+	if (!mddev->per_bio_pool)
+		return;
+	data = mempool_alloc(mddev->per_bio_pool, GFP_NOIO);
+	data->orig_endio = bio->bi_end_io;
+	data->orig_private = bio->bi_private;
+	data->mddev = mddev;
+
+	bio->bi_private = data;
+	bio->bi_end_io = md_bio_end_io;
+}
+EXPORT_SYMBOL(md_bio_attach_data);
+
 /* Rather than calling directly into the personality make_request function,
  * IO requests come here first so that we can check if the device is
  * being suspended pending a reconfiguration.
@@ -274,6 +315,8 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 		bio_endio(bio);
 		return BLK_QC_T_NONE;
 	}
+	md_bio_attach_data(mddev, bio);
+
 	smp_rmb(); /* Ensure implications of  'active' are visible */
 	rcu_read_lock();
 	if (mddev->suspended) {
@@ -3513,6 +3556,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 	long level;
 	void *priv, *oldpriv;
 	struct md_rdev *rdev;
+	mempool_t *new_pool;
 
 	if (slen == 0 || slen >= sizeof(clevel))
 		return -EINVAL;
@@ -3580,7 +3624,15 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 		rv = len;
 		goto out_unlock;
 	}
+	new_pool = md_alloc_per_bio_pool(pers);
+	if (IS_ERR(new_pool)) {
+		module_put(pers->owner);
+		rv = -EINVAL;
+		goto out_unlock;
+	}
+
 	if (!pers->takeover) {
+		mempool_destroy(new_pool);
 		module_put(pers->owner);
 		pr_warn("md: %s: %s does not support personality takeover\n",
 			mdname(mddev), clevel);
@@ -3596,6 +3648,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 	 */
 	priv = pers->takeover(mddev);
 	if (IS_ERR(priv)) {
+		mempool_destroy(new_pool);
 		mddev->new_level = mddev->level;
 		mddev->new_layout = mddev->layout;
 		mddev->new_chunk_sectors = mddev->chunk_sectors;
@@ -3660,6 +3713,9 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 
 	module_put(oldpers->owner);
 
+	mempool_destroy(mddev->per_bio_pool);
+	mddev->per_bio_pool = new_pool;
+
 	rdev_for_each(rdev, mddev) {
 		if (rdev->raid_disk < 0)
 			continue;
@@ -5209,6 +5265,7 @@ int md_run(struct mddev *mddev)
 	int err;
 	struct md_rdev *rdev;
 	struct md_personality *pers;
+	mempool_t *new_pool;
 
 	if (list_empty(&mddev->disks))
 		/* cannot run an array with no devices.. */
@@ -5299,6 +5356,13 @@ int md_run(struct mddev *mddev)
 		return -EINVAL;
 	}
 
+	new_pool = md_alloc_per_bio_pool(pers);
+	if (IS_ERR(new_pool)) {
+		module_put(pers->owner);
+		return -EINVAL;
+	}
+	mddev->per_bio_pool = new_pool;
+
 	if (pers->sync_request) {
 		/* Warn if this is a potentially silly
 		 * configuration.
@@ -5364,6 +5428,8 @@ int md_run(struct mddev *mddev)
 
 	}
 	if (err) {
+		mempool_destroy(new_pool);
+		mddev->per_bio_pool = NULL;
 		mddev_detach(mddev);
 		if (mddev->private)
 			pers->free(mddev, mddev->private);
@@ -5619,6 +5685,8 @@ static void __md_stop(struct mddev *mddev)
 	mddev->pers = NULL;
 	spin_unlock(&mddev->lock);
 	pers->free(mddev, mddev->private);
+	mempool_destroy(mddev->per_bio_pool);
+	mddev->per_bio_pool = NULL;
 	mddev->private = NULL;
 	if (pers->sync_request && mddev->to_remove == NULL)
 		mddev->to_remove = &md_redundancy_group;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 968bbe7..3d2fe21 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -450,6 +450,7 @@ struct mddev {
 	void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
 	struct md_cluster_info		*cluster_info;
 	unsigned int			good_device_nr;	/* good device num within cluster raid */
+	mempool_t			*per_bio_pool;
 };
 
 enum recovery_flags {
@@ -540,8 +541,25 @@ struct md_personality
 	/* congested implements bdi.congested_fn().
 	 * Will not be called while array is 'suspended' */
 	int (*congested)(struct mddev *mddev, int bits);
+	size_t per_bio_data_size;
 };
 
+struct md_per_bio_data {
+	bio_end_io_t *orig_endio;
+	void *orig_private;
+	struct mddev *mddev;
+};
+
+#define MD_PER_BIO_DATA_SIZE(per) (sizeof(struct md_per_bio_data) + \
+	roundup(per->per_bio_data_size, __alignof__(struct md_per_bio_data)))
+
+static inline void *md_get_per_bio_data(struct bio *bio)
+{
+	return ((struct md_per_bio_data *)bio->bi_private) + 1;
+}
+
+extern void md_bio_attach_data(struct mddev *mddev, struct bio *bio);
+
 struct md_sysfs_entry {
 	struct attribute attr;
 	ssize_t (*show)(struct mddev *, char *);
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/5] md/raid5: don't abuse bio->bi_phys_segments
  2017-02-07 16:55 [PATCH 0/5] MD: don't abuse bi_phys_segements Shaohua Li
  2017-02-07 16:55 ` [PATCH 1/5] MD: attach data to each bio Shaohua Li
@ 2017-02-07 16:55 ` Shaohua Li
  2017-02-07 16:56 ` [PATCH 3/5] md/raid5: change disk limits Shaohua Li
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2017-02-07 16:55 UTC (permalink / raw)
  To: linux-raid; +Cc: khlebnikov, hch, neilb

Allocates extra data for activate_stripes and processed_stripes
accounting. Don't need to abuse bio->bi_phys_segments.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 12 ++++++++----
 drivers/md/raid5.h | 37 +++++++++++++++++++------------------
 2 files changed, 27 insertions(+), 22 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b0d1345..fd3e5ce 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3103,7 +3103,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 		(unsigned long long)sh->sector);
 
 	/*
-	 * If several bio share a stripe. The bio bi_phys_segments acts as a
+	 * If several bio share a stripe. The raid5_bio_data has a
 	 * reference count to avoid race. The reference count should already be
 	 * increased before this function is called (for example, in
 	 * raid5_make_request()), so other bio sharing this stripe will not free the
@@ -5146,6 +5146,7 @@ static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio)
 		if (sectors < bio_sectors(raid_bio)) {
 			split = bio_split(raid_bio, sectors, GFP_NOIO, fs_bio_set);
 			bio_chain(split, raid_bio);
+			md_bio_attach_data(mddev, split);
 		} else
 			split = raid_bio;
 
@@ -5335,7 +5336,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	last_sector = bi->bi_iter.bi_sector + (bi->bi_iter.bi_size>>9);
 
 	bi->bi_next = NULL;
-	bi->bi_phys_segments = 1; /* over-loaded to count active stripes */
+	raid5_set_bi_stripes(bi, 1);
 
 	stripe_sectors = conf->chunk_sectors *
 		(conf->raid_disks - conf->max_degraded);
@@ -5463,7 +5464,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	logical_sector = bi->bi_iter.bi_sector & ~((sector_t)STRIPE_SECTORS-1);
 	last_sector = bio_end_sector(bi);
 	bi->bi_next = NULL;
-	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
+	raid5_set_bi_stripes(bi, 1);
 
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
@@ -5963,7 +5964,7 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 	 * We cannot pre-allocate enough stripe_heads as we may need
 	 * more than exist in the cache (if we allow ever large chunks).
 	 * So we do one stripe head at a time and record in
-	 * ->bi_hw_segments how many have been done.
+	 * ->raid5_bio_data.processed_stripes how many have been done.
 	 *
 	 * We *know* that this entire raid_bio is in one chunk, so
 	 * it will be only one 'dd_idx' and only need one call to raid5_compute_sector.
@@ -8188,6 +8189,7 @@ static struct md_personality raid6_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid6_takeover,
 	.congested	= raid5_congested,
+	.per_bio_data_size = sizeof(struct raid5_bio_data),
 };
 static struct md_personality raid5_personality =
 {
@@ -8211,6 +8213,7 @@ static struct md_personality raid5_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid5_takeover,
 	.congested	= raid5_congested,
+	.per_bio_data_size = sizeof(struct raid5_bio_data),
 };
 
 static struct md_personality raid4_personality =
@@ -8235,6 +8238,7 @@ static struct md_personality raid4_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid4_takeover,
 	.congested	= raid5_congested,
+	.per_bio_data_size = sizeof(struct raid5_bio_data),
 };
 
 static int __init raid5_init(void)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index c0687df..9232463 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -481,48 +481,49 @@ static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
 		return NULL;
 }
 
-/*
- * We maintain a biased count of active stripes in the bottom 16 bits of
- * bi_phys_segments, and a count of processed stripes in the upper 16 bits
- */
+struct raid5_bio_data {
+	atomic_t active_stripes;
+	atomic_t processed_stripes;
+};
+
+#define raid5_get_bio_data(bio) \
+	((struct raid5_bio_data *)md_get_per_bio_data(bio))
+
 static inline int raid5_bi_processed_stripes(struct bio *bio)
 {
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	struct raid5_bio_data *data= raid5_get_bio_data(bio);
 
-	return (atomic_read(segments) >> 16) & 0xffff;
+	return atomic_read(&data->processed_stripes);
 }
 
 static inline int raid5_dec_bi_active_stripes(struct bio *bio)
 {
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	struct raid5_bio_data *data= raid5_get_bio_data(bio);
 
-	return atomic_sub_return(1, segments) & 0xffff;
+	return atomic_sub_return(1, &data->active_stripes);
 }
 
 static inline void raid5_inc_bi_active_stripes(struct bio *bio)
 {
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	struct raid5_bio_data *data= raid5_get_bio_data(bio);
 
-	atomic_inc(segments);
+	atomic_inc(&data->active_stripes);
 }
 
 static inline void raid5_set_bi_processed_stripes(struct bio *bio,
 	unsigned int cnt)
 {
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	int old, new;
+	struct raid5_bio_data *data= raid5_get_bio_data(bio);
 
-	do {
-		old = atomic_read(segments);
-		new = (old & 0xffff) | (cnt << 16);
-	} while (atomic_cmpxchg(segments, old, new) != old);
+	atomic_set(&data->processed_stripes, cnt);
 }
 
 static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
 {
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	struct raid5_bio_data *data= raid5_get_bio_data(bio);
 
-	atomic_set(segments, cnt);
+	atomic_set(&data->active_stripes, cnt);
+	atomic_set(&data->processed_stripes, 0);
 }
 
 /* NOTE NR_STRIPE_HASH_LOCKS must remain below 64.
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/5] md/raid5: change disk limits
  2017-02-07 16:55 [PATCH 0/5] MD: don't abuse bi_phys_segements Shaohua Li
  2017-02-07 16:55 ` [PATCH 1/5] MD: attach data to each bio Shaohua Li
  2017-02-07 16:55 ` [PATCH 2/5] md/raid5: don't abuse bio->bi_phys_segments Shaohua Li
@ 2017-02-07 16:56 ` Shaohua Li
  2017-02-07 16:56 ` [PATCH 4/5] md/raid1: don't abuse bio->bi_phys_segments Shaohua Li
  2017-02-07 16:56 ` [PATCH 5/5] md/raid10: " Shaohua Li
  4 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2017-02-07 16:56 UTC (permalink / raw)
  To: linux-raid; +Cc: khlebnikov, hch, neilb

Now we are using atomic_t to track stripes for a bio, the original limit
(because of 16-bits data for the tracking) doesn't apply any more.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index fd3e5ce..467ad4f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7255,13 +7255,8 @@ static int raid5_run(struct mddev *mddev)
 		mddev->queue->limits.discard_alignment = stripe;
 		mddev->queue->limits.discard_granularity = stripe;
 
-		/*
-		 * We use 16-bit counter of active stripes in bi_phys_segments
-		 * (minus one for over-loaded initialization)
-		 */
-		blk_queue_max_hw_sectors(mddev->queue, 0xfffe * STRIPE_SECTORS);
-		blk_queue_max_discard_sectors(mddev->queue,
-					      0xfffe * STRIPE_SECTORS);
+		blk_queue_max_hw_sectors(mddev->queue, BLK_DEF_MAX_SECTORS);
+		blk_queue_max_discard_sectors(mddev->queue, UINT_MAX);
 
 		/*
 		 * unaligned part of discard request will be ignored, so can't
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/5] md/raid1: don't abuse bio->bi_phys_segments
  2017-02-07 16:55 [PATCH 0/5] MD: don't abuse bi_phys_segements Shaohua Li
                   ` (2 preceding siblings ...)
  2017-02-07 16:56 ` [PATCH 3/5] md/raid5: change disk limits Shaohua Li
@ 2017-02-07 16:56 ` Shaohua Li
  2017-02-07 16:56 ` [PATCH 5/5] md/raid10: " Shaohua Li
  4 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2017-02-07 16:56 UTC (permalink / raw)
  To: linux-raid; +Cc: khlebnikov, hch, neilb

Allocates extra data (an 'unsigned int') for bio reference accounting.
Don't need to abuse bio->bi_phys_segments.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid1.c | 44 +++++++++++++++++++++++++-------------------
 1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b0f647..daa4038 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -102,6 +102,11 @@ static void r1bio_pool_free(void *r1_bio, void *data)
 #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
 #define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
 
+static unsigned int *raid1_bio_ref_ptr(struct bio *bio)
+{
+	return md_get_per_bio_data(bio);
+}
+
 static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
 {
 	struct pool_info *pi = data;
@@ -246,15 +251,15 @@ static void call_bio_endio(struct r1bio *r1_bio)
 	sector_t start_next_window = r1_bio->start_next_window;
 	sector_t bi_sector = bio->bi_iter.bi_sector;
 
-	if (bio->bi_phys_segments) {
+	if (*raid1_bio_ref_ptr(bio)) {
 		unsigned long flags;
 		spin_lock_irqsave(&conf->device_lock, flags);
-		bio->bi_phys_segments--;
-		done = (bio->bi_phys_segments == 0);
+		(*raid1_bio_ref_ptr(bio))--;
+		done = (*raid1_bio_ref_ptr(bio) == 0);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 		/*
 		 * make_request() might be waiting for
-		 * bi_phys_segments to decrease
+		 * bio reference count to decrease
 		 */
 		wake_up(&conf->wait_barrier);
 	} else
@@ -1138,10 +1143,10 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 				   - bio->bi_iter.bi_sector);
 		r1_bio->sectors = max_sectors;
 		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
+		if (*raid1_bio_ref_ptr(bio) == 0)
+			*raid1_bio_ref_ptr(bio) = 2;
 		else
-			bio->bi_phys_segments++;
+			(*raid1_bio_ref_ptr(bio))++;
 		spin_unlock_irq(&conf->device_lock);
 
 		/*
@@ -1316,13 +1321,13 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		start_next_window = wait_barrier(conf, bio);
 		/*
 		 * We must make sure the multi r1bios of bio have
-		 * the same value of bi_phys_segments
+		 * the same value of reference count
 		 */
-		if (bio->bi_phys_segments && old &&
+		if (*raid1_bio_ref_ptr(bio) && old &&
 		    old != start_next_window)
 			/* Wait for the former r1bio(s) to complete */
 			wait_event(conf->wait_barrier,
-				   bio->bi_phys_segments == 1);
+				   *raid1_bio_ref_ptr(bio) == 1);
 		goto retry_write;
 	}
 
@@ -1332,10 +1337,10 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		 */
 		r1_bio->sectors = max_sectors;
 		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
+		if (*raid1_bio_ref_ptr(bio) == 0)
+			*raid1_bio_ref_ptr(bio) = 2;
 		else
-			bio->bi_phys_segments++;
+			(*raid1_bio_ref_ptr(bio))++;
 		spin_unlock_irq(&conf->device_lock);
 	}
 	sectors_handled = r1_bio->sector + max_sectors - bio->bi_iter.bi_sector;
@@ -1428,7 +1433,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	if (sectors_handled < bio_sectors(bio)) {
 		r1_bio_write_done(r1_bio);
 		/* We need another r1_bio.  It has already been counted
-		 * in bio->bi_phys_segments
+		 * in bio reference count
 		 */
 		r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
 		r1_bio->master_bio = bio;
@@ -1466,11 +1471,11 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 	/*
 	 * We might need to issue multiple reads to different devices if there
 	 * are bad blocks around, so we keep track of the number of reads in
-	 * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
+	 * bio reference count.  If this is 0, there is only one r1_bio and
 	 * no locking will be needed when requests complete.  If it is
 	 * non-zero, then it is the number of not-completed requests.
 	 */
-	bio->bi_phys_segments = 0;
+	*raid1_bio_ref_ptr(bio) = 0;
 	bio_clear_flag(bio, BIO_SEG_VALID);
 
 	if (bio_data_dir(bio) == READ)
@@ -2438,10 +2443,10 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 					       - mbio->bi_iter.bi_sector);
 			r1_bio->sectors = max_sectors;
 			spin_lock_irq(&conf->device_lock);
-			if (mbio->bi_phys_segments == 0)
-				mbio->bi_phys_segments = 2;
+			if (*raid1_bio_ref_ptr(mbio) == 0)
+				*raid1_bio_ref_ptr(mbio) = 2;
 			else
-				mbio->bi_phys_segments++;
+				(*raid1_bio_ref_ptr(mbio))++;
 			spin_unlock_irq(&conf->device_lock);
 			trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
 					      bio, bio_dev, bio_sector);
@@ -3289,6 +3294,7 @@ static struct md_personality raid1_personality =
 	.quiesce	= raid1_quiesce,
 	.takeover	= raid1_takeover,
 	.congested	= raid1_congested,
+	.per_bio_data_size = sizeof(unsigned int),
 };
 
 static int __init raid_init(void)
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/5] md/raid10: don't abuse bio->bi_phys_segments
  2017-02-07 16:55 [PATCH 0/5] MD: don't abuse bi_phys_segements Shaohua Li
                   ` (3 preceding siblings ...)
  2017-02-07 16:56 ` [PATCH 4/5] md/raid1: don't abuse bio->bi_phys_segments Shaohua Li
@ 2017-02-07 16:56 ` Shaohua Li
  4 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2017-02-07 16:56 UTC (permalink / raw)
  To: linux-raid; +Cc: khlebnikov, hch, neilb

Allocates extra data (an 'unsigned int') for bio reference accounting.
Don't need to abuse bio->bi_phys_segments.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid10.c | 40 ++++++++++++++++++++++++----------------
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 1920756..77623ff 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -133,6 +133,12 @@ static void r10bio_pool_free(void *r10_bio, void *data)
 /* maximum number of concurrent requests, memory permitting */
 #define RESYNC_DEPTH (32*1024*1024/RESYNC_BLOCK_SIZE)
 
+/* The bio has a counter attached */
+static unsigned int *raid10_bio_ref_ptr(struct bio *bio)
+{
+	return md_get_per_bio_data(bio);
+}
+
 /*
  * When performing a resync, we need to read and compare, so
  * we need as many pages are there are copies.
@@ -304,11 +310,11 @@ static void raid_end_bio_io(struct r10bio *r10_bio)
 	int done;
 	struct r10conf *conf = r10_bio->mddev->private;
 
-	if (bio->bi_phys_segments) {
+	if (*raid10_bio_ref_ptr(bio)) {
 		unsigned long flags;
 		spin_lock_irqsave(&conf->device_lock, flags);
-		bio->bi_phys_segments--;
-		done = (bio->bi_phys_segments == 0);
+		(*raid10_bio_ref_ptr(bio))--;
+		done = (*raid10_bio_ref_ptr(bio) == 0);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 	} else
 		done = 1;
@@ -1162,10 +1168,10 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 				   - bio->bi_iter.bi_sector);
 		r10_bio->sectors = max_sectors;
 		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
+		if (*raid10_bio_ref_ptr(bio) == 0)
+			*raid10_bio_ref_ptr(bio) = 2;
 		else
-			bio->bi_phys_segments++;
+			(*raid10_bio_ref_ptr(bio))++;
 		spin_unlock_irq(&conf->device_lock);
 		/*
 		 * Cannot call generic_make_request directly as that will be
@@ -1262,7 +1268,7 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	 * writing to those blocks.  This potentially requires several
 	 * writes to write around the bad blocks.  Each set of writes
 	 * gets its own r10_bio with a set of bios attached.  The number
-	 * of r10_bios is recored in bio->bi_phys_segments just as with
+	 * of r10_bios is recored in a counter just as with
 	 * the read case.
 	 */
 
@@ -1389,10 +1395,10 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 		 */
 		r10_bio->sectors = max_sectors;
 		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
+		if (*raid10_bio_ref_ptr(bio) == 0)
+			*raid10_bio_ref_ptr(bio) = 2;
 		else
-			bio->bi_phys_segments++;
+			(*raid10_bio_ref_ptr(bio))++;
 		spin_unlock_irq(&conf->device_lock);
 	}
 	sectors_handled = r10_bio->sector + max_sectors -
@@ -1493,7 +1499,7 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	if (sectors_handled < bio_sectors(bio)) {
 		one_write_done(r10_bio);
 		/* We need another r10_bio.  It has already been counted
-		 * in bio->bi_phys_segments.
+		 * in bio's counter.
 		 */
 		r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
 
@@ -1525,11 +1531,11 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 	/*
 	 * We might need to issue multiple reads to different devices if there
 	 * are bad blocks around, so we keep track of the number of reads in
-	 * bio->bi_phys_segments.  If this is 0, there is only one r10_bio and
+	 * a counter.  If this is 0, there is only one r10_bio and
 	 * no locking will be needed when the request completes.  If it is
 	 * non-zero, then it is the number of not-completed requests.
 	 */
-	bio->bi_phys_segments = 0;
+	*raid10_bio_ref_ptr(bio) = 0;
 	bio_clear_flag(bio, BIO_SEG_VALID);
 
 	if (bio_data_dir(bio) == READ)
@@ -1567,6 +1573,7 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
 					   (chunk_sects - 1)),
 					  GFP_NOIO, fs_bio_set);
 			bio_chain(split, bio);
+			md_bio_attach_data(mddev, split);
 		} else {
 			split = bio;
 		}
@@ -2667,10 +2674,10 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 			- mbio->bi_iter.bi_sector;
 		r10_bio->sectors = max_sectors;
 		spin_lock_irq(&conf->device_lock);
-		if (mbio->bi_phys_segments == 0)
-			mbio->bi_phys_segments = 2;
+		if (*raid10_bio_ref_ptr(mbio) == 0)
+			*raid10_bio_ref_ptr(mbio) = 2;
 		else
-			mbio->bi_phys_segments++;
+			(*raid10_bio_ref_ptr(mbio))++;
 		spin_unlock_irq(&conf->device_lock);
 		generic_make_request(bio);
 
@@ -4816,6 +4823,7 @@ static struct md_personality raid10_personality =
 	.start_reshape	= raid10_start_reshape,
 	.finish_reshape	= raid10_finish_reshape,
 	.congested	= raid10_congested,
+	.per_bio_data_size = sizeof(unsigned int),
 };
 
 static int __init raid_init(void)
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] MD: attach data to each bio
  2017-02-08  9:07   ` Guoqing Jiang
@ 2017-02-08  5:24     ` Shaohua Li
  2017-02-09  0:09     ` Wols Lists
  1 sibling, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2017-02-08  5:24 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: Shaohua Li, linux-raid, khlebnikov, hch, neilb

On Wed, Feb 08, 2017 at 05:07:56PM +0800, Guoqing Jiang wrote:
> 
> 
> On 02/08/2017 12:55 AM, Shaohua Li wrote:
> > Currently MD is rebusing some bio fields. To remove the hack, we attach
> > extra data to each bio. Each personablity can attach extra data to the
> > bios, so we don't need to rebuse bio fields.
> 
> rebuse? Maybe it should be reuse, or abuse.

oops, this is a typo, should be abuse 
> [snip]
> 
> > +static inline void *md_get_per_bio_data(struct bio *bio)
> > +{
> > +	return ((struct md_per_bio_data *)bio->bi_private) + 1;
> > +}
> > +
> > +extern void md_bio_attach_data(struct mddev *mddev, struct bio *bio);
> 
> Why not make raid1_bio_ref_ptr and raid10_bio_ref_ptr as macro
> like raid5_get_bio_data? Or add inline before the two funcs since
> both of them are called frequently.

I'll add 'inline'. Actually it doesn't really matter. The compiler will inline
simple functions automatically. In this case, I'm sure compiler will do the
inline. On the other hand, 'inline' doesn't guarantee compiler will do inline
unless you use __attribute__((always_inline)).

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] MD: attach data to each bio
  2017-02-07 16:55 ` [PATCH 1/5] MD: attach data to each bio Shaohua Li
@ 2017-02-08  9:07   ` Guoqing Jiang
  2017-02-08  5:24     ` Shaohua Li
  2017-02-09  0:09     ` Wols Lists
  2017-02-10  6:08   ` NeilBrown
  1 sibling, 2 replies; 16+ messages in thread
From: Guoqing Jiang @ 2017-02-08  9:07 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: khlebnikov, hch, neilb



On 02/08/2017 12:55 AM, Shaohua Li wrote:
> Currently MD is rebusing some bio fields. To remove the hack, we attach
> extra data to each bio. Each personablity can attach extra data to the
> bios, so we don't need to rebuse bio fields.

rebuse? Maybe it should be reuse, or abuse.

[snip]

> +static inline void *md_get_per_bio_data(struct bio *bio)
> +{
> +	return ((struct md_per_bio_data *)bio->bi_private) + 1;
> +}
> +
> +extern void md_bio_attach_data(struct mddev *mddev, struct bio *bio);

Why not make raid1_bio_ref_ptr and raid10_bio_ref_ptr as macro
like raid5_get_bio_data? Or add inline before the two funcs since
both of them are called frequently.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] MD: attach data to each bio
  2017-02-08  9:07   ` Guoqing Jiang
  2017-02-08  5:24     ` Shaohua Li
@ 2017-02-09  0:09     ` Wols Lists
  1 sibling, 0 replies; 16+ messages in thread
From: Wols Lists @ 2017-02-09  0:09 UTC (permalink / raw)
  To: Guoqing Jiang, Shaohua Li, linux-raid; +Cc: khlebnikov, hch, neilb

On 08/02/17 09:07, Guoqing Jiang wrote:
> 
> 
> On 02/08/2017 12:55 AM, Shaohua Li wrote:
>> Currently MD is rebusing some bio fields. To remove the hack, we attach
>> extra data to each bio. Each personablity can attach extra data to the
>> bios, so we don't need to rebuse bio fields.
> 
> rebuse? Maybe it should be reuse, or abuse.
> 
imho abuse

also
c/personablity/personality/

Cheers,
Wol


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] MD: attach data to each bio
  2017-02-07 16:55 ` [PATCH 1/5] MD: attach data to each bio Shaohua Li
  2017-02-08  9:07   ` Guoqing Jiang
@ 2017-02-10  6:08   ` NeilBrown
  2017-02-10  6:47     ` Shaohua Li
  2017-02-13  7:37     ` Christoph Hellwig
  1 sibling, 2 replies; 16+ messages in thread
From: NeilBrown @ 2017-02-10  6:08 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: khlebnikov, hch

[-- Attachment #1: Type: text/plain, Size: 10392 bytes --]

On Tue, Feb 07 2017, Shaohua Li wrote:

> Currently MD is rebusing some bio fields. To remove the hack, we attach
> extra data to each bio. Each personablity can attach extra data to the
> bios, so we don't need to rebuse bio fields.

I must say that I don't really like this approach.
Temporarily modifying ->bi_private and ->bi_end_io seems
.... intrusive.   I suspect it works, but I wonder if it is really
robust in the long term.

How about a different approach..  Your main concern with my first patch
was that it called md_write_start() and md_write_end() much more often,
and these performed atomic ops on "global" variables, particular
writes_pending.

We could change writes_pending to a per-cpu array which we only count
occasionally when needed.  As writes_pending is updated often and
checked rarely, a per-cpu array which is summed on demand seems
appropriate.

The following patch is an early draft - it doesn't obviously fail and
isn't obviously wrong to me.  There is certainly room for improvement
and may be bugs.
Next week I'll work on collection the re-factoring into separate
patches, which are possible good-to-have anyway.

Thoughts?

Thanks,
NeilBrown


diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8926fb781cdc..883c409902b0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -64,6 +64,8 @@
 #include <linux/raid/md_p.h>
 #include <linux/raid/md_u.h>
 #include <linux/slab.h>
+#include <linux/percpu.h>
+
 #include <trace/events/block.h>
 #include "md.h"
 #include "bitmap.h"
@@ -2250,6 +2252,81 @@ static void export_array(struct mddev *mddev)
 	mddev->major_version = 0;
 }
 
+/*
+ * The percpu writes_pending counters are linked with ->checkers and ->lock.
+ * If ->writes_pending can always be decremented without a lock.
+ * It can only be incremented without a lock if ->checkers is 0 and the test+incr
+ * happen in a rcu_readlocked region.
+ * ->checkers can only be changed under ->lock protection.
+ * To determine if ->writes_pending is totally zero, a quick sum without
+ * locks can be performed. If this is non-zero, then the result is final.
+ * Otherwise ->checkers must be incremented and synchronize_rcu() called.
+ * Then a sum calculated under ->lock, and the result is final until the
+ * ->checkers is decremented, or the lock is dropped.
+ *
+ */
+
+static bool __writes_pending(struct mddev *mddev)
+{
+	long cnt = 0;
+	int i;
+	for_each_possible_cpu(i)
+		cnt += *per_cpu_ptr(mddev->writes_pending_percpu, i);
+	return cnt != 0;
+}
+
+static bool set_in_sync(struct mddev *mddev)
+{
+
+	WARN_ON_ONCE(!spin_is_locked(&mddev->lock));
+	if (__writes_pending(mddev)) {
+		if (mddev->safemode == 1)
+			mddev->safemode = 0;
+		return false;
+	}
+	if (mddev->in_sync)
+		return false;
+
+	mddev->checkers ++;
+	spin_unlock(&mddev->lock);
+	synchronize_rcu();
+	spin_lock(&mddev->lock);
+	if (mddev->in_sync) {
+		/* someone else set it */
+		mddev->checkers --;
+		return false;
+	}
+
+	if (! __writes_pending(mddev))
+		mddev->in_sync = 1;
+	if (mddev->safemode == 1)
+		mddev->safemode = 0;
+	mddev->checkers --;
+	return mddev->in_sync;
+}
+
+static void inc_writes_pending(struct mddev *mddev)
+{
+	rcu_read_lock();
+	if (mddev->checkers == 0) {
+		__this_cpu_inc(*mddev->writes_pending_percpu);
+		rcu_read_unlock();
+		return;
+	}
+	rcu_read_unlock();
+	/* Need that spinlock */
+	spin_lock(&mddev->lock);
+	this_cpu_inc(*mddev->writes_pending_percpu);
+	spin_unlock(&mddev->lock);
+}
+
+static void zero_writes_pending(struct mddev *mddev)
+{
+	int i;
+	for_each_possible_cpu(i)
+		*per_cpu_ptr(mddev->writes_pending_percpu, i) = 0;
+}
+
 static void sync_sbs(struct mddev *mddev, int nospares)
 {
 	/* Update each superblock (in-memory image), but
@@ -3583,7 +3660,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 	mddev->delta_disks = 0;
 	mddev->reshape_backwards = 0;
 	mddev->degraded = 0;
-	spin_unlock(&mddev->lock);
 
 	if (oldpers->sync_request == NULL &&
 	    mddev->external) {
@@ -3596,8 +3672,8 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 		 */
 		mddev->in_sync = 0;
 		mddev->safemode_delay = 0;
-		mddev->safemode = 0;
 	}
+	spin_unlock(&mddev->lock);
 
 	oldpers->free(mddev, oldpriv);
 
@@ -3963,16 +4039,11 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 			wake_up(&mddev->sb_wait);
 			err = 0;
 		} else /* st == clean */ {
+			err = 0;
 			restart_array(mddev);
-			if (atomic_read(&mddev->writes_pending) == 0) {
-				if (mddev->in_sync == 0) {
-					mddev->in_sync = 1;
-					if (mddev->safemode == 1)
-						mddev->safemode = 0;
-					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
-				}
-				err = 0;
-			} else
+			if (set_in_sync(mddev))
+				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+			else if (!mddev->in_sync)
 				err = -EBUSY;
 		}
 		if (!err)
@@ -4030,15 +4101,9 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 			if (err)
 				break;
 			spin_lock(&mddev->lock);
-			if (atomic_read(&mddev->writes_pending) == 0) {
-				if (mddev->in_sync == 0) {
-					mddev->in_sync = 1;
-					if (mddev->safemode == 1)
-						mddev->safemode = 0;
-					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
-				}
-				err = 0;
-			} else
+			if (set_in_sync(mddev))
+				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+			else if (!mddev->in_sync)
 				err = -EBUSY;
 			spin_unlock(&mddev->lock);
 		} else
@@ -4993,6 +5058,9 @@ static void md_free(struct kobject *ko)
 		del_gendisk(mddev->gendisk);
 		put_disk(mddev->gendisk);
 	}
+	if (mddev->writes_pending_percpu) {
+		free_percpu(mddev->writes_pending_percpu);
+	}
 
 	kfree(mddev);
 }
@@ -5069,6 +5137,13 @@ static int md_alloc(dev_t dev, char *name)
 	blk_queue_make_request(mddev->queue, md_make_request);
 	blk_set_stacking_limits(&mddev->queue->limits);
 
+	mddev->writes_pending_percpu = alloc_percpu(long);
+	if (!mddev->writes_pending_percpu) {
+		blk_cleanup_queue(mddev->queue);
+		mddev->queue = NULL;
+		goto abort;
+	}
+
 	disk = alloc_disk(1 << shift);
 	if (!disk) {
 		blk_cleanup_queue(mddev->queue);
@@ -5152,7 +5227,7 @@ static void md_safemode_timeout(unsigned long data)
 {
 	struct mddev *mddev = (struct mddev *) data;
 
-	if (!atomic_read(&mddev->writes_pending)) {
+	if (!__writes_pending(mddev)) {
 		mddev->safemode = 1;
 		if (mddev->external)
 			sysfs_notify_dirent_safe(mddev->sysfs_state);
@@ -5358,7 +5433,7 @@ int md_run(struct mddev *mddev)
 	} else if (mddev->ro == 2) /* auto-readonly not meaningful */
 		mddev->ro = 0;
 
-	atomic_set(&mddev->writes_pending,0);
+	zero_writes_pending(mddev);
 	atomic_set(&mddev->max_corr_read_errors,
 		   MD_DEFAULT_MAX_CORRECTED_READ_ERRORS);
 	mddev->safemode = 0;
@@ -5451,7 +5526,6 @@ static int restart_array(struct mddev *mddev)
 			return -EINVAL;
 	}
 
-	mddev->safemode = 0;
 	mddev->ro = 0;
 	set_disk_ro(disk, 0);
 	pr_debug("md: %s switched to read-write mode.\n", mdname(mddev));
@@ -7767,9 +7841,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 		md_wakeup_thread(mddev->sync_thread);
 		did_change = 1;
 	}
-	atomic_inc(&mddev->writes_pending);
-	if (mddev->safemode == 1)
-		mddev->safemode = 0;
+	inc_writes_pending(mddev);
 	if (mddev->in_sync) {
 		spin_lock(&mddev->lock);
 		if (mddev->in_sync) {
@@ -7790,12 +7862,17 @@ EXPORT_SYMBOL(md_write_start);
 
 void md_write_end(struct mddev *mddev)
 {
-	if (atomic_dec_and_test(&mddev->writes_pending)) {
-		if (mddev->safemode == 2)
+	this_cpu_dec(*mddev->writes_pending_percpu);
+	if (mddev->safemode == 2) {
+		if (!__writes_pending(mddev))
 			md_wakeup_thread(mddev->thread);
-		else if (mddev->safemode_delay)
-			mod_timer(&mddev->safemode_timer, jiffies + mddev->safemode_delay);
-	}
+	} else if (mddev->safemode_delay)
+		/* The roundup() ensure this only performs locking once
+		 * every ->safemode_delay jiffies
+		 */
+		mod_timer(&mddev->safemode_timer,
+			  roundup(jiffies, mddev->safemode_delay) + mddev->safemode_delay);
+
 }
 EXPORT_SYMBOL(md_write_end);
 
@@ -8398,7 +8475,7 @@ void md_check_recovery(struct mddev *mddev)
 		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
 		test_bit(MD_RELOAD_SB, &mddev->flags) ||
 		(mddev->external == 0 && mddev->safemode == 1) ||
-		(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
+		(mddev->safemode == 2 && ! __writes_pending(mddev)
 		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
 		))
 		return;
@@ -8450,19 +8527,14 @@ void md_check_recovery(struct mddev *mddev)
 				md_reload_sb(mddev, mddev->good_device_nr);
 		}
 
-		if (!mddev->external) {
+		if (!mddev->external && mddev->safemode) {
 			int did_change = 0;
 			spin_lock(&mddev->lock);
-			if (mddev->safemode &&
-			    !atomic_read(&mddev->writes_pending) &&
-			    !mddev->in_sync &&
-			    mddev->recovery_cp == MaxSector) {
-				mddev->in_sync = 1;
+			if (mddev->recovery_cp == MaxSector &&
+			    set_in_sync(mddev)) {
 				did_change = 1;
 				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
 			}
-			if (mddev->safemode == 1)
-				mddev->safemode = 0;
 			spin_unlock(&mddev->lock);
 			if (did_change)
 				sysfs_notify_dirent_safe(mddev->sysfs_state);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2a514036a83d..7e41f882d33d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -404,7 +404,8 @@ struct mddev {
 							 */
 	unsigned int			safemode_delay;
 	struct timer_list		safemode_timer;
-	atomic_t			writes_pending;
+	long				*writes_pending_percpu;
+	int				checkers;	/* # of threads checking writes_pending */
 	struct request_queue		*queue;	/* for plugging ... */
 
 	struct bitmap			*bitmap; /* the bitmap for the device */
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3c7e106c12a2..45d260326efd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7360,7 +7360,7 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 		 * we can't wait pending write here, as this is called in
 		 * raid5d, wait will deadlock.
 		 */
-		if (atomic_read(&mddev->writes_pending))
+		if (!mddev->in_sync)
 			return -EBUSY;
 		log = conf->log;
 		conf->log = NULL;

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] MD: attach data to each bio
  2017-02-10  6:08   ` NeilBrown
@ 2017-02-10  6:47     ` Shaohua Li
  2017-02-13  9:49       ` NeilBrown
  2017-02-13  7:37     ` Christoph Hellwig
  1 sibling, 1 reply; 16+ messages in thread
From: Shaohua Li @ 2017-02-10  6:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, khlebnikov, hch

On Fri, Feb 10, 2017 at 05:08:54PM +1100, Neil Brown wrote:
> On Tue, Feb 07 2017, Shaohua Li wrote:
> 
> > Currently MD is rebusing some bio fields. To remove the hack, we attach
> > extra data to each bio. Each personablity can attach extra data to the
> > bios, so we don't need to rebuse bio fields.
> 
> I must say that I don't really like this approach.
> Temporarily modifying ->bi_private and ->bi_end_io seems
> .... intrusive.   I suspect it works, but I wonder if it is really
> robust in the long term.
> 
> How about a different approach..  Your main concern with my first patch
> was that it called md_write_start() and md_write_end() much more often,
> and these performed atomic ops on "global" variables, particular
> writes_pending.
> 
> We could change writes_pending to a per-cpu array which we only count
> occasionally when needed.  As writes_pending is updated often and
> checked rarely, a per-cpu array which is summed on demand seems
> appropriate.
> 
> The following patch is an early draft - it doesn't obviously fail and
> isn't obviously wrong to me.  There is certainly room for improvement
> and may be bugs.
> Next week I'll work on collection the re-factoring into separate
> patches, which are possible good-to-have anyway.

For your first patch, I don't have much concern. It's ok to me. What I don't
like is the bi_phys_segments handling part. The patches add a lot of logic to
handle the reference count. They should work, but I'd say it's not easy to
understand and could be error prone. What we really need is a reference count
for the bio, so let's just add a reference count. That's my logic and it's
simple.

For the modifying bi_private and bi_end_io part, I saw some filesystems are
using this way, at least btrfs. If this is really intrusive, is cloning a bio
better?

Thanks,
Shaohua

> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 8926fb781cdc..883c409902b0 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -64,6 +64,8 @@
>  #include <linux/raid/md_p.h>
>  #include <linux/raid/md_u.h>
>  #include <linux/slab.h>
> +#include <linux/percpu.h>
> +
>  #include <trace/events/block.h>
>  #include "md.h"
>  #include "bitmap.h"
> @@ -2250,6 +2252,81 @@ static void export_array(struct mddev *mddev)
>  	mddev->major_version = 0;
>  }
>  
> +/*
> + * The percpu writes_pending counters are linked with ->checkers and ->lock.
> + * If ->writes_pending can always be decremented without a lock.
> + * It can only be incremented without a lock if ->checkers is 0 and the test+incr
> + * happen in a rcu_readlocked region.
> + * ->checkers can only be changed under ->lock protection.
> + * To determine if ->writes_pending is totally zero, a quick sum without
> + * locks can be performed. If this is non-zero, then the result is final.
> + * Otherwise ->checkers must be incremented and synchronize_rcu() called.
> + * Then a sum calculated under ->lock, and the result is final until the
> + * ->checkers is decremented, or the lock is dropped.
> + *
> + */
> +
> +static bool __writes_pending(struct mddev *mddev)
> +{
> +	long cnt = 0;
> +	int i;
> +	for_each_possible_cpu(i)
> +		cnt += *per_cpu_ptr(mddev->writes_pending_percpu, i);
> +	return cnt != 0;
> +}
> +
> +static bool set_in_sync(struct mddev *mddev)
> +{
> +
> +	WARN_ON_ONCE(!spin_is_locked(&mddev->lock));
> +	if (__writes_pending(mddev)) {
> +		if (mddev->safemode == 1)
> +			mddev->safemode = 0;
> +		return false;
> +	}
> +	if (mddev->in_sync)
> +		return false;
> +
> +	mddev->checkers ++;
> +	spin_unlock(&mddev->lock);
> +	synchronize_rcu();
> +	spin_lock(&mddev->lock);
> +	if (mddev->in_sync) {
> +		/* someone else set it */
> +		mddev->checkers --;
> +		return false;
> +	}
> +
> +	if (! __writes_pending(mddev))
> +		mddev->in_sync = 1;
> +	if (mddev->safemode == 1)
> +		mddev->safemode = 0;
> +	mddev->checkers --;
> +	return mddev->in_sync;
> +}
> +
> +static void inc_writes_pending(struct mddev *mddev)
> +{
> +	rcu_read_lock();
> +	if (mddev->checkers == 0) {
> +		__this_cpu_inc(*mddev->writes_pending_percpu);
> +		rcu_read_unlock();
> +		return;
> +	}
> +	rcu_read_unlock();
> +	/* Need that spinlock */
> +	spin_lock(&mddev->lock);
> +	this_cpu_inc(*mddev->writes_pending_percpu);
> +	spin_unlock(&mddev->lock);
> +}
> +
> +static void zero_writes_pending(struct mddev *mddev)
> +{
> +	int i;
> +	for_each_possible_cpu(i)
> +		*per_cpu_ptr(mddev->writes_pending_percpu, i) = 0;
> +}
> +
>  static void sync_sbs(struct mddev *mddev, int nospares)
>  {
>  	/* Update each superblock (in-memory image), but
> @@ -3583,7 +3660,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>  	mddev->delta_disks = 0;
>  	mddev->reshape_backwards = 0;
>  	mddev->degraded = 0;
> -	spin_unlock(&mddev->lock);
>  
>  	if (oldpers->sync_request == NULL &&
>  	    mddev->external) {
> @@ -3596,8 +3672,8 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>  		 */
>  		mddev->in_sync = 0;
>  		mddev->safemode_delay = 0;
> -		mddev->safemode = 0;
>  	}
> +	spin_unlock(&mddev->lock);
>  
>  	oldpers->free(mddev, oldpriv);
>  
> @@ -3963,16 +4039,11 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
>  			wake_up(&mddev->sb_wait);
>  			err = 0;
>  		} else /* st == clean */ {
> +			err = 0;
>  			restart_array(mddev);
> -			if (atomic_read(&mddev->writes_pending) == 0) {
> -				if (mddev->in_sync == 0) {
> -					mddev->in_sync = 1;
> -					if (mddev->safemode == 1)
> -						mddev->safemode = 0;
> -					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> -				}
> -				err = 0;
> -			} else
> +			if (set_in_sync(mddev))
> +				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> +			else if (!mddev->in_sync)
>  				err = -EBUSY;
>  		}
>  		if (!err)
> @@ -4030,15 +4101,9 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
>  			if (err)
>  				break;
>  			spin_lock(&mddev->lock);
> -			if (atomic_read(&mddev->writes_pending) == 0) {
> -				if (mddev->in_sync == 0) {
> -					mddev->in_sync = 1;
> -					if (mddev->safemode == 1)
> -						mddev->safemode = 0;
> -					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> -				}
> -				err = 0;
> -			} else
> +			if (set_in_sync(mddev))
> +				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> +			else if (!mddev->in_sync)
>  				err = -EBUSY;
>  			spin_unlock(&mddev->lock);
>  		} else
> @@ -4993,6 +5058,9 @@ static void md_free(struct kobject *ko)
>  		del_gendisk(mddev->gendisk);
>  		put_disk(mddev->gendisk);
>  	}
> +	if (mddev->writes_pending_percpu) {
> +		free_percpu(mddev->writes_pending_percpu);
> +	}
>  
>  	kfree(mddev);
>  }
> @@ -5069,6 +5137,13 @@ static int md_alloc(dev_t dev, char *name)
>  	blk_queue_make_request(mddev->queue, md_make_request);
>  	blk_set_stacking_limits(&mddev->queue->limits);
>  
> +	mddev->writes_pending_percpu = alloc_percpu(long);
> +	if (!mddev->writes_pending_percpu) {
> +		blk_cleanup_queue(mddev->queue);
> +		mddev->queue = NULL;
> +		goto abort;
> +	}
> +
>  	disk = alloc_disk(1 << shift);
>  	if (!disk) {
>  		blk_cleanup_queue(mddev->queue);
> @@ -5152,7 +5227,7 @@ static void md_safemode_timeout(unsigned long data)
>  {
>  	struct mddev *mddev = (struct mddev *) data;
>  
> -	if (!atomic_read(&mddev->writes_pending)) {
> +	if (!__writes_pending(mddev)) {
>  		mddev->safemode = 1;
>  		if (mddev->external)
>  			sysfs_notify_dirent_safe(mddev->sysfs_state);
> @@ -5358,7 +5433,7 @@ int md_run(struct mddev *mddev)
>  	} else if (mddev->ro == 2) /* auto-readonly not meaningful */
>  		mddev->ro = 0;
>  
> -	atomic_set(&mddev->writes_pending,0);
> +	zero_writes_pending(mddev);
>  	atomic_set(&mddev->max_corr_read_errors,
>  		   MD_DEFAULT_MAX_CORRECTED_READ_ERRORS);
>  	mddev->safemode = 0;
> @@ -5451,7 +5526,6 @@ static int restart_array(struct mddev *mddev)
>  			return -EINVAL;
>  	}
>  
> -	mddev->safemode = 0;
>  	mddev->ro = 0;
>  	set_disk_ro(disk, 0);
>  	pr_debug("md: %s switched to read-write mode.\n", mdname(mddev));
> @@ -7767,9 +7841,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
>  		md_wakeup_thread(mddev->sync_thread);
>  		did_change = 1;
>  	}
> -	atomic_inc(&mddev->writes_pending);
> -	if (mddev->safemode == 1)
> -		mddev->safemode = 0;
> +	inc_writes_pending(mddev);
>  	if (mddev->in_sync) {
>  		spin_lock(&mddev->lock);
>  		if (mddev->in_sync) {
> @@ -7790,12 +7862,17 @@ EXPORT_SYMBOL(md_write_start);
>  
>  void md_write_end(struct mddev *mddev)
>  {
> -	if (atomic_dec_and_test(&mddev->writes_pending)) {
> -		if (mddev->safemode == 2)
> +	this_cpu_dec(*mddev->writes_pending_percpu);
> +	if (mddev->safemode == 2) {
> +		if (!__writes_pending(mddev))
>  			md_wakeup_thread(mddev->thread);
> -		else if (mddev->safemode_delay)
> -			mod_timer(&mddev->safemode_timer, jiffies + mddev->safemode_delay);
> -	}
> +	} else if (mddev->safemode_delay)
> +		/* The roundup() ensure this only performs locking once
> +		 * every ->safemode_delay jiffies
> +		 */
> +		mod_timer(&mddev->safemode_timer,
> +			  roundup(jiffies, mddev->safemode_delay) + mddev->safemode_delay);
> +
>  }
>  EXPORT_SYMBOL(md_write_end);
>  
> @@ -8398,7 +8475,7 @@ void md_check_recovery(struct mddev *mddev)
>  		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>  		test_bit(MD_RELOAD_SB, &mddev->flags) ||
>  		(mddev->external == 0 && mddev->safemode == 1) ||
> -		(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
> +		(mddev->safemode == 2 && ! __writes_pending(mddev)
>  		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
>  		))
>  		return;
> @@ -8450,19 +8527,14 @@ void md_check_recovery(struct mddev *mddev)
>  				md_reload_sb(mddev, mddev->good_device_nr);
>  		}
>  
> -		if (!mddev->external) {
> +		if (!mddev->external && mddev->safemode) {
>  			int did_change = 0;
>  			spin_lock(&mddev->lock);
> -			if (mddev->safemode &&
> -			    !atomic_read(&mddev->writes_pending) &&
> -			    !mddev->in_sync &&
> -			    mddev->recovery_cp == MaxSector) {
> -				mddev->in_sync = 1;
> +			if (mddev->recovery_cp == MaxSector &&
> +			    set_in_sync(mddev)) {
>  				did_change = 1;
>  				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
>  			}
> -			if (mddev->safemode == 1)
> -				mddev->safemode = 0;
>  			spin_unlock(&mddev->lock);
>  			if (did_change)
>  				sysfs_notify_dirent_safe(mddev->sysfs_state);
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 2a514036a83d..7e41f882d33d 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -404,7 +404,8 @@ struct mddev {
>  							 */
>  	unsigned int			safemode_delay;
>  	struct timer_list		safemode_timer;
> -	atomic_t			writes_pending;
> +	long				*writes_pending_percpu;
> +	int				checkers;	/* # of threads checking writes_pending */
>  	struct request_queue		*queue;	/* for plugging ... */
>  
>  	struct bitmap			*bitmap; /* the bitmap for the device */
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3c7e106c12a2..45d260326efd 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7360,7 +7360,7 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>  		 * we can't wait pending write here, as this is called in
>  		 * raid5d, wait will deadlock.
>  		 */
> -		if (atomic_read(&mddev->writes_pending))
> +		if (!mddev->in_sync)
>  			return -EBUSY;
>  		log = conf->log;
>  		conf->log = NULL;



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] MD: attach data to each bio
  2017-02-10  6:08   ` NeilBrown
  2017-02-10  6:47     ` Shaohua Li
@ 2017-02-13  7:37     ` Christoph Hellwig
  2017-02-13  9:32       ` NeilBrown
  1 sibling, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2017-02-13  7:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, khlebnikov, hch

On Fri, Feb 10, 2017 at 05:08:54PM +1100, NeilBrown wrote:
> I must say that I don't really like this approach.
> Temporarily modifying ->bi_private and ->bi_end_io seems
> .... intrusive.   I suspect it works, but I wonder if it is really
> robust in the long term.
> 
> How about a different approach..  Your main concern with my first patch
> was that it called md_write_start() and md_write_end() much more often,
> and these performed atomic ops on "global" variables, particular
> writes_pending.
> 
> We could change writes_pending to a per-cpu array which we only count
> occasionally when needed.  As writes_pending is updated often and
> checked rarely, a per-cpu array which is summed on demand seems
> appropriate.

FYI, I much prefer you original approach, it's much closer to how
the rest of the block stack works. 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] MD: attach data to each bio
  2017-02-13  7:37     ` Christoph Hellwig
@ 2017-02-13  9:32       ` NeilBrown
  0 siblings, 0 replies; 16+ messages in thread
From: NeilBrown @ 2017-02-13  9:32 UTC (permalink / raw)
  Cc: Shaohua Li, linux-raid, khlebnikov, hch

[-- Attachment #1: Type: text/plain, Size: 1207 bytes --]

On Mon, Feb 13 2017, Christoph Hellwig wrote:

> On Fri, Feb 10, 2017 at 05:08:54PM +1100, NeilBrown wrote:
>> I must say that I don't really like this approach.
>> Temporarily modifying ->bi_private and ->bi_end_io seems
>> .... intrusive.   I suspect it works, but I wonder if it is really
>> robust in the long term.
>> 
>> How about a different approach..  Your main concern with my first patch
>> was that it called md_write_start() and md_write_end() much more often,
>> and these performed atomic ops on "global" variables, particular
>> writes_pending.
>> 
>> We could change writes_pending to a per-cpu array which we only count
>> occasionally when needed.  As writes_pending is updated often and
>> checked rarely, a per-cpu array which is summed on demand seems
>> appropriate.
>
> FYI, I much prefer you original approach, it's much closer to how
> the rest of the block stack works. 

I probably wasn't clear, but my intention was to stick with my original
approach, but make it more acceptable by removing the extra cost of
cache-line-bouncing that Shaohua correctly identified.
i.e. this patch was a preliminary to improve the original series.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] MD: attach data to each bio
  2017-02-10  6:47     ` Shaohua Li
@ 2017-02-13  9:49       ` NeilBrown
  2017-02-13 18:49         ` Shaohua Li
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2017-02-13  9:49 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Shaohua Li, linux-raid, khlebnikov, hch

[-- Attachment #1: Type: text/plain, Size: 3358 bytes --]

On Thu, Feb 09 2017, Shaohua Li wrote:

> On Fri, Feb 10, 2017 at 05:08:54PM +1100, Neil Brown wrote:
>> On Tue, Feb 07 2017, Shaohua Li wrote:
>> 
>> > Currently MD is rebusing some bio fields. To remove the hack, we attach
>> > extra data to each bio. Each personablity can attach extra data to the
>> > bios, so we don't need to rebuse bio fields.
>> 
>> I must say that I don't really like this approach.
>> Temporarily modifying ->bi_private and ->bi_end_io seems
>> .... intrusive.   I suspect it works, but I wonder if it is really
>> robust in the long term.
>> 
>> How about a different approach..  Your main concern with my first patch
>> was that it called md_write_start() and md_write_end() much more often,
>> and these performed atomic ops on "global" variables, particular
>> writes_pending.
>> 
>> We could change writes_pending to a per-cpu array which we only count
>> occasionally when needed.  As writes_pending is updated often and
>> checked rarely, a per-cpu array which is summed on demand seems
>> appropriate.
>> 
>> The following patch is an early draft - it doesn't obviously fail and
>> isn't obviously wrong to me.  There is certainly room for improvement
>> and may be bugs.
>> Next week I'll work on collection the re-factoring into separate
>> patches, which are possible good-to-have anyway.
>
> For your first patch, I don't have much concern. It's ok to me. What I don't
> like is the bi_phys_segments handling part. The patches add a lot of logic to
> handle the reference count. They should work, but I'd say it's not easy to
> understand and could be error prone. What we really need is a reference count
> for the bio, so let's just add a reference count. That's my logic and it's
> simple.

We already have two reference counts, and you want to add a third one.

bi_phys_segments is currently used for two related purposes.
It counts the number of stripe_heads currently attached to the bio so
that when the count reaches zero:
 1/ ->writes_pending can be decremented
 2/ bio_endio() can be called.

When the code was written, the __bi_remaining counter didn't exist.  Now
it does and it is integrated with bio_endio() so it should make the code
easier to understand if we just use bio_endio() rather and doing our own
accounting.

That just leaves '1'.  We can easily decrement ->writes_pending directly
instead of decrementing a per-bio refcount, and then when it reaches
zero, decrement ->writes_pending.  As you pointed out, that comes with a
cost.  If ->writes_pending is changed to a per-cpu array which is summed
on demand, the cost goes away.

Having an extra refcount in the bio just adds a level of indirection
that doesn't (that I can see) provide actual value.

>
> For the modifying bi_private and bi_end_io part, I saw some filesystems are
> using this way, at least btrfs. If this is really intrusive, is cloning a bio
> better?

The bio belongs to the filesystem.  It allocated it and can do whatever
it likes with bi_end_io and bi_private.  I don't think a block device
driver should ever change bi_private of bi_end_io of a bio that it was
passed (if it allocates its own bios, it can of course change those).
I don't think cloning the bio would really help, though you could
probably make something work.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] MD: attach data to each bio
  2017-02-13  9:49       ` NeilBrown
@ 2017-02-13 18:49         ` Shaohua Li
  2017-02-14  2:40           ` NeilBrown
  0 siblings, 1 reply; 16+ messages in thread
From: Shaohua Li @ 2017-02-13 18:49 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, khlebnikov, hch

On Mon, Feb 13, 2017 at 08:49:33PM +1100, Neil Brown wrote:
> On Thu, Feb 09 2017, Shaohua Li wrote:
> 
> > On Fri, Feb 10, 2017 at 05:08:54PM +1100, Neil Brown wrote:
> >> On Tue, Feb 07 2017, Shaohua Li wrote:
> >> 
> >> > Currently MD is rebusing some bio fields. To remove the hack, we attach
> >> > extra data to each bio. Each personablity can attach extra data to the
> >> > bios, so we don't need to rebuse bio fields.
> >> 
> >> I must say that I don't really like this approach.
> >> Temporarily modifying ->bi_private and ->bi_end_io seems
> >> .... intrusive.   I suspect it works, but I wonder if it is really
> >> robust in the long term.
> >> 
> >> How about a different approach..  Your main concern with my first patch
> >> was that it called md_write_start() and md_write_end() much more often,
> >> and these performed atomic ops on "global" variables, particular
> >> writes_pending.
> >> 
> >> We could change writes_pending to a per-cpu array which we only count
> >> occasionally when needed.  As writes_pending is updated often and
> >> checked rarely, a per-cpu array which is summed on demand seems
> >> appropriate.
> >> 
> >> The following patch is an early draft - it doesn't obviously fail and
> >> isn't obviously wrong to me.  There is certainly room for improvement
> >> and may be bugs.
> >> Next week I'll work on collection the re-factoring into separate
> >> patches, which are possible good-to-have anyway.
> >
> > For your first patch, I don't have much concern. It's ok to me. What I don't
> > like is the bi_phys_segments handling part. The patches add a lot of logic to
> > handle the reference count. They should work, but I'd say it's not easy to
> > understand and could be error prone. What we really need is a reference count
> > for the bio, so let's just add a reference count. That's my logic and it's
> > simple.
> 
> We already have two reference counts, and you want to add a third one.
> 
> bi_phys_segments is currently used for two related purposes.
> It counts the number of stripe_heads currently attached to the bio so
> that when the count reaches zero:
>  1/ ->writes_pending can be decremented
>  2/ bio_endio() can be called.
> 
> When the code was written, the __bi_remaining counter didn't exist.  Now
> it does and it is integrated with bio_endio() so it should make the code
> easier to understand if we just use bio_endio() rather and doing our own
> accounting.
> 
> That just leaves '1'.  We can easily decrement ->writes_pending directly
> instead of decrementing a per-bio refcount, and then when it reaches
> zero, decrement ->writes_pending.  As you pointed out, that comes with a
> cost.  If ->writes_pending is changed to a per-cpu array which is summed
> on demand, the cost goes away.
> 
> Having an extra refcount in the bio just adds a level of indirection
> that doesn't (that I can see) provide actual value.

Ok, fair enough. I do think an explict counter in the driver side will help us
a lot, eg, we can better control when to endio and do something there (for
example the current blk trace, or something we want to add in the future). But
I don't insist currently.

For the patches, can you repost? I think:
- patch 2 missed md_write_start for make_discard_request
- It's unnecessary to zero bi_phys_segments in patch 5. And raid5-cache need do
  the same change of bio_endio.
For the md_write_start optimization, we can do it later.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] MD: attach data to each bio
  2017-02-13 18:49         ` Shaohua Li
@ 2017-02-14  2:40           ` NeilBrown
  0 siblings, 0 replies; 16+ messages in thread
From: NeilBrown @ 2017-02-14  2:40 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Shaohua Li, linux-raid, khlebnikov, hch

[-- Attachment #1: Type: text/plain, Size: 3753 bytes --]

On Mon, Feb 13 2017, Shaohua Li wrote:

> On Mon, Feb 13, 2017 at 08:49:33PM +1100, Neil Brown wrote:
>> On Thu, Feb 09 2017, Shaohua Li wrote:
>> 
>> > On Fri, Feb 10, 2017 at 05:08:54PM +1100, Neil Brown wrote:
>> >> On Tue, Feb 07 2017, Shaohua Li wrote:
>> >> 
>> >> > Currently MD is rebusing some bio fields. To remove the hack, we attach
>> >> > extra data to each bio. Each personablity can attach extra data to the
>> >> > bios, so we don't need to rebuse bio fields.
>> >> 
>> >> I must say that I don't really like this approach.
>> >> Temporarily modifying ->bi_private and ->bi_end_io seems
>> >> .... intrusive.   I suspect it works, but I wonder if it is really
>> >> robust in the long term.
>> >> 
>> >> How about a different approach..  Your main concern with my first patch
>> >> was that it called md_write_start() and md_write_end() much more often,
>> >> and these performed atomic ops on "global" variables, particular
>> >> writes_pending.
>> >> 
>> >> We could change writes_pending to a per-cpu array which we only count
>> >> occasionally when needed.  As writes_pending is updated often and
>> >> checked rarely, a per-cpu array which is summed on demand seems
>> >> appropriate.
>> >> 
>> >> The following patch is an early draft - it doesn't obviously fail and
>> >> isn't obviously wrong to me.  There is certainly room for improvement
>> >> and may be bugs.
>> >> Next week I'll work on collection the re-factoring into separate
>> >> patches, which are possible good-to-have anyway.
>> >
>> > For your first patch, I don't have much concern. It's ok to me. What I don't
>> > like is the bi_phys_segments handling part. The patches add a lot of logic to
>> > handle the reference count. They should work, but I'd say it's not easy to
>> > understand and could be error prone. What we really need is a reference count
>> > for the bio, so let's just add a reference count. That's my logic and it's
>> > simple.
>> 
>> We already have two reference counts, and you want to add a third one.
>> 
>> bi_phys_segments is currently used for two related purposes.
>> It counts the number of stripe_heads currently attached to the bio so
>> that when the count reaches zero:
>>  1/ ->writes_pending can be decremented
>>  2/ bio_endio() can be called.
>> 
>> When the code was written, the __bi_remaining counter didn't exist.  Now
>> it does and it is integrated with bio_endio() so it should make the code
>> easier to understand if we just use bio_endio() rather and doing our own
>> accounting.
>> 
>> That just leaves '1'.  We can easily decrement ->writes_pending directly
>> instead of decrementing a per-bio refcount, and then when it reaches
>> zero, decrement ->writes_pending.  As you pointed out, that comes with a
>> cost.  If ->writes_pending is changed to a per-cpu array which is summed
>> on demand, the cost goes away.
>> 
>> Having an extra refcount in the bio just adds a level of indirection
>> that doesn't (that I can see) provide actual value.
>
> Ok, fair enough. I do think an explict counter in the driver side will help us
> a lot, eg, we can better control when to endio and do something there (for
> example the current blk trace, or something we want to add in the future). But
> I don't insist currently.
>
> For the patches, can you repost? I think:
> - patch 2 missed md_write_start for make_discard_request
> - It's unnecessary to zero bi_phys_segments in patch 5. And raid5-cache need do
>   the same change of bio_endio.
> For the md_write_start optimization, we can do it later.

Sure. I agree those two changes are needed.  I'll try to send something
in the next day or so.

NeilBrown


>
> Thanks,
> Shaohua

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-02-14  2:40 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-07 16:55 [PATCH 0/5] MD: don't abuse bi_phys_segements Shaohua Li
2017-02-07 16:55 ` [PATCH 1/5] MD: attach data to each bio Shaohua Li
2017-02-08  9:07   ` Guoqing Jiang
2017-02-08  5:24     ` Shaohua Li
2017-02-09  0:09     ` Wols Lists
2017-02-10  6:08   ` NeilBrown
2017-02-10  6:47     ` Shaohua Li
2017-02-13  9:49       ` NeilBrown
2017-02-13 18:49         ` Shaohua Li
2017-02-14  2:40           ` NeilBrown
2017-02-13  7:37     ` Christoph Hellwig
2017-02-13  9:32       ` NeilBrown
2017-02-07 16:55 ` [PATCH 2/5] md/raid5: don't abuse bio->bi_phys_segments Shaohua Li
2017-02-07 16:56 ` [PATCH 3/5] md/raid5: change disk limits Shaohua Li
2017-02-07 16:56 ` [PATCH 4/5] md/raid1: don't abuse bio->bi_phys_segments Shaohua Li
2017-02-07 16:56 ` [PATCH 5/5] md/raid10: " Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.