All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] Badblock tracking for gendisks
@ 2015-11-25 18:43 Vishal Verma
  2015-11-25 18:43 ` [PATCH v2 1/3] badblocks: Add core badblock management code Vishal Verma
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Vishal Verma @ 2015-11-25 18:43 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Vishal Verma, linux-block, linux-raid, linux-scsi, Jens Axboe,
	NeilBrown, Jeff Moyer

v2:
  - In badblocks_free, make 'page' NULL (patch 1)
  - Move the core badblocks code to a new .c file (patch 1) (Jens)
  - Fix a sizeof usage in disk_alloc_badblocks (patch 2) (Dan)
  - Since disk_alloc_badblocks can fail, check disk->bb for NULL in the
    genhd wrappers (patch 2) (Jeff)
  - Update the md conversion to also ise the badblocks init and free
    functions (patch 3)
  - Remove the BB_* macros from md.h as they are now in badblocks.h (patch 3)

Patch 1 copies badblock management code into a header of its own,
making it generally available. It follows common libraries of code
such as linked lists, where anyone may embed a core data structure
in another place, and use the provided accessor functions to
manipulate the data.

Patch 2 adds badblock tracking to gendisks (in preparation for use
by NVDIMM devices). Right now, it is turned on unconditionally - I'd
appreciate comments on if that is the right path.

Patch 3 converts md over to use the new badblocks 'library'. I have
done some pretty simple testing on this - created a raid 1 device,
made sure the sysfs entries show up, and can be used to add and view
badblocks. A closer look by the md folks would be nice here.


Vishal Verma (3):
  badblocks: Add core badblock management code
  block: Add badblock management for gendisks
  md: convert to use the generic badblocks code

 block/Makefile            |   2 +-
 block/badblocks.c         | 523 ++++++++++++++++++++++++++++++++++++++++++++++
 block/genhd.c             |  81 +++++++
 drivers/md/md.c           | 507 ++------------------------------------------
 drivers/md/md.h           |  40 +---
 include/linux/badblocks.h |  53 +++++
 include/linux/genhd.h     |   6 +
 7 files changed, 687 insertions(+), 525 deletions(-)
 create mode 100644 block/badblocks.c
 create mode 100644 include/linux/badblocks.h

-- 
2.5.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/3] badblocks: Add core badblock management code
  2015-11-25 18:43 [PATCH v2 0/3] Badblock tracking for gendisks Vishal Verma
@ 2015-11-25 18:43 ` Vishal Verma
  2015-12-04 23:30   ` James Bottomley
  2015-11-25 18:43 ` [PATCH v2 2/3] block: Add badblock management for gendisks Vishal Verma
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Vishal Verma @ 2015-11-25 18:43 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Vishal Verma, linux-block, linux-raid, linux-scsi, Jens Axboe,
	NeilBrown, Jeff Moyer

Take the core badblocks implementation from md, and make it generally
available. This follows the same style as kernel implementations of
linked lists, rb-trees etc, where you can have a structure that can be
embedded anywhere, and accessor functions to manipulate the data.

The only changes in this copy of the code are ones to generalize
function/variable names from md-specific ones. Also add init and free
functions.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 block/Makefile            |   2 +-
 block/badblocks.c         | 523 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/badblocks.h |  53 +++++
 3 files changed, 577 insertions(+), 1 deletion(-)
 create mode 100644 block/badblocks.c
 create mode 100644 include/linux/badblocks.h

diff --git a/block/Makefile b/block/Makefile
index 00ecc97..db5f622 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -8,7 +8,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-iopoll.o blk-lib.o blk-mq.o blk-mq-tag.o \
 			blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
 			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
-			partitions/
+			badblocks.o partitions/
 
 obj-$(CONFIG_BOUNCE)	+= bounce.o
 obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
diff --git a/block/badblocks.c b/block/badblocks.c
new file mode 100644
index 0000000..6e07855
--- /dev/null
+++ b/block/badblocks.c
@@ -0,0 +1,523 @@
+/*
+ * Bad block management
+ *
+ * - Heavily based on MD badblocks code from Neil Brown
+ *
+ * Copyright (c) 2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/badblocks.h>
+#include <linux/seqlock.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/stddef.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+
+/*
+ * We can record which blocks on each device are 'bad' and so just
+ * fail those blocks, or that stripe, rather than the whole device.
+ * Entries in the bad-block table are 64bits wide.  This comprises:
+ * Length of bad-range, in sectors: 0-511 for lengths 1-512
+ * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
+ *  A 'shift' can be set so that larger blocks are tracked and
+ *  consequently larger devices can be covered.
+ * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ *
+ * Locking of the bad-block table uses a seqlock so badblocks_check
+ * might need to retry if it is very unlucky.
+ * We will sometimes want to check for bad blocks in a bi_end_io function,
+ * so we use the write_seqlock_irq variant.
+ *
+ * When looking for a bad block we specify a range and want to
+ * know if any block in the range is bad.  So we binary-search
+ * to the last range that starts at-or-before the given endpoint,
+ * (or "before the sector after the target range")
+ * then see if it ends after the given start.
+ * We return
+ *  0 if there are no known bad blocks in the range
+ *  1 if there are known bad block which are all acknowledged
+ * -1 if there are bad blocks which have not yet been acknowledged in metadata.
+ * plus the start/length of the first bad section we overlap.
+ */
+int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
+			sector_t *first_bad, int *bad_sectors)
+{
+	int hi;
+	int lo;
+	u64 *p = bb->page;
+	int rv;
+	sector_t target = s + sectors;
+	unsigned seq;
+
+	if (bb->shift > 0) {
+		/* round the start down, and the end up */
+		s >>= bb->shift;
+		target += (1<<bb->shift) - 1;
+		target >>= bb->shift;
+		sectors = target - s;
+	}
+	/* 'target' is now the first block after the bad range */
+
+retry:
+	seq = read_seqbegin(&bb->lock);
+	lo = 0;
+	rv = 0;
+	hi = bb->count;
+
+	/* Binary search between lo and hi for 'target'
+	 * i.e. for the last range that starts before 'target'
+	 */
+	/* INVARIANT: ranges before 'lo' and at-or-after 'hi'
+	 * are known not to be the last range before target.
+	 * VARIANT: hi-lo is the number of possible
+	 * ranges, and decreases until it reaches 1
+	 */
+	while (hi - lo > 1) {
+		int mid = (lo + hi) / 2;
+		sector_t a = BB_OFFSET(p[mid]);
+
+		if (a < target)
+			/* This could still be the one, earlier ranges
+			 * could not.
+			 */
+			lo = mid;
+		else
+			/* This and later ranges are definitely out. */
+			hi = mid;
+	}
+	/* 'lo' might be the last that started before target, but 'hi' isn't */
+	if (hi > lo) {
+		/* need to check all range that end after 's' to see if
+		 * any are unacknowledged.
+		 */
+		while (lo >= 0 &&
+		       BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
+			if (BB_OFFSET(p[lo]) < target) {
+				/* starts before the end, and finishes after
+				 * the start, so they must overlap
+				 */
+				if (rv != -1 && BB_ACK(p[lo]))
+					rv = 1;
+				else
+					rv = -1;
+				*first_bad = BB_OFFSET(p[lo]);
+				*bad_sectors = BB_LEN(p[lo]);
+			}
+			lo--;
+		}
+	}
+
+	if (read_seqretry(&bb->lock, seq))
+		goto retry;
+
+	return rv;
+}
+EXPORT_SYMBOL_GPL(badblocks_check);
+
+/*
+ * Add a range of bad blocks to the table.
+ * This might extend the table, or might contract it
+ * if two adjacent ranges can be merged.
+ * We binary-search to find the 'insertion' point, then
+ * decide how best to handle it.
+ */
+int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
+			int acknowledged)
+{
+	u64 *p;
+	int lo, hi;
+	int rv = 1;
+	unsigned long flags;
+
+	if (bb->shift < 0)
+		/* badblocks are disabled */
+		return 0;
+
+	if (bb->shift) {
+		/* round the start down, and the end up */
+		sector_t next = s + sectors;
+
+		s >>= bb->shift;
+		next += (1<<bb->shift) - 1;
+		next >>= bb->shift;
+		sectors = next - s;
+	}
+
+	write_seqlock_irqsave(&bb->lock, flags);
+
+	p = bb->page;
+	lo = 0;
+	hi = bb->count;
+	/* Find the last range that starts at-or-before 's' */
+	while (hi - lo > 1) {
+		int mid = (lo + hi) / 2;
+		sector_t a = BB_OFFSET(p[mid]);
+
+		if (a <= s)
+			lo = mid;
+		else
+			hi = mid;
+	}
+	if (hi > lo && BB_OFFSET(p[lo]) > s)
+		hi = lo;
+
+	if (hi > lo) {
+		/* we found a range that might merge with the start
+		 * of our new range
+		 */
+		sector_t a = BB_OFFSET(p[lo]);
+		sector_t e = a + BB_LEN(p[lo]);
+		int ack = BB_ACK(p[lo]);
+
+		if (e >= s) {
+			/* Yes, we can merge with a previous range */
+			if (s == a && s + sectors >= e)
+				/* new range covers old */
+				ack = acknowledged;
+			else
+				ack = ack && acknowledged;
+
+			if (e < s + sectors)
+				e = s + sectors;
+			if (e - a <= BB_MAX_LEN) {
+				p[lo] = BB_MAKE(a, e-a, ack);
+				s = e;
+			} else {
+				/* does not all fit in one range,
+				 * make p[lo] maximal
+				 */
+				if (BB_LEN(p[lo]) != BB_MAX_LEN)
+					p[lo] = BB_MAKE(a, BB_MAX_LEN, ack);
+				s = a + BB_MAX_LEN;
+			}
+			sectors = e - s;
+		}
+	}
+	if (sectors && hi < bb->count) {
+		/* 'hi' points to the first range that starts after 's'.
+		 * Maybe we can merge with the start of that range
+		 */
+		sector_t a = BB_OFFSET(p[hi]);
+		sector_t e = a + BB_LEN(p[hi]);
+		int ack = BB_ACK(p[hi]);
+
+		if (a <= s + sectors) {
+			/* merging is possible */
+			if (e <= s + sectors) {
+				/* full overlap */
+				e = s + sectors;
+				ack = acknowledged;
+			} else
+				ack = ack && acknowledged;
+
+			a = s;
+			if (e - a <= BB_MAX_LEN) {
+				p[hi] = BB_MAKE(a, e-a, ack);
+				s = e;
+			} else {
+				p[hi] = BB_MAKE(a, BB_MAX_LEN, ack);
+				s = a + BB_MAX_LEN;
+			}
+			sectors = e - s;
+			lo = hi;
+			hi++;
+		}
+	}
+	if (sectors == 0 && hi < bb->count) {
+		/* we might be able to combine lo and hi */
+		/* Note: 's' is at the end of 'lo' */
+		sector_t a = BB_OFFSET(p[hi]);
+		int lolen = BB_LEN(p[lo]);
+		int hilen = BB_LEN(p[hi]);
+		int newlen = lolen + hilen - (s - a);
+
+		if (s >= a && newlen < BB_MAX_LEN) {
+			/* yes, we can combine them */
+			int ack = BB_ACK(p[lo]) && BB_ACK(p[hi]);
+
+			p[lo] = BB_MAKE(BB_OFFSET(p[lo]), newlen, ack);
+			memmove(p + hi, p + hi + 1,
+				(bb->count - hi - 1) * 8);
+			bb->count--;
+		}
+	}
+	while (sectors) {
+		/* didn't merge (it all).
+		 * Need to add a range just before 'hi'
+		 */
+		if (bb->count >= MAX_BADBLOCKS) {
+			/* No room for more */
+			rv = 0;
+			break;
+		} else {
+			int this_sectors = sectors;
+
+			memmove(p + hi + 1, p + hi,
+				(bb->count - hi) * 8);
+			bb->count++;
+
+			if (this_sectors > BB_MAX_LEN)
+				this_sectors = BB_MAX_LEN;
+			p[hi] = BB_MAKE(s, this_sectors, acknowledged);
+			sectors -= this_sectors;
+			s += this_sectors;
+		}
+	}
+
+	bb->changed = 1;
+	if (!acknowledged)
+		bb->unacked_exist = 1;
+	write_sequnlock_irqrestore(&bb->lock, flags);
+
+	return rv;
+}
+EXPORT_SYMBOL_GPL(badblocks_set);
+
+/*
+ * Remove a range of bad blocks from the table.
+ * This may involve extending the table if we spilt a region,
+ * but it must not fail.  So if the table becomes full, we just
+ * drop the remove request.
+ */
+int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
+{
+	u64 *p;
+	int lo, hi;
+	sector_t target = s + sectors;
+	int rv = 0;
+
+	if (bb->shift > 0) {
+		/* When clearing we round the start up and the end down.
+		 * This should not matter as the shift should align with
+		 * the block size and no rounding should ever be needed.
+		 * However it is better the think a block is bad when it
+		 * isn't than to think a block is not bad when it is.
+		 */
+		s += (1<<bb->shift) - 1;
+		s >>= bb->shift;
+		target >>= bb->shift;
+		sectors = target - s;
+	}
+
+	write_seqlock_irq(&bb->lock);
+
+	p = bb->page;
+	lo = 0;
+	hi = bb->count;
+	/* Find the last range that starts before 'target' */
+	while (hi - lo > 1) {
+		int mid = (lo + hi) / 2;
+		sector_t a = BB_OFFSET(p[mid]);
+
+		if (a < target)
+			lo = mid;
+		else
+			hi = mid;
+	}
+	if (hi > lo) {
+		/* p[lo] is the last range that could overlap the
+		 * current range.  Earlier ranges could also overlap,
+		 * but only this one can overlap the end of the range.
+		 */
+		if (BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > target) {
+			/* Partial overlap, leave the tail of this range */
+			int ack = BB_ACK(p[lo]);
+			sector_t a = BB_OFFSET(p[lo]);
+			sector_t end = a + BB_LEN(p[lo]);
+
+			if (a < s) {
+				/* we need to split this range */
+				if (bb->count >= MAX_BADBLOCKS) {
+					rv = -ENOSPC;
+					goto out;
+				}
+				memmove(p+lo+1, p+lo, (bb->count - lo) * 8);
+				bb->count++;
+				p[lo] = BB_MAKE(a, s-a, ack);
+				lo++;
+			}
+			p[lo] = BB_MAKE(target, end - target, ack);
+			/* there is no longer an overlap */
+			hi = lo;
+			lo--;
+		}
+		while (lo >= 0 &&
+		       BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
+			/* This range does overlap */
+			if (BB_OFFSET(p[lo]) < s) {
+				/* Keep the early parts of this range. */
+				int ack = BB_ACK(p[lo]);
+				sector_t start = BB_OFFSET(p[lo]);
+
+				p[lo] = BB_MAKE(start, s - start, ack);
+				/* now low doesn't overlap, so.. */
+				break;
+			}
+			lo--;
+		}
+		/* 'lo' is strictly before, 'hi' is strictly after,
+		 * anything between needs to be discarded
+		 */
+		if (hi - lo > 1) {
+			memmove(p+lo+1, p+hi, (bb->count - hi) * 8);
+			bb->count -= (hi - lo - 1);
+		}
+	}
+
+	bb->changed = 1;
+out:
+	write_sequnlock_irq(&bb->lock);
+	return rv;
+}
+EXPORT_SYMBOL_GPL(badblocks_clear);
+
+/*
+ * Acknowledge all bad blocks in a list.
+ * This only succeeds if ->changed is clear.  It is used by
+ * in-kernel metadata updates
+ */
+void ack_all_badblocks(struct badblocks *bb)
+{
+	if (bb->page == NULL || bb->changed)
+		/* no point even trying */
+		return;
+	write_seqlock_irq(&bb->lock);
+
+	if (bb->changed == 0 && bb->unacked_exist) {
+		u64 *p = bb->page;
+		int i;
+
+		for (i = 0; i < bb->count ; i++) {
+			if (!BB_ACK(p[i])) {
+				sector_t start = BB_OFFSET(p[i]);
+				int len = BB_LEN(p[i]);
+
+				p[i] = BB_MAKE(start, len, 1);
+			}
+		}
+		bb->unacked_exist = 0;
+	}
+	write_sequnlock_irq(&bb->lock);
+}
+EXPORT_SYMBOL_GPL(ack_all_badblocks);
+
+/* sysfs access to bad-blocks list. */
+ssize_t badblocks_show(struct badblocks *bb, char *page, int unack)
+{
+	size_t len;
+	int i;
+	u64 *p = bb->page;
+	unsigned seq;
+
+	if (bb->shift < 0)
+		return 0;
+
+retry:
+	seq = read_seqbegin(&bb->lock);
+
+	len = 0;
+	i = 0;
+
+	while (len < PAGE_SIZE && i < bb->count) {
+		sector_t s = BB_OFFSET(p[i]);
+		unsigned int length = BB_LEN(p[i]);
+		int ack = BB_ACK(p[i]);
+
+		i++;
+
+		if (unack && ack)
+			continue;
+
+		len += snprintf(page+len, PAGE_SIZE-len, "%llu %u\n",
+				(unsigned long long)s << bb->shift,
+				length << bb->shift);
+	}
+	if (unack && len == 0)
+		bb->unacked_exist = 0;
+
+	if (read_seqretry(&bb->lock, seq))
+		goto retry;
+
+	return len;
+}
+EXPORT_SYMBOL_GPL(badblocks_show);
+
+#define DO_DEBUG 1
+
+ssize_t badblocks_store(struct badblocks *bb, const char *page, size_t len,
+			int unack)
+{
+	unsigned long long sector;
+	int length;
+	char newline;
+#ifdef DO_DEBUG
+	/* Allow clearing via sysfs *only* for testing/debugging.
+	 * Normally only a successful write may clear a badblock
+	 */
+	int clear = 0;
+
+	if (page[0] == '-') {
+		clear = 1;
+		page++;
+	}
+#endif /* DO_DEBUG */
+
+	switch (sscanf(page, "%llu %d%c", &sector, &length, &newline)) {
+	case 3:
+		if (newline != '\n')
+			return -EINVAL;
+	case 2:
+		if (length <= 0)
+			return -EINVAL;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+#ifdef DO_DEBUG
+	if (clear) {
+		badblocks_clear(bb, sector, length);
+		return len;
+	}
+#endif /* DO_DEBUG */
+	if (badblocks_set(bb, sector, length, !unack))
+		return len;
+	else
+		return -ENOSPC;
+}
+EXPORT_SYMBOL_GPL(badblocks_store);
+
+int badblocks_init(struct badblocks *bb, int enable)
+{
+	bb->count = 0;
+	if (enable)
+		bb->shift = 0;
+	else
+		bb->shift = -1;
+	bb->page = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (bb->page == NULL) {
+		bb->shift = -1;
+		return -ENOMEM;
+	}
+	seqlock_init(&bb->lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(badblocks_init);
+
+void badblocks_free(struct badblocks *bb)
+{
+	kfree(bb->page);
+	bb->page = NULL;
+}
+EXPORT_SYMBOL_GPL(badblocks_free);
diff --git a/include/linux/badblocks.h b/include/linux/badblocks.h
new file mode 100644
index 0000000..9293446
--- /dev/null
+++ b/include/linux/badblocks.h
@@ -0,0 +1,53 @@
+#ifndef _LINUX_BADBLOCKS_H
+#define _LINUX_BADBLOCKS_H
+
+#include <linux/seqlock.h>
+#include <linux/kernel.h>
+#include <linux/stddef.h>
+#include <linux/types.h>
+
+#define BB_LEN_MASK	(0x00000000000001FFULL)
+#define BB_OFFSET_MASK	(0x7FFFFFFFFFFFFE00ULL)
+#define BB_ACK_MASK	(0x8000000000000000ULL)
+#define BB_MAX_LEN	512
+#define BB_OFFSET(x)	(((x) & BB_OFFSET_MASK) >> 9)
+#define BB_LEN(x)	(((x) & BB_LEN_MASK) + 1)
+#define BB_ACK(x)	(!!((x) & BB_ACK_MASK))
+#define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
+
+/* Bad block numbers are stored sorted in a single page.
+ * 64bits is used for each block or extent.
+ * 54 bits are sector number, 9 bits are extent size,
+ * 1 bit is an 'acknowledged' flag.
+ */
+#define MAX_BADBLOCKS	(PAGE_SIZE/8)
+
+struct badblocks {
+	int count;		/* count of bad blocks */
+	int unacked_exist;	/* there probably are unacknowledged
+				 * bad blocks.  This is only cleared
+				 * when a read discovers none
+				 */
+	int shift;		/* shift from sectors to block size
+				 * a -ve shift means badblocks are
+				 * disabled.*/
+	u64 *page;		/* badblock list */
+	int changed;
+	seqlock_t lock;
+	sector_t sector;
+	sector_t size;		/* in sectors */
+};
+
+int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
+		   sector_t *first_bad, int *bad_sectors);
+int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
+			int acknowledged);
+int badblocks_clear(struct badblocks *bb, sector_t s, int sectors);
+void ack_all_badblocks(struct badblocks *bb);
+ssize_t badblocks_show(struct badblocks *bb, char *page, int unack);
+ssize_t badblocks_store(struct badblocks *bb, const char *page, size_t len,
+			int unack);
+int badblocks_init(struct badblocks *bb, int enable);
+void badblocks_free(struct badblocks *bb);
+
+#endif
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/3] block: Add badblock management for gendisks
  2015-11-25 18:43 [PATCH v2 0/3] Badblock tracking for gendisks Vishal Verma
  2015-11-25 18:43 ` [PATCH v2 1/3] badblocks: Add core badblock management code Vishal Verma
@ 2015-11-25 18:43 ` Vishal Verma
  2015-12-04 23:33   ` James Bottomley
  2015-11-25 18:43 ` [PATCH v2 3/3] md: convert to use the generic badblocks code Vishal Verma
  2015-12-04 22:53 ` [PATCH v2 0/3] Badblock tracking for gendisks Verma, Vishal L
  3 siblings, 1 reply; 23+ messages in thread
From: Vishal Verma @ 2015-11-25 18:43 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Vishal Verma, linux-block, linux-raid, linux-scsi, Jens Axboe,
	NeilBrown, Jeff Moyer

NVDIMM devices, which can behave more like DRAM rather than block
devices, may develop bad cache lines, or 'poison'. A block device
exposed by the pmem driver can then consume poison via a read (or
write), and cause a machine check. On platforms without machine
check recovery features, this would mean a crash.

The block device maintaining a runtime list of all known sectors that
have poison can directly avoid this, and also provide a path forward
to enable proper handling/recovery for DAX faults on such a device.

Use the new badblock management interfaces to add a badblocks list to
gendisks.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 block/genhd.c         | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/genhd.h |  6 ++++
 2 files changed, 87 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 0c706f3..84fd65c 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -20,6 +20,7 @@
 #include <linux/idr.h>
 #include <linux/log2.h>
 #include <linux/pm_runtime.h>
+#include <linux/badblocks.h>
 
 #include "blk.h"
 
@@ -505,6 +506,20 @@ static int exact_lock(dev_t devt, void *data)
 	return 0;
 }
 
+static void disk_alloc_badblocks(struct gendisk *disk)
+{
+	disk->bb = kzalloc(sizeof(*(disk->bb)), GFP_KERNEL);
+	if (!disk->bb) {
+		pr_warn("%s: failed to allocate space for badblocks\n",
+			disk->disk_name);
+		return;
+	}
+
+	if (badblocks_init(disk->bb, 1))
+		pr_warn("%s: failed to initialize badblocks\n",
+			disk->disk_name);
+}
+
 static void register_disk(struct gendisk *disk)
 {
 	struct device *ddev = disk_to_dev(disk);
@@ -609,6 +624,7 @@ void add_disk(struct gendisk *disk)
 	disk->first_minor = MINOR(devt);
 
 	disk_alloc_events(disk);
+	disk_alloc_badblocks(disk);
 
 	/* Register BDI before referencing it from bdev */
 	bdi = &disk->queue->backing_dev_info;
@@ -657,6 +673,11 @@ void del_gendisk(struct gendisk *disk)
 	blk_unregister_queue(disk);
 	blk_unregister_region(disk_devt(disk), disk->minors);
 
+	if (disk->bb) {
+		badblocks_free(disk->bb);
+		kfree(disk->bb);
+	}
+
 	part_stat_set_all(&disk->part0, 0);
 	disk->part0.stamp = 0;
 
@@ -670,6 +691,63 @@ void del_gendisk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(del_gendisk);
 
+/*
+ * The gendisk usage of badblocks does not track acknowledgements for
+ * badblocks. We always assume they are acknowledged.
+ */
+int disk_check_badblocks(struct gendisk *disk, sector_t s, int sectors,
+		   sector_t *first_bad, int *bad_sectors)
+{
+	if (!disk->bb)
+		return 0;
+
+	return badblocks_check(disk->bb, s, sectors, first_bad, bad_sectors);
+}
+EXPORT_SYMBOL(disk_check_badblocks);
+
+int disk_set_badblocks(struct gendisk *disk, sector_t s, int sectors)
+{
+	if (!disk->bb)
+		return 0;
+
+	return badblocks_set(disk->bb, s, sectors, 1);
+}
+EXPORT_SYMBOL(disk_set_badblocks);
+
+int disk_clear_badblocks(struct gendisk *disk, sector_t s, int sectors)
+{
+	if (!disk->bb)
+		return 0;
+
+	return badblocks_clear(disk->bb, s, sectors);
+}
+EXPORT_SYMBOL(disk_clear_badblocks);
+
+/* sysfs access to bad-blocks list. */
+static ssize_t disk_badblocks_show(struct device *dev,
+					struct device_attribute *attr,
+					char *page)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+
+	if (!disk->bb)
+		return 0;
+
+	return badblocks_show(disk->bb, page, 0);
+}
+
+static ssize_t disk_badblocks_store(struct device *dev,
+					struct device_attribute *attr,
+					const char *page, size_t len)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+
+	if (!disk->bb)
+		return 0;
+
+	return badblocks_store(disk->bb, page, len, 0);
+}
+
 /**
  * get_gendisk - get partitioning information for a given device
  * @devt: device to get partitioning information for
@@ -988,6 +1066,8 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, disk_discard_alignment_show,
 static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL);
 static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
 static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
+static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show,
+		disk_badblocks_store);
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 static struct device_attribute dev_attr_fail =
 	__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
@@ -1009,6 +1089,7 @@ static struct attribute *disk_attrs[] = {
 	&dev_attr_capability.attr,
 	&dev_attr_stat.attr,
 	&dev_attr_inflight.attr,
+	&dev_attr_badblocks.attr,
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 	&dev_attr_fail.attr,
 #endif
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 2adbfa6..5563bde 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -162,6 +162,7 @@ struct disk_part_tbl {
 };
 
 struct disk_events;
+struct badblocks;
 
 struct gendisk {
 	/* major, first_minor and minors are input parameters only,
@@ -201,6 +202,7 @@ struct gendisk {
 	struct blk_integrity *integrity;
 #endif
 	int node_id;
+	struct badblocks *bb;
 };
 
 static inline struct gendisk *part_to_disk(struct hd_struct *part)
@@ -421,6 +423,10 @@ extern void add_disk(struct gendisk *disk);
 extern void del_gendisk(struct gendisk *gp);
 extern struct gendisk *get_gendisk(dev_t dev, int *partno);
 extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
+extern int disk_check_badblocks(struct gendisk *disk, sector_t s, int sectors,
+		   sector_t *first_bad, int *bad_sectors);
+extern int disk_set_badblocks(struct gendisk *disk, sector_t s, int sectors);
+extern int disk_clear_badblocks(struct gendisk *disk, sector_t s, int sectors);
 
 extern void set_device_ro(struct block_device *bdev, int flag);
 extern void set_disk_ro(struct gendisk *disk, int flag);
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/3] md: convert to use the generic badblocks code
  2015-11-25 18:43 [PATCH v2 0/3] Badblock tracking for gendisks Vishal Verma
  2015-11-25 18:43 ` [PATCH v2 1/3] badblocks: Add core badblock management code Vishal Verma
  2015-11-25 18:43 ` [PATCH v2 2/3] block: Add badblock management for gendisks Vishal Verma
@ 2015-11-25 18:43 ` Vishal Verma
  2015-12-01 18:55   ` Shaohua Li
  2015-12-04 22:53 ` [PATCH v2 0/3] Badblock tracking for gendisks Verma, Vishal L
  3 siblings, 1 reply; 23+ messages in thread
From: Vishal Verma @ 2015-11-25 18:43 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Vishal Verma, linux-block, linux-raid, linux-scsi, Jens Axboe,
	NeilBrown, Jeff Moyer

Retain badblocks as part of rdev, but use the accessor functions from
include/linux/badblocks for all manipulation.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 drivers/md/md.c | 507 +++-----------------------------------------------------
 drivers/md/md.h |  40 +----
 2 files changed, 23 insertions(+), 524 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c702de1..63eab20 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -34,6 +34,7 @@
 
 #include <linux/kthread.h>
 #include <linux/blkdev.h>
+#include <linux/badblocks.h>
 #include <linux/sysctl.h>
 #include <linux/seq_file.h>
 #include <linux/fs.h>
@@ -707,8 +708,7 @@ void md_rdev_clear(struct md_rdev *rdev)
 		put_page(rdev->bb_page);
 		rdev->bb_page = NULL;
 	}
-	kfree(rdev->badblocks.page);
-	rdev->badblocks.page = NULL;
+	badblocks_free(&rdev->badblocks);
 }
 EXPORT_SYMBOL_GPL(md_rdev_clear);
 
@@ -1358,8 +1358,6 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 *sb)
 	return cpu_to_le32(csum);
 }
 
-static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
-			    int acknowledged);
 static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_version)
 {
 	struct mdp_superblock_1 *sb;
@@ -1484,7 +1482,7 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
 			count <<= sb->bblog_shift;
 			if (bb + 1 == 0)
 				break;
-			if (md_set_badblocks(&rdev->badblocks,
+			if (badblocks_set(&rdev->badblocks,
 					     sector, count, 1) == 0)
 				return -EINVAL;
 		}
@@ -2226,7 +2224,7 @@ repeat:
 			rdev_for_each(rdev, mddev) {
 				if (rdev->badblocks.changed) {
 					rdev->badblocks.changed = 0;
-					md_ack_all_badblocks(&rdev->badblocks);
+					ack_all_badblocks(&rdev->badblocks);
 					md_error(mddev, rdev);
 				}
 				clear_bit(Blocked, &rdev->flags);
@@ -2352,7 +2350,7 @@ repeat:
 			clear_bit(Blocked, &rdev->flags);
 
 		if (any_badblocks_changed)
-			md_ack_all_badblocks(&rdev->badblocks);
+			ack_all_badblocks(&rdev->badblocks);
 		clear_bit(BlockedBadBlocks, &rdev->flags);
 		wake_up(&rdev->blocked_wait);
 	}
@@ -2944,11 +2942,17 @@ static ssize_t recovery_start_store(struct md_rdev *rdev, const char *buf, size_
 static struct rdev_sysfs_entry rdev_recovery_start =
 __ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, recovery_start_store);
 
-static ssize_t
-badblocks_show(struct badblocks *bb, char *page, int unack);
-static ssize_t
-badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack);
-
+/* sysfs access to bad-blocks list.
+ * We present two files.
+ * 'bad-blocks' lists sector numbers and lengths of ranges that
+ *    are recorded as bad.  The list is truncated to fit within
+ *    the one-page limit of sysfs.
+ *    Writing "sector length" to this file adds an acknowledged
+ *    bad block list.
+ * 'unacknowledged-bad-blocks' lists bad blocks that have not yet
+ *    been acknowledged.  Writing to this file adds bad blocks
+ *    without acknowledging them.  This is largely for testing.
+ */
 static ssize_t bb_show(struct md_rdev *rdev, char *page)
 {
 	return badblocks_show(&rdev->badblocks, page, 0);
@@ -3063,14 +3067,7 @@ int md_rdev_init(struct md_rdev *rdev)
 	 * This reserves the space even on arrays where it cannot
 	 * be used - I wonder if that matters
 	 */
-	rdev->badblocks.count = 0;
-	rdev->badblocks.shift = -1; /* disabled until explicitly enabled */
-	rdev->badblocks.page = kmalloc(PAGE_SIZE, GFP_KERNEL);
-	seqlock_init(&rdev->badblocks.lock);
-	if (rdev->badblocks.page == NULL)
-		return -ENOMEM;
-
-	return 0;
+	return badblocks_init(&rdev->badblocks, 0);
 }
 EXPORT_SYMBOL_GPL(md_rdev_init);
 /*
@@ -8348,253 +8345,7 @@ void md_finish_reshape(struct mddev *mddev)
 }
 EXPORT_SYMBOL(md_finish_reshape);
 
-/* Bad block management.
- * We can record which blocks on each device are 'bad' and so just
- * fail those blocks, or that stripe, rather than the whole device.
- * Entries in the bad-block table are 64bits wide.  This comprises:
- * Length of bad-range, in sectors: 0-511 for lengths 1-512
- * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
- *  A 'shift' can be set so that larger blocks are tracked and
- *  consequently larger devices can be covered.
- * 'Acknowledged' flag - 1 bit. - the most significant bit.
- *
- * Locking of the bad-block table uses a seqlock so md_is_badblock
- * might need to retry if it is very unlucky.
- * We will sometimes want to check for bad blocks in a bi_end_io function,
- * so we use the write_seqlock_irq variant.
- *
- * When looking for a bad block we specify a range and want to
- * know if any block in the range is bad.  So we binary-search
- * to the last range that starts at-or-before the given endpoint,
- * (or "before the sector after the target range")
- * then see if it ends after the given start.
- * We return
- *  0 if there are no known bad blocks in the range
- *  1 if there are known bad block which are all acknowledged
- * -1 if there are bad blocks which have not yet been acknowledged in metadata.
- * plus the start/length of the first bad section we overlap.
- */
-int md_is_badblock(struct badblocks *bb, sector_t s, int sectors,
-		   sector_t *first_bad, int *bad_sectors)
-{
-	int hi;
-	int lo;
-	u64 *p = bb->page;
-	int rv;
-	sector_t target = s + sectors;
-	unsigned seq;
-
-	if (bb->shift > 0) {
-		/* round the start down, and the end up */
-		s >>= bb->shift;
-		target += (1<<bb->shift) - 1;
-		target >>= bb->shift;
-		sectors = target - s;
-	}
-	/* 'target' is now the first block after the bad range */
-
-retry:
-	seq = read_seqbegin(&bb->lock);
-	lo = 0;
-	rv = 0;
-	hi = bb->count;
-
-	/* Binary search between lo and hi for 'target'
-	 * i.e. for the last range that starts before 'target'
-	 */
-	/* INVARIANT: ranges before 'lo' and at-or-after 'hi'
-	 * are known not to be the last range before target.
-	 * VARIANT: hi-lo is the number of possible
-	 * ranges, and decreases until it reaches 1
-	 */
-	while (hi - lo > 1) {
-		int mid = (lo + hi) / 2;
-		sector_t a = BB_OFFSET(p[mid]);
-		if (a < target)
-			/* This could still be the one, earlier ranges
-			 * could not. */
-			lo = mid;
-		else
-			/* This and later ranges are definitely out. */
-			hi = mid;
-	}
-	/* 'lo' might be the last that started before target, but 'hi' isn't */
-	if (hi > lo) {
-		/* need to check all range that end after 's' to see if
-		 * any are unacknowledged.
-		 */
-		while (lo >= 0 &&
-		       BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
-			if (BB_OFFSET(p[lo]) < target) {
-				/* starts before the end, and finishes after
-				 * the start, so they must overlap
-				 */
-				if (rv != -1 && BB_ACK(p[lo]))
-					rv = 1;
-				else
-					rv = -1;
-				*first_bad = BB_OFFSET(p[lo]);
-				*bad_sectors = BB_LEN(p[lo]);
-			}
-			lo--;
-		}
-	}
-
-	if (read_seqretry(&bb->lock, seq))
-		goto retry;
-
-	return rv;
-}
-EXPORT_SYMBOL_GPL(md_is_badblock);
-
-/*
- * Add a range of bad blocks to the table.
- * This might extend the table, or might contract it
- * if two adjacent ranges can be merged.
- * We binary-search to find the 'insertion' point, then
- * decide how best to handle it.
- */
-static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
-			    int acknowledged)
-{
-	u64 *p;
-	int lo, hi;
-	int rv = 1;
-	unsigned long flags;
-
-	if (bb->shift < 0)
-		/* badblocks are disabled */
-		return 0;
-
-	if (bb->shift) {
-		/* round the start down, and the end up */
-		sector_t next = s + sectors;
-		s >>= bb->shift;
-		next += (1<<bb->shift) - 1;
-		next >>= bb->shift;
-		sectors = next - s;
-	}
-
-	write_seqlock_irqsave(&bb->lock, flags);
-
-	p = bb->page;
-	lo = 0;
-	hi = bb->count;
-	/* Find the last range that starts at-or-before 's' */
-	while (hi - lo > 1) {
-		int mid = (lo + hi) / 2;
-		sector_t a = BB_OFFSET(p[mid]);
-		if (a <= s)
-			lo = mid;
-		else
-			hi = mid;
-	}
-	if (hi > lo && BB_OFFSET(p[lo]) > s)
-		hi = lo;
-
-	if (hi > lo) {
-		/* we found a range that might merge with the start
-		 * of our new range
-		 */
-		sector_t a = BB_OFFSET(p[lo]);
-		sector_t e = a + BB_LEN(p[lo]);
-		int ack = BB_ACK(p[lo]);
-		if (e >= s) {
-			/* Yes, we can merge with a previous range */
-			if (s == a && s + sectors >= e)
-				/* new range covers old */
-				ack = acknowledged;
-			else
-				ack = ack && acknowledged;
-
-			if (e < s + sectors)
-				e = s + sectors;
-			if (e - a <= BB_MAX_LEN) {
-				p[lo] = BB_MAKE(a, e-a, ack);
-				s = e;
-			} else {
-				/* does not all fit in one range,
-				 * make p[lo] maximal
-				 */
-				if (BB_LEN(p[lo]) != BB_MAX_LEN)
-					p[lo] = BB_MAKE(a, BB_MAX_LEN, ack);
-				s = a + BB_MAX_LEN;
-			}
-			sectors = e - s;
-		}
-	}
-	if (sectors && hi < bb->count) {
-		/* 'hi' points to the first range that starts after 's'.
-		 * Maybe we can merge with the start of that range */
-		sector_t a = BB_OFFSET(p[hi]);
-		sector_t e = a + BB_LEN(p[hi]);
-		int ack = BB_ACK(p[hi]);
-		if (a <= s + sectors) {
-			/* merging is possible */
-			if (e <= s + sectors) {
-				/* full overlap */
-				e = s + sectors;
-				ack = acknowledged;
-			} else
-				ack = ack && acknowledged;
-
-			a = s;
-			if (e - a <= BB_MAX_LEN) {
-				p[hi] = BB_MAKE(a, e-a, ack);
-				s = e;
-			} else {
-				p[hi] = BB_MAKE(a, BB_MAX_LEN, ack);
-				s = a + BB_MAX_LEN;
-			}
-			sectors = e - s;
-			lo = hi;
-			hi++;
-		}
-	}
-	if (sectors == 0 && hi < bb->count) {
-		/* we might be able to combine lo and hi */
-		/* Note: 's' is at the end of 'lo' */
-		sector_t a = BB_OFFSET(p[hi]);
-		int lolen = BB_LEN(p[lo]);
-		int hilen = BB_LEN(p[hi]);
-		int newlen = lolen + hilen - (s - a);
-		if (s >= a && newlen < BB_MAX_LEN) {
-			/* yes, we can combine them */
-			int ack = BB_ACK(p[lo]) && BB_ACK(p[hi]);
-			p[lo] = BB_MAKE(BB_OFFSET(p[lo]), newlen, ack);
-			memmove(p + hi, p + hi + 1,
-				(bb->count - hi - 1) * 8);
-			bb->count--;
-		}
-	}
-	while (sectors) {
-		/* didn't merge (it all).
-		 * Need to add a range just before 'hi' */
-		if (bb->count >= MD_MAX_BADBLOCKS) {
-			/* No room for more */
-			rv = 0;
-			break;
-		} else {
-			int this_sectors = sectors;
-			memmove(p + hi + 1, p + hi,
-				(bb->count - hi) * 8);
-			bb->count++;
-
-			if (this_sectors > BB_MAX_LEN)
-				this_sectors = BB_MAX_LEN;
-			p[hi] = BB_MAKE(s, this_sectors, acknowledged);
-			sectors -= this_sectors;
-			s += this_sectors;
-		}
-	}
-
-	bb->changed = 1;
-	if (!acknowledged)
-		bb->unacked_exist = 1;
-	write_sequnlock_irqrestore(&bb->lock, flags);
-
-	return rv;
-}
+/* Bad block management */
 
 int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 		       int is_new)
@@ -8604,8 +8355,7 @@ int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 		s += rdev->new_data_offset;
 	else
 		s += rdev->data_offset;
-	rv = md_set_badblocks(&rdev->badblocks,
-			      s, sectors, 0);
+	rv = badblocks_set(&rdev->badblocks, s, sectors, 0);
 	if (rv) {
 		/* Make sure they get written out promptly */
 		sysfs_notify_dirent_safe(rdev->sysfs_state);
@@ -8617,101 +8367,6 @@ int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 }
 EXPORT_SYMBOL_GPL(rdev_set_badblocks);
 
-/*
- * Remove a range of bad blocks from the table.
- * This may involve extending the table if we spilt a region,
- * but it must not fail.  So if the table becomes full, we just
- * drop the remove request.
- */
-static int md_clear_badblocks(struct badblocks *bb, sector_t s, int sectors)
-{
-	u64 *p;
-	int lo, hi;
-	sector_t target = s + sectors;
-	int rv = 0;
-
-	if (bb->shift > 0) {
-		/* When clearing we round the start up and the end down.
-		 * This should not matter as the shift should align with
-		 * the block size and no rounding should ever be needed.
-		 * However it is better the think a block is bad when it
-		 * isn't than to think a block is not bad when it is.
-		 */
-		s += (1<<bb->shift) - 1;
-		s >>= bb->shift;
-		target >>= bb->shift;
-		sectors = target - s;
-	}
-
-	write_seqlock_irq(&bb->lock);
-
-	p = bb->page;
-	lo = 0;
-	hi = bb->count;
-	/* Find the last range that starts before 'target' */
-	while (hi - lo > 1) {
-		int mid = (lo + hi) / 2;
-		sector_t a = BB_OFFSET(p[mid]);
-		if (a < target)
-			lo = mid;
-		else
-			hi = mid;
-	}
-	if (hi > lo) {
-		/* p[lo] is the last range that could overlap the
-		 * current range.  Earlier ranges could also overlap,
-		 * but only this one can overlap the end of the range.
-		 */
-		if (BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > target) {
-			/* Partial overlap, leave the tail of this range */
-			int ack = BB_ACK(p[lo]);
-			sector_t a = BB_OFFSET(p[lo]);
-			sector_t end = a + BB_LEN(p[lo]);
-
-			if (a < s) {
-				/* we need to split this range */
-				if (bb->count >= MD_MAX_BADBLOCKS) {
-					rv = -ENOSPC;
-					goto out;
-				}
-				memmove(p+lo+1, p+lo, (bb->count - lo) * 8);
-				bb->count++;
-				p[lo] = BB_MAKE(a, s-a, ack);
-				lo++;
-			}
-			p[lo] = BB_MAKE(target, end - target, ack);
-			/* there is no longer an overlap */
-			hi = lo;
-			lo--;
-		}
-		while (lo >= 0 &&
-		       BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
-			/* This range does overlap */
-			if (BB_OFFSET(p[lo]) < s) {
-				/* Keep the early parts of this range. */
-				int ack = BB_ACK(p[lo]);
-				sector_t start = BB_OFFSET(p[lo]);
-				p[lo] = BB_MAKE(start, s - start, ack);
-				/* now low doesn't overlap, so.. */
-				break;
-			}
-			lo--;
-		}
-		/* 'lo' is strictly before, 'hi' is strictly after,
-		 * anything between needs to be discarded
-		 */
-		if (hi - lo > 1) {
-			memmove(p+lo+1, p+hi, (bb->count - hi) * 8);
-			bb->count -= (hi - lo - 1);
-		}
-	}
-
-	bb->changed = 1;
-out:
-	write_sequnlock_irq(&bb->lock);
-	return rv;
-}
-
 int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 			 int is_new)
 {
@@ -8719,133 +8374,11 @@ int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 		s += rdev->new_data_offset;
 	else
 		s += rdev->data_offset;
-	return md_clear_badblocks(&rdev->badblocks,
+	return badblocks_clear(&rdev->badblocks,
 				  s, sectors);
 }
 EXPORT_SYMBOL_GPL(rdev_clear_badblocks);
 
-/*
- * Acknowledge all bad blocks in a list.
- * This only succeeds if ->changed is clear.  It is used by
- * in-kernel metadata updates
- */
-void md_ack_all_badblocks(struct badblocks *bb)
-{
-	if (bb->page == NULL || bb->changed)
-		/* no point even trying */
-		return;
-	write_seqlock_irq(&bb->lock);
-
-	if (bb->changed == 0 && bb->unacked_exist) {
-		u64 *p = bb->page;
-		int i;
-		for (i = 0; i < bb->count ; i++) {
-			if (!BB_ACK(p[i])) {
-				sector_t start = BB_OFFSET(p[i]);
-				int len = BB_LEN(p[i]);
-				p[i] = BB_MAKE(start, len, 1);
-			}
-		}
-		bb->unacked_exist = 0;
-	}
-	write_sequnlock_irq(&bb->lock);
-}
-EXPORT_SYMBOL_GPL(md_ack_all_badblocks);
-
-/* sysfs access to bad-blocks list.
- * We present two files.
- * 'bad-blocks' lists sector numbers and lengths of ranges that
- *    are recorded as bad.  The list is truncated to fit within
- *    the one-page limit of sysfs.
- *    Writing "sector length" to this file adds an acknowledged
- *    bad block list.
- * 'unacknowledged-bad-blocks' lists bad blocks that have not yet
- *    been acknowledged.  Writing to this file adds bad blocks
- *    without acknowledging them.  This is largely for testing.
- */
-
-static ssize_t
-badblocks_show(struct badblocks *bb, char *page, int unack)
-{
-	size_t len;
-	int i;
-	u64 *p = bb->page;
-	unsigned seq;
-
-	if (bb->shift < 0)
-		return 0;
-
-retry:
-	seq = read_seqbegin(&bb->lock);
-
-	len = 0;
-	i = 0;
-
-	while (len < PAGE_SIZE && i < bb->count) {
-		sector_t s = BB_OFFSET(p[i]);
-		unsigned int length = BB_LEN(p[i]);
-		int ack = BB_ACK(p[i]);
-		i++;
-
-		if (unack && ack)
-			continue;
-
-		len += snprintf(page+len, PAGE_SIZE-len, "%llu %u\n",
-				(unsigned long long)s << bb->shift,
-				length << bb->shift);
-	}
-	if (unack && len == 0)
-		bb->unacked_exist = 0;
-
-	if (read_seqretry(&bb->lock, seq))
-		goto retry;
-
-	return len;
-}
-
-#define DO_DEBUG 1
-
-static ssize_t
-badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack)
-{
-	unsigned long long sector;
-	int length;
-	char newline;
-#ifdef DO_DEBUG
-	/* Allow clearing via sysfs *only* for testing/debugging.
-	 * Normally only a successful write may clear a badblock
-	 */
-	int clear = 0;
-	if (page[0] == '-') {
-		clear = 1;
-		page++;
-	}
-#endif /* DO_DEBUG */
-
-	switch (sscanf(page, "%llu %d%c", &sector, &length, &newline)) {
-	case 3:
-		if (newline != '\n')
-			return -EINVAL;
-	case 2:
-		if (length <= 0)
-			return -EINVAL;
-		break;
-	default:
-		return -EINVAL;
-	}
-
-#ifdef DO_DEBUG
-	if (clear) {
-		md_clear_badblocks(bb, sector, length);
-		return len;
-	}
-#endif /* DO_DEBUG */
-	if (md_set_badblocks(bb, sector, length, !unack))
-		return len;
-	else
-		return -ENOSPC;
-}
-
 static int md_notify_reboot(struct notifier_block *this,
 			    unsigned long code, void *x)
 {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index ab33957..253ad74 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -17,6 +17,7 @@
 
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
+#include <linux/badblocks.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
 #include <linux/mm.h>
@@ -28,13 +29,6 @@
 
 #define MaxSector (~(sector_t)0)
 
-/* Bad block numbers are stored sorted in a single page.
- * 64bits is used for each block or extent.
- * 54 bits are sector number, 9 bits are extent size,
- * 1 bit is an 'acknowledged' flag.
- */
-#define MD_MAX_BADBLOCKS	(PAGE_SIZE/8)
-
 /*
  * MD's 'extended' device
  */
@@ -111,22 +105,7 @@ struct md_rdev {
 	struct kernfs_node *sysfs_state; /* handle for 'state'
 					   * sysfs entry */
 
-	struct badblocks {
-		int	count;		/* count of bad blocks */
-		int	unacked_exist;	/* there probably are unacknowledged
-					 * bad blocks.  This is only cleared
-					 * when a read discovers none
-					 */
-		int	shift;		/* shift from sectors to block size
-					 * a -ve shift means badblocks are
-					 * disabled.*/
-		u64	*page;		/* badblock list */
-		int	changed;
-		seqlock_t lock;
-
-		sector_t sector;
-		sector_t size;		/* in sectors */
-	} badblocks;
+	struct badblocks badblocks;
 };
 enum flag_bits {
 	Faulty,			/* device is known to have a fault */
@@ -174,22 +153,11 @@ enum flag_bits {
 				 */
 };
 
-#define BB_LEN_MASK	(0x00000000000001FFULL)
-#define BB_OFFSET_MASK	(0x7FFFFFFFFFFFFE00ULL)
-#define BB_ACK_MASK	(0x8000000000000000ULL)
-#define BB_MAX_LEN	512
-#define BB_OFFSET(x)	(((x) & BB_OFFSET_MASK) >> 9)
-#define BB_LEN(x)	(((x) & BB_LEN_MASK) + 1)
-#define BB_ACK(x)	(!!((x) & BB_ACK_MASK))
-#define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
-
-extern int md_is_badblock(struct badblocks *bb, sector_t s, int sectors,
-			  sector_t *first_bad, int *bad_sectors);
 static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
 			      sector_t *first_bad, int *bad_sectors)
 {
 	if (unlikely(rdev->badblocks.count)) {
-		int rv = md_is_badblock(&rdev->badblocks, rdev->data_offset + s,
+		int rv = badblocks_check(&rdev->badblocks, rdev->data_offset + s,
 					sectors,
 					first_bad, bad_sectors);
 		if (rv)
@@ -202,8 +170,6 @@ extern int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 			      int is_new);
 extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 				int is_new);
-extern void md_ack_all_badblocks(struct badblocks *bb);
-
 struct md_cluster_info;
 
 struct mddev {
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] md: convert to use the generic badblocks code
  2015-11-25 18:43 ` [PATCH v2 3/3] md: convert to use the generic badblocks code Vishal Verma
@ 2015-12-01 18:55   ` Shaohua Li
  2015-12-01 19:52     ` Verma, Vishal L
  0 siblings, 1 reply; 23+ messages in thread
From: Shaohua Li @ 2015-12-01 18:55 UTC (permalink / raw)
  To: Vishal Verma
  Cc: linux-nvdimm, linux-block, linux-raid, linux-scsi, Jens Axboe,
	NeilBrown, Jeff Moyer

On Wed, Nov 25, 2015 at 11:43:33AM -0700, Vishal Verma wrote:
> Retain badblocks as part of rdev, but use the accessor functions from
> include/linux/badblocks for all manipulation.
> 
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  drivers/md/md.c | 507 +++-----------------------------------------------------
>  drivers/md/md.h |  40 +----
>  2 files changed, 23 insertions(+), 524 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index c702de1..63eab20 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -34,6 +34,7 @@
>  
>  #include <linux/kthread.h>
>  #include <linux/blkdev.h>
> +#include <linux/badblocks.h>
>  #include <linux/sysctl.h>
>  #include <linux/seq_file.h>
>  #include <linux/fs.h>
> @@ -707,8 +708,7 @@ void md_rdev_clear(struct md_rdev *rdev)
>  		put_page(rdev->bb_page);
>  		rdev->bb_page = NULL;
>  	}
> -	kfree(rdev->badblocks.page);
> -	rdev->badblocks.page = NULL;
> +	badblocks_free(&rdev->badblocks);
>  }

why does rdev have extra badblocks? the gendisk already had one.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] md: convert to use the generic badblocks code
  2015-12-01 18:55   ` Shaohua Li
@ 2015-12-01 19:52     ` Verma, Vishal L
  0 siblings, 0 replies; 23+ messages in thread
From: Verma, Vishal L @ 2015-12-01 19:52 UTC (permalink / raw)
  To: shli
  Cc: linux-raid, linux-scsi, linux-nvdimm, linux-block, neilb, axboe, jmoyer

On Tue, 2015-12-01 at 10:55 -0800, Shaohua Li wrote:
> On Wed, Nov 25, 2015 at 11:43:33AM -0700, Vishal Verma wrote:
> > Retain badblocks as part of rdev, but use the accessor functions
> > from
> > include/linux/badblocks for all manipulation.
> > 
> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> > ---
> >  drivers/md/md.c | 507 +++----------------------------------------
> > -------------
> >  drivers/md/md.h |  40 +----
> >  2 files changed, 23 insertions(+), 524 deletions(-)
> > 
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index c702de1..63eab20 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -34,6 +34,7 @@
> >  
> >  #include <linux/kthread.h>
> >  #include <linux/blkdev.h>
> > +#include <linux/badblocks.h>
> >  #include <linux/sysctl.h>
> >  #include <linux/seq_file.h>
> >  #include <linux/fs.h>
> > @@ -707,8 +708,7 @@ void md_rdev_clear(struct md_rdev *rdev)
> >  		put_page(rdev->bb_page);
> >  		rdev->bb_page = NULL;
> >  	}
> > -	kfree(rdev->badblocks.page);
> > -	rdev->badblocks.page = NULL;
> > +	badblocks_free(&rdev->badblocks);
> >  }
> 
> why does rdev have extra badblocks? the gendisk already had one.

rdev originally had badblocks, and this path set adds badblocks to
gendisk. It does appear that md's badblock tracking will be a bit
redundant if/once gendisk has badblocks support - see the discussion
here:
https://lists.01.org/pipermail/linux-nvdimm/2015-November/002980.html

	-Vishal

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/3] Badblock tracking for gendisks
  2015-11-25 18:43 [PATCH v2 0/3] Badblock tracking for gendisks Vishal Verma
                   ` (2 preceding siblings ...)
  2015-11-25 18:43 ` [PATCH v2 3/3] md: convert to use the generic badblocks code Vishal Verma
@ 2015-12-04 22:53 ` Verma, Vishal L
  3 siblings, 0 replies; 23+ messages in thread
From: Verma, Vishal L @ 2015-12-04 22:53 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-raid, linux-scsi, linux-block, neilb, axboe, jmoyer

On Wed, 2015-11-25 at 11:43 -0700, Vishal Verma wrote:
> v2:
>   - In badblocks_free, make 'page' NULL (patch 1)
>   - Move the core badblocks code to a new .c file (patch 1) (Jens)
>   - Fix a sizeof usage in disk_alloc_badblocks (patch 2) (Dan)
>   - Since disk_alloc_badblocks can fail, check disk->bb for NULL in
> the
>     genhd wrappers (patch 2) (Jeff)
>   - Update the md conversion to also ise the badblocks init and free
>     functions (patch 3)
>   - Remove the BB_* macros from md.h as they are now in badblocks.h
> (patch 3)
> 
> Patch 1 copies badblock management code into a header of its own,
> making it generally available. It follows common libraries of code
> such as linked lists, where anyone may embed a core data structure
> in another place, and use the provided accessor functions to
> manipulate the data.
> 
> Patch 2 adds badblock tracking to gendisks (in preparation for use
> by NVDIMM devices). Right now, it is turned on unconditionally - I'd
> appreciate comments on if that is the right path.
> 
> Patch 3 converts md over to use the new badblocks 'library'. I have
> done some pretty simple testing on this - created a raid 1 device,
> made sure the sysfs entries show up, and can be used to add and view
> badblocks. A closer look by the md folks would be nice here.
> 
> 
> Vishal Verma (3):
>   badblocks: Add core badblock management code
>   block: Add badblock management for gendisks
>   md: convert to use the generic badblocks code
> 

Ping.

Jens, are you ok taking this through the block tree?
Any other comments from anyone else?

Thanks,
	-Vishal

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] badblocks: Add core badblock management code
  2015-11-25 18:43 ` [PATCH v2 1/3] badblocks: Add core badblock management code Vishal Verma
@ 2015-12-04 23:30   ` James Bottomley
  2015-12-04 23:58     ` Verma, Vishal L
  0 siblings, 1 reply; 23+ messages in thread
From: James Bottomley @ 2015-12-04 23:30 UTC (permalink / raw)
  To: Vishal Verma
  Cc: linux-nvdimm, linux-block, linux-raid, linux-scsi, Jens Axboe,
	NeilBrown, Jeff Moyer

On Wed, 2015-11-25 at 11:43 -0700, Vishal Verma wrote:
> Take the core badblocks implementation from md, and make it generally
> available. This follows the same style as kernel implementations of
> linked lists, rb-trees etc, where you can have a structure that can be
> embedded anywhere, and accessor functions to manipulate the data.
> 
> The only changes in this copy of the code are ones to generalize
> function/variable names from md-specific ones. Also add init and free
> functions.
> 
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  block/Makefile            |   2 +-
>  block/badblocks.c         | 523 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/badblocks.h |  53 +++++
>  3 files changed, 577 insertions(+), 1 deletion(-)
>  create mode 100644 block/badblocks.c
>  create mode 100644 include/linux/badblocks.h
> 
> diff --git a/block/Makefile b/block/Makefile
> index 00ecc97..db5f622 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -8,7 +8,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
>  			blk-iopoll.o blk-lib.o blk-mq.o blk-mq-tag.o \
>  			blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
>  			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
> -			partitions/
> +			badblocks.o partitions/
>  
>  obj-$(CONFIG_BOUNCE)	+= bounce.o
>  obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
> diff --git a/block/badblocks.c b/block/badblocks.c
> new file mode 100644
> index 0000000..6e07855
> --- /dev/null
> +++ b/block/badblocks.c
> @@ -0,0 +1,523 @@
> +/*
> + * Bad block management
> + *
> + * - Heavily based on MD badblocks code from Neil Brown
> + *
> + * Copyright (c) 2015, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <linux/badblocks.h>
> +#include <linux/seqlock.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/stddef.h>
> +#include <linux/types.h>
> +#include <linux/slab.h>
> +
> +/*
> + * We can record which blocks on each device are 'bad' and so just
> + * fail those blocks, or that stripe, rather than the whole device.
> + * Entries in the bad-block table are 64bits wide.  This comprises:
> + * Length of bad-range, in sectors: 0-511 for lengths 1-512
> + * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
> + *  A 'shift' can be set so that larger blocks are tracked and
> + *  consequently larger devices can be covered.
> + * 'Acknowledged' flag - 1 bit. - the most significant bit.
> + *
> + * Locking of the bad-block table uses a seqlock so badblocks_check
> + * might need to retry if it is very unlucky.
> + * We will sometimes want to check for bad blocks in a bi_end_io function,
> + * so we use the write_seqlock_irq variant.
> + *
> + * When looking for a bad block we specify a range and want to
> + * know if any block in the range is bad.  So we binary-search
> + * to the last range that starts at-or-before the given endpoint,
> + * (or "before the sector after the target range")
> + * then see if it ends after the given start.
> + * We return
> + *  0 if there are no known bad blocks in the range
> + *  1 if there are known bad block which are all acknowledged
> + * -1 if there are bad blocks which have not yet been acknowledged in metadata.
> + * plus the start/length of the first bad section we overlap.
> + */

This comment should be docbook.

> +int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
> +			sector_t *first_bad, int *bad_sectors)
[...]
> +
> +/*
> + * Add a range of bad blocks to the table.
> + * This might extend the table, or might contract it
> + * if two adjacent ranges can be merged.
> + * We binary-search to find the 'insertion' point, then
> + * decide how best to handle it.
> + */

And this one, plus you don't document returns.  It looks like this
function returns 1 on success and zero on failure, which is really
counter-intuitive for the kernel: zero is usually returned on success
and negative error on failure.

> +int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
> +			int acknowledged)
[...]
> +
> +/*
> + * Remove a range of bad blocks from the table.
> + * This may involve extending the table if we spilt a region,
> + * but it must not fail.  So if the table becomes full, we just
> + * drop the remove request.
> + */

Docbook and document returns.  This time they're the kernel standard of
0 on success and negative error on failure making the convention for
badblocks_set even more counterintuitive.

> +int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
> +{
[...]
> +#define DO_DEBUG 1

Why have this at all if it's unconditionally defined and always set.

> +ssize_t badblocks_store(struct badblocks *bb, const char *page, size_t len,
> +			int unack)
[...]
> +int badblocks_init(struct badblocks *bb, int enable)
> +{
> +	bb->count = 0;
> +	if (enable)
> +		bb->shift = 0;
> +	else
> +		bb->shift = -1;
> +	bb->page = kmalloc(PAGE_SIZE, GFP_KERNEL);

Why not __get_free_page(GFP_KERNEL)?  The problem with kmalloc of an
exactly known page sized quantity is that the slab tracker for this
requires two contiguous pages for each page because of the overhead.

James



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/3] block: Add badblock management for gendisks
  2015-11-25 18:43 ` [PATCH v2 2/3] block: Add badblock management for gendisks Vishal Verma
@ 2015-12-04 23:33   ` James Bottomley
  2015-12-05  0:17     ` Verma, Vishal L
  0 siblings, 1 reply; 23+ messages in thread
From: James Bottomley @ 2015-12-04 23:33 UTC (permalink / raw)
  To: Vishal Verma
  Cc: linux-nvdimm, linux-block, linux-raid, linux-scsi, Jens Axboe,
	NeilBrown, Jeff Moyer

On Wed, 2015-11-25 at 11:43 -0700, Vishal Verma wrote:
> NVDIMM devices, which can behave more like DRAM rather than block
> devices, may develop bad cache lines, or 'poison'. A block device
> exposed by the pmem driver can then consume poison via a read (or
> write), and cause a machine check. On platforms without machine
> check recovery features, this would mean a crash.
> 
> The block device maintaining a runtime list of all known sectors that
> have poison can directly avoid this, and also provide a path forward
> to enable proper handling/recovery for DAX faults on such a device.
> 
> Use the new badblock management interfaces to add a badblocks list to
> gendisks.
> 
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  block/genhd.c         | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/genhd.h |  6 ++++
>  2 files changed, 87 insertions(+)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 0c706f3..84fd65c 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -20,6 +20,7 @@
>  #include <linux/idr.h>
>  #include <linux/log2.h>
>  #include <linux/pm_runtime.h>
> +#include <linux/badblocks.h>
>  
>  #include "blk.h"
>  
> @@ -505,6 +506,20 @@ static int exact_lock(dev_t devt, void *data)
>  	return 0;
>  }
>  
> +static void disk_alloc_badblocks(struct gendisk *disk)
> +{
> +	disk->bb = kzalloc(sizeof(*(disk->bb)), GFP_KERNEL);
> +	if (!disk->bb) {
> +		pr_warn("%s: failed to allocate space for badblocks\n",
> +			disk->disk_name);
> +		return;
> +	}
> +
> +	if (badblocks_init(disk->bb, 1))
> +		pr_warn("%s: failed to initialize badblocks\n",
> +			disk->disk_name);
> +}
> +
>  static void register_disk(struct gendisk *disk)
>  {
>  	struct device *ddev = disk_to_dev(disk);
> @@ -609,6 +624,7 @@ void add_disk(struct gendisk *disk)
>  	disk->first_minor = MINOR(devt);
>  
>  	disk_alloc_events(disk);
> +	disk_alloc_badblocks(disk);

Why unconditionally do this?  No-one currently uses the interface, but
every disk will now pay the price of an additional structure plus a page
for no benefit.  You should probably either export the initializer for
those who want to use it or, perhaps even better, make it lazily
allocated the first time anyone tries to set a bad block.

If you come up with a really good reason for allocating it
unconditionally, then it should probably be an embedded structure in the
gendisk.

James



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] badblocks: Add core badblock management code
  2015-12-04 23:30   ` James Bottomley
@ 2015-12-04 23:58     ` Verma, Vishal L
  2015-12-05  0:06       ` James Bottomley
                         ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Verma, Vishal L @ 2015-12-04 23:58 UTC (permalink / raw)
  To: James.Bottomley, neilb
  Cc: linux-raid, linux-scsi, linux-nvdimm, linux-block, jmoyer, axboe

On Fri, 2015-12-04 at 15:30 -0800, James Bottomley wrote:
[...]
> > + * We return
> > + *  0 if there are no known bad blocks in the range
> > + *  1 if there are known bad block which are all acknowledged
> > + * -1 if there are bad blocks which have not yet been acknowledged
> > in metadata.
> > + * plus the start/length of the first bad section we overlap.
> > + */
> 
> This comment should be docbook.

Applicable to all your comments - (and they are all valid), I simply
copied over all this from md. I'm happy to make the changes to comments,
and the other two things (see below) if that's the right thing to do --
I just tried to keep my own changes to the original md badblocks code
minimal.
Would it be better (for review-ability) if I made these changes in a new
patch on top of this, or should I just squash them into this one?

> 
> > +int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
> > +			sector_t *first_bad, int *bad_sectors)
> [...]
> > +
> > +/*
> > + * Add a range of bad blocks to the table.
> > + * This might extend the table, or might contract it
> > + * if two adjacent ranges can be merged.
> > + * We binary-search to find the 'insertion' point, then
> > + * decide how best to handle it.
> > + */
> 
> And this one, plus you don't document returns.  It looks like this
> function returns 1 on success and zero on failure, which is really
> counter-intuitive for the kernel: zero is usually returned on success
> and negative error on failure.
> 
> > +int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
> > +			int acknowledged)
> [...]
> > +
> > +/*
> > + * Remove a range of bad blocks from the table.
> > + * This may involve extending the table if we spilt a region,
> > + * but it must not fail.  So if the table becomes full, we just
> > + * drop the remove request.
> > + */
> 
> Docbook and document returns.  This time they're the kernel standard
> of
> 0 on success and negative error on failure making the convention for
> badblocks_set even more counterintuitive.
> 
> > +int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
> > +{
> [...]
> > +#define DO_DEBUG 1
> 
> Why have this at all if it's unconditionally defined and always set.

Neil - any reason or anything you had in mind for this? Or is it just an
artifact and can be removed.

> 
> > +ssize_t badblocks_store(struct badblocks *bb, const char *page,
> > size_t len,
> > +			int unack)
> [...]
> > +int badblocks_init(struct badblocks *bb, int enable)
> > +{
> > +	bb->count = 0;
> > +	if (enable)
> > +		bb->shift = 0;
> > +	else
> > +		bb->shift = -1;
> > +	bb->page = kmalloc(PAGE_SIZE, GFP_KERNEL);
> 
> Why not __get_free_page(GFP_KERNEL)?  The problem with kmalloc of an
> exactly known page sized quantity is that the slab tracker for this
> requires two contiguous pages for each page because of the overhead.

Cool, I didn't know about __get_free_page - I can fix this up too.

> 
> James
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] badblocks: Add core badblock management code
  2015-12-04 23:58     ` Verma, Vishal L
@ 2015-12-05  0:06       ` James Bottomley
  2015-12-05  0:11         ` Verma, Vishal L
  2015-12-08 21:03       ` NeilBrown
  2015-12-22  5:34       ` NeilBrown
  2 siblings, 1 reply; 23+ messages in thread
From: James Bottomley @ 2015-12-05  0:06 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: neilb, linux-raid, linux-scsi, linux-nvdimm, linux-block, jmoyer, axboe

On Fri, 2015-12-04 at 23:58 +0000, Verma, Vishal L wrote:
> On Fri, 2015-12-04 at 15:30 -0800, James Bottomley wrote:
> [...]
> > > + * We return
> > > + *  0 if there are no known bad blocks in the range
> > > + *  1 if there are known bad block which are all acknowledged
> > > + * -1 if there are bad blocks which have not yet been acknowledged
> > > in metadata.
> > > + * plus the start/length of the first bad section we overlap.
> > > + */
> > 
> > This comment should be docbook.
> 
> Applicable to all your comments - (and they are all valid), I simply
> copied over all this from md. I'm happy to make the changes to comments,
> and the other two things (see below) if that's the right thing to do --
> I just tried to keep my own changes to the original md badblocks code
> minimal.
> Would it be better (for review-ability) if I made these changes in a new
> patch on top of this, or should I just squash them into this one?

If you were moving it, that might be appropriate.  However, this is
effectively new code because you're not removing the original, so we
should begin at least with a coherent API. (i.e. corrections to the
original patch rather than incremental).

Thanks,

James


> > 
> > > +int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
> > > +			sector_t *first_bad, int *bad_sectors)
> > [...]
> > > +
> > > +/*
> > > + * Add a range of bad blocks to the table.
> > > + * This might extend the table, or might contract it
> > > + * if two adjacent ranges can be merged.
> > > + * We binary-search to find the 'insertion' point, then
> > > + * decide how best to handle it.
> > > + */
> > 
> > And this one, plus you don't document returns.  It looks like this
> > function returns 1 on success and zero on failure, which is really
> > counter-intuitive for the kernel: zero is usually returned on success
> > and negative error on failure.
> > 
> > > +int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
> > > +			int acknowledged)
> > [...]
> > > +
> > > +/*
> > > + * Remove a range of bad blocks from the table.
> > > + * This may involve extending the table if we spilt a region,
> > > + * but it must not fail.  So if the table becomes full, we just
> > > + * drop the remove request.
> > > + */
> > 
> > Docbook and document returns.  This time they're the kernel standard
> > of
> > 0 on success and negative error on failure making the convention for
> > badblocks_set even more counterintuitive.
> > 
> > > +int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
> > > +{
> > [...]
> > > +#define DO_DEBUG 1
> > 
> > Why have this at all if it's unconditionally defined and always set.
> 
> Neil - any reason or anything you had in mind for this? Or is it just an
> artifact and can be removed.
> 
> > 
> > > +ssize_t badblocks_store(struct badblocks *bb, const char *page,
> > > size_t len,
> > > +			int unack)
> > [...]
> > > +int badblocks_init(struct badblocks *bb, int enable)
> > > +{
> > > +	bb->count = 0;
> > > +	if (enable)
> > > +		bb->shift = 0;
> > > +	else
> > > +		bb->shift = -1;
> > > +	bb->page = kmalloc(PAGE_SIZE, GFP_KERNEL);
> > 
> > Why not __get_free_page(GFP_KERNEL)?  The problem with kmalloc of an
> > exactly known page sized quantity is that the slab tracker for this
> > requires two contiguous pages for each page because of the overhead.
> 
> Cool, I didn't know about __get_free_page - I can fix this up too.
> 
> > 
> > James
> > 
> > NrybXǧv^)޺{.n+{"{ay\x1dʇڙ,j\afhz\x1ew\fj:+vwjm\azZ+ݢj"!



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] badblocks: Add core badblock management code
  2015-12-05  0:06       ` James Bottomley
@ 2015-12-05  0:11         ` Verma, Vishal L
  0 siblings, 0 replies; 23+ messages in thread
From: Verma, Vishal L @ 2015-12-05  0:11 UTC (permalink / raw)
  To: James.Bottomley
  Cc: linux-raid, linux-scsi, linux-nvdimm, neilb, linux-block, jmoyer, axboe

On Fri, 2015-12-04 at 16:06 -0800, James Bottomley wrote:
> On Fri, 2015-12-04 at 23:58 +0000, Verma, Vishal L wrote:
> > On Fri, 2015-12-04 at 15:30 -0800, James Bottomley wrote:
> > [...]
> > > > + * We return
> > > > + *  0 if there are no known bad blocks in the range
> > > > + *  1 if there are known bad block which are all acknowledged
> > > > + * -1 if there are bad blocks which have not yet been
> > > > acknowledged
> > > > in metadata.
> > > > + * plus the start/length of the first bad section we overlap.
> > > > + */
> > > 
> > > This comment should be docbook.
> > 
> > Applicable to all your comments - (and they are all valid), I simply
> > copied over all this from md. I'm happy to make the changes to
> > comments,
> > and the other two things (see below) if that's the right thing to do
> > --
> > I just tried to keep my own changes to the original md badblocks
> > code
> > minimal.
> > Would it be better (for review-ability) if I made these changes in a
> > new
> > patch on top of this, or should I just squash them into this one?
> 
> If you were moving it, that might be appropriate.  However, this is
> effectively new code because you're not removing the original, so we
> should begin at least with a coherent API. (i.e. corrections to the
> original patch rather than incremental).
> 

Patch 3 does remove the original code, but yes, I agree. Will send
another version.

Thanks for the review.

	-Vishal

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/3] block: Add badblock management for gendisks
  2015-12-04 23:33   ` James Bottomley
@ 2015-12-05  0:17     ` Verma, Vishal L
  0 siblings, 0 replies; 23+ messages in thread
From: Verma, Vishal L @ 2015-12-05  0:17 UTC (permalink / raw)
  To: James.Bottomley
  Cc: linux-raid, linux-scsi, linux-nvdimm, linux-block, neilb, axboe, jmoyer

On Fri, 2015-12-04 at 15:33 -0800, James Bottomley wrote:
[...]
> >  static void register_disk(struct gendisk *disk)
> >  {
> >  	struct device *ddev = disk_to_dev(disk);
> > @@ -609,6 +624,7 @@ void add_disk(struct gendisk *disk)
> >  	disk->first_minor = MINOR(devt);
> >  
> >  	disk_alloc_events(disk);
> > +	disk_alloc_badblocks(disk);
> 
> Why unconditionally do this?  No-one currently uses the interface, but
> every disk will now pay the price of an additional structure plus a
> page
> for no benefit.  You should probably either export the initializer for
> those who want to use it or, perhaps even better, make it lazily
> allocated the first time anyone tries to set a bad block.
> 
> If you come up with a really good reason for allocating it
> unconditionally, then it should probably be an embedded structure in
> the gendisk.
> 
Agreed - I'll fix for v3.

I'm considering an embedded structure in gendisk (same as md) (why is
this preferred to pointer chasing, especially when this wastes more
space?), and a new exported initializer that is used by anyone who wants
to use gendisk's badblocks.

	-Vishal

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] badblocks: Add core badblock management code
  2015-12-04 23:58     ` Verma, Vishal L
  2015-12-05  0:06       ` James Bottomley
@ 2015-12-08 21:03       ` NeilBrown
  2015-12-08 21:08         ` Verma, Vishal L
  2015-12-22  5:34       ` NeilBrown
  2 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2015-12-08 21:03 UTC (permalink / raw)
  To: Verma, Vishal L, James.Bottomley
  Cc: linux-raid, linux-scsi, linux-nvdimm, linux-block, jmoyer, axboe

[-- Attachment #1: Type: text/plain, Size: 705 bytes --]

On Sat, Dec 05 2015, Verma, Vishal L wrote:
>> 
>> > +int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
>> > +{
>> [...]
>> > +#define DO_DEBUG 1
>> 
>> Why have this at all if it's unconditionally defined and always set.
>
> Neil - any reason or anything you had in mind for this? Or is it just an
> artifact and can be removed.

Like the comment says:

	/* Allow clearing via sysfs *only* for testing/debugging.
	 * Normally only a successful write may clear a badblock
	 */

The DO_DEBUG define and ifdefs are documentation identifying bits of
code that should be removed when it all seems to be working.
Maybe now is a good time to remove that code.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] badblocks: Add core badblock management code
  2015-12-08 21:03       ` NeilBrown
@ 2015-12-08 21:08         ` Verma, Vishal L
  2015-12-08 21:18           ` Dan Williams
  0 siblings, 1 reply; 23+ messages in thread
From: Verma, Vishal L @ 2015-12-08 21:08 UTC (permalink / raw)
  To: Williams, Dan J, James.Bottomley, neilb
  Cc: linux-raid, linux-scsi, linux-nvdimm, linux-block, jmoyer, axboe

[-- Attachment #1: Type: text/plain, Size: 1154 bytes --]

On Wed, 2015-12-09 at 08:03 +1100, NeilBrown wrote:
> On Sat, Dec 05 2015, Verma, Vishal L wrote:
> > > 
> > > > +int badblocks_clear(struct badblocks *bb, sector_t s, int
> > > > sectors)
> > > > +{
> > > [...]
> > > > +#define DO_DEBUG 1
> > > 
> > > Why have this at all if it's unconditionally defined and always
> > > set.
> > 
> > Neil - any reason or anything you had in mind for this? Or is it
> > just an
> > artifact and can be removed.
> 
> Like the comment says:
> 
> 	/* Allow clearing via sysfs *only* for testing/debugging.
> 	 * Normally only a successful write may clear a badblock
> 	 */
> 
> The DO_DEBUG define and ifdefs are documentation identifying bits of
> code that should be removed when it all seems to be working.
> Maybe now is a good time to remove that code.
> 
Hm, I think it would be nice to continue to have the ability to clear
badblocks using sysfs at least for a while more, as we test the various
error handling paths for NVDIMMS (Dan, thoughts?).

We could either remove it later or (I'm leaning towards) make it a
config option similar to FAIL_MAKE_REQUEST and friends..

	-Vishal

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] badblocks: Add core badblock management code
  2015-12-08 21:08         ` Verma, Vishal L
@ 2015-12-08 21:18           ` Dan Williams
  2015-12-08 23:47             ` Verma, Vishal L
  0 siblings, 1 reply; 23+ messages in thread
From: Dan Williams @ 2015-12-08 21:18 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: James.Bottomley@HansenPartnership.com, neilb, linux-raid,
	linux-scsi, linux-nvdimm, linux-block, jmoyer, axboe

On Tue, Dec 8, 2015 at 1:08 PM, Verma, Vishal L
<vishal.l.verma@intel.com> wrote:
> On Wed, 2015-12-09 at 08:03 +1100, NeilBrown wrote:
>> On Sat, Dec 05 2015, Verma, Vishal L wrote:
>> > >
>> > > > +int badblocks_clear(struct badblocks *bb, sector_t s, int
>> > > > sectors)
>> > > > +{
>> > > [...]
>> > > > +#define DO_DEBUG 1
>> > >
>> > > Why have this at all if it's unconditionally defined and always
>> > > set.
>> >
>> > Neil - any reason or anything you had in mind for this? Or is it
>> > just an
>> > artifact and can be removed.
>>
>> Like the comment says:
>>
>>       /* Allow clearing via sysfs *only* for testing/debugging.
>>        * Normally only a successful write may clear a badblock
>>        */
>>
>> The DO_DEBUG define and ifdefs are documentation identifying bits of
>> code that should be removed when it all seems to be working.
>> Maybe now is a good time to remove that code.
>>
> Hm, I think it would be nice to continue to have the ability to clear
> badblocks using sysfs at least for a while more, as we test the various
> error handling paths for NVDIMMS (Dan, thoughts?).
>
> We could either remove it later or (I'm leaning towards) make it a
> config option similar to FAIL_MAKE_REQUEST and friends..

"later" as in before v4.5-rc1?  We can always carry this debug feature
locally for testing.  We don't want userspace growing ABI attachments
to this capability now that it's more than just md tooling that will
see this.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] badblocks: Add core badblock management code
  2015-12-08 21:18           ` Dan Williams
@ 2015-12-08 23:47             ` Verma, Vishal L
  0 siblings, 0 replies; 23+ messages in thread
From: Verma, Vishal L @ 2015-12-08 23:47 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: neilb, linux-block, jmoyer, linux-raid, linux-scsi, axboe,
	linux-nvdimm, James.Bottomley

On Tue, 2015-12-08 at 13:18 -0800, Dan Williams wrote:
> On Tue, Dec 8, 2015 at 1:08 PM, Verma, Vishal L
> <vishal.l.verma@intel.com> wrote:
> > On Wed, 2015-12-09 at 08:03 +1100, NeilBrown wrote:
> > > On Sat, Dec 05 2015, Verma, Vishal L wrote:
> > > > > 
> > > > > > +int badblocks_clear(struct badblocks *bb, sector_t s, int
> > > > > > sectors)
> > > > > > +{
> > > > > [...]
> > > > > > +#define DO_DEBUG 1
> > > > > 
> > > > > Why have this at all if it's unconditionally defined and
> > > > > always
> > > > > set.
> > > > 
> > > > Neil - any reason or anything you had in mind for this? Or is it
> > > > just an
> > > > artifact and can be removed.
> > > 
> > > Like the comment says:
> > > 
> > >       /* Allow clearing via sysfs *only* for testing/debugging.
> > >        * Normally only a successful write may clear a badblock
> > >        */
> > > 
> > > The DO_DEBUG define and ifdefs are documentation identifying bits
> > > of
> > > code that should be removed when it all seems to be working.
> > > Maybe now is a good time to remove that code.
> > > 
> > Hm, I think it would be nice to continue to have the ability to
> > clear
> > badblocks using sysfs at least for a while more, as we test the
> > various
> > error handling paths for NVDIMMS (Dan, thoughts?).
> > 
> > We could either remove it later or (I'm leaning towards) make it a
> > config option similar to FAIL_MAKE_REQUEST and friends..
> 
> "later" as in before v4.5-rc1?  We can always carry this debug feature
> locally for testing.  We don't want userspace growing ABI attachments
> to this capability now that it's more than just md tooling that will
> see this.


Agreed. The following incremental patch removes sysfs support.
All the latest badblocks patches can also be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/vishal/nvdimm.git gendisk-badblocks


8<-----
From 5f0e7ac31d27a132f314106f1db33af22fde03ed Mon Sep 17 00:00:00 2001
From: Vishal Verma <vishal.l.verma@intel.com>
Date: Tue, 8 Dec 2015 16:28:31 -0700
Subject: [PATCH v4 4/3] badblocks: remove support for clearing via sysfs

sysfs support for clearing badblocks was originally meant for testing
only. With the move to generalize the interface, remove this support so
that userspace doesn't start treating this as an ABI.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 block/badblocks.c | 15 ---------------
 1 file changed, 15 deletions(-)

diff --git a/block/badblocks.c b/block/badblocks.c
index f0ac279..e5d2a91 100644
--- a/block/badblocks.c
+++ b/block/badblocks.c
@@ -503,16 +503,6 @@ ssize_t badblocks_store(struct badblocks *bb, const
char *page, size_t len,
 	int length;
 	char newline;
 
-	/* Allow clearing via sysfs *only* for testing/debugging.
-	 * Normally only a successful write may clear a badblock
-	 */
-	int clear = 0;
-
-	if (page[0] == '-') {
-		clear = 1;
-		page++;
-	}
-
 	switch (sscanf(page, "%llu %d%c", &sector, &length, &newline))
{
 	case 3:
 		if (newline != '\n')
@@ -525,11 +515,6 @@ ssize_t badblocks_store(struct badblocks *bb, const
char *page, size_t len,
 		return -EINVAL;
 	}
 
-	if (clear) {
-		badblocks_clear(bb, sector, length);
-		return len;
-	}
-
 	if (badblocks_set(bb, sector, length, !unack))
 		return -ENOSPC;
 	else
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] badblocks: Add core badblock management code
  2015-12-04 23:58     ` Verma, Vishal L
  2015-12-05  0:06       ` James Bottomley
  2015-12-08 21:03       ` NeilBrown
@ 2015-12-22  5:34       ` NeilBrown
  2015-12-22 22:13         ` Verma, Vishal L
  2 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2015-12-22  5:34 UTC (permalink / raw)
  To: Verma, Vishal L, James.Bottomley
  Cc: linux-raid, linux-scsi, linux-nvdimm, linux-block, jmoyer, axboe

[-- Attachment #1: Type: text/plain, Size: 1432 bytes --]

On Sat, Dec 05 2015, Verma, Vishal L wrote:

> On Fri, 2015-12-04 at 15:30 -0800, James Bottomley wrote:
> [...]
>> > +ssize_t badblocks_store(struct badblocks *bb, const char *page,
>> > size_t len,
>> > +			int unack)
>> [...]
>> > +int badblocks_init(struct badblocks *bb, int enable)
>> > +{
>> > +	bb->count = 0;
>> > +	if (enable)
>> > +		bb->shift = 0;
>> > +	else
>> > +		bb->shift = -1;
>> > +	bb->page = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> 
>> Why not __get_free_page(GFP_KERNEL)?  The problem with kmalloc of an
>> exactly known page sized quantity is that the slab tracker for this
>> requires two contiguous pages for each page because of the overhead.
>
> Cool, I didn't know about __get_free_page - I can fix this up too.
>

I was reminded of this just recently I thought I should clear up the
misunderstanding.

kmalloc(PAGE_SIZE) does *not* incur significant overhead and certainly
does not require two contiguous free pages.
If you "grep kmalloc-4096 /proc/slabinfo" you will note that both
objperslab and pagesperslab are 1.  So one page is used to store each
4096 byte allocation.

To quote the email from Linus which reminded me about this

> If you
> want to allocate a page, and get a pointer, just use "kmalloc()".
> Boom, done!

https://lkml.org/lkml/2015/12/21/605

There probably is a small CPU overhead from using kmalloc, but no memory
overhead.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] badblocks: Add core badblock management code
  2015-12-22  5:34       ` NeilBrown
@ 2015-12-22 22:13         ` Verma, Vishal L
  2015-12-22 23:06           ` NeilBrown
  0 siblings, 1 reply; 23+ messages in thread
From: Verma, Vishal L @ 2015-12-22 22:13 UTC (permalink / raw)
  To: James.Bottomley, neilb
  Cc: linux-raid, linux-scsi, linux-nvdimm, linux-block, jmoyer, axboe

[-- Attachment #1: Type: text/plain, Size: 1940 bytes --]

On Tue, 2015-12-22 at 16:34 +1100, NeilBrown wrote:
> On Sat, Dec 05 2015, Verma, Vishal L wrote:
> 
> > On Fri, 2015-12-04 at 15:30 -0800, James Bottomley wrote:
> > [...]
> > > > +ssize_t badblocks_store(struct badblocks *bb, const char *page,
> > > > size_t len,
> > > > +			int unack)
> > > [...]
> > > > +int badblocks_init(struct badblocks *bb, int enable)
> > > > +{
> > > > +	bb->count = 0;
> > > > +	if (enable)
> > > > +		bb->shift = 0;
> > > > +	else
> > > > +		bb->shift = -1;
> > > > +	bb->page = kmalloc(PAGE_SIZE, GFP_KERNEL);
> > > 
> > > Why not __get_free_page(GFP_KERNEL)?  The problem with kmalloc of
> > > an
> > > exactly known page sized quantity is that the slab tracker for
> > > this
> > > requires two contiguous pages for each page because of the
> > > overhead.
> > 
> > Cool, I didn't know about __get_free_page - I can fix this up too.
> > 
> 
> I was reminded of this just recently I thought I should clear up the
> misunderstanding.
> 
> kmalloc(PAGE_SIZE) does *not* incur significant overhead and certainly
> does not require two contiguous free pages.
> If you "grep kmalloc-4096 /proc/slabinfo" you will note that both
> objperslab and pagesperslab are 1.  So one page is used to store each
> 4096 byte allocation.
> 
> To quote the email from Linus which reminded me about this
> 
> > If you
> > want to allocate a page, and get a pointer, just use "kmalloc()".
> > Boom, done!
> 
> https://lkml.org/lkml/2015/12/21/605
> 
> There probably is a small CPU overhead from using kmalloc, but no
> memory
> overhead.

Thanks Neil.
I just read the rest of that thread - and I'm wondering if we should
change back to kzalloc here.

The one thing __get_free_page gets us is PAGE_SIZE-aligned memory. Do
you think that would be better for this use? (I can't think of any). If
not, I can send out a new version reverting back to kzalloc.

	-Vishal


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] badblocks: Add core badblock management code
  2015-12-22 22:13         ` Verma, Vishal L
@ 2015-12-22 23:06           ` NeilBrown
  2015-12-23  0:38             ` Verma, Vishal L
  0 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2015-12-22 23:06 UTC (permalink / raw)
  To: Verma, Vishal L, James.Bottomley
  Cc: linux-raid, linux-scsi, linux-nvdimm, linux-block, jmoyer, axboe

[-- Attachment #1: Type: text/plain, Size: 2395 bytes --]

On Wed, Dec 23 2015, Verma, Vishal L wrote:

> On Tue, 2015-12-22 at 16:34 +1100, NeilBrown wrote:
>> On Sat, Dec 05 2015, Verma, Vishal L wrote:
>> 
>> > On Fri, 2015-12-04 at 15:30 -0800, James Bottomley wrote:
>> > [...]
>> > > > +ssize_t badblocks_store(struct badblocks *bb, const char *page,
>> > > > size_t len,
>> > > > +			int unack)
>> > > [...]
>> > > > +int badblocks_init(struct badblocks *bb, int enable)
>> > > > +{
>> > > > +	bb->count = 0;
>> > > > +	if (enable)
>> > > > +		bb->shift = 0;
>> > > > +	else
>> > > > +		bb->shift = -1;
>> > > > +	bb->page = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> > > 
>> > > Why not __get_free_page(GFP_KERNEL)?  The problem with kmalloc of
>> > > an
>> > > exactly known page sized quantity is that the slab tracker for
>> > > this
>> > > requires two contiguous pages for each page because of the
>> > > overhead.
>> > 
>> > Cool, I didn't know about __get_free_page - I can fix this up too.
>> > 
>> 
>> I was reminded of this just recently I thought I should clear up the
>> misunderstanding.
>> 
>> kmalloc(PAGE_SIZE) does *not* incur significant overhead and certainly
>> does not require two contiguous free pages.
>> If you "grep kmalloc-4096 /proc/slabinfo" you will note that both
>> objperslab and pagesperslab are 1.  So one page is used to store each
>> 4096 byte allocation.
>> 
>> To quote the email from Linus which reminded me about this
>> 
>> > If you
>> > want to allocate a page, and get a pointer, just use "kmalloc()".
>> > Boom, done!
>> 
>> https://lkml.org/lkml/2015/12/21/605
>> 
>> There probably is a small CPU overhead from using kmalloc, but no
>> memory
>> overhead.
>
> Thanks Neil.
> I just read the rest of that thread - and I'm wondering if we should
> change back to kzalloc here.
>
> The one thing __get_free_page gets us is PAGE_SIZE-aligned memory. Do
> you think that would be better for this use? (I can't think of any). If
> not, I can send out a new version reverting back to kzalloc.

kzalloc(PAGE_SIZE) will also always return page-aligned memory.
kzalloc returns a void*, __get_free_page returns unsigned long.  For
that reason alone I would prefer kzalloc.

But I'm not necessarily suggesting you change the code.  I just wanted
to clarify a misunderstanding.  You should produce the code that you are
most happy with.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] badblocks: Add core badblock management code
  2015-12-22 23:06           ` NeilBrown
@ 2015-12-23  0:38             ` Verma, Vishal L
  0 siblings, 0 replies; 23+ messages in thread
From: Verma, Vishal L @ 2015-12-23  0:38 UTC (permalink / raw)
  To: James.Bottomley, neilb
  Cc: linux-raid, linux-scsi, linux-nvdimm, linux-block, jmoyer, axboe

[-- Attachment #1: Type: text/plain, Size: 2818 bytes --]

On Wed, 2015-12-23 at 10:06 +1100, NeilBrown wrote:
> On Wed, Dec 23 2015, Verma, Vishal L wrote:
> 
> > On Tue, 2015-12-22 at 16:34 +1100, NeilBrown wrote:
> > > On Sat, Dec 05 2015, Verma, Vishal L wrote:
> > > 
> > > > On Fri, 2015-12-04 at 15:30 -0800, James Bottomley wrote:
> > > > [...]
> > > > > > +ssize_t badblocks_store(struct badblocks *bb, const char
> > > > > > *page,
> > > > > > size_t len,
> > > > > > +			int unack)
> > > > > [...]
> > > > > > +int badblocks_init(struct badblocks *bb, int enable)
> > > > > > +{
> > > > > > +	bb->count = 0;
> > > > > > +	if (enable)
> > > > > > +		bb->shift = 0;
> > > > > > +	else
> > > > > > +		bb->shift = -1;
> > > > > > +	bb->page = kmalloc(PAGE_SIZE, GFP_KERNEL);
> > > > > 
> > > > > Why not __get_free_page(GFP_KERNEL)?  The problem with kmalloc
> > > > > of
> > > > > an
> > > > > exactly known page sized quantity is that the slab tracker for
> > > > > this
> > > > > requires two contiguous pages for each page because of the
> > > > > overhead.
> > > > 
> > > > Cool, I didn't know about __get_free_page - I can fix this up
> > > > too.
> > > > 
> > > 
> > > I was reminded of this just recently I thought I should clear up
> > > the
> > > misunderstanding.
> > > 
> > > kmalloc(PAGE_SIZE) does *not* incur significant overhead and
> > > certainly
> > > does not require two contiguous free pages.
> > > If you "grep kmalloc-4096 /proc/slabinfo" you will note that both
> > > objperslab and pagesperslab are 1.  So one page is used to store
> > > each
> > > 4096 byte allocation.
> > > 
> > > To quote the email from Linus which reminded me about this
> > > 
> > > > If you
> > > > want to allocate a page, and get a pointer, just use
> > > > "kmalloc()".
> > > > Boom, done!
> > > 
> > > https://lkml.org/lkml/2015/12/21/605
> > > 
> > > There probably is a small CPU overhead from using kmalloc, but no
> > > memory
> > > overhead.
> > 
> > Thanks Neil.
> > I just read the rest of that thread - and I'm wondering if we should
> > change back to kzalloc here.
> > 
> > The one thing __get_free_page gets us is PAGE_SIZE-aligned memory.
> > Do
> > you think that would be better for this use? (I can't think of any).
> > If
> > not, I can send out a new version reverting back to kzalloc.
> 
> kzalloc(PAGE_SIZE) will also always return page-aligned memory.
> kzalloc returns a void*, __get_free_page returns unsigned long.  For
> that reason alone I would prefer kzalloc.
> 
> But I'm not necessarily suggesting you change the code.  I just wanted
> to clarify a misunderstanding.  You should produce the
> code that you are
> most happy with.


I agree, the typecasting with __get_free_page is pretty ugly. I'll
change it back to kzalloc.

Thanks,
	-Vishal

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/3] Badblock tracking for gendisks
  2015-12-08  2:52 Vishal Verma
@ 2015-12-08  2:54 ` Verma, Vishal L
  0 siblings, 0 replies; 23+ messages in thread
From: Verma, Vishal L @ 2015-12-08  2:54 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-raid, linux-scsi, James.Bottomley, linux-block, neilb,
	axboe, jmoyer

Oops, sorry, should've been PATCH v3..
The contents are right, just the subject line is off.

	-Vishal


On Mon, 2015-12-07 at 19:52 -0700, Vishal Verma wrote:
> v3:
>   - Add kernel-doc style comments to all exported functions in
> badblocks.c (James)
>   - Make return values from badblocks functions consistent with
> themselves
>     and the kernel style. Change the polarity of badblocks_set, and
> update
>     all callers accordingly (James)
>   - In gendisk, don't unconditionally allocate badblocks, export the
> initializer.
>     This also allows the initializer to be a non-void return type, so
> that the
>     badblocks user can act upon failures better (James)
> 
> 
> v2:
>   - In badblocks_free, make 'page' NULL (patch 1)
>   - Move the core badblocks code to a new .c file (patch 1) (Jens)
>   - Fix a sizeof usage in disk_alloc_badblocks (patch 2) (Dan)
>   - Since disk_alloc_badblocks can fail, check disk->bb for NULL in
> the
>     genhd wrappers (patch 2) (Jeff)
>   - Update the md conversion to also ise the badblocks init and free
>     functions (patch 3)
>   - Remove the BB_* macros from md.h as they are now in badblocks.h
> (patch 3)
> 
> Patch 1 copies badblock management code into a header of its own,
> making it generally available. It follows common libraries of code
> such as linked lists, where anyone may embed a core data structure
> in another place, and use the provided accessor functions to
> manipulate the data.
> 
> Patch 2 adds badblock tracking to gendisks (in preparation for use
> by NVDIMM devices).
> 
> Patch 3 converts md over to use the new badblocks 'library'. I have
> done some pretty simple testing on this - created a raid 1 device,
> made sure the sysfs entries show up, and can be used to add and view
> badblocks. A closer look by the md folks would be nice here.
> 
> Vishal Verma (3):
>   badblocks: Add core badblock management code
>   block: Add badblock management for gendisks
>   md: convert to use the generic badblocks code
> 
>  block/Makefile            |   2 +-
>  block/badblocks.c         | 576
> ++++++++++++++++++++++++++++++++++++++++++++++
>  block/genhd.c             |  76 ++++++
>  drivers/md/md.c           | 516 ++-----------------------------------
> ----
>  drivers/md/md.h           |  40 +---
>  include/linux/badblocks.h |  53 +++++
>  include/linux/genhd.h     |   7 +
>  7 files changed, 741 insertions(+), 529 deletions(-)
>  create mode 100644 block/badblocks.c
>  create mode 100644 include/linux/badblocks.h
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 0/3] Badblock tracking for gendisks
@ 2015-12-08  2:52 Vishal Verma
  2015-12-08  2:54 ` Verma, Vishal L
  0 siblings, 1 reply; 23+ messages in thread
From: Vishal Verma @ 2015-12-08  2:52 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Vishal Verma, linux-block, linux-raid, linux-scsi, Jens Axboe,
	NeilBrown, Jeff Moyer, James Bottomley

v3:
  - Add kernel-doc style comments to all exported functions in badblocks.c (James)
  - Make return values from badblocks functions consistent with themselves
    and the kernel style. Change the polarity of badblocks_set, and update
    all callers accordingly (James)
  - In gendisk, don't unconditionally allocate badblocks, export the initializer.
    This also allows the initializer to be a non-void return type, so that the
    badblocks user can act upon failures better (James)


v2:
  - In badblocks_free, make 'page' NULL (patch 1)
  - Move the core badblocks code to a new .c file (patch 1) (Jens)
  - Fix a sizeof usage in disk_alloc_badblocks (patch 2) (Dan)
  - Since disk_alloc_badblocks can fail, check disk->bb for NULL in the
    genhd wrappers (patch 2) (Jeff)
  - Update the md conversion to also ise the badblocks init and free
    functions (patch 3)
  - Remove the BB_* macros from md.h as they are now in badblocks.h (patch 3)

Patch 1 copies badblock management code into a header of its own,
making it generally available. It follows common libraries of code
such as linked lists, where anyone may embed a core data structure
in another place, and use the provided accessor functions to
manipulate the data.

Patch 2 adds badblock tracking to gendisks (in preparation for use
by NVDIMM devices).

Patch 3 converts md over to use the new badblocks 'library'. I have
done some pretty simple testing on this - created a raid 1 device,
made sure the sysfs entries show up, and can be used to add and view
badblocks. A closer look by the md folks would be nice here.

Vishal Verma (3):
  badblocks: Add core badblock management code
  block: Add badblock management for gendisks
  md: convert to use the generic badblocks code

 block/Makefile            |   2 +-
 block/badblocks.c         | 576 ++++++++++++++++++++++++++++++++++++++++++++++
 block/genhd.c             |  76 ++++++
 drivers/md/md.c           | 516 ++---------------------------------------
 drivers/md/md.h           |  40 +---
 include/linux/badblocks.h |  53 +++++
 include/linux/genhd.h     |   7 +
 7 files changed, 741 insertions(+), 529 deletions(-)
 create mode 100644 block/badblocks.c
 create mode 100644 include/linux/badblocks.h

-- 
2.5.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-12-23  0:38 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-25 18:43 [PATCH v2 0/3] Badblock tracking for gendisks Vishal Verma
2015-11-25 18:43 ` [PATCH v2 1/3] badblocks: Add core badblock management code Vishal Verma
2015-12-04 23:30   ` James Bottomley
2015-12-04 23:58     ` Verma, Vishal L
2015-12-05  0:06       ` James Bottomley
2015-12-05  0:11         ` Verma, Vishal L
2015-12-08 21:03       ` NeilBrown
2015-12-08 21:08         ` Verma, Vishal L
2015-12-08 21:18           ` Dan Williams
2015-12-08 23:47             ` Verma, Vishal L
2015-12-22  5:34       ` NeilBrown
2015-12-22 22:13         ` Verma, Vishal L
2015-12-22 23:06           ` NeilBrown
2015-12-23  0:38             ` Verma, Vishal L
2015-11-25 18:43 ` [PATCH v2 2/3] block: Add badblock management for gendisks Vishal Verma
2015-12-04 23:33   ` James Bottomley
2015-12-05  0:17     ` Verma, Vishal L
2015-11-25 18:43 ` [PATCH v2 3/3] md: convert to use the generic badblocks code Vishal Verma
2015-12-01 18:55   ` Shaohua Li
2015-12-01 19:52     ` Verma, Vishal L
2015-12-04 22:53 ` [PATCH v2 0/3] Badblock tracking for gendisks Verma, Vishal L
2015-12-08  2:52 Vishal Verma
2015-12-08  2:54 ` Verma, Vishal L

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.